# approximate_kernel_selection_with_strong_approximate_consistency__27e024c8.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Approximate Kernel Selection with Strong Approximate Consistency

Lizhong Ding,1, Yong Liu,3 Shizhong Liao,4 Yu Li,2 Peng Yang,2 Yijie Pan,5 Chao Huang,6 Ling Shao,1 Xin Gao2,

1Inception Institute of Artiﬁcial Intelligence (IIAI), Abu Dhabi, UAE 2King Abdullah University of Science and Technology (KAUST), Saudi Arabia 3Institute of Information Engineering, CAS, China 4Tianjin University, China 5Ningbo Institute of Computing Technology, CAS, China 6Ningbo Institute of Information Technology Application, CAS, China

Kernel selection is fundamental to the generalization performance of kernel-based learning algorithms. Approximate kernel selection is an efﬁcient kernel selection approach that exploits the convergence property of the kernel selection criteria and the computational virtue of kernel matrix approximation. The convergence property is measured by the notion of approximate consistency. For the existing Nystr om approximations, whose sampling distributions are independent of the speciﬁc learning task at hand, it is difﬁcult to establish the strong approximate consistency. They mainly focus on the quality of the low-rank matrix approximation, rather than the performance of the kernel selection criterion used in conjunction with the approximate matrix. In this paper, we propose a novel Nystr om approximate kernel selection algorithm by customizing a criterion-driven adaptive sampling distribution for the Nystr om approximation, which adaptively reduces the error between the approximate and accurate criteria. We theoretically derive the strong approximate consistency of the proposed Nystr om approximate kernel selection algorithm. Finally, we empirically evaluate the approximate consistency of our algorithm as compared to state-of-the-art methods.

Introduction Kernel-based learning provides a way to implicitly transform data into a new feature space, which allows the learning of nonlinear functions using linear classiﬁers or regressors in the kernel-induced feature space. The kernel function determines the reproducing kernel Hilbert space (RKHS), which is the hypothesis space of kernel-based learning and hence has an essential inﬂuence on the performance of the resulting hypothesis. Therefore, the selection of the kernel function is a central problem in kernel-based learning (Micchelli and Pontil 2005). The problem of kernel selection is closely linked to the generalization error of kernelbased learning algorithms. The kernel with the smallest generalization error is usually regarded as the optimal kernel (Bartlett, Boucheron, and Lugosi 2002; Liu et al. 2017; Liu and Liao 2015). However, in practice one cannot compute the generalization error because the underlying probability distribution of the data is unknown. It is thus com-

Corresponding authors (email: lizhong.ding@inceptioniai.org, xin.gao@kaust.edu.sa) Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

mon practice to resort to estimates of the generalization error. One can either use empirical estimates that are based on experiments, or theoretical estimates that are based on the upper bounds of the generalization error. Cross-validation (CV) is a well-known empirical estimate of the generalization error. Leave-one-out (LOO), the extreme form of CV, provides an almost unbiased estimate of the generalization error. However, the na ıve kernel selection strategy of CV, exhaustive search in the kernel parameter space, is computationally intensive. To speed up CV-based methods, approximate CV approaches were proposed, such as efﬁcient LOO (Cawley and Talbot 2010) and Bouligand inﬂuence function CV (BIFCV) (Liu, Jiang, and Liao 2014; Liu et al. 2018). Employing the theoretical estimate bounds of the generalization error as kernel selection criteria is another alternative to experimental methods. Different boundbased criteria introduce different measures of the capacity of the hypothesis space (Bartlett, Boucheron, and Lugosi 2002; Ding et al. 2018), such as Rademacher complexity (Bartlett and Mendelson 2002), local Rademacher complexity (Cortes, Kloft, and Mohri 2013; Li et al. 2018), maximal discrepancy (Bartlett, Boucheron, and Lugosi 2002), maximum mean discrepancy (MMD) (Sriperumbudur et al. 2009; Gretton et al. 2012; Song et al. 2012; Ding et al. 2019), covering number (Ding and Liao 2014b) and effective dimensionality (Zhang 2005). However, the computational complexities of the existing kernel selection criteria are at least quadratic in the number of examples l, i.e., O(l2). This kind of scalability is prohibitive for large-scale problems.

Approximate kernel selection is an emerging and efﬁcient kernel selection approach, which exploits the convergence property of the kernel selection criteria as well as the computational virtue of kernel matrix approximation (Ding and Liao 2011; 2014a; 2017). The basic principle of approximate kernel selection is that it is sufﬁcient to calculate approximate kernel selection criteria, which can discriminate the (nearly) optimal kernel from other candidates with high efﬁciency. Two theoretical problems are faced by approximate kernel selection: how kernel matrix approximation impacts the kernel selection criterion and whether this impact can be ignored for large enough examples. In (Ding and Liao 2014a), the approximate consistency was ﬁrst deﬁned to theoretically answer these questions, by studying under what conditions and at what speed the approximate kernel selec-

tion criterion is close to the accurate one, if at all. It is worth mentioning that the approximate consistency is deﬁned for approximate kernel selection algorithms and different from the classical concept of consistency in the learning theory, which is deﬁned for learning algorithms and measures how the learned hypothesis converges to the optimal one that minimizes the expected error in the hypothesis space. The Nystr om approximation (Williams and Seeger 2000; Drineas and Mahoney 2005; Yang et al. 2012; Kumar, Mohri, and Talwalkar 2012; Gittens and Mahoney 2013; Jin et al. 2013; Musco and Musco 2017) is a prevailing lowrank matrix approximation method in the machine learning community. However, even for the Nystr om approximation with the best kernel matrix approximation error bound (Gittens and Mahoney 2013), it is difﬁcult to prove the strong approximate consistency of the approximate kernel selection method (Ding and Liao 2014a). It has been proven that the best approximate consistency for Nystr om methods is the 1

2order approximate consistency (weaker than the strong approximate consistency) (Ding and Liao 2014a), which is derived from the Nystr om approximation using leverage score sampling (Gittens and Mahoney 2013). Providing the ﬁrst Nystr om approximate kernel selection approach with strong approximate consistency is the goal of this paper. Sampling distribution is critical to the performance of the Nystr om approximation. However, for the existing Nystr om methods, the sampling distributions are independent of the speciﬁc learning task at hand and focus on the quality of the low-rank matrix approximation, which ignores the performance of the kernel selection criterion used in conjunction with these approximations. The 1

2-order approximate consistency is the best approximate consistency for Nystr om methods (Ding and Liao 2014a), which is likely caused by the isolation between the sampling distribution and the kernel selection. In this paper, we customize an adaptive sampling distribution for the Nystr om approximation and propose a novel Nystr om approximate kernel selection algorithm with strong approximate consistency. The main contributions of this paper can be summarized as follows. First, a criteriondriven adaptive sampling distribution that iteratively reduces the error between the approximate and accurate criteria is designed for the Nystr om approximation for the ﬁrst time. Second, based on this newly designed sampling distribution, we propose a novel Nystr om approximate kernel selection algorithm. Third, we prove the strong approximate consistency of the proposed Nystr om approximate kernel selection algorithm. Finally, we conduct empirical evaluations of the approximate consistency of the proposed algorithm as compared to state-of-the-art approximate algorithms.

Related Work Kernel-based learning algorithms suffer from high computational and storage complexity due to the use of the kernel matrix. Kernel matrix approximation is adopted to effectively reduce the computational and storage burdens of kernel-based learning. To achieve linear complexity in the number of examples, low-rank approximations from subsets of columns are considered, such as the classical Nystr om method with different kinds of sampling strategies

(Williams and Seeger 2000; Drineas and Mahoney 2005; Zhang and Kwok 2010; Kumar, Mohri, and Talwalkar 2012; Gittens and Mahoney 2013; Musco and Musco 2017), the modiﬁed Nystr om method (Wang and Zhang 2013), incomplete Cholesky decomposition (Fine and Scheinberg 2002; Bach and Jordan 2005; Bach 2013), sparse greedy approximations (Smola and Sch olkopf 2000), and CUR matrix decomposition (Drineas, Mahoney, and Muthukrishnan 2008). The Nystr om method1 is an effective low-rank matrix approximation method that has been extensively used in different domains of machine learning, such as Gaussian process (Williams and Seeger 2000), spectral grouping (Fowlkes et al. 2004), and manifold learning (Talwalkar, Kumar, and Rowley 2008; Zhang and Kwok 2010). To reduce the matrix approximation error, different sampling strategies for the Nystr om approximation have been considered and theoretically analyzed, including uniform sampling (Kumar, Mohri, and Talwalkar 2009), column norm-based sampling (Drineas and Mahoney 2005), k-means clustering sampling (Zhang and Kwok 2010; Si, Hsieh, and Dhillon 2017), and leverage score-based sampling (Gittens and Mahoney 2013). We collectively refer to the above sampling as ﬁxed sampling, which determines the distribution of all columns before the sampling procedure. In addition to the ﬁxed sampling, an adaptive sampling technique was proposed in (Deshpande et al. 2006), which selects more informative columns in each iteration and iteratively updates the sampling distribution for selecting the next columns. The adaptive sampling has been extended to the Nystr om approximation (Kumar, Mohri, and Talwalkar 2012), the modiﬁed Nystr om approximation (Wang and Zhang 2013) and the ridge leverage score Nystr om approximation (Musco and Musco 2017). However, these existing ﬁxed and adaptive sampling distributions are almost independent of the speciﬁc learning task at hand. They focus on the quality of the lowrank approximation, the kernel matrix approximation error, rather than the performance of the kernel selection criterion with these approximations. As compared to (Ding and Liao 2011; 2014a; 2017), this paper designs the ﬁrst criterion-driven adaptive sampling strategy and provides the ﬁrst approximate kernel selection algorithm with strong approximate consistency for classical Nystrom approximation.

Notations and Preliminaries We use X to denote the input space and Y the output domain. Usually we will have X Rd, Y = { 1, 1} for binary classiﬁcation and Y = R for regression. We assume |y| M for any y Y, where M is a constant. The training set is denoted by S = {(x1, y1) , . . . , (xl, yl)} (X Y)l. We consider the Mercer kernel κ in this paper, which is a continuous, symmetric and positive deﬁnite function from X X to R. The kernel matrix K = [κ(xi, xj)]l i,j=1, de-

1In the rest of this paper, if we mention the Nystr om method, we refer to the classical Nystr om method, not the modiﬁed Nystr om method. Although the modiﬁed Nystr om method has a tighter matrix error bound, its computational burden is higher than the classical Nystr om method.

ﬁned on a ﬁnite set of inputs {x1, . . . , xl} X is symmetric and positive deﬁnite (SPD). The reproducing kernel Hilbert space (RKHS) Hκ associated with the kernel κ can be deﬁned as Hκ = span{κ(x, ) : x X}, and the inner product , Hκ on Hκ is determined by κ(x, ), κ(x , ) Hκ = κ(x, x ) for x, x X. We use K 2 and K F to denote the spectral and Frobenius norm of K, respectively. We use λt(K) for t = 1, . . . , l to denote the eigenvalues of K in descending order.

Approximate Kernel Selection In this section, we ﬁrst introduce a kernel selection criterion that is very general in supervised learning, and then give a brief review of approximate kernel selection and approximate consistency. We consider the regularized square loss function, which is very common in the machine learning community,

i=1 (f(xi) yi)2 + µ f 2 Hκ,

where µ denotes the regularization parameter. The optimal function fκ = arg minf Hκ E(f). By the representer theorem, we have fκ = Pl i=1 αiκ(xi, ) with α = (α1, . . . , αl)T = (K + µl I) 1y, where y = (y1 . . . , yl)T and I denotes the identity matrix. Writing Kµ = K + µl I, fκ 2 Hκ = αTKα = y TK 1 µ KK 1 µ y. Denoting fκ = (fκ(x1), . . . , fκ(xl))T, we have fκ = Kα = KK 1 µ y, which implies

fκ y = KK 1 µ y KµK 1 µ y = µl K 1 µ y.

Now, we have

l (fκ y)T(fκ y) + µ fκ 2 Hκ

= µy TK 1 µ y.

E(fκ) is the regularized empirical error of the optimal function fκ. For a ﬁxed regularization parameter µ, E(fκ) only depends on the kernel κ. It is known that κ has a oneto-one correspondence with RKHSs Hκ. Different kernels correspond to different RKHSs. In different RKHSs, different optimal functions are derived. We can select the optimal function fκ that makes E(fκ) the smallest among all optimal functions, and then the corresponding kernel κ will be the optimal kernel. We deﬁne a kernel selection criterion as

C(K) = E(fκ) = µy TK 1 µ y. (1)

In the following, we adopt C(K) as a case to show our Nystr om approximate kernel selection algorithm. It is worth noting that our algorithm can be generalized to any other kernel selection criterion C() that is deﬁned as a function of a kernel matrix. For a prescribed set of kernels K, we can ﬁnd the optimal kernel by

κ = arg min κ K C(K) = arg min κ K µy TK 1 µ y.

There are three cases for the choice of the kernel set K: (i) K includes a given type of kernel that has ﬁnite candidate parameters; (ii) K includes a given type of kernel that

has continuous parameters; (iii) K is deﬁned as a set of nonnegative combinations of base kernels (Ding and Liao 2017). C(K) can be applied to these three cases. In this paper, we concentrate on the design of the sampling distribution and the approximate kernel selection algorithm, so we only consider the ﬁrst case and leave the latter two as future work. The approximate kernel selection was ﬁrst studied in (Ding and Liao 2011). Suppose that a kernel selection criterion C() and a kernel matrix approximation algorithm A(), which uses the training data S and the kernel κ to generate the approximate matrix K, are given, the approximate kernel selection is developed to select the kernel κ as

κ = arg min κ K C(A(S, κ)) = arg min κ K C( K). (2)

Here we denote an approximate kernel selection method M as a 2-tuple: M = (C(), A()). The computational cost for C(K) deﬁned in (1) is O(l3), which is prohibitive for big data. The computation of C( K) could be much more efﬁcient than that of C(K) due to the speciﬁc structure of K. For the Nystr om approximation, the Woodbury formula could be used to calculate C( K) (Ding and Liao 2012) and for the multilevel circulant matrix approximation (Ding and Liao 2017), fast Fourier transform (FFT) could be used. To demonstrate the rationality of approximate kernel selection, the notion of approximate consistency was deﬁned in (Ding and Liao 2014a), which answers the theoretical questions under what conditions and at what speed the approximate kernel selection criterion converges to the accurate one, if at all2. Deﬁnition 1. Suppose we are given an approximate kernel selection method M = (C(), A()), where C() is a kernel selection criterion, and A() is a kernel matrix approximation algorithm, which uses S and κ to generate the approximate matrix K. We say the approximate kernel selection method M is of strong approximate consistency, if

|C(K) C( K)| ε(l), (3)

where liml ε(l) 0. We say M is of p-order approximate consistency if

|C(K) C( K)| ε(l), (4)

where liml ε(l)/lp 0. There are two scenarios: if A is a deterministic algorithm, the approximate consistency is deﬁned deterministically; if A is a stochastic algorithm, (3) or (4) is established under expectation or with high probability.

Nystr om Approximate Kernel Selection In this section, we materialize the kernel matrix approximation algorithm A() by the Nystr om approximation, customize an adaptive sampling strategy for the Nystr om approximation and propose a novel Nystr om approximate kernel selection algorithm with strong approximate consistency.

2In (Ding and Liao 2014a), the approximate consistency is deﬁned for A(). Here we reﬁne that deﬁnition and take approximate consistency as a basic property of M = (C(), A()).

Suppose we randomly sample c columns of K. Let C denote the l c matrix formed by these columns. Let D be the c c matrix consisting of the intersection of these c columns with the corresponding c rows of K. The Nystr om approximation matrix is K = CD k CT K,

where Dk is the optimal rank k approximation to D and D k is the Moore-Penrose generalized inverse of Dk. If we denote the SVD of D as D = UDΣDUT D,

D k = UD,kΣ D,k UT D,k,

where ΣD,k and UD,k correspond to the top k singular values and singular vectors of D, respectively, then we have

K = CUD,k q

Σ D,k | {z } V

where we let V = CUD,k q

Σ D,k Rl k. As shown in (2), approximate kernel selection adopts the approximate criterion C( K) to select the optimal kernel. To calculate the value of C( K), we need to solve the inverse of K + µl Il, where Il denotes the l l identity matrix. Using the Woodbury formula, we obtain K + µl Il 1 = 1

Il V µl Ik + VTV 1 VT .

To solve ( K+µl Il) 1y in C( K), we introduce the vector u and let ( K + µl Il)u = y. Then we have

y V µl Ik + VTV 1 VTy . (5)

To efﬁciently solve (5), we introduce a temporary variable

ω : µl Ik + VTV ω = VTy,

and then u = 1 µl(y Vω).

We summarize the above computation of C( K) from Step 9 to Step 14 in Algorithm 1. Before conducting the above approximation procedure, the most important step for the Nystr om method is to determine the distribution for sampling c columns from K. The sampling distributions of existing Nystr om methods are independent of the speciﬁc learning task at hand and mainly focus on the kernel matrix approximation error. The independence between the sampling distribution and the learning task is the main source of the weaker approximate consistency. Here we will customize the sampling distribution of the Nystr om approximation for approximate kernel selection. We adopt the adaptive sampling for the Nystr om approximation, instead of sampling all columns at one time, to select more informative columns. The adaptive sampling procedure is given by Steps 3 to 8 in Algorithm 1. In each iteration, we only select s < c columns from K3. Then according

3We do not calculate the whole kernel matrix K and then sample. We just calculate the corresponding s columns of K.

Algorithm 1 Nystr om Approximate Kernel Selection

Require: Training data S = {(xi, yi)}l i=1, candidate kernel set K = {κ(i) : i [N]}, number of columns to be chosen c, initial distribution D0, number of columns selected at each iteration s, regularization parameter µ Ensure: κ

1: Initialize: Copt = ; 2: for each κ K do 3: Sample s indices according to D0 to form I; 4: t = c/s 1{Number of iterations}; 5: for i = 1 : t do 6: Di = Update Probability(I);{Algorithm 2} 7: Ii = set of s indices sampled according to Di; 8: I = I Ii; 9: end for 10: Form C and D according to I; 11: Calculate the SVD of D as D = UDΣDUT D;

12: Let V = CUD,k q

13: Solve µl Ik + VTV ω = VTy to obtain ω;

15: C( K) = µy Tu; 16: if C( K) Copt then 17: Copt = C( K); 18: κ = κ; 19: end if 20: end for 21: return κ ;

to the errors incurred by these columns we update the distribution that will be used for selecting the next s columns. The error used in the adaptive sampling of (Kumar, Mohri, and Talwalkar 2009) is only the matrix approximation error, which is independent of the learning and kernel selection. For the kernel selection criterion C(K), to study the approximate consistency, we need to bound the difference

C(K) C( K) = µy TK 1 µ y µy T K 1 µ y,

where Kµ = K + µl I. The tighter the difference between C(K) and C( K) can be bounded, the stronger the approximate consistency can be established.

In order to reduce the difference between C(K) and C( K), we deﬁne the error matrix

E = K 1 µ yy T K 1 µ yy T,

where is the Hadamard product. This error matrix contains the information of the vector y and the regularized kernel matrix. For each iteration of adaptive sampling, we choose the columns to make the error between the approximate and accurate criteria small. However, calculating the error matrix E requires computing the inverse of Kµ, which

Algorithm 2 Update Probability Require: The index set I Ensure: The distribution D = {p1, . . . , pl} 1: Form C and D according to I; 2: Construct the Nystr om approximation matrix K; 3: E = C yy T ν C yy T ν ; 4: for i = 1 : l do 5: if i I then 6: pi = 0; 7: else 8: pi = Ei 2 2; 9: end if 10: end for 11: pi = pi/ Pl i=1 pi for i = 1, . . . , l; 12: return D = {p1, . . . , pl};

is of O(l3) time complexity. We can prove that

|C(K) C( K)|

= |µy TK 1 µ y µy T K 1 µ y|

= µy T K 1 µ K 1 µ y

= µy T[(K + µl I) 1(K K)( K + µl I) 1]y

µ y T 2 (K + µl I) 1 2 K K 2

( K + µl I) 1 2 y 2

µ y T 2 K K 2 y 2 λmin(K + µl I)λmin( K + µl I)

1 µl2 y T 2 K K 2 y 2.

This upper bound shows that for C(K) we can reduce the error K 1 µ K 1 µ by reducing K K. Now we redeﬁne the error matrix as E = Kµ yy T Kµ yy T. Computing E requires a full pass over Kµ which is inefﬁcient for largescale problems. In order to further reduce the computational burden, we approximate E as follows

E = C yy T ν C yy T ν ,

where C is the previously sampled columns of K, C is the corresponding columns of K, yν denotes the ﬁrst ν elements of y and ν is the number of columns in C. For classiﬁcation, we regularize the labels to keep the class information: we use l+(l ) to denote the number of the positive (negative) data points and let yi = 1/l+, if yi = +1 and yi = 1/l if yi = 1 for i = 1, . . . , l. For regression, we keep the original labels. The error between C and C is always less than the error between Kµ and Kµ, i.e., C C F Kµ Kµ F. When we theoretically prove the approximate consistency, we can just derive the related upper bound of Kµ Kµ F. Finally, we deﬁne the sampling distribution as

D = {pi}l i=1, pi = Ei 2 2/ E 2 F, i = 1, . . . , l,

where Ei is the i-th column of E. The procedure for updating the sampling distribution is shown in Algorithm 2. Step 5 and Step 6 in Algorithm 2 imply that our sampling is without replacement. The complete Nystr om approximate kernel selection algorithm is shown in Algorithm 1. The main computational cost of Algorithm 1 is from Step 11 to Step 13. The time complexity of SVD in Step 11 is O(c3). In Step 12, the matrix multiplication with O(lck) complexity is conducted. In Step 13, the inverse of µl Ik + VTV is solved by computing its Cholesky factorization with complexity O(k3). Computing the matrix of the linear system takes O(lk2) multiplications. The total complexity of Step 13 is thus O(lk2). Therefore, the total time complexity of Algorithm 1 is O(N(c3 + lck)), where N is the number of candidate kernels, which is linear to the number of examples l. Step 3 of Algorithm 2 can be parallelly done in each column, so the time complexity is O(l). Before giving the main theorem, we ﬁrst introduce two assumptions (Jin et al. 2013). Assumption 1. For ρ (0, 1/2) and the rank parameter k c l, λk(K) = Ω(l/cρ) and λk+1(K) = O(l/c1 ρ), where ρ characterizes the eigen-gap. Assumption 2. We always assume that the rank parameter k is a constant and the sampling size c is a small ratio r of l. Assumption 1 states the large eigen-gap in the spectrum of K (Jin et al. 2013), i.e., the ﬁrst few eigenvalues of K are much larger than the remaining ones, which is not a strong assumption. As assumed in (Bach 2013), the eigenvalues of the kernel matrix have polynomial or exponential decay. The eigenvalues of Gaussian kernels have exponential decay (Cortes, Kloft, and Mohri 2013). Assumption 1 is always weaker than the exponential decay, even when ρ goes to 0. When ρ is close to 1/2, Assumption 1 is weaker than the polynomial decay. The assumption of the constant rank has been adopted in (Wang and Zhang 2013). Assumption 2 is one of the common settings for the Nystr om approximation. The following theorem shows the strong approximate consistency of Algorithm 1, whose proof sketch is given in the Appendix. Theorem 1. If Assumptions 1 and 2 hold, for the kernel selection criterion C(K) deﬁned in (1), we have

E |C(K) C( K)| ε(l),

where the calculation of C( K) is shown in Algorithm 1,

k(k + 1)M 8 + s M 4

for some constant τ and liml ε(l) 0. The strong approximate consistency reveals the fast convergence of the difference between the approximate kernel selection criterion and the accurate one. When our adaptive Nystr om approximate kernel selection algorithm is applied to the kernel selection problem, we can obtain the optimal kernel that is closest to the one produced by accurate kernel selection methods as compared to other Nystr om approximate algorithms with weaker approximate consistency, which shows the appositeness of our adaptive Nystr om approximate algorithm for kernel selection.

0.3 0.5 0.7 0.9

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20 21 22

0.3 0.4 0.5 0.6 0.7 0.8

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20 21 22

0.5 0.6 0.7 0.8 0.9

liver disorders

Values of the criterion

2 10 2 8 2 6 2 4 2 2 20 22 24 26 28 210

0.3 0.4 0.5 0.6 0.7 0.8

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20 21 22

0.1 0.2 0.3 0.4 0.5

breast cancer

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20 21 22

0.4 0.5 0.6 0.7 0.8 0.9

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20 21 22

0.6 0.7 0.8 0.9

Values of the criterion

2 10 2 8 2 6 2 4 2 2 20 22 24 26 28

0.2 0.4 0.6 0.8

Values of the criterion

2 10 2 8 2 6 2 4 2 2 20 22 24 26 28

0.60 0.70 0.80 0.90

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20 21 22

0.55 0.65 0.75

Values of the criterion

2 10 2 8 2 6 2 4 2 2 20 22 24

0.5 0.6 0.7 0.8 0.9

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20

0.5 0.6 0.7 0.8 0.9

Values of the criterion

2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 20 21 22

Original Opt App Uniform Col Norm Leverage Adapt Deshp Adapt Kumar Adapt MS

Figure 1: Approximate consistency for different kernel matrix approximation algorithms. These ﬁgures show the values of the criteria on different kernel parameters γ.

Empirical Studies Here we empirically evaluate the approximate consistency of the proposed approximate kernel selection algorithm. We compare 8 different methods. The ﬁrst one adopts the original criterion C(K) in (1) for kernel selection (Original). The second is the optimal rank k approximation (Opt App). For ﬁxed sampling distributions, we consider the Nystr om approximation with uniform sampling (Uniform), column norm-based sampling (Col Norm) (Drineas and Mahoney 2005), and leverage score-based sampling (Leverage) (Gittens and Mahoney 2013). For adaptive sampling distributions, we compare the adaptive sampling in (Deshpande et al. 2006) (Adapt Deshp) and the one in (Kumar, Mohri, and Talwalkar 2012) (Adapt Kumar). We denote our sampling strategy as Adapt MS , which is short for model selectionbased adaptive sampling. We set the sampling size c = 0.2l and the adaptive sampling size s = 0.1c. To avoid the randomness, we run all methods 10 times. Since Gaussian kernels are universal (Steinwart 2001), we adopt Gaussian kernels κ(x, x ) = exp γ x x 2 2 with variable width γ as our candidate kernel set K. The focus of this paper is not on tuning the regularization parameter µ, so we just set µ = 0.005. Since

the regularized kernel matrix Kµ = K + µl I, µ = 0.005 is not too small. All the implementations are in the R language. We conduct experiments on benchmark data sets from UCI repository4 and LIBSVM Data5. Since the aim of our experiments is to evaluate the theoretical ﬁndings on approximate consistency, we do not conduct experiments on very large scale datasets. For each kernel parameter γ, we generate the kernel matrix K and then use different approximation methods to produce the approximate kernel matrices K. We compare the values of C(K) and C( K). The results are shown in Figure 1. We can see that, apart from the optimal rank k approximation (Opt App), the curves of Adapt MS are closest to the curves of the original criterion C(K) for all data sets, which shows stronger approximate consistency as compared to other approximation methods. It is worth noting that the complexity of Opt App is O(l3) for each candidate kernel, whereas the complexity of our algorithm is O(c3 + lck) (k c l). These results demonstrate that when we conduct approximate kernel selection, our Nystr om approx-

4http://www.ics.uci.edu/ mlearn/MLRepository.html 5http://www.csie.ntu.edu.tw/ cjlin/libsvm

imate kernel selection algorithm can obtain the kernel that is closest to the one produced by the accurate kernel selection algorithm as compared to the approximate algorithms with weaker approximate consistency. The complexities of Original and Opt App are all O(l3). Our adaptive Nystr om approximate kernel selection algorithm is faster than Opt App, Leverage and Adapt Deshp, close to Adapt Kumar, but slower than Uniform and Col Norm. Adapt Deshp requires a full pass through K in each iteration. Leverage requires SVD to compute the sampling distribution with O(l3) time complexity. Although there are cases where some other approximate algorithms have comparable performance to our proposed algorithm, the strong approximate consistency of our algorithm can guarantee a consistently better performance than the algorithms with weaker approximate consistency.

Conclusions In this paper, we proposed a novel Nystr om approximate kernel selection method. By introducing a criteriondriven adaptive sampling distribution, we established the ﬁrst strong approximate consistency of the Nystr om approximate kernel selection method. The sampling strategy considered in this paper is different from the existing matrixerror-based sampling strategies and closely related to the speciﬁc learning task at hand. This design for the sampling distribution may open a door for the research into learningerror-based kernel matrix approximation. Through empirical studies, we showed the stronger approximate consistency of the proposed adaptive Nystr om approximate kernel selection method as compared to the state-of-the-art algorithms. In future, we consider the application of our theoretical and algorithmic results into online learning (Yang, Zhao, and Gao 2018) and recommendation (Yang et al. 2018).

Appendix: Proof Sketch of Theorem 1 The proof is mainly based on the results in (Deshpande et al. 2006; Cortes, Mohri, and Talwalkar 2010; Deshpande and Rademacher 2010). According to (6), we can bound

|C(K) C( K)| 1 µl2 y T 2 K K 2 y 2,

l M, we have

|C(K) C( K)| M 2

As discussed in (Gittens and Mahoney 2013), SPSD matrix approximations based on column sampling (such as Nystr om method) and those based on mixtures of columns can both be subsumed in the SPSD Sketching Model. Here we will apply the results for the approximation on mixtures of columns to the Nystr om approximation. The initial s columns are sampled according to the efﬁcient volume sampling (Deshpande and Rademacher 2010). We use Cs to denote the matrix formed by the s columns of K and Ds to denote the corresponding intersection matrix. From Theorem 8 in (Deshpande and Rademacher 2010), we obtain

K Cs D s,k CT s F

k + 1 K Kk F.

According to Theorem 2.1 in (Deshpande et al. 2006), for adaptive sampling, if we further adaptively sample another s columns, we have

E K C2s D 2s,k CT 2s 2 F K Kk 2 F + k

Here K is C2s D 2s,k CT 2s. If we continue the adaptive sampling, the error will decrease. From the theoretical perspective, we just need to bound the sampling twice. Since y 2

l M, we can prove that

E 2 F M 4 K Cs D s,k CT s 2 F.

Now, we have

E K C2s D 2s,k CT 2s 2 F

k(k + 1)M 4 + s

s K Kk 2 F.

Since K Kk F

l kλk+1(K) and the fact that A 2 A F, according to the above derived bounds, Assumption 1 and Assumption 2, we can obtain

|C(K) C( K)| ε(l)

k(k + 1)M 8 + s M 4

Since ρ < 1

2, liml ε(l) 0.

Acknowledgments This publication is based upon work supported by the King Abdullah University of Science and Technology (KAUST) Ofﬁce of Sponsored Research (OSR) under Award No. URF/1/3007-01-01 and BAS/1/1624-01-01, National Natural Science Foundation of China (No. 61703396), National Natural Science Foundation of China (No. 61673293) and Shenzhen Government (GJHZ20180419190732022).

References Bach, F. R., and Jordan, M. I. 2005. Predictive low-rank decomposition for kernel methods. In ICML, 33 40. Bach, F. 2013. Sharp analysis of low-rank kernel matrix approximations. In COLT, 185 209. Bartlett, P. L., and Mendelson, S. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3:463 482. Bartlett, P. L.; Boucheron, S.; and Lugosi, G. 2002. Model selection and error estimation. Machine Learning 48(1 3):85 113. Cawley, G. C., and Talbot, N. L. C. 2010. On over-ﬁtting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11:2079 2107. Cortes, C.; Kloft, M.; and Mohri, M. 2013. Learning kernels using local Rademacher complexity. In NIPS 26, 2760 2768. Cortes, C.; Mohri, M.; and Talwalkar, A. 2010. On the impact of kernel approximation on learning accuracy. In AISTATS, 113 120.

Deshpande, A., and Rademacher, L. 2010. Efﬁcient volume sampling for row/column subset selection. In FOCS, 329 338.

Deshpande, A.; Rademacher, L.; Vempala, S.; and Wang, G. 2006. Matrix approximation and projective clustering via volume sampling. Theory of Computing 2:225 247.

Ding, L., and Liao, S. 2011. Approximate model selection for large scale LSSVM. Journal of Machine Learning Research - Proceedings Track 20:165 180.

Ding, L., and Liao, S. 2012. Nystr om approximate model selection for LSSVM. In Advances in Knowledge Discovery and Data Mining Proceedings of the 16th Paciﬁc-Asia Conference (PAKDD), 282 293.

Ding, L., and Liao, S. 2014a. Approximate consistency: Towards foundations of approximate kernel selection. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Database (ECML PKDD), 354 369.

Ding, L., and Liao, S. 2014b. Model selection with the covering number of the ball of RKHS. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM), 1159 1168.

Ding, L., and Liao, S. 2017. An approximate approach to automatic kernel selection. IEEE Transactions on Cybernetics 47(3):554 565.

Ding, L.; Liao, S.; Liu, Y.; Yang, P.; and Gao, X. 2018. Randomized kernel selection with spectra of multilevel circulant matrices. In Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence (AAAI), 2910 2917.

Ding, L.; Liu, Z.; Li, Y.; Liao, S.; Liu, Y.; Yang, P.; Yu, G.; Shao, L.; and Gao, X. 2019. Linear kernel tests via empirical likelihood for high-dimensional data. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence (AAAI).

Drineas, P., and Mahoney, M. W. 2005. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6:2153 2175.

Drineas, P.; Mahoney, M. W.; and Muthukrishnan, S. 2008. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications 30(2):844 881.

Fine, S., and Scheinberg, K. 2002. Efﬁcient SVM training using low-rank kernel representations. Journal of Machine Learning Research 2:243 264.

Fowlkes, C.; Belongie, S.; Chung, F.; and Malik, J. 2004. Spectral grouping using the Nystr om method. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(2):214 225.

Gittens, A., and Mahoney, M. W. 2013. Revisiting the Nystr om method for improved large-scale machine learning. In ICML, 567 575.

Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Sch olkopf, B.; and Smola, A. 2012. A kernel two-sample test. Journal of Machine Learning Research 13(1):723 773.

Jin, R.; Yang, T.; Mahdavi, M.; Li, Y.-F.; and Zhou, Z.-H. 2013. Improved bounds for the Nystr om method with application to kernel classiﬁcation. IEEE Transactions on Information Theory 5(10):6939 6949.

Kumar, S.; Mohri, M.; and Talwalkar, A. 2009. Sampling techniques for the Nystr om method. In AISTATS, 304 311.

Kumar, S.; Mohri, M.; and Talwalkar, A. 2012. Sampling methods for the Nystr om method. Journal of Machine Learning Research 13:981 1006.

Li, J.; Liu, Y.; Yin, R.; Zhang, H.; Ding, L.; and Wang, W. 2018. Multi-class learning: from theory to algorithm. In Neur IPS 31, 1593 1602. Liu, Y., and Liao, S. 2015. Eigenvalues ratio for kernel selection of kernel methods. In AAAI, 2814 2820. Liu, Y.; Liao, S.; Lin, H.; Yue, Y.; and Wang, W. 2017. Inﬁnite kernel learning: generalization bounds and algorithms. In AAAI, 2280 2286. Liu, Y.; Lin, H.; Ding, L.; Wang, W.; and Liao, S. 2018. Fast cross-validation. In IJCAI, 2497 2503. Liu, Y.; Jiang, S.; and Liao, S. 2014. Efﬁcient approximation of cross-validation for kernel methods using Bouligand inﬂuence function. In ICML, 324 332. Micchelli, C. A., and Pontil, M. 2005. Learning the kernel function via regularization. Journal of Machine Learning Research 6:1099 1125. Musco, C., and Musco, C. 2017. Recursive sampling for the Nystr om method. In NIPS 30, 3836 3848. Si, S.; Hsieh, C.-J.; and Dhillon, I. S. 2017. Memory efﬁcient kernel approximation. Journal of Machine Learning Research 18:682 713. Smola, A. J., and Sch olkopf, B. 2000. Sparse greedy matrix approximation for machine learning. In ICML, 911 918. Song, L.; Smola, A.; Gretton, A.; Bedo, J.; and Borgwardt, K. M. 2012. Feature selection via dependence maximization. Journal of Machine Learning Research 13:1393 1434. Sriperumbudur, B. K.; Fukumizu, K.; Gretton, A.; Lanckriet, G. R. G.; and Sch olkopf, B. 2009. Kernel choice and classiﬁability for RKHS embeddings of probability distributions. In NIPS 22, 1750 1758. Steinwart, I. 2001. On the inﬂuence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research 2:67 93. Talwalkar, A.; Kumar, S.; and Rowley, H. 2008. Large-scale manifold learning. In CVPR, 1 8. Wang, S., and Zhang, Z. 2013. Improving CUR matrix decomposition and the Nystr om approximation via adaptive sampling. Journal of Machine Learning Research 14:2729 2769. Williams, C. K. I., and Seeger, M. 2000. Using the Nystr om method to speed up kernel machines. In NIPS 13, 682 688. Yang, T.; Li, Y.-F.; Mahdavi, M.; Jin, R.; and Zhou, Z.-H. 2012. Nystr om method vs random Fourier features: A theoretical and empirical comparison. In NIPS 25, 1060 1068. Yang, P.; Zhao, P.; Zheng, V. W.; Ding, L.; and Gao, X. 2018. Robust asymmetric recommendation via min-max optimization. In SIGIR, 1077 1080. Yang, P.; Zhao, P.; and Gao, X. 2018. Bandit online learning on graphs via adaptive optimization. In IJCAI. Zhang, K., and Kwok, J. T. 2010. Clustered Nystr om method for large scale manifold learning and dimension reduction. IEEE Transactions on Neural Networks 21(10):1576 1587. Zhang, T. 2005. Learning bounds for kernel regression using effective data dimensionality. Neural Computation 17(9):2077 2098.