# efficient_kernel_selection_via_spectral_analysis__54dbfe15.pdf

Efﬁcient Kernel Selection via Spectral Analysis

Jian Li1,2, Yong Liu1,3 , Hailun Lin1, Yinliang Yue1, Weiping Wang1

1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Tianjin University {lijian9026,liuyong,yueyinliang,linhailun,wangweiping}@iie.ac.cn

Kernel selection is a fundamental problem of kernel methods. Existing measures for kernel selection either provide less theoretical guarantee or have high computational complexity. In this paper, we propose a novel kernel selection criterion based on a newly deﬁned spectral measure of a kernel matrix, with sound theoretical foundation and high computational efﬁciency. We ﬁrst show that the spectral measure can be used to derive generalization bounds for some kernel-based algorithms. By minimizing the derived generalization bounds, we propose the kernel selection criterion with spectral measure. Moreover, we demonstrate that the popular minimum graph cut and maximum mean discrepancy are two special cases of the proposed criterion. Experimental results on lots of data sets show that our proposed criterion can not only give the comparable results as the state-of-the-art criterion, but also signiﬁcantly improve the efﬁciency.

1 Introduction

Kernel methods, such as support vector machine (SVM) and least square support vector machine (LSSVM), have been widely used in data mining, pattern recognition and machine learning. The performance of these algorithms greatly depends on the choice of kernel function. Therefore, kernel selection is foundational to kernel methods and is also a challenging problem in kernel methods. The standard technique for kernel selection is crossvalidation (CV). It consists in using a subset of the data for training, and then testing on the remaining data in order to estimate the generalization error. However, CV requires training the learning algorithm several times, which is computationally intensive. For the sake of efﬁciency, some approximate CV criteria are introduced: such as generalized crossvalidation (GCV) [Golub et al., 1979], generalized approximate cross-validation (GACV) [Wahba et al., 1999], span bound [Chapelle et al., 2002], efﬁcient leave-one-out crossvalidation (ELOO) [Cawley, 2006], inﬂuence function [De-

Corresponding author

bruyne et al., 2008], Bouligand Inﬂuence Function [Liu et al., 2014; Liu and Liao, 2017], et al. Kernel target alignment (KTA) [Cristianini et al., 2001] is another widely used criterion that can be effectively calculated in O(n2) time complexity, where n is the size of data set. Based on the centered kernel matrix, an improved centered KTA (called CKTA) is proposed [Cortes et al., 2010], which gives better performance than KTA. FSM is another criterion that can be effectively calculated [Nguyen and Ho, 2008], which evaluates the goodness of a kernel function via the data distribution in the feature space. Due to their simplicity and efﬁciency, KTA, CKTA and FSM have been widely used. However, the connection between these measures and the generalization error of some special algorithms, such as SVM and LSSVM, has not established. Therefore, the kernels chosen by these criteria can not guarantee good performance for these speciﬁc algorithms. Minimizing theoretical estimate bounds of generalization error is an alternative to kernel selection. To this end, some measures of complexity are introduced: such as VC dimension [Vapnik, 2000], Rademacher complexity [Bartlett and Mendelson, 2002], local Rademacher complexity [Bartlett et al., 2005], covering number [Zhang, 2002], uniform stability [Bousquet and Elisseeff, 2002], compression coefﬁcient [Luxburg et al., 2004], eigenvalues perturbation [Liu et al., 2013], kernel stability [Liu and Liao, 2014], eigenvalues ratio [Liu and Liao, 2015], principal eigenvalue proportion [Liu et al., 2017d; 2017c], et al. However, the focus of these measures lies on deriving theoretical generalization bounds, which are usually difﬁcult to be used for kernel selection in practice. To address this problem, Cortes et al. [2013] used the tail eigenvalues of kernel matrix to design new algorithms for learning kernels. However, the theoretical error bound based on the tail eigenvalues of kernel matrix is lacking. Liu et al. [2015] investigate it further, and introduce a notion of eigenvalue ratio for kernel selection, which can be used to derive generalization bounds. However, the time complexity of this measure is high. Moreover, this measure brings two extra parameters that should be tuned, making this measure hard to use. In this paper, we propose a novel measure, called spectral measure, with sound theoretical foundation and high computational efﬁciency. The spectral measure is deﬁned based on the spectral decomposition of kernel matrix, and can be

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

effectively calculated in O(n2). We show that the generalization error of SVM and LSSVM can be bounded with spectral measure. Furthermore, we propose a new kernel selection criterion with spectral measure by minimizing the derived upper bounds to guarantee good generalization performance, and prove that the minimum graph cut and maximum mean discrepancy are its two special cases. Theoretical analyses and experimental results show that our criterion is sound and effective. To our knowledge, the kernel selection criterion with both theoretical guarantee and O(n2) time complexity for kernel methods has never been given before.

2 Notations and Preliminaries We consider supervised learning where a learning algorithm receives a sample of n labeled points S = {(xi, yi)}n i=1 (X Y)n, where X denotes the input space and Y = { 1, +1} denotes the output space. We assume S is drawn identically and independently from a ﬁxed, but unknown probability distribution D on X Y. Let K : X X R be a kernel function. The reproducing kernel Hilbert space (RKHS) HK associated with K is deﬁned to be the completion of the linear span of the set of functions {K(x, ) : x X} with the inner product denoted as , K satisfying K(x, ), f K = f(x), f HK. We use K to denote the norm in HK. In this paper, we study the regularized algorithms:

f S := arg min f HK

i=1 ℓ(f(xi), yi) + λ f 2 K

where ℓ( , ) is a loss function and λ is the regularization parameter. Choosing ℓ( , ) to be the square loss ℓ(t, y) = (t y)2 gives rise to LSSVM while choosing it to be the hinge loss ℓ(t, y) = max{0, 1 yt} produces SVM. The performance of the regularized algorithms for classiﬁcation is usually measured by the generalization error or risk R(S) = Pr(x,y) D[yf S(x) < 0]. Unfortunately, R(S) can not be computed since the probability distribution D is unknown, hence we should estimate it from empirical data. In this paper, we will introduce a novel measure to estimate it. K = [K(xi, xj)]n i,j=1 is the kernel matrix and N is the normal kernel matrix denoted as N = K/|K|1, where |K|1 = Pn i,j=1 K(xi, xj). Let (λi, vi) be the spectral decomposition of N, where λi is the eigenvalue and vi is the eigenvector, i = 1, . . . , n. In the following, we assume that |K|1 =: C, and 0 K(x, x ) κ, x, x X. Let R+ = {t|t R, t 0}.

3 Spectral Measure In this section, we will ﬁrst introduce the deﬁnition of spectral measure, and then use it to derive generalization bounds for LSSVM and SVM.

3.1 Deﬁnition of Spectral Measure It is well known that the kernel matrix contains most of the information needed by kernel methods, and its spectral decomposition plays an important role in kernel matrix. Therefore,

we aim at proposing a novel measure based on the spectral decomposition of kernel matrix for kernel selection. Deﬁnition 1 (Spectral Measure (SM)). Let (λi, vi) be the spectral decomposition of the normal kernel matrix N, i = 1, . . . , n. Assume that ϕ : R+ R+ is a function, ϕ(λi) λi for all i {1, . . . , n}. Then the spectral measure of K with respect to ϕ is deﬁned as

SM(K, ϕ) := 1

i=1 ϕ(λi) y, vi 2,

where y = (y1, . . . , yn)T. The condition of ϕ(λi) λi looks strange at ﬁrst glance. Then, we will give two forms of ϕ to show that it is a very important element to remove the noise. Two special cases of ϕ are given as follows: Hinge form: h 0,

ϕ(λi) = 0 if λi h, λi otherwise.

High degree form: ϕ(λi) = λr i , r 1.

It is easy to verity that the Hinge form satisﬁes the assumption of ϕ(λi) λi. From the deﬁnition of N, one can see that 0 λi 1, so the high degree form also satisﬁes the assumption of ϕ(λi) λi. From the deﬁnition of the above two forms of ϕ, one can see that ϕ(λi) = 0 (Hinge form) or ϕ(λi) 0 quickly (High degree form) when λi is small. Note that the small value of eigenvalue usually expresses the noise [Steinwart and Christmann, 2008], thus we can use them to remove the noise. When we use the High degree form,

SM(K, ϕ) : = 1

i=1 λr i y, vi 2

i=1 λr i viv T i

The above equations shows that the SM(K, tr) can be effectively calculated in O(n2). Remark 1. The time cost of high degree form is O(n2), which is much faster than that of Hinge form. However, the accuracy of these two forms is similar, so in this paper, we only consider the use of high degree form for kernel selection.

3.2 SM-based Generalization Error Bounds In this subsection, we will derive the generalization bounds for LSSVM and SVM with spectral measure. Theorem 1. Consider the LSSVM, and assume that f K 1, f HK. Then, with probability 1 δ over the random choice of sample S with size n 5, we have R(S) 1 c0 SM(K, ϕ)+

inf θ (0,1]

θ + 7µ + 3 3µ + 6

where µ = 8 θ2 ln n ln(2n) + ln 2n

δ , c0 = Cλ C+λ.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Time complexity and theoretical guarantee of Cross Validation, KTA, CKTA, FSM and ER

Criteria Time complexity Theory Cross-Validation O(n3) at least Yes KTA, CKTA, FSM O(n2) No ER O(n3) Yes SM (Ours) O(n2) Yes

Proof. The proof is given in Appendix A.

The assumption of f HK, f K 1 is a common assumption which is used in [Bartlett et al., 2005; Kloft and Blanchard, 2011; Cortes et al., 2013]. The above theorem shows that the generalization error of LSSVM can be bounded with spectral measure SM(K, ϕ). Therefore, we can choose the kernel function by maximizing SM(K, ϕ) to guarantee good generalization performance. Theorem 2. Consider the SVM, and assume that f K 1, f HK. Then, for SVM, with probability 1 δ over the random choice of sample S with size n 5, we have R(S) 1 C SM(K, ϕ)+

inf θ (0,1]

θ + 7µ + 3 3µ + 3/(2λ) + 3b

where µ = 8 θ2 ln n ln(2n) + ln 2n

δ , and b = max{1, 1

Proof. The proof is given in Appendix B.

The above theoretical analysis demonstrates the theoretical guarantee of spectral measure for kernel selection.

4 Spectral Kernel Selection In this section, we will present a novel kernel selection criterion with spectral measure, and show the minimum graph cut and maximum mean discrepancy are its two special cases. According to Theorem 1 and Theorem 2, to guarantee good generalization performance, we can select the kernel function by maximizing the spectral measure SM(K, ϕ). Note that we can set ϕ(t) = tr to ignore the noise, and we know that SM(K, tr) can be calculated effectively. Thus, we consider the use of the following kernel selection criterion:

arg max K K SM(K, tr) = 1

where K is a candidate set of kernel functions. Note that we can give different weights to positive and negative classes according to their sample sizes to avoid the imbalance problem of positive and negative classes. Thus, we ﬁnally consider the following weighted spectral measure criterion (SM):

arg max K K

SM(K, tr) = 1

n y TNr y, (1)

where y+ = n n+ and y = n

n , n+ and n are respective the sizes of positive and negative classes. One can see that the time complexity of SM criterion is O(n2). The time complexity of existing popular kernel selection criteria, and whether they have theoretical guarantee are reported in Table 1. We can ﬁnd that our SM is the only criterion with both theoretical guarantee and O(n2) time complexity.

Connections to Graph Cut In graph theory, graph cut is used to measure the degree of dissimilarity of different segmentation [Shi and Malik, 2000]. Note that K(xi, xj) can be considered as the similarity between xi and xj, thus we can use kernel matrix K as similar matrix to construct a graph. In this case, the normalized graph cut (Ncut) [Shi and Malik, 2000] can be written as

Ncut(K) = y TLy

where L = D K, K is the kernel matrix, D is the diagonal matrix with the diagonal element Dii = Pn j=1 Kij. If the positive and negative classes are balanced, that is n+ = n , then y TDy = Pn i=1 Dii = Pn i,j=1 Kij = |K|1. Thus, we have

Ncut(K) = 1 y TKy

y TDy = 1 y TK y

2|K|1 = 1 n

2 SM(K, t),

The above equation shows that minimizing Ncut(K) is equal to maximizing SM(K, ϕ) when setting ϕ(t) = t for kernel selection.

Connections to Mean Discrepancy Mean discrepancy (MD) [Gretton et al., 2007] is proposed to test whether two distributions are different on the basis of samples drawn from each of them. The estimate of MD in RKHS HK [Gretton et al., 2007] can be written as

where φ(xi), φ(xj) K = Kij. If K has been normalized, that is K = N, it is easy to verity that

SM(K, t) = 1

n y TN y = 1

i,j Kij + 1

i,j Kij 2 n n+

The above equation will show that MD(K) is a special case of SM(K, ϕ), which demonstrates the effectiveness of our proposed kernel selection criterion again. Remark 2. The measure of graph cut and mean discrepancy are two special case of our SM when ϕ(t) = t, but in our analysis of the above subsection, we know that ϕ(t) = tr, r 1 can used to ignore the noise. Remark 3. Instead of choosing a single kernel, several authors consider the use of multiple kernels by some criteria, called multiple kernel learning (MKL), see [Lanckriet et al., 2004; Kloft et al., 2011; Liu et al., 2017a; 2017b] and references therein. Our spectral measure criterion can be applied to MKL: minµ MS(Kµ, tr) s.t. µ p = 1, µ 0, where Kµ = Pk i=1 µi Ki. The above optimization problem can be

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 2: Comparison of test errors (%) among our spectral measure criterion (SM) and other ﬁve popular ones including 5-fold crossvalidation (CV), efﬁcient leave-one-out cross-validation (ELOO), centered kernel target alignment (CKTA), feature space-based kernel matrix evaluation (FSM) and eigenvalue ratio (ER). We bold the numbers of the best method, and underline the numbers of the other methods which are not signiﬁcantly worse than the best one.

SM CV ELOO CKTA FSM ER a1a 16.84 1.39 17.02 1.57 16.88 1.41 18.86 1.49 24.72 1.67 16.97 1.52 a2a 17.78 1.28 17.96 1.25 17.94 1.27 18.52 1.26 25.62 1.47 18.99 1.37 anneal 2.69 3.28 3.81 4.11 2.69 3.28 4.75 4.78 5.13 4.18 5.50 4.95 australian 13.71 2.10 13.84 2.18 13.82 2.04 13.91 1.89 44.71 2.47 13.53 2.06 autos 11.81 11.67 11.81 11.67 12.75 11.06 13.71 12.03 12.71 8.06 12.14 11.51 breast-w 3.27 1.01 3.56 1.16 3.59 1.08 3.51 1.05 3.50 1.05 4.26 1.40 breast-cancer 3.18 1.15 3.63 1.16 3.50 1.23 3.63 1.16 3.60 1.14 4.04 1.12 bupa 30.29 3.48 29.10 4.04 30.31 4.27 35.81 3.45 39.77 3.68 29.13 4.46 colic 15.62 3.00 16.47 2.78 15.73 2.97 19.27 2.58 36.42 3.28 17.35 3.09 diabetes 24.22 2.41 24.69 2.71 23.51 2.75 24.85 2.46 35.30 3.00 23.90 2.48 glass 22.09 5.07 21.82 5.68 20.95 4.82 26.41 7.13 43.00 9.22 22.50 5.08 german.numer 24.09 2.15 25.28 2.38 23.81 2.26 26.02 2.16 29.89 2.41 25.33 2.14 heart 16.53 3.27 16.69 3.36 15.95 3.29 18.67 3.78 44.37 5.50 15.98 3.47 hepatitis 15.57 4.68 17.09 5.74 16.63 4.64 15.74 5.00 21.22 5.41 18.91 6.20 ionosphere 4.88 2.10 5.28 2.11 6.42 2.17 11.70 3.43 35.77 4.00 4.86 1.99 labor 13.65 8.10 14.47 8.08 14.82 8.34 15.41 8.80 34.59 8.70 18.82 8.81 pima 23.80 2.14 22.78 2.36 22.51 2.41 24.38 2.28 34.47 2.42 22.78 2.07 segment 0.01 0.00 0.06 0.24 0.20 0.04 0.32 0.03 0.21 0.01 0.24 0.04 liver-disorders 31.94 3.21 29.00 4.11 30.02 4.76 36.27 3.93 40.90 4.10 29.69 4.97 sonar 15.06 4.80 14.26 4.93 13.68 4.43 15.00 5.51 49.32 6.93 18.84 5.75 vehicle 3.02 1.79 3.33 1.77 3.02 1.79 3.77 1.51 53.32 3.38 5.52 2.44 vote 4.31 1.71 4.78 1.74 4.82 1.73 5.25 1.72 6.37 3.96 7.80 2.33 wpbc 23.10 4.58 22.83 4.32 21.93 4.45 21.87 4.13 22.13 4.19 21.87 4.13 tic-tac-toe 10.10 1.93 10.28 1.66 9.78 1.66 33.62 5.31 34.44 2.04 14.62 2.05 wdbc 2.29 1.15 2.43 1.07 2.73 1.11 2.82 1.20 37.49 3.83 4.75 1.66

efﬁciently solved with gradient-based algorithms. However, in this paper, we mainly want to verify the effectiveness of our spectral measure criterion. Therefore, in our experiments, we focus on comparing our criterion with other popular kernel selection criteria.

5 Experiments

In this section, we will empirically analyze the performance of our proposed spectral measure criterion SM(K, tr) (SM). The evaluation is made on 25 publicly available data sets from UCI, Stat Lib and Weka Collections seen in Table 2. For each data set, we run all methods 50 times with randomly selected 70% of all data for training and the other 30% for testing. The use of multiple training/test partitions allows an estimate of the statistical signiﬁcance of differences in performance between methods. Let Ai and Bi be the test errors of methods A and B in partition i, and di = Bi Ai, i = 1, . . . , 50. Let d and Sd be the mean and standard error of di. Then under t-test, with conﬁdence level 95%, we claim that A is signiﬁcantly better than B (or equivalently B signiﬁcantly worse than A) if the t-statistic d Sd/

50 > 1.676. All statements of statistical signiﬁcance in the remainder refer to a 95% level of signiﬁcance. Experiments are conducted on a Dell PC with 3.1-GHz 4-core CPU and 4-GB memory. We use the popular Gaussian ker-

nels K(x, x ) = exp x x 2 2 2τ as our candidate kernels,

τ {2i, i = 15, 14, . . . , 15}. In this paper, we mainly focus on choosing the kernel selection, so we set the regularization parameter λ = 1 for all methods. The learning machine we used is LSSVM.

In the ﬁrst experiment, we compare the performance of our proposed SM criterion with ﬁve popular kernel selection criteria: 5-fold cross-validation (CV), efﬁcient leaveone-out cross-validation (ELOO) [Cawley, 2006], centered kernel target alignment (CKTA) [Cortes et al., 2010], feature space-based kernel matrix evaluation (FSM) [Nguyen and Ho, 2007] and the latest eigenvalue ratio (ER) [Liu and Liao, 2015]. For each data set, we choose the kernel parameter τ by each kernel selection criterion on the training set and then evaluate the test error for the chosen parameters on the test set. In this experiment, we set r = 3. We will explore the inﬂuence of r later. Table 2 reports the average test errors that can be summarized as follows: (a) SM is signiﬁcantly better than FSM and CKTA on nearly all data sets. This can be explained by the fact that FSM and CKTA don t have theoretical guarantee for special learning algorithm, so the kernels chosen by these criteria can not guarantee good generalization performance; (b) SM is better than CV and ER on most data sets. In particular, SM is signiﬁcantly better than ER (or CV) on 15 (or 11) of 25 data sets, but only 4 (or 5) sets signiﬁcantly worse on the remaining data sets; (c) SM gives compara-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 3: Comparison of run time (second) among our proposed SM and other ﬁve ones including CV, ELOO, CKTA, FSM and ER.

SM CV ELOO CKTA FSM ER a1a 63.77 704.50 217.03 87.51 79.71 422.71 a2a 124.53 1995.76 810.87 172.28 156.68 1558.70 anneal 0.28 4.60 0.84 0.43 0.55 1.36 australian 8.30 111.31 30.33 11.34 9.51 52.53 autos 0.10 2.18 0.23 0.17 0.29 0.37 breast-cancer 8.05 104.55 27.13 11.10 9.36 47.68 breast-w 8.59 105.39 29.77 11.80 10.11 49.86 colic 1.55 25.72 6.52 2.15 2.17 11.60 glass 0.33 5.74 1.15 0.53 0.66 2.00 heart 1.02 14.71 3.66 1.12 1.20 6.33 hepatitis 0.38 6.55 1.42 0.58 0.70 2.44 ionosphere 1.59 24.60 6.08 2.03 2.06 10.60 labor 0.16 2.88 0.42 0.28 0.42 0.67 pima 10.99 137.19 36.60 15.11 12.74 62.45 segment 7.50 91.73 23.83 10.49 8.93 42.63 diabetes 10.70 134.71 36.26 14.90 12.51 62.46 german.numer 21.80 249.98 72.63 29.59 26.27 127.51 liver-disorders 1.37 21.04 5.46 1.91 1.94 9.27 sonar 0.62 10.53 2.34 0.87 0.98 4.06 vehicle 2.03 35.64 9.35 2.86 2.85 15.95 vote 2.03 34.43 9.33 2.79 2.81 15.92 wpbc 0.52 9.06 2.07 0.76 0.92 3.33 bupa 1.36 21.09 5.34 1.85 1.88 9.19 tic-tac-toe 18.93 224.13 63.10 26.13 23.04 107.12 wdbc 5.75 61.32 19.18 7.86 6.80 33.65

ble results with ELOO. In particular, SM is signiﬁcantly better than ELOO on 9 data sets (autos, breast-w, breast-cancer, hepatitis, ionosphere, labor, segment, vote and wdbc) and is signiﬁcantly worse on 8 data sets (diabetes, glass, heart, pima, liver-disorders, sonar, wpbc and tic-tac-toe). The run time of SM, CV, ELOO, CKTA, FSM and ER are reported in Table 3. It turns out that SM is much faster than CV, ELOO and ER, and gives comparable results with CKTA and FSM. The above results manifest that SM can guarantee generalization performance and has high computational efﬁciency as well. In the following experiment, we explore the inﬂuence of the parameter r of SM. Figure 1 plots the average test errors with different r (due to space limit, we randomly select 2 data sets). For each ﬁxed r, we choose the kernel parameter τ by SM on the training set, and evaluate the test errors for the chosen parameters on test set. We ﬁnd that the optimal r belong to [2, 5] on most data sets. And we can also ﬁnd out that we can randomly set r = 2, r = 3 or r = 4 in practice without sacriﬁcing accuracy.

6 Conclusion

In this paper, we deﬁne a new spectral measure of a kernel matrix, which can be effectively calculated in O(n2) time complexity. With the spectral measure, we can bound the generalization errors of LSSVM and SVM, and propose the kernel selection criterion by minimizing the derived generalization error bounds, of which the minimum graph cut and maximum mean discrepancy are two special cases. The high computational efﬁciency and theoretical guarantee of the spectral measure may bring a new perspective on designing

5 10 15 20 25 30

5 10 20 25 30 20

r (b) german-numer

Figure 1: The test errors of SM criterion with different r. For each r, we choose the kernel by SM on the training set, and evaluate the test errors for the chosen parameters on test set. (Due to space limited, we randomly select 2 data sets)

kernel selection criterion using spectral analysis.

Appendix A: Proof of the Theorem 1 To prove Theorem 1, we ﬁrst prove the following Lemma: Lemma 1. If f S is the solution of LSSVM, then

i=1 yif S(xi) c0 SM(K, ϕ). (2)

Proof. Note that (f S(x1), . . . , f S(xn))T = Kα where α = [K + λI] 1y. Thus

i=1 yif S(xi) = y TKα = y TK[K + λI] 1y. (3)

Let (βi, ui) be the spectral decomposition of K, then we have

y TK[K + λI] 1y =

βi λ + βi y, ui 2. (4)

Since Tr(K) |K|1 = C, we know that βj

C 1. According to Equation (4), we have y TK[K + λI] 1y = Pn i=1 βi/C βi/C+λ/C y, ui 2 Pn i=1 βi/C 1+λ/C y, ui 2. Note that λi = βi/|K|1 = βi/C, vi = ui, where (λi, vi) is the spectral decomposition of N, thus

y TK[K + λI] 1y

λi 1 + λ/C y, vi 2. (5)

From the assumption ϕ(λi) λi, Equations (3), (4) and (5), we have Pn i=1 yif S(xi) Pn i=1 ϕ(λi) 1+λ/C y, vi 2, which completes the proof.

Theorem 3 (Theorem 8 [Gao and Zhou, 2013]). For any δ > 0, with probability at least 1 δ over the random choice of sample S with size n 5, every f HK satisﬁes the following bound:

Pr D [yf(x) < 0] 2

n + inf θ (0,1]

h Pr S [yf(x) θ] +

n Pr S [yf(x) θ] i ,

where µ = 8 θ2 ln n ln(2n) + ln 2n

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Proof of the Theorem 1. Note that Pr S[yf S(x) θ] = 1 n Pn i=1 1[yif S(xi) θ], where 1[t] = 1, if t < 0, otherwise 1[t] = 0. Thus, it is easy to verify that

Pr S [yf S(x) θ] 1

i=1 max {0, 1 yif S(xi) + θ} . (6)

From the assumption that HK ranges in [-1,1], we have 1 yif S(xi) + θ > 0. Thus, by Equation (6), we have

Pr S [yf S(x) θ] 1

i=1 (1 yif S(xi) + θ). (7)

According to Equation (7) and Equation (2), we have

Pr S [yf S(x) θ] 1 + θ 1

i=1 yif S(xi)

1 + θ c0 SM(K, ϕ). (8)

Substituting Equation (8) to Theorem 3, we have

Pr D [yf S(x) < 0] 2

n + inf θ (0,1]

h [1 + θ c0 SM(K, ϕ)] +

This completes the proof for Theorem 1.

Appendix B: Proof of the Theorem 2 Similar with the proof of the Theorem 1, we ﬁrst prove the following lemma: Lemma 2. If f S be the solution of SVM, then

i=1 yif S(xi) C SM(K, ϕ) + d

where d = min{ 1, 1 1 2λ}.

Proof. Note that f S(x) = Pn i=1 αi K(xi, x), where α is the solution of the dual form of SVM. Thus, we have

i=1 yif S(xi) = 1

ny TK(y α), (9)

where is the entrywise matrix product (also known as the Hadamard product). Note that

y TK(y α) y TKy

yi=yj αi Kij X

yi =yj αi Kij X

yi=yj Kij X

yi=yj αi Kij + X

yi =yj Kij X

yi =yj αi Kij + X

i,j min{αi, 1} Kij X

i,j max{αi, 1} Kij

i,j (min{αi, 1} max{αi, 1}) Kij

i,j ci Kij, (10)

where ci = min{αi, 1} max{αi, 1}. Note that 0 αi 1 2λ, so we can obtain that

ci = min{αi 1, 1 αi} min{ 1, 1 1

From Equation (10), we have

y TK(y α) y TKy d X

i,j Kij = d C.

Thus, we can obtain that y TK(y α) d C + y TKy = d C + P i βi yi, ui 2 = d C + C P i λi yi, vi 2 d C +C P i ϕ(λi) yi, vi 2, where (βi, ui) is the spectral decomposition of K. This completes the proof.

Proof of Theorem 2. According to Lemma 2 and Equation (7), we can obtain that

Pr S [yf S(x) θ] 1 + θ C SM(K, ϕ) Cd

Substituting Equation (11) to Theorem 3, it is easy to complete the proof.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (No.61673293, No.61602467, No.61303056), and Excellent Talent Introduction, Institute of Information Engineering, CAS (Y7Z0111107), and Youth Innovation Promotion Association, CAS (No.2016146).

[Bartlett and Mendelson, 2002] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

[Bartlett et al., 2005] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

[Bousquet and Elisseeff, 2002] Olivier Bousquet and Andr e Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499 526, 2002.

[Cawley, 2006] Gavin C. Cawley. Leave-one-out crossvalidation based model selection criteria for weighted LSSVMs. In Proceeding of the International Joint Conference on Neural Networks (IJCNN 2006), pages 1661 1668, 2006.

[Chapelle et al., 2002] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, and Sayan Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131 159, 2002.

[Cortes et al., 2010] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Two-stage learning kernel algorithms. In Proceedings of the 27th Conference on Machine Learning (ICML 2010), pages 239 246, 2010.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Cortes et al., 2013] Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local Rademacher complexity. In Advances in Neural Information Processing Systems 25 (NIPS 2013), pages 2760 2768. MIT Press, 2013. [Cristianini et al., 2001] Nello Cristianini, John Shawe Taylor, Andr e Elisseeff, and Jaz S. Kandola. On kerneltarget alignment. In Advances in Neural Information Processing Systems 14 (NIPS 2001), pages 367 373, 2001. [Debruyne et al., 2008] Michiel Debruyne, Mia Hubert, and Johan A.K. Suykens. Model selection in kernel based regression using the inﬂuence function. Journal of Machine Learning Research, 9:2377 2400, 2008. [Gao and Zhou, 2013] Wei Gao and Zhi-Hua Zhou. On the doubt about margin explanation of boosting. Artiﬁcial Intelligence, 203:1 18, 2013. [Golub et al., 1979] Gene H. Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215 223, 1979. [Gretton et al., 2007] Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Sch olkopf, and Alexander J. Smola. A kernel method for the twosample problem. In Advances in Neural Information Processing Systems 15 (NIPS 2007), pages 513 520, 2007. [Kloft and Blanchard, 2011] Marius Kloft and Gilles Blanchard. The local Rademacher complexity of lp-norm multiple kernel learning. In Advances in Neural Information Processing Systems 23 (NIPS 2011), pages 2438 2446. MIT Press, 2011. [Kloft et al., 2011] Marius Kloft, Ulf Brefeld, S oren Sonnenburg, and Alexander Zien. Lp-norm multiple kernel learning. Journal of Machine Learning Research, 12:953 997, 2011. [Lanckriet et al., 2004] Gert R. G. Lanckriet, Nello Cristianini, Peter L. Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semideﬁnite programming. Journal of Machine Learning Research, 5:27 72, 2004. [Liu and Liao, 2014] Yong Liu and Shizhong Liao. Preventing over-ﬁtting of cross-validation with kernel stability. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML), pages 290 305, 2014. [Liu and Liao, 2015] Yong Liu and Shizhong Liao. Eigenvalues ratio for kernel selection of kernel methods. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 2814 2820, 2015. [Liu and Liao, 2017] Yong Liu and Shizhong Liao. Granularity selection for cross-validation of svm. Information Sciences, 378:475 483, 2017. [Liu et al., 2013] Yong Liu, Shali Jiang, and Shizhong Liao. Eigenvalues perturbation of integral operator for kernel selection. In Proceedings of the 22nd ACM International

Conference on Information and Knowledge Management (CIKM 2013), pages 2189 2198, 2013. [Liu et al., 2014] Yong Liu, Shali Jiang, and Shizhong Liao. Efﬁcient approximation of cross-validation for kernel methods using Bouligand inﬂuence function. In Proceedings of The 31st International Conference on Machine Learning (ICML 2014 (1)), pages 324 332, 2014. [Liu et al., 2017a] Xinwang Liu, Miaomiao Li, Lei Wang, Yong Dou, Jianping Yin, and En Zhu. Multiple kernel k-means with incomplete kernels. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 2259 2265, 2017. [Liu et al., 2017b] Xinwang Liu, Sihang Zhou, Yueqing Wang, Miaomiao Li, Yong Dou, En Zhu, and Jianping Yin:. Optimal neighborhood kernel clustering with multiple kernels. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 2266 2272, 2017. [Liu et al., 2017c] Yong Liu, Shizhong Liao, Hailun Lin, Yinliang Yue, and Weiping Wang. Generalization analysis for ranking using integral operator. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 2273 2279, 2017. [Liu et al., 2017d] Yong Liu, Shizhong Liao, Hailun Lin, Yinliang Yue, and Weiping Wang. Inﬁnite kernel learning: generalization bounds and algorithms. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence (AAAI), pages 2280 2286, 2017. [Luxburg et al., 2004] Ulrike Von Luxburg, Olivier Bousquet, and Bernhard Sch olkopf. A compression approach to support vector model selection. Journal of Machine Learning Research, 5:293 323, 2004. [Nguyen and Ho, 2007] Canh Hao Nguyen and Tu Bao Ho. Kernel matrix evaluation. In Proceedings of the 20th International Joint Conference on Artifﬁcial Intelligence (IJCAI 2007), pages 987 992, 2007. [Nguyen and Ho, 2008] Canh Hao Nguyen and Tu Bao Ho. An efﬁcient kernel matrix evaluation measure. Pattern Recognition, 41(11):3366 3372, 2008. [Shi and Malik, 2000] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000. [Steinwart and Christmann, 2008] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Verlag, New York, 2008. [Vapnik, 2000] Vladimir Vapnik. The nature of statistical learning theory. Springer Verlag, 2000. [Wahba et al., 1999] Grace Wahba, Yi Lin, and Hao Zhang. GACV for support vector machines. In Advances in Large Margin Classiﬁers. MIT Press, Cambridge,, 1999. [Zhang, 2002] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527 550, 2002.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)