# consistency_of_multiple_kernel_clustering__9971f1b2.pdf

Consistency of Multiple Kernel Clustering

Weixuan Liang 1 Xinwang Liu 1 Yong Liu 2 3 Chuan Ma 4 Yunping Zhao 1 Zhe Liu 4 En Zhu 1

Consistency plays an important role in learning theory. However, in multiple kernel clustering (MKC), the consistency of kernel weights has not been sufficiently investigated. In this work, we fill this gap with a non-asymptotic analysis on the consistency of kernel weights of a novel method termed Simple MKKM. Under the assumptions of the eigenvalue gap, we give an infinity norm bound as e O(k/ n), where k is the number of clusters and n is the number of samples. On this basis, we establish an upper bound for the excess clustering risk. Moreover, we study the difference of the kernel weights learned from n samples and r points sampled without replacement, and derive its upper bound as e O(k p

1/r 1/n). Based on the above results, we propose a novel strategy with Nystr om method to enable Simple MKKM to handle large-scale datasets with a theoretical learning guarantee. Finally, extensive experiments are conducted to verify the theoretical results and the effectiveness of the proposed large-scale strategy.

1. Introduction

Multiple kernel clustering (MKC) (Zhao et al., 2009) is proposed for better performance by searching for an optimal kernel from several base kernels. In recent years, researchers have made great progress in MKC. Huang et al. (2011) propose the multiple kernel k-means algorithm (MKKM), which unifies all base kernels into a consensus one based on a linear combination. Subsequently, several works (Liu et al., 2016; Du et al., 2015; Liu et al., 2021; Wang et al., 2021) enhance MKKM from different perspectives. Among

1College of Computer, National University of Defense Technology, Changsha, China 2Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 3Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China 4Zhejiang Laboratory, Hangzhou, China. Correspondence to: Xinwang Liu <xinwangliu@nudt.edu.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

them, Du et al. (2015) improve the robustness of MKKM by using ℓ2,1-norm. Liu et al. (2016) introduce a matrixinduced regularization to increase the diversity of the optimal kernel. Wang et al. (2021) improve the two-stage strategy into a single step, further reducing redundancy in kernel fusion. A recently proposed algorithm termed Simple MKKM (Liu, 2022) greatly promotes the clustering performance by a minimization-maximization framework.

Although there are various improvements in MKC, some vital statistical properties of it are not sufficiently studied, especially the consistency of the kernel weights. We usually say that a learning algorithm is consistent, i.e., the parameters learned from the training set will converge to the parameters from the whole sample space when the training sample number n . Consistency is an important property in statistical learning, as we can estimate whether the learned parameters are effective by studying the consistency of a learning algorithm. In the existing literature, the consistency of clustering centroids of k-means has been studied in (Pollard, 1981a). Von Luxburg et al. (2008) establish several important results about the consistency of spectral clustering. The consistency of kernel weights in MKC is also a key research problem, as it can be used to derive other important statistical properties, such as excess risk bound. In this paper, we attempt to address this issue. We bound the difference of the weights learned from the training set and the sample space with a non-asymptotic analysis. Under some assumptions about the gap of eigenvalues of the kernel matrix, we establish an infinity bound as e O(k/ n)1, where k is the number of clusters and n is the number of samples. Based on the results of consistency, we derive the excess risk bound of Simple MKKM.

The difference of the kernel weights learned from the training set and its subset is another interesting problem. Utilizing a concentration inequality for sampling without replacement (Bardenet & Maillard, 2015), this difference can be bounded by e O(k p

1/r 1/n), where r is the number of points sampled without placement from the training set with size n. It is illustrated that the kernel weights learned from the selected subset have a fast convergence rate to the whole training set. Based on this, we propose a new algorithm with Nystr om method that can enable Sim-

1 e O( ) hides logarithmic terms.

Consistency of Multiple Kernel Clustering

ple MKKM to handle large-scale datasets. Specifically, we perform Simple MKKM on the selected subset for a group of approximated kernel weights. Then we make weighted combinations of the base kernel similarity matrices consisting of all n samples and the selected subset. Finally, we use the standard Nystr om method to obtain the clustering results. Our algorithm can reduce the complexity of Simple MKKM from O(n3) to be linear with n. Thus it can cluster large-scale datasets. In addition, we derive the excess risk bound of the proposed algorithm for a theoretical learning guarantee. Consequently, when the number of the selected samples is Θ( n), the proposed algorithm will have a favorable statistical and computational trade-off. By the selection, the excess risk bound is the same as single kernel clustering, which is O(k/ n).

To verify the proposed theoretical results, we conduct experiments on commonly used datasets. The numerical experiments substantiate the correctness of the derived bounds. Moreover, we perform our algorithm on large-scale datasets to verify its effectiveness and efficiency.

The contributions of this paper can be summarized as

1. This paper theoretically analyzes the consistency of the kernel weights of an MKC algorithm, and derives its excess risk bound.

2. This paper studies the difference of the kernel weights learned from the whole training set and its subset. Based on this, a method enabling MKC to handle largescale datasets is proposed. In addition, the generalization ability of the proposed method is studied, and the optimal number of the selected subset is also given by theoretical analysis.

3. Extensive experiments are conducted to verify the correctness of our theoretical results, as well as the effectiveness and efficiency of the proposed large-scale algorithm.

The paper is organized as follows. Section 2 introduces the notations, general assumptions, and the problem of the consistency of multiple kernel clustering. Section 4 states the main results. Section 5 establishes the excess risk bound of Simple MKKM. Section 6 proposes the large-scale strategy of Simple MKKM and derives the corresponding excess risk bound. Section 7 reports the experimental results. Section 8 summarizes the paper and discusses future works.

2. Preliminaries

In this section, we first introduce the main notations and general assumptions. Then, we describe multiple kernel clustering and the consistency problem of the kernel weights.

2.1. Notations and General Assumptions

Mathematical notations. We introduce the used mathematical notations across the whole paper for easy reading. We use X to represent the sample space, and ρ(x) is the corresponding probability measure. We use ρn(x) to denote the empirical distribution, i.e., ρn(x) = 1

n if point x belongs the training set, otherwise ρn(x) = 0. The definitions of the asymptotic notations O, Θ, and Ωcan be referred to in Chapter 3 of (Cormen et al., 2022). g(n) = O(f(n)) means g(n) cf(n) for some constant c, and we also write it as g(n) f(n). g(n) = Ω(f(n)) indicates g(n) cf(n) for some constant c. If there exist two constant c1, c2 such that c1f(n) g(n) c2f(n), we denote that g(n) = Θ(f(n)). A is the operator norm if A is a matrix or a operator, and if A is a vector, A denotes the 2-norm.

General Assumptions. The general assumptions partially refer to (Von Luxburg et al., 2008). The sample space X is supposed to be compact. The base kernel functions {Kp( , )}m p=1 are bounded, positive-definite, and conjugate symmetric. We assume that Kp(x, y) 1, for any x, y X and p [m]. The elements of training set Sn = {xi}n i=1 are drawn i.i.d. from X with the distribution ρ. The basic notations used in this paper are summarized in Section C.

2.2. Multiple Kernel Clustering

Multiple kernel clustering (MKC) aims to combine several base kernel matrices into a unified one for better clustering performance. Assume that we have m base kernel functions {Kp( , )}m p=1, and the corresponding feature map of Kp( , ) is ϕp( ). For any point x in sample space X, we can obtain its feature map in multiple kernel scenarios as

[ϕ 1 (x), , ϕ m(x)] .

To reflect the different importance of base kernels, we impose kernel weights {αp}m p=1 on them as

[α1ϕ 1 (x), , αmϕ m(x)] ,

where Pm p=1 αp = 1 and αp 0. Suppose that sample set is Sn = {xi}n i=1 and the kernel matrix computed by the p-th base kernel is 1

n Kp. Then the combination of base kernel matrices can be represented by

p=1 α2 p Kp.

In existing MKC algorithms, researchers mainly design an objective function fn(α) w.r.t. the kernel weights α and minimize it to obtain a group of desirable kernel weights. There are two mainstream categories:

1. Coordinate descent based multiple kernel clustering:

fn(α) = min H H=Ik 1 n Tr(Kα(Ik HH )),

Consistency of Multiple Kernel Clustering

2. Kernel alignment based multiple kernel clustering:

fn(α) = max H H=Ik 1 n Tr(KαHH ). (1)

The second method outperforms the first one in terms of clustering performance. Thus we focus on studying the consistency of kernel alignment based MKC, which is termed Simple MKKM (Liu, 2022). Next, we introduce the objective function when the input training set is the whole sample space X. As n , the empirical kernel matrix will converge to an integral operator LKg(x) := R

X K(x, y)g(y)dρ(y), where K( , ) is the corresponding kernel function (Rosasco et al., 2010). Assume that the first k non-zero eigenvalues of the integral operator LK are {λj}k j=1, and the corresponding eigenvectors are {hj}k j=1. Then, we have

X K(x, y)hj(y)dρ(y).

{hj}k j=1 = argmax {gj}k j=1 Γ

X K(x, y)gj(x)gj(y)dρ(x)dρ(y),

where Γ denotes the orthonormal constraint on L2(X, ρ) space.

For any weights α, we assume that Kα(x, y) = Pm p=1 α2 p Kp(x, y). When the training set is the sample space X, the objective function of kernel alignment based multiple kernel clustering is

f(α) = max {hj}k j=1 Γ

X Kα(x, y)hj(x)hj(y)dρ(x)dρ(y),

(2) where {hj}k j=1 are termed clustering indicator functions.

We denote the first k eigenvalues of 1

n K as {ˆλj}k j=1, and the corresponding eigenvectors are {hj}k j=1. By the definition in (Bengio et al., 2004), the empirical eigenfunctions of operator 1

n K are given by

i=1 K(x, xi)ˆhj(xi),

where ˆhj(xi) = nhij, and hij is the i-the element of hj. Consequently, the objective in Eq. (1) can be rewritten as

fn(α) = max {ˆhj}k j=1

j=1 Kα(xi, xt)ˆhj(xi)ˆhj(xt),

(3) where ˆhj(xi) = nhij, hij is the i-th row and the j-th column of H, and H H = Ik. {ˆhj}k j=1 are termed approximated clustering indicator functions.

Key problems. In this paper, we focus on the following two key problems:

1. We denote ˆαn = argminα fn(α) and α = argminα f(α) in which fn, f are given by Eq.(3) and Eq.(2), respectively. To study the consistency of kernel weights, we try to establish a non-asymptotic bound of ˆαn α .

2. Suppose that ˆαr is the kernel weights learned from r points, which are sampled from Sn without replacement. We also aim to bound ˆαr ˆαn .

2.3. The Optimization of Simple MKKM

Before stating our results, we first introduce how to optimize the objective function fn in Eq.(3).

In (Liu, 2022), the author first proves the differentiability of fn(α) and utilizes the reduced gradient descent method to minimize it w.r.t. α. With some fixed u [m], p = u, its reduced gradient is

[ fn(α)]p = fn(α)

To satisfy the simplex constraint, we know the gradient of the u-th component is

[ fn(α)]u = X

αp = 2αp n(m 1)Tr Kp HαHα ,

when the kernel weights are α and the corresponding clustering indicator matrix is Hα. Then, the final descent direction is

0, if αp = 0 and [ fn(α)]p 0,

[ fn(α)]p, if αp 0 and p = u,

[ fn(α)]u, if p = u.

Denoting γ as the learning step length, we can update α by α = α + γ ˆd. For ease of analysis, we let γ c, where c is some positive constant.

Similarly, because the eigenfunctions corresponding to the first k eigenvalues of the operator LKα : LKαg(x) = R

X Kα(x, y)g(y)dρ(y) are unique, we know that f(α) is also differentiable by Theorem 4.1 in (Bonnans & Shapiro, 1998). Fixed some u [m], we can compute the reduced gradient of Eq.(2) as follows,

[ f(α)]p = f(α)

αu , ( p [m], p = u),

Consistency of Multiple Kernel Clustering

[ f(α)]u = X

where p [m],

αp = 2αp m 1

X Kα(x, y)hα j (x)hα j (y)dρ(x)dρ(y).

where hα j ( ) is the j-th clustering indicator function when the kernel weights are α. The optimization of f(α) is similar to fn(α).

3. Related Work

In the general setting of learning tasks, the training set is drawn from an underlying probability distribution. In such cases, clustering algorithms should satisfy the following fundamental consistency criteria. When the sample number goes to infinite, the parameters, such as clustering centroids and eigenvectors of graph Laplacian matrices, constructed by the clustering algorithm should converge to the parameters of the whole underlying space. In the existing literature, several studies are proposed to derive the consistency of classic clustering algorithms.

3.1. Consistency of k-means

Pollard (1981b) shows the consistency of the global minimizer of the objective function for k-means clustering. Specifically, given a set of points {xi}n i=1, the objective function of k-means is

W(A, ρn) = Z

X min a A x a 2dρn(x),

where A = {a1, , ak} is a group of clustering centriods. For a fixed A, by the law of large numbers,

W(A, ρn) W(A, ρ) := Z

X min a A x a 2dρ(x).

Denote that An = argmin A W(A, ρn) and A = argmin A W(A, ρ). The author of (Pollard, 1981b) shows that An can converge almost surely A . However, the convergence rate is not studied in (Pollard, 1981b). Subsequently, variants of k-means are proven to have consistency guarantees (Sun et al., 2012; Georgogiannis, 2016; Paul et al., 2023).

3.2. Consistency of Spectral Clustering

Spectral clustering is another important algorithm for partitioning non-linear datasets. The consistency of spectral clustering is studied in (Von Luxburg et al., 2008). The authors in (Von Luxburg et al., 2008) show that the normalized graph Laplacian matrix s eigenfunctions converge

to a Laplacian operator s eigenfunctions. In particular, assume that K Rn n is the data similarity matrix, and D is a diagonal matrix with elements dii := Pn j=1 Kij. The normalized graph Laplacian is

Un = I D 1/2KD 1/2.

Then, the corresponding Laplacian operator U is defined by

Uf(x) = f(x) Z

X K(x, y)f(y)/ p

d(x)d(y)dρ(y),

where d(x) = R

X K(x, y)dρ(y). Then the spectra relation between Un and U is theoretically derived in (Von Luxburg et al., 2008), and the convergence rate is also given for the Gaussian kernel. The consistency properties of other categories of algorithms based on spectral clustering are studied in (Ghoshdastidar & Dukkipati, 2017; Zhixin Zhou;Amini, 2019).

Although the consistency of k-means and spectral clustering is well studied in existing research, multiple kernel clustering (MKC) still lacks consistency guarantees. To fill this gap, we theoretically study the consistency of kernel weights of MKC and derive the corresponding convergence rate.

4. Main Results

In this section, we present our main results. Besides the general assumptions, we need the following assumption. Assumption 4.1. For any vector γ Rm, let δj(γ) be the gap between the j-th eigenvalue and the (j + 1)-th eigenvalue of 1

n Kγ. For any j [k], there exists some constant c 0 such that δj(γ) 1/c holds with any γ .

Remark. This assumption commonly appears in matrix perturbation theory (Stewart, 1990; Chen et al., 2016). In the study of the perturbation of eigenvectors and orthogonal projectors, the gaps of eigenvalues are usually regarded as constants. For example, Mitz & Shkolnisky (2022) derive the bounded difference of eigenvectors in Nystr om approximation by assuming the eigen gaps are constants. We use a similar technique of (Von Luxburg et al., 2008) to prove our main results. Notice that Assumption 4.1 also implies all the eigenvalues are separated as the assumption made in (Von Luxburg et al., 2008).

The following two theorems are our main results. The first is the difference between the kernel weights learned from the training set Sn and the sample space X. Theorem 4.2. Under Assumption 4.1, with the same initialization and learning step length γ c, after convergence, the solutions of Simple MKKM on Sn and X are ˆαn, α , respectively. Then

Consistency of Multiple Kernel Clustering

holds with probability at least 1 δ.

Remark. As far as we know, this is the first result about the consistency of the kernel weights in MKC algorithms. This result can be used to obtain some essential properties of MKC algorithms. In the next section, we can obtain the excess risk bound by utilizing this theorem, which is the same as the single kernel clustering proposed in (Biau et al., 2008). The proof can be found in Section B.3 of the appendix.

Theorem 4.3. Under Assumption 4.1, with the same initialization and learning step length γ c, after convergence, the solutions of Simple MKKM on Sn and its subset Sr (sampling without replacement) are ˆαn, ˆαr, respectively. Then

holds with probability at least 1 δ.

Remark. Theorem 4.3 implies that the difference of the kernel weights learned from Sn and Sr has a fast convergence rate as r n. Thus, we can approximate the kernel weights learned from Sn when r is sufficiently large. Consequently, we propose a large-scale extension with a learning guarantee for Simple MKKM, which will be described in Section 6.

5. The analysis of Excess Clustering Risk

In (Liu, 2022), the author gives an upper bound of the generalization clustering risk. However, the objective function of (Liu, 2022) being analyzed is not a standard clustering risk function. Moreover, the excess risk bound of Simple MKKM has not been studied, which is a more general form than the generalization bound.

We first define the loss function for any x X. When the kernel weights are α Rm, the feature map of x is

ϕα(x) = [α1ϕ 1 (x), , αmϕ m(x)] .

We denote the Hilbert space that ϕα(x) belongs to as Hα. For some clustering centroids C = {cj}k j=1 Hk α, the loss function of x is

l(x, C, α) = min j [k] ϕα(x) cj 2. (4)

Accordingly, the empirical clustering risk can be expressed as

Wn(C, α, ρn) = 1

i=1 min j [k] ϕα(xi) cj 2,

and the expected clustering risk is denoted as

W(C, α, ρ) = Z

X min j [k] ϕα(x) cj 2dρ(x).

We denote that the kernel weights learned by performing Simple MKKM on the training set Sn are ˆαn, and the homologous clustering centroids are ˆC = {ˆcj}k j=1 Hk ˆαn. When the input is the sample space, the output kernel weights are denoted as α , and the clustering centroids are C which satisfies C = argmin C Hk W(C, α , ρ). To verify the generalization ability of the learned kernel weights and clustering centroids, we need to upper bound the following formula

ESn[W( ˆC, ˆαn, ρ)] W(C , α , ρ). (5)

We must perform the standard k-means to obtain the clustering centroids ˆC. It is well known that finding the optimal solution of k-means is an NP-hard problem. This paper does not discuss the relationship between the clustering centroids obtained by standard k-means and the optimal ones. Consequently, we simply assume that ˆC = argmin C Hk ˆ αn Wn(C, ˆαn, ρn).

To bound Eq.(5), we process a decomposition as follows

ESn[W( ˆC, ˆαn, ρ)] W(C , α , ρ)

= ESn[W( ˆC, ˆαn, ρ) Wn( ˆC, ˆαn, ρn)] | {z } A + ESn[Wn( ˆC, ˆαn, ρn) Wn(C , α , ρn)] | {z } B + ESn[Wn(C , α , ρn)] W(C , α , ρ) | {z } C

Term A and Term C are the same as the generalization risk of single kernel clustering, and their upper bound is O(k/ n) (Biau et al., 2008).

Term B can be bounded by ˆα α multiplying a constant. Combining with Theorem 4.2, we can obtain the excess risk bound of Simple MKKM by the following theorem. Theorem 5.1. The excess clustering risk of Simple MKKM can be upper bounded by e O(k/ n).

Remark. Theorem 5.1 gives the upper bound of the excess risk of Simple MKKM with a standard clustering loss function as is represented in Eq. (4). Meanwhile, the objective function analyzed in (Liu, 2022) is in the form of the inner product, and it is uncommon among the studies of clustering risk bound. Thus, the proposed risk bound is more rational than the one in (Liu, 2022). The detailed proof of Theorem 5.1 is in Section B.5 of the appendix.

6. Large-Scale Extension with Nystr om Method

In this section, we use Nystr om method to propose a largescale strategy for Simple MKKM. Nystr om method is usu-

Consistency of Multiple Kernel Clustering

ally used to accelerate kernel k-means such that the time complexity is linear with the sample number n (Calandriello & Rosasco, 2018). However, existing works rarely apply Nystr om method to multiple kernel clustering. In a recent paper (Lu et al., 2022), the authors make the first attempt and propose a scalable multiple kernel k-means clustering. However, their method lacks a theoretical learning guarantee, and the number of sampled points will be an additional hyperparameter. Selecting an optimal hyperparameter in the unsupervised scenario is still an open problem. To address the above issues, we propose a novel algorithm, and the detailed process is listed as follows.

1. Sampling r points {yi}r i=1 (without replacement) from n points {xi}n i=1.

2. Perform Simple MKKM on {yi}r i=1, and the output kernel weights are denoted as ˆαr.

3. In the p-th kernel, we denote the kernel similarity matrix consisting of all n samples and r selected samples as Rp Rn r, and the kernel matrix consisting of r samples is Wp. We then use ˆαr to make weighted combinations of the above matrices which are denoted as Rˆαr and Wˆαr, respectively.

4. Performing eigen decomposition on Wˆαr, we have Wˆαr = UrΛr U r .

5. Perform standard k-means on the rows of Rˆαr UrΛ 1/2 r to obtain the final clustering results.

The above algorithm is simple but efficient and effective. We then give an analysis of the complexity and its excess risk bound.

Computational and storage complexity. In Step 2, performing Simple MKKM on m base kernel matrices of size r r costs O(Tr3 + Tmr2) time, and occupies O(mr2) space, where T is the iteration number of Simple MKKM. In Step 3, the computations of Rˆαr and Wˆαr cost O((m 1)(nr + r2)) time, and the space complexity is O(mnr + mr2). Step 4 is the eigen decomposition of Wˆαr whose time consumption is O(r3). In Step 5, computing Rˆαr UrΛ 1/2 r costs O(nr2) time and the subsequent k-means consumes O(nrk T1) time, where k is the cluster number and T1 is the iteration number of k-means. Above all, if r n, the computational and storage complexity is linear with n. As a result, the proposed method can handle large-scale datasets.

Excess risk bound. Besides the complexity analysis, to illustrate the effectiveness of the proposed method, we also give an upper bound of it. We assume that the clustering centroids learned by our method are e Cn,r which belong

to the space Rr Rr | {z } k

2. Then, we can compute the

corresponding clustering centroids ˆCn,r in the Hilbert space Hk ˆαr by ˆCn,r = Φˆαr r UrΛ 1/2 r , where Φˆαr r is the feature map of r selected points when the kernel weights are ˆαr. Specifically, the i-th column of Φˆαr r is

[ˆαr(1)ϕ 1 (yi), , ˆαr(m)ϕ m(yi)] ,

where ˆαr(p) denotes the p-th component of ˆαr.

We should give the upper bound of the following formula to verify the generalization ability of the learned kernel weights and clustering centroids.

ES[W( ˆCn,r, ˆαr, ρ)] W(C , α , ρ).

We can deduce Theorem 6.1 by our theoretical result about the consistency of kernel weights.

Theorem 6.1. When the number of selected points is r, then

ESn[W( ˆCn,r, ˆαr, ρ)] W(C , α , ρ)

can be upper bounded by

r + k r + k n

Remark. Usually, we let r = Θ( n). In this case, Theorem 6.1 gives an upper bound as e O(kn 1/4), which is very loose and unsatisfactory. Fortunately, we can further improve this result.

Tighter risk bound. In (Yin et al., 2022a;b; 2020b), Yin et al. give several randomized sketching methods for clustering, and provide a risk bound for the sketching dimension. Surprisingly, we can use a similar technique of (Yin et al., 2022a;b; 2020b) to establish the following theorem, providing a tighter bound than Theorem 6.1. Theorem 6.2. When the number of selected points r log(2/δ) ε log(1+ε),

ESn[W( ˆCn,r, ˆαr, ρ)] W(C , α , ρ)

can be bounded by

r + k n + kε .

Remark. Larger r can make the above bound tighter but cost more time. To make a favorable statistical and computational trade-off, we should select an appropriate r. In Theorem 6.2, let ε = 1/ n, then r = Ω( n) and the excess risk bound is e O(k/ n). The proof of Theorem 6.2 is in Section B.6 of the appendix.

2Rr Rr | {z } k is the k-multiple Cartesian products of Rr.

Consistency of Multiple Kernel Clustering

7. Experiments

The experiments compose of two parts. The first part validates the non-asymptotic bound of the difference between the kernel weights learned from the whole dataset and its subset as shown in Theorem 4.3. The second part is the numerical experiment of the proposed large-scale algorithm. All experiments are conducted on a laptop with Intel(R) Core(TM)-i7-10870H CPU.

7.1. The Difference of Kernel Weights

Table 1. Benchmark datasets Datasets Samples Kernels Clusters Flo17 1360 7 17 Flo102 8189 4 102 DIGIT 2000 3 10 PFold 694 12 27 CCV 6773 3 20 Reuters 18758 5 6

We conduct experiments on 6 benchmark datasets, including Flo17, Flo102, DIGIT, PFold, CCV and Reuters. We report the detailed information in Table 1, and their URLs can be found in Appendix D. For each dataset, we first perform Simple MKKM on the whole samples to obtain the kernel weights ˆαn. Then, we sample r points without replacement. We run Simple MKKM on these r samples and record the corresponding kernel weights. To reduce the influence of randomness, we repeat the above process 50 times and compute the average kernel weights, denoted as ˆαr. The number of selected points r varies in [100, 200, , 3000]. We let r be smaller than the whole sample number for the small datasets. In addition, for the datasets with large cluster numbers k, we let r be bigger than k. We compute the values of ˆαn ˆαr for different r, and illustrate them in Figure 1.

In Figure 1, the blue points reflect the variation trend of ˆαn ˆαr as r becomes larger. As seen, ˆαn ˆαr tends to be smaller and converges to a small value when r is large enough. As a reference, we plot the image of the function f(r) = a p

1/r 1/n by a red curve, and report the different a s of all the 6 datasets in the right-hand corner of Figure 1. It can be observed that the blue points are bounded by the red curve. This verifies the correctness of the bound in Theorem 4.3. Moreover, the difference of the blue points and the red curve is small when r is small. It shows the tightness of the proposed bound. Meanwhile, this difference becomes larger as r is larger.

Table 2. Large-scale datasets used in the experiments

Dataset Samples Views Clusters NUSWIDE 30000 5 31 Aw A 30475 6 50 CIFAR10 50000 3 10 Yt Video 101499 5 31 Winnipeg 325834 2 7 Covertype 581012 2 10

7.2. Experiments on Large-Scale Datasets

In Section 6, we propose a method with Nystr om that can make Simple MKKM able to deal with large-scale datasets. Furthermore, we analyze the proposed method s complexity and excess risk bound in theory. We conduct experiments on 6 large-scale datasets in this subsection to test the actual clustering performance and running time. The used datasets are NUSWIDE, Aw A, CIFAR10, Yt Video, Winnipeg, and Covertype. Their sample numbers, cluster numbers, and view numbers can be found in Table 2. The number of samples ranges from 30000 to 581012. Such large-scale datasets are rarely seen in the research of kernel clustering, especially for multiple kernel clustering. Despite this, the proposed algorithm is still able to handle them. Due to the limited space, the URLs of these datasets are listed in Appendix D.

For each view, we use the Gaussian RBF kernel to construct the kernel similarity matrix between the whole training set Sn and the selected subset Sr, i.e.,

K(x, z) = exp x z 2

where x Sn, z Sr, and σ is the square root of the average interpoint distance between Sn and Sr, i.e.,

z Sr x z 2.

As proven in Section 6, by setting r = Θ( n), the proposed algorithm can have a favorable statistical and computational trade-off. For a sufficient r, we let r = 3 n . Because the used datasets are extra large, existing multiple kernel clustering methods can rarely be performed on them. Thus, we only perform single kernel clustering with Nystr om on each kernel for comparisons. Our experiments use three frequently-used clustering metrics: accuracy (ACC), normalized mutual information (NMI), and purity. We also record the execution time of all experiments. The experimental results are reported in Table 3. Notice that the best results are in bold fonts, and we underline the second-best results for each dataset.

Consistency of Multiple Kernel Clustering

0 200 400 600 800 1000 1200 1400 Sample Number

0 500 1000 1500 2000 2500 3000 Sample Number

0 200 400 600 800 1000 1200 1400 1600 1800 2000 Sample Number

100 150 200 250 300 350 400 450 500 550 600 Sample Number

0 500 1000 1500 2000 2500 3000 Sample Number

0 500 1000 1500 2000 2500 3000 Sample Number

Figure 1. The difference of the kernel weights learned from 6 benchmark datasets and their respective subsets. The blue points record the true difference of the numerical experiments, while the red curves are the image of f(r) = a p

1/r 1/n, where the value a differs in all 6 datasets.

As seen from Table 3, the experimental results show that the proposed method achieves the best clustering performance. At the same time, the time consumption is slightly larger than the single kernel k-means with Nystr om. Specifically, from this table, we have the following observations:

1. The proposed method outperforms the second-best results by 1.36%, 1.83%, 15.42%, 1.72%, 8.27%, and 3.91% in terms of NMI on all six datasets. On the other two clustering metrics, the proposed method also performs best.

2. Notice that the sample numbers of the last three datasets are enormous. Although such large-scale datasets are rarely seen in the studies of kernel clustering, the proposed method can also handle them within hundreds of seconds.

In summary, the proposed method demonstrates superior clustering performance and has a high execution efficiency on all benchmark datasets.

8. Conclusions and Future Works

In the paper, we study the consistency of multiple kernel clustering. We have obtained two important bounds, i.e., the bounds of ˆαn α and ˆαr ˆαn , where ˆαr, ˆαn, α are respectively the kernel weights learned from Sr, Sn, X by Simple MKKM. We then use the derived bounds to obtain the excess risk bound of Simple MKKM. Moreover, we make a large-scale extension of Simple MKKM with a theoretical learning guarantee. Besides the theoretical analysis, we conduct experiments to

verify the proposed results and the large-scale algorithm.

In the future, we will focus on the following three aspects to improve our work.

1. The proposed bound of the kernel weights could be further improved for a tighter excess risk bound. In our proofs, we use the excess risk bound of single kernel clustering, which is e O(k/ n) (Biau et al., 2008). However, Liu (2021) improves the above bound as e O( p

k/n), and prove its tightness. We want to know whether the proposed bound of the difference of the kernel weights can be improved from e O(k/ n) to e O( p

k/n). Accordingly, the excess risk bound of multiple kernel clustering can also be improved.

2. The proposed large-scale algorithm is effective and efficient, but there is room for improvement in multiple kernel clustering. We will try to design more large-scale algorithms for better clustering performance and higher operating efficiency.

3. As is well known, kernel clustering has a significant connection with spectral clustering (Dhillon et al., 2004). Meanwhile, this connection still exists among multiple kernel clustering and multi-view spectral clustering (MVSC). Although the consistency of spectral clustering has been discovered by (Von Luxburg et al., 2008), the consistency of MVSC needs to be further studied. We will explore whether MVSC has consistency with the techniques proposed in this paper.

Consistency of Multiple Kernel Clustering

Table 3. The clustering performance on large-scale datasets.

(a) NUSWIDE

Method ACC NMI Purity Time(s) View1 13.58 8.59 20.30 7.84 View2 12.67 9.23 20.97 7.87 View3 13.65 9.48 21.79 8.10 View4 16.16 12.08 24.69 8.26 View5 14.28 10.21 23.61 10.34 Proposed 16.64 13.44 25.19 12.09

Method ACC NMI Purity Time(s) View1 7.56 7.05 8.69 13.18 View2 7.65 7.79 8.79 13.33 View3 7.23 6.16 7.85 14.01 View4 7.82 8.08 9.04 17.47 View5 8.15 9.06 9.27 15.92 View6 8.09 8.21 9.44 13.47 Proposed 9.42 10.89 10.78 22.41

(c) CIFAR10

Method ACC NMI Purity Time(s) View1 73.34 65.87 74.22 6.35 View2 71.54 62.36 73.50 6.57 View3 63.63 54.41 65.29 6.55 Proposed 81.36 81.29 85.53 12.35

(d) Yt Video

Method ACC NMI Purity Time(s) View1 9.79 5.37 26.87 30.85 View2 17.81 15.32 29.09 29.27 View3 13.74 11.29 27.07 33.93 View4 18.29 16.51 29.76 33.75 View5 18.97 11.67 28.59 30.22 Proposed 18.97 17.04 29.36 60.26

(e) Winnipeg

Method ACC NMI Purity Time(s) View1 60.20 47.72 71.84 188.79 View2 56.84 44.60 65.42 121.81 Proposed 68.93 55.99 76.33 164.81

(f) Covertype

Method ACC NMI Purity Time(s) View1 36.34 10.64 54.23 391.18 View2 48.76 11.32 48.76 402.54 Proposed 45.26 15.23 55.21 485.52

9. Acknowledgments

This work was supported by the National Key R&D Program of China 2020AAA0107100, Youth Foundation Project of Zhejiang Lab (No. K2023PD0AA01), and the National Natural Science Foundation of China (project no. 62002170, 61773392, 61872377, 61922088, 61976196, and 62006237).

Bardenet, R. and Maillard, O.-A. Concentration inequalities for sampling without replacement. In Bernoulli, volume 21, pp. 1361 1385, 2015.

Bengio, Y., Delalleau, O., Roux, N. L., Paiement, J.-F., Vincent, P., and Ouimet, M. Learning eigenfunctions links spectral embedding and kernel pca. Neural computation, 16(10):2197 2219, 2004.

Biau, G., Devroye, L., and Lugosi, G. On the performance of clustering in hilbert spaces. In IEEE Transactions on Information Theory (TIT), pp. 781 790, 2008.

Bonnans, J. F. and Shapiro, A. Optimization problems with perturbations: A guided tour. In SIAM review, volume 40, pp. 228 264. SIAM, 1998.

Calandriello, D. and Rosasco, L. Statistical and computational trade-offs in kernel k-means. In Advances in neural information processing systems (Neur IPS), volume 31, 2018.

Chen, Y. M., Chen, X. S., and Li, W. On perturbation bounds for orthogonal projections. In Numerical Algorithms, pp. 433 444, 2016.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to algorithms. MIT press, 2022.

Dhillon, I. S., Guan, Y., and Kulis, B. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 551 556, 2004.

Du, L., Zhou, P., Shi, L., Wang, H., Fan, M., Wang, W., and Shen, Y.-D. Robust multiple kernel k-means using l21-norm. In Twenty-fourth international joint conference on artificial intelligence (IJCAI), 2015.

Georgogiannis, A. Robust k-means: a theoretical revisit. In Advances in Neural Information Processing Systems (Neur IPS), volume 29, 2016.

Ghoshdastidar, D. and Dukkipati, A. Consistency of spectral hypergraph partitioning under planted partition model. In The Annals of Statistics, volume 45, pp. 289 315, 2017.

Consistency of Multiple Kernel Clustering

Huang, H.-C., Chuang, Y.-Y., and Chen, C.-S. Multiple kernel fuzzy clustering. In IEEE Transactions on Fuzzy Systems (TFS), pp. 120 134, 2011.

Liu, J., Liu, X., Wang, S., Zhou, S., and Yang, Y. Hierarchical multiple kernel clustering. In Proceedings of the AAAI conference on artificial intelligence (AAAI), volume 35, pp. 8671 8679, 2021.

Liu, X. Simplemkkm: Simple multiple kernel k-means. In IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE, 2022.

Liu, X., Dou, Y., Yin, J., Wang, L., and Zhu, E. Multiple kernel k-means clustering with matrix-induced regularization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 1888 1894, 2016.

Liu, Y. Refined learning bounds for kernel and approximate k-means. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Lu, Y., Xin, H., Wang, R., Nie, F., and Li, X. Scalable multiple kernel k-means clustering. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM), pp. 4279 4283, 2022.

Mc Diarmid, C. On the method of bounded differences. In Surveys in combinatorics, volume 141, pp. 148 188. Norwich, 1989.

Mitz, R. and Shkolnisky, Y. A perturbation-based kernel approximation framework. In Journal of Machine Learning Research (JMLR), volume 23, pp. 1 26, 2022.

Paul, D., Chakraborty, S., Das, S., and Xu, J. Implicit annealing in kernel spaces: A strongly consistent clustering approach. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), volume 45, pp. 5862 5871, 2023.

Pollard, D. Strong consistency of k-means clustering. In The Annals of Statistics, pp. 135 140. JSTOR, 1981a.

Pollard, D. Strong consistency of k-means clustering. In The Annals of Statistics, volume 9, pp. 135 140, 1981b.

Rosasco, L., Belkin, M., and Vito, E. D. On learning with integral operators. In Journal of Machine Learning Research (JMLR), pp. 905 934, 2010.

Stewart, G. W. Matrix perturbation theory. Citeseer, 1990.

Sun, W., Wang, J., and Fang, Y. Regularized k-means clustering of high-dimensional data and its asymptotic consistency. In Electronic Journal of Statistics, volume 6, pp. 148 167, 2012.

Von Luxburg, U., Belkin, M., and Bousquet, O. Consistency of spectral clustering. In The Annals of Statistics, pp. 555 586, 2008.

Wang, R., Lu, J., Lu, Y., Nie, F., and Li, X. Discrete multiple kernel k-means. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pp. 3111 3117, 2021.

Yin, R., Liu, Y., Wang, W., and Meng, D. Extremely sparse johnson-lindenstrauss transform: From theory to algorithm. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 1376 1381, 2020a. doi: 10.1109/ICDM50108.2020.00180.

Yin, R., Liu, Y., Wang, W., and Meng, D. Extremely sparse johnson-lindenstrauss transform: From theory to algorithm. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 1376 1381. IEEE, 2020b.

Yin, R., Liu, Y., Wang, W., and Meng, D. Randomized sketches for clustering: Fast and optimal kernel k-means. In Advances in Neural Information Processing Systems (Neur IPS), 2022a.

Yin, R., Liu, Y., Wang, W., and Meng, D. Scalable kernel k-means with randomized sketching: From theory to algorithm. In IEEE Transactions on Knowledge and Data Engineering (TKDE). IEEE, 2022b.

Yu, Y., Wang, T., and Samworth, R. J. A useful variant of the davis-kahan theorem for statisticians. In Biometrika, pp. 315 323, 2014.

Zhao, B., Kwok, J. T., and Zhang, C. Multiple kernel clustering. In International Conference on Data Mining (ICDM), pp. 638 649, 2009.

Zhixin Zhou;Amini, A. A. Analysis of spectral clustering algorithms for community detection: the general bipartite setting. In Journal of Machine Learning Research (JMLR), volume 20, pp. 1 47, 2019.

Zwald, L. and Blanchard, G. On the convergence of eigenspaces in kernel principal component analysis. In Advances in Neural Information Processing Systems (Neur IPS), volume 18, 2005.

Consistency of Multiple Kernel Clustering

A. Overview of the Proofs

In this section, we outline the proofs of the main results proposed in Section 4.

A.1. Proof Technique of Theorem 4.2

To study the difference between ˆαn and α , we should quantify the difference of alignment of the p-th base kernel and the clustering indicator functions, which are respectively defined as

ˆϵ(Kp, {ˆhα j }k j=1) = 1

t=1 Kp(xi, xt)ˆhα j (xi)ˆhα j (xt),

ϵ(Kp, {hα j }k j=1) =

X Kp(x, y)hα j (x)hα j (y)dρ(x)dρ(y),

where ˆhα j , hα j are the clustering indicator functions of the operators of 1

n Kα and LKα.

In each iteration, we denote α is the kernel weights obtained from Sn, and the corresponding first k eigenfunctions are {ˆhα j ( )}k j=1. Accordingly, we denote β as the kernel weights learned from the sample space X in the same iteration, and

{hβ j ( )}k j=1. We then bound ˆϵ(Kp, {ˆhα j }k j=1) ϵ(Kp, {hβ j }k j=1) . By decoupling the kernel weights and alignment, we can deduce that ˆϵ(Kp, {ˆhα j }k j=1) ϵ(Kp, {hβ j }k j=1)

= ˆϵ(Kp, {ˆhα j }k j=1) ˆϵ(Kp, {ˆhβ j }k j=1) | {z } A

+ ˆϵ(Kp, {ˆhβ j }k j=1) ϵ(Kp, {hβ j }k j=1) | {z } B

By the assumption of the eigen gap of 1

n Kα, we can use matrix perturbation theory to bound Term A as

A α β . (7)

The following lemma reflects how the alignment level of arbitrary kernel function K( , ) and {hj}k j=1 differs from the alignment level of K( , ) and {ˆhj}k j=1.

Lemma A.1. Under Assumption 4.1,

ˆϵ(K, {ˆhj}k j=1) ϵ(K, {hj}k j=1) k

holds with probability at least 1 δ.

By Lemma A.1, for any β, we have

Above all, we have the following theorem.

Theorem A.2. Under Assumption 4.1, for any α, β , ˆϵ(Kp, {ˆhα j }k j=1) ϵ(Kp, {hβ j }k j=1)

holds with probability at least 1 δ.

Consistency of Multiple Kernel Clustering

The sketching proof of Theorem 4.2. In all iterations, denote that the kernel weights learned from the training set are respectively α(0), , α(T ).

Meanwhile, the kernel weights obtained from the whole sample space are

β(0), , β(T ).

With the same initialization, we know α0 = β0. Then, it is easy to check that for any integer t 0,

α(t+1) β(t+1) α(t) β(t) + ˆϵp(α(t)) ϵp(β(t)) ,

where ˆϵp(α(t)) is a brief notation of ˆϵ(Kp, {ˆhα(t) j }k j=1) as well as for ϵp(β(t)).

With Theorem A.2, we then have

α(t) β(t) α(t 1) β(t 1) + k

By repeating the above process and the convergence of the reduced gradient descent algorithm, the result of Theorem 4.2 follows. The detailed proof is in Section B.3 of the appendix.

A.2. Proof Technique of Theorem 4.3

The proof process is similar to Theorem 4.2. By the utilization of the similar notations as the above subsection, we define the alignment level of the training set of n points as

ˆϵ(Kp, {ˆhn,j}k j=1) = 1

j=1 Kp(xi, xj)ˆhn,j(xi)ˆhn,j(xj),

where ˆhn,j is the approximated clustering indicator function. Similar, the alignment level of r points (sampling from Sn without replacement) can be defined as

ˆϵ(Kp, {ˆhr,j}k j=1) = 1

t=1 Kp(xi, xt)ˆhr,j(xi)ˆhr,j(xt).

We first quantify the difference between the above two terms by the concentration properties for sampling without replacement, as shown in the following lemma.

Lemma A.3. Under Assumption 4.1,

ˆϵ(Kp, {ˆhn,j}k j=1) ˆϵ(Kp, {ˆhr,j}k j=1) k

holds with probability at least 1 δ.

Let α be some kernel weights obtained from the training set Sn and β be the kernel weights from Sr. Similar to Theorem A.2, we have the important undermentioned theorem.

Theorem A.4. Under Assumption 4.1,

ˆϵ(Kp, {ˆhα n,j}k j=1) ˆϵ(Kp, {ˆhβ r,j}k j=1) α β + k

holds with probability at least 1 δ.

By Theorem A.4, Theorem 4.3 can be obtained similarly as the proof of Theorem 4.2.

Consistency of Multiple Kernel Clustering

B. Detailed Proof

B.1. The Proof of Lemma A.1

The proof is based on a concentration inequality and the bounded difference of two operators. We first introduce the two lemmas which are important to our proof.

The first lemma is the famous Mc Diarmid s inequality.

Lemma B.1. (Mc Diarmid, 1989) If f has c-bounded differences on the sample space X, then for all ε > 0:

Pr(|f(Sn) E Sn [f(Sn)]| ε) 2 exp 2ε2

where f is some function of Sn.

The second lemma is about the bounded difference of an operator defined in Hilbert space and its empirical version.

Lemma B.2. (Rosasco et al., 2010) Define two operators as

Tn : H H, Tn = 1

i=1 , Kxi Kxi,

and TH : H H, TH = Z

X , Kx Kxdρ(x).

The operators TH and Tn are Hilbert-Schmidt. Assume that K(x, x) 1, then

2 log(2/δ) n

holds with probability at least 1 δ.

Proof. By the triangle inequality, we have ˆϵ(K, {ˆhj}k j=1) ϵ(K, {hj}k j=1)

t=1 K(xi, xt)ˆhj(xi)ˆhj(xt) ZZ

X K(x, y)hj(x)hj(y)dρ(x)dρ(y)

For the j-th term, 1 n2

t=1 K(xi, xt)ˆhj(xi)ˆhj(xt) ZZ

X K(x, y)hj(x)hj(y)dρ(x)dρ(y)

t=1 K(xi, xt)ˆhj(xi)ˆhj(xt) ZZ

X K(x, y)ˆhj(x)ˆhj(y)dρ(x)dρ(y)

X K(x, y)ˆhj(x)ˆhj(y)dρ(x)dρ(y) ZZ

X K(x, y)hj(x)hj(y)dρ(x)dρ(y) | {z } D

In C, notice that the latter term is the expectation of the ahead one, so we can give an upper bound of it by Mc Diarmid s inequality. Denote that

t=1 K(xi, xt)ˆhj(xi)ˆhj(xt).

Consistency of Multiple Kernel Clustering

We replace the l-th sample xl of Sn with x l and denote the new training set as S n. It can be checked that

G(Sn) G(S n)

K(xl, xt)ˆhj(xl)ˆhj(xt) K(x l, xt)ˆhj(x l)ˆhj(xt)

By Mc Diarmid s inequality (Lemma B.1), we know that there exists a constant c > 0 such that

holds with probability at least 1 δ.

Then we bound Term D, we have

X K(x, y)(ˆhj(x)ˆhj(y) hj(x)hj(y))dρ(x)dρ(y)

sup x,y |ˆhj(x)ˆhj(y) hj(x)hj(y)|.

We first define the following operator.

ˆLKf(x) := 1

i=1 K(x, xi)f(xi).

ˆhj is the eigenfunction of ˆLK, because

ˆLKˆhj(x) = 1

i=1 K(x, xi)ˆhj(xi) = ˆλjˆhj(x).

The following proof is similar to the method in (Von Luxburg et al., 2008), which discusses the consistency of spectral clustering. For completeness, we also give detailed proof. By Theorem 7 and Proposition 18 of (Von Luxburg et al., 2008), we know that there exists a sequence {aj}j {1, 1} and a constant C such that

ajˆhj hj 2 hj Pj(K)hj

2C( (ˆLK LK)hj + (ˆLK LK)ˆLK )

2C( ˆLK LK hj + ˆLK LK ˆLK ),

where Pj(K) is the orthogonal projector onto the subspace spanned by j-th eigenvector of the kernel function K.

sup f =1 ˆLKf = sup f =1

i=1 K(x, xi)f(xi)

i=1 |K(x, xi)| |f(xi)| 1,

we can obtain ˆLK 1.

Consistency of Multiple Kernel Clustering

By the definition of TH and Tn in Lemma B.2, we can scale ajˆhj hj as

=4C sup x X f =1

i=1 K(x, xi)f(xi) Z

X K(x, y)f(y)dρ(y)

4C sup x X f =1

i=1 Kxif(xi) Z

X Kyf(y)dρ(y)

=4C sup Kx,f HK f =0

| Kx, (Tn TH)f HK|

=4C sup Kx,f HK f =0

| Kx, (Tn TH)f HK|

supx X | Kx, f HK|

4C sup Kx,f HK f =0

Kx, (Tn TH)f HK

4C sup Kx,f HK f =0

Kx HK (Tn TH)f HK

Combining Lemma B.2, we know that there exists a constant c > 0 such that

holds with probability at least 1 δ.

On the other hand, we can obtain

D sup x,y |ajˆhj(x) ajˆhj(y) ajˆhj(x)hj(y) + ajˆhj(x)hj(y) hj(x)hj(y)|

sup x,y |ajˆhj(x)||ajˆhj(y) hj(y)| + |hj(y)||ajˆhj(x) hj(x)|

2 ajˆhj hj .

Thus, there exists c > 0 such that

n holds with probability at least 1 δ.

Above all, there is also a constant c 0 such that

holds with probability at least 1 δ.

Then, we can obtain the final bound as

ˆϵ(K, {ˆhj}k j=1) ϵ(K, {hj}k j=1) ck

holds with probability at least 1 δ. This because (1 δ/k)k 1 δ. The proof is complete.

Consistency of Multiple Kernel Clustering

B.2. The Proof of Theorem A.2

The following lemma is about the perturbation of eigenvectors of Hermitian matrices, which is useful to our proof.

Lemma B.3. (Yu et al., 2014) Let A, B Rn n be Hermitian, with eigenvalues λ1 λn and ˆλ1 ˆλn respectively. Fixed 1 r s n and assume that min(λr 1 λr, λs λs+1) > 0, where λ0 := and λn+1 := . Let d := s r + 1, let H = [hr, hr+1, , hs] Rn d and ˆH = [ˆhr, ˆhr+1, , ˆhs] Rn d have orthonormal columns satisfying Ahj = λjhj and Bˆhj = ˆλj ˆhj for j = r, r + 1, , s. Then

sinθ(H, ˆH) F 2 min(d1/2 A B op, A B F)

min(λr 1 λr, λs λs+1) ,

where θ(H, ˆH) Rd d is the diagonal matrix whose j-th diagonal entry is the j-th principal angle, i.e., arccos(h j ˆhj).

Proof. We make a decomposition as

ˆϵ(Kp, {ˆhα j }k j=1) ϵ(Kp, {hβ j }k j=1)

n Tr Kp HαHα

X Kp(x, y)hβ j (x)hβ j (y)dρ(x)dρ(y)

n Tr Kp HαHα 1

n Tr Kp HβHβ

t=1 Kp(xi, xt)ˆhβ j (xi)ˆhβ j (xt)

X Kp(x, y)hβ j (x)hβ j (y)dρ(x)dρ(y)

It is sufficient to bound Term B by Lemma A.1. We will use matrix perturbation theory (Stewart, 1990) to bound Term A, and we can deduce that

n Tr Kp HαHα 1

n Tr Kp HβHβ

HαHα HβHβ F

n2 HαHα HβHβ F

n2 HαHα HβHβ F

HαHα HβHβ F

j=1 (1 (hα j hβ j )2)

j=1 (1 cos2θ(hα j , hβ j ))

2 sinθ(Hα, Hβ) F .

For any vector α Rm, denote that δ(α) is the gap of the k-th and (k + 1)-th eigenvalues of matrix 1

n Kα. By the assumption that there exists a constant c 0 such that for any α , δ(α) 1/c. By Theorem B.3, letting r = 1, s = k,

Consistency of Multiple Kernel Clustering

2 sinθ(Hα, Hβ) F

n Kβ F δ(α)

t=1 (α2p β2p)2 K2(xi, xt)

p=1 (α2p β2p)2

p=1 (αp βp)2(αp + βp)2

2c max p [m] |αp βp|

p=1 (αp + βp)2

2c max p [m] |αp βp|

p=1 (αp + βp)

2c max p [m] |αp βp|

Above all, we know that

1 n Tr Kp HαHα

X Kp(x, y)hβ j (x)hβ j (y)dρ(x)dρ(y) c α β + ck

holds with probability at least 1 δ. The proof is complete.

B.3. The Proof of Theorem 4.2

Proof. For ease of proof, we briefly denote ϵ(Kp, {hα(t) j }k j=1) and ˆϵ(Kp, {ˆhα(t) j }k j=1) as ϵp(α(t)) and ˆϵp(α(t)) respectively. When the training set is Sn, assume that the kernel weights of each iteration are

α(0), , α(T ),

Meanwhile, the kernel weights of each iteration learned from the sample space are denoted as

β(0), , β(T ).

where α0 = β0 due to the same initialization.

Because Simple MKKM can obtain the globally optimal solution, after T iterations, we have

α(T ) = ˆαn, β(T ) = α .

Consistency of Multiple Kernel Clustering

For any integers t 1 and p [m], if p = u, we have α(t) u β(t) u α(t 1) u β(t 1) u

ˆαpˆϵp(α(t 1)) ˆαuˆϵu(α(t 1)) X

βpϵp(β(t 1)) βuϵu(β(t 1))

ˆαpˆϵp(α(t 1)) βpϵp(β(t 1)) (m 1) ˆαuˆϵu(α(t 1)) βuϵu(β(t 1))

ˆαpˆϵp(α(t 1)) βpϵp(β(t 1)) + ˆαuˆϵu(α(t 1)) βuϵu(β(t 1))

ˆαpˆϵp(α(t 1)) βpϵp(β(t 1))

= max p [m]

ˆαpˆϵp(α(t 1)) βpˆϵp(α(t 1)) + βpˆϵp(α(t 1)) βpϵp(β(t 1))

(ˆαp βp)ˆϵp(α(t 1)) + βp ˆϵp(α(t 1)) ϵp(β(t 1))

α(t 1) β(t 1) + ˆϵp(α(t 1)) ϵp(β(t 1)) .

By Theorem A.2, and α(t 1) u β(t 1) u α(t 1) β(t 1) ,

we have α(t) u β(t) u α(t 1) β(t 1) + k

Similarly, for p = u, we have

α(t) β(t) α(t 1) β(t 1) + k

Finally, due to the convergence of the reduced gradient descent algorithm,

α(T ) β(T ) α(0) β(0) + k

It means that

holds with probability at least 1 δ.

B.4. The Proof of Lemma A.3

Theorem B.4. (Zwald & Blanchard, 2005) Let A be a symmetric positive Hilbert-Schmidt operator of the Hilbert space with simple positive eigenvalues λ1 > λ2 > . For an integer j such that λj > 0, let σj = σj σj+1 where σj = 1

2(λj λj+1). Let B HS(H) be another symmetric operator such that B < σj/2 and A + B is still a positive operator with simple nonzero eigenvalues. Let Pj(A) be the orthogonal projector onto the subspace spanned by j-th eigenvector of A. Then, these projectors satisfy the following:

Pj(A) Pj(A + B) 2 B

Consistency of Multiple Kernel Clustering

Theorem B.5. (Bardenet & Maillard, 2015) Let Sn = {xi}n i=1 be a finite sequence of real numbers, and Sr = {xi}r i=1 are r points uniformly selected from it without replacement. Then, for any t > 0, the following probability inequality holds

(1 r/n)(1 + 1/r)(b a)2

where a = mini [n] xi and b = maxi [n] xi.

Proof. Without loss of generality, we assume that the selected points are {xi}r i=1. Then for any kernel function K( , ), the following inequality holds ˆϵ(K, {ˆhn,j}k j=1) ˆϵ(K, {ˆhr,j}k j=1)

t=1 K(xi, xt)ˆhn,j(xi)ˆhn,j(xt)

t=1 K(xi, xt)ˆhr,j(xi)ˆhr,j(xt)

The difference of j-th term in the above formula can be bounded by 1 n2

t=1 K(xi, xt)ˆhn,j(xi)ˆhn,j(xt) 1

t=1 K(xi, xt)ˆhr,j(xi)ˆhr,j(xt)

t=1 K(xi, xt)ˆhn,j(xi)ˆhn,j(xt) 1

t=1 K(xi, xt)ˆhr,j(xi)ˆhr,j(xt)

t=1 K(xi, xt)ˆhr,j(xi)ˆhr,j(xt) 1

t=1 K(xi, xt)ˆhr,j(xi)ˆhr,j(xt)

We first bound Term C as

t=1 K(xi, xt)(ˆhn,j(xi)ˆhn,j(xt) ˆhr,j(xi)ˆhr,j(xt))

K(xi, xt)(ˆhn,j(xi)ˆhn,j(xt) ˆhr,j(xi)ˆhr,j(xt))

j=1 sup x,y

K(x, y)(ˆhn,j(x)ˆhn,j(y) ˆhr,j(x)ˆhr,j(y))

ˆhn,j(x)ˆhn,j(y) ˆhr,j(x)ˆhr,j(y)

ˆhn,j(x)ˆhn,j(y) ˆhn,j(x)ˆhr,j(y) + ˆhn,j(x)ˆhr,j(y) ˆhr,j(x)ˆhr,j(y)

sup x,y |ˆhn,j(x)| ˆhn,j(y) ˆhr,j(y) + sup x,y |ˆhr,j(y)| ˆhn,j(x) ˆhr,j(x)

ˆhn,j(x) ˆhr,j(x) .

Similar to the definition of Lemma B.2, we define the following two operators Tn, Tr : H H:

i=1 , Kxi Kxi, Tr = 1

i=1 , Kxi Kxi.

Consistency of Multiple Kernel Clustering

According to (Rosasco et al., 2010), the above two operators are positive definite, and it can be checked that

Tnˆhn,j(x) = 1

i=1 K(x, xi)ˆhn,j(xi) = ˆλn,j.

Thus ˆhn,j H is the eigenfunction of Tn, while ˆhr,j is the eigenfunction of Tr, and their corresponding non-zero eigenvalues are the same as the eigenvalues of 1

r Kr, respectively. We assume that the minimal gap of the first k+1 eigenvalues of 1

By the reproducing property of the kernel function and Theorem B.4, we have

ˆhn,k(x) ˆhr,k(x)

D Kx, ˆhn,k ˆhr,k E

2 sup x Kx ˆhn,k ˆhr,k

2 ˆhn,k ˆhr,k

We will give the conditions of the last inequality later. Before this, we aim to bound Tr Tn .

= sup f H, f =1 Trf Tnf

sup f,v H, f = v =1 Trf Tnf, v H

= sup f,v H, f = v =1

i=1 f(xi) Kxi, v H 1

i=1 f(xi) Kxi, v H

By Theorem B.5, we have

i=1 f(xi) Kxi, v H 1

i=1 f(xi) Kxi, v H

2(1 r/n)(1 + 1/r)

Then the following inequality holds with probability 1 δ:

Now we find the establishment condition of the last inequality in Eq. (10). By the conditions of Theorem B.4, we know that r should satisfy

r = Ω n 1 + n σ2

We let n = Ω(1/ σ2), and r will be Ω(1/ σ2).

Above all, if σ 1/c and r, n are Ω(1/ σ2), then we have

Consistency of Multiple Kernel Clustering

In the next, we proceed to bound Term D. Notice that

t=1 K(xi, xt)ˆhr,j(xi)ˆhr,j(xt) =

i=1 ˆhr,j(xi)ϕ(xi)

Thus, we have

i=1 ˆhr,j(xi)ϕ(xi) 1

i=1 ˆhr,j(xi)ϕ(xi)

i=1 ˆhr,j(xi)ϕ(xi) + 1

i=1 ˆhr,j(xi)ϕ(xi)

i=1 ˆhr,j(xi)ϕ(xi) 1

i=1 ˆhr,j(xi)ϕ(xi)

i=1 ˆhr,j(xi)ϕ(xi) + 1

i=1 ˆhr,j(xi)ϕ(xi)

i=1 ˆhr,j(xi)ϕ(xi) 1

i=1 ˆhr,j(xi)ϕ(xi)

2 sup v =1,v H

i=1 ˆhr,j(xi) v, ϕ(xi) H 1

i=1 ˆhr,j(xi) v, ϕ(xi) H

For any v0 H, v0 = 1, by Theorem B.5, we have

i=1 ˆhr,j(xi) v0, ϕ(xi) H 1

i=1 ˆhr,j(xi) v0, ϕ(xi) H

4(1 r/n)(1 + 1/r)

Then, we know that

holds with probability at least 1 δ.

Combining all things together, the following inequality holds with probability 1 δ:

ˆϵ(Kp, {ˆhn,j}k j=1) ˆϵ(Kp, {ˆhr,j}k j=1) k

B.5. The Proof of Theorem 5.1

Proof. We make a decomposition as follows:

ESn[W( ˆC, ˆα, ρ)] W(C , α , ρ)

= ESn[W( ˆC, ˆα, ρ) Wn( ˆC, ˆα, ρn)] | {z } A

+ ESn[Wn( ˆC, ˆα, ρn) Wn(C , α , ρn)] | {z } B + ESn[Wn(C , α , ρn)] W(C , α , ρ) | {z } C

A and C can be bounded by the generalization risk of single kernel clustering (Biau et al., 2008), and their upper bounds are all O(k/ n).

Then we bound Term B. Assume that the clustering indicator matrix corresponding to clustering centroids ˆC is ˆHˆα Rn k, whose element is hij = 1/ p

|Cj|. When xi belongs to the j-th cluster hij = 1/ p

|Cj|, otherwise hˆα ij = 0. Similarly, we denote the clustering indicator matrix corresponding to C is ˆHα Rn k.

Consistency of Multiple Kernel Clustering

Because ˆC = argmin C Hk ˆ α Wn(C, ˆα, ρn), we have

B =ESn[Wn( ˆC, ˆα, ρn) Wn(C , α , ρn)]

n Tr Kˆα(In ˆHˆα ˆH ˆα) 1

n Tr Kα (In ˆHα ˆH α )

n Tr Kˆα(In ˆHˆα ˆH ˆα) 1

n Tr Kˆα(In ˆHα ˆH α )

n Tr Kˆα(In ˆHα ˆH α ) 1

n Tr Kα (In ˆHα ˆH α )

n Tr (Kˆα Kα )(In ˆHα ˆH α )

ˆα2 p (α p)2 1

n Tr Kp(In ˆHα ˆH α ) #

p=1 (ˆαp + α p)|ˆαp α p|

p=1 (ˆαp + α p) ˆα α

The proof is complete.

B.6. The Proof of Theorem 6.1

Proof. When the kernel weights are ˆαr, assume that the corresponding clustering centroids are ˆCn Hk ˆαr. Then, we have

ESn[W( ˆCn,r, ˆαr, ρ)] W(C , α , ρ)

= ESn[W( ˆCn,r, ˆαr, ρ) Wn( ˆCn,r, ˆαr, ρn)] | {z } A

+ ESn[Wn( ˆCn,r, ˆαr, ρn) Wn( ˆCn, ˆαr, ρn)] | {z } B + ESn[Wn( ˆCn, ˆαr, ρn) Wn( ˆC, ˆα, ρn)] | {z } C

+ ESn[Wn( ˆC, ˆα, ρn)] W(C , α , ρ) | {z } D

According to (Biau et al., 2008), A can be bounded by e O(k/ n). By Theorem 5.1, D has an upper bound as e O(k/ n).

Moreover, by Theorem 1 in (Calandriello & Rosasco, 2018), when r = Ω(n/γ), B can be bounded by e O(kγ/n), where γ is a positive parameter.

Similar to the proof of Theorem 5.1, combining Theorem 4.3, we can bound C as

C = ESn[Wn( ˆCˆαr, ˆαr, ρn) Wn( ˆC, ˆα, ρn)]

Above all, we can conclude that ESn[W( ˆCn,r, ˆαr, ρ)] W(C , α , ρ) can be bounded by

n + K r + K n

Consistency of Multiple Kernel Clustering

Letting γ = Θ(n/r), the desirable result follows.

B.7. The Proof of Theorem 6.2

We first bound the difference of Tn and Tr by the following theorem.

Theorem B.6. When r log(2/δ) ε log(1+ε), the following holds with probability 1 δ:

We introduce the following lemma and its corollary to prove Theorem B.6. Lemma B.7. (Yin et al., 2020a) Suppose that G N(0, n), and a random matrix T = [t1, , tm] Rn m. The j-th column tj of T has only a non-zero element with probability

Pr (tij = δ(i)n) = 1

n, Pr(tij = 0) = 1 1

where δ(i) = 1 or 1 with equal probability. Then for any vector b Rn and non-negative integer Sn, the following inequality holds E (t j b)2s E G2s .

Corollary B.8. Let T N(0, 1/r), and a random matrix S = [s1, , sr] Rn r. Each column of S has only a non-zero element with probability

Pr sij = δ(i) rn

n, Pr(sij = 0) = 1 1

where δ(i) = 1 or 1 with equal probability. Then for any vector b Rn and non-negative integer Sn, the following inequality holds E (s j b)2s E T 2s .

The proof of Corollary B.8 is as follows.

Proof. By construction, we know tij = sij mn. According to Lemma B.7, we have

E (s j b)2s E

It is obvious that G mn N(0, 1/m), thus the corollary holds.

Now, we prove Theorem B.6. The technique is similar to (Yin et al., 2022a), but we give the detailed process of proof for completeness.

Proof. For ease of proof, we rewrite Tn and Tr as matrix forms. Let Φn = [ϕ1, , ϕn] Rd n, where d denotes the dimension of the Hilbert space corresponding to the kernel function. Then, Tn = 1

nΦnΦ n . Similarly, we can rewrite Tr as Tr = 1

rΦrΦ r . By the notations of Corollary B.8, we have

Tn Tr = 1 nΦnΦ n 1

= 1 nΦnΦ n 1

1 nx ΦnΦ n x 1

nx Φn SS Φ n x

2 1 n S Φ n x

Consistency of Multiple Kernel Clustering

To bound the above formula, for any x = 1, we first give the bound of

1 n S Φ n x 2 1 nΦ n x 2

1 nΦ n x 2 ε

1 n S Φ n x 2

1 nΦ n x 2 1 + ε

Setting b = 1 nΦ n x/ 1 nΦ n x , for any λ < m/2, the following inequality holds according to Markov s inequality.

Pr( S b 1 + ε) = Pr(exp(λ S b 2) exp(λ(1 + ε)))

E exp(λ S b 2) exp( λ(1 + ε))

= E exp(λ S b 2) exp( λ(1 + ε))

j=1 (s j b)2)

exp( λ(1 + ε))

= exp( λ(1 + ε))

j=1 E exp(λ(s j b)2) .

Furthermore, by Taylor s formula and Corollary B.8, we have

E exp(λT 2) =

s! E[(s j b)2s]

= E exp(λ(s j b)2) .

In addition, we have

E exp(λT 2) = Z +

2π exp( mx2/2) exp(λx2)dx

2π exp( t2/2) exp λt2/m dt

Thus, E exp(λ(s j b)2) 1

1 2λ/m. Combining Eq. (11), we have

Pr( S b 1 + ε)

exp( λ(1 + ε)).

Setting λ = ε 1+ε m

2 , we can obtain

Pr( S b 1 + ε) (1 + ε)m/2 exp mε

On the other hand, for any x = 1, we turn to bound

1 n S Φ n x 2 1 nΦ n x 2

1 nΦ n x 2 ε

1 n S Φ n x 2

1 nΦ n x 2 1 ε

Consistency of Multiple Kernel Clustering

Similar to the above procedures, for any λ < m/2, we have

Pr( S b 1 ε) = Pr(exp(λ(1 ε)) exp(λ S b 2))

exp(λ(1 ε)) E exp( λ S b 2)

= exp(λ(1 ε)) E exp( λ S b 2)

= exp(λ(1 ε)) E

j=1 (s j b)2)

= exp(λ(1 ε))

j=1 E exp( λ(s j b)2)

exp(λ(1 ε))

1 λ(s j b)2 + λ2(s j b)4

Moreover, it can be checked that

E (s j b)2 =

t=1 E [(sijbi)(stjbt)]

i=1 E s2 ijb2 i

E (s j b)4 E T 4 = 3 m2 . (15)

Substituting Eq. (14) and Eq. (15) into Eq. (13), we have

Pr( S b 1 ε) exp(λ(1 ε)) 1 λ

Setting λ = ε 1+ε m

2 and w = ε 1+ε, we have

Pr( S b 1 ε) exp ε(1 ε)

1 ε 2(1 + ε) + 3ε2

(1 + ε)m/2 . (17)

Combining Eq. (12) and Eq. (17), we can obtain

1 n S Φ n x 2 1 nΦ n x 2

2 exp m log(1 + ϵ) mε

Above all, it can be deduced that if m log(2/δ) ε log(1+ε),

1 n S Φ n x

2 ε 1 nΦ n x

Consistency of Multiple Kernel Clustering

holds with probability at least 1 δ.

Because of the arbitrariness of x, we have Tn Tr ε.

Finally, we complete the proof of Theorem 6.2.

Proof. We can use Theorem B.6 to improve the result of Lemma A.3. In the proof of Lemma A.3, Term C can be bounded by O( Tn Tr ), and Term D in Lemma A.3 also has the same upper bound. This is because

i=1 ˆhr,j(xi)ϕ(xi) 1

i=1 ˆhr,j(xi)ϕ(xi)

=2 (Tn Tr)ˆhr,j

2 ˆhr,j Tn Tr

By Theorem B.6, when r log(2/δ) ε log(1+ε),

ˆϵ(Kp, {ˆhn,j}k j=1) ˆϵ(Kp, {ˆhr,j}k j=1) k Tn Tr kε.

By the same decomposition of the proof of Theorem 6.1, we can improve the bound of C in Theorem 6.1 as

C ˆαr ˆα k Tn Tr kε.

Combining all things together, we can conclude the conclusion.

Consistency of Multiple Kernel Clustering

C. Notation Table

Table 4. Basic notations. Notations Meaning k Clustering number n Sample number X Sample space Kp( , ) The p-th base kernel function Kp The p-th base kernel matrix ρ Real distribution ρn Empirical distribution H Clustering indicator matrix {hj( )}k j=1 Clustering indicator functions {ˆhj( )}k j=1 Approximated clustering indicator functions LK The integral operator associated with kernel function K.

D. URLs of the Used Datasets

In this section, we give a detailed introduction to the datasets used in our experiments. The URLs of the datasets in Table 1 are as follows:

1. Flo17: http://www.robots.ox.ac.uk/œvgg/data/flowers/17/,

2. Flo102: http://www.robots.ox.ac.uk/œvgg/data/flowers/102/,

3. DIGIT: http://ss.sysu.edu.cn/py/,

4. PFold: http://mkl.ucsd.edu/dataset/protein-fold-prediction,

5. CCV: http://www.ee.columbia.edu/ln/dvmm/CCV/,

6. Reuters: http://kdd.ics.uci.edu/databases/reuters21578/.

All the kernel matrices are pre-computed and available on websites. The large-scale datasets in Table 2 can be downloaded from

1. NUSWIDE: http://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/ NUS-WIDE.html,

2. Aw A: http://cvml.ist.ac.at/Aw A/,

3. CIFAR10: http://www.cs.toronto.edu/ kriz/cifar.html,

4. Yt Video: http://archive.ics.uci.edu/ml/datasets/You Tube+Multiview+Video+Games+ Dataset,

5. Winnipeg: https://archive.ics.uci.edu/ml/datasets/Crop+mapping+using+fused+ optical-radar+data+set,

6. Covertype: http://archive.ics.uci.edu/ml/datasets/Covertype.