# hierarchical_multiple_kernel_clustering__c55b9ea8.pdf Hierarchical Multiple Kernel Clustering Jiyuan Liu, 1 Xinwang Liu, 1* Siwei Wang, 1 Sihang Zhou, 2 Yuexiang Yang 1 1 College of Computer, National University of Defense Technology, Changsha, Hunan, China 2 College of Intelligence Science and Technology, National University of Defense Technology, Changsha, Hunan, China {liujiyuan13, xinwangliu, wangsiwei13}@nudt.edu.cn, sihangjoe@gmail.com, yyx@nudt.edu.cn Current multiple kernel clustering algorithms compute a partition with the consensus kernel or graph learned from the pre-specified ones, while the emerging late fusion methods firstly construct multiple partitions from each kernel separately, and then obtain a consensus one with them. However, both of them directly distill the clustering information from kernels or graphs to partition matrices, where the sudden dimension drop would result in loss of advantageous details for clustering. In this paper, we provide a brief insight of the aforementioned issue and propose a hierarchical approach to perform clustering while preserving advantageous details maximumly. Specifically, we gradually group samples into fewer clusters, together with generating a sequence of intermediary matrices of descending sizes. The consensus partition with is simultaneously learned and conversely guides the construction of intermediary matrices. Nevertheless, this cyclic process is modeled into an unified objective and an alternative algorithm is designed to solve it. In addition, the proposed method is validated and compared with other representative multiple kernel clustering algorithms on benchmark datasets, demonstrating state-of-the-art performance by a large margin. Introduction Although kernel clustering algorithms, such as kernel kmeans, kernel spectral clustering, etc., have already achieved satisfying performance in various applications, how to choose the right kernel mapping and corresponding parameters is still a tricky problem. More lethal is their incapability of dealing with rapidly increasing multi-view data. Multiple kernel clustering (MKC) methods address these problems by generating multiple kernels with a set of parameterized mapping functions corresponding to each data view, and then exploring the complementary information to categorize samples into clusters. According to the time when clustering details are utilized, they can be roughly separated into two categories, i.e. early-fusion and late-fusion methods. Early-fusion approaches directly learn a consensus kernel or graph from multiple ones, afterwards generate the final partition [Ren et al. 2020; Wen et al. 2020]. Mostly, the *Corresponding author Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. two processes are jointly optimized, as shown in Fig. 1(a) and 1(b). In this group of methods, low rank optimization techniques are widely adopted. For instance, Zhou et al. find rows and columns of kernel will be both contaminated when a sample is corrupted with noise, and employ L2,1 norm regularization to capture such row-wise and column-wise noises [Zhou et al. 2015]. Following multiple kernel learning (MKL) framework [Kloft, R uckert, and Bartlett 2010; Kloft et al. 2011], a large number of algorithms assume that the consensus kernel is a weighted combination of the prespecified kernels [Huang, Chuang, and Chen 2012; Kloft et al. 2011; Liu et al. 2020a; Peng et al. 2019; Liu et al. 2020c]. Among them, various regularization terms, such as local kernel alignment [G onen and Margolin 2014], matrixinduced regularization [Liu et al. 2016] and optimal neighborhood kernel [Liu et al. 2017], are proposed to refine the kernel weights. In addition, some multi-view spectral clustering methods firstly construct graphs from kernels, and then adopt graph-based techniques to learn the cluster assignments with the by-product unified graph [Kang et al. 2018; Zhan et al. 2019; Kang et al. 2019; Zhou et al. 2020]. Late-fusion algorithms firstly obtain multiple partitions from each kernel independently with basic kernel clustering algorithms [Liu et al. 2020b]. Then, the final clustering assignments are constructed based on them, as shown in Fig. 1(c). Long et al. introduce the concept of mapping functions to make spaces of each view comparable, hence an optimal clustering assignment can be made from multiple results [Long, Yu, and Zhang 2008]. Wang et al. build the base partitions by performing kernel k-means on pre-specified kernels, then maximize the alignment between consensus partition and the weighted combination of base partitions [Wang et al. 2019]. Although early-fusion and late-fusion MKC methods achieve satisfying performances, both of them try to encode the clustering information from kernels or graphs with size Rn n to partition matrices with size Rn k, where n and k are the number of samples and clusters, respectively. To see this point in depth, we conduct a brief illustration on CCV benchmark in Fig. 2. It can be clearly observed that a large volume of details beneficial to clustering would be lost in this sudden drop of dimension. A brief insight is also provided at the beginning of section 3. In order to balance the partition information distilling and preserving, we propose a The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) (a) (b) (c) Original data Intermediary data Figure 1: (a) and (b) visualize early-fusion methods with kernels and graphs, while (c) and (d) are the frameworks of late-fusion approaches and the proposed algorithm, respectively. The gray squares or graphs and orange ones indicate the original data and clustering results. Squares of other colors represent the intermediary data. Early-fusion methods obtain the final partition, H, by learning a consensus kernel or graph, KC or GC, from the pre-specified ones, {Kp}m p=1 or {Gp}m p=1. On the contrary, late-fusion methods construct base partitions, {Bp}m p=1, in one-way ticket from each kernel without considering the others, and then integrate them into a final one, H. Different from these two approaches, the proposed algorithm extracts the partition information into smaller matrices, {H(t) p }m,s p,t=1, iteratively, and learn the final partition, H, after one or more steps. In upper cases, two-way arrow represents update of current matrices would affect the previous ones. Hierarchical Multiple Kernel Clustering (HMKC) approach. At the beginning, data samples are categorized into c1 clusters by constructing an intermediary partition matrix H(1) Rn c1 corresponding to each kernel, where n > c1 > k. Based on the obtained H(1), a smaller intermediary partition matrix H(2) Rn c2, in which n > c1 > c2 > k, are computed to get c2 clusters. By repeating the aforementioned process s times, beneficial clustering details can be distilled into H(s) Rn cs step by step. With {H(s) p }m p=1 corresponding to m kernels, the consensus partition H is simultaneously learned and conversely guides the construction of intermediary matrices. We also jointly optimize the intermediary matrices and the consensus partition by modeling them in an unified objective. The above idea is visualized in Fig. 1(d). Furthermore, an alternative algorithm is designed to solve the resultant optimization problem and proven to be convergent. Extensive experiments are conducted on benchmark datasets. It can observed that the proposed algorithm outperforms other representative MKC algorithms. Related Work Kernel k-means Given a collection of data observations {xi}n i=1 and a kernel mapping φ( ), kernel k-means aims to group the samples into k clusters via minimizing the sum-of-square loss, which can be formulated as i=1,j=1 Aij φ (xi) cj 2 2 s.t. j=1 Aij = 1 (1) where A {0, 1}n k is the cluster assignment of each sample and cj is the j-th cluster centroid. In most cases, φ(xi) Rd where d n or infinite. As a result, Eq. (1) cannot be optimized directly. Therefore, we equivalently rewrite its matrix-vector form as min A Tr(K) Tr L 1 2 A KAL 1 2 (2) in which Kij = φ(xi) φ(xj), L = diag([n 1 1 , n 1 2 , , n 1 k ]) with nj = Pn i=1 Aij. The discrete A makes the upper equation hard to solve, and a common trick is to relax it to take arbitrary values. By defining H = AL 1 2 , the above problem can be transformed to min H Tr K In HH s.t. H H = Ik, H Rn k, (3) which can be simplified as follows: max H Tr KHH s.t. H H = Ik, H Rn k. (4) With the obtained partition matrix H, k-means clustering is applied to compute the final clustering assignments. 1 250 500 750 1000 Eigenvalue number Logarithmic value Calculative sum 1 250 500 750 1000 Eigenvalue number Logarithmic value Calculative sum Figure 2: Eigenvalue distributions of the first kernel K1 and consensus kernel Kβ obtained by MKKM on CCV. The eigenvalues are sorted from large to small, and only the first 1000 out of 6773 ones are plotted. The bar plot shows the times of each logarithmic eigenvalue to the first one. Meanwhile, the curve presents the calculative sum of the sorted eigenvalues. Multiple Kernel k-means Following the MKL framework, multiple kernel k-means (MKKM) assumes the consensus kernel Kβ as a weighted combination of the pre-specified ones [Huang, Chuang, and Chen 2012], and minimize min H,β Tr Kβ In HH s.t. β 1 = 1, H H = Ik, H Rn k (5) in which Kβ = Pm p=1 β2 p Kp. The Proposed Algorithm In the following, we provide a brief insight on the clustering detail loss of current MKC algorithms when constructing the consensus partition. The results are obtained on CCV dataset which is collected from 20 clusters and thoroughly described in Table 1. Early-fusion MKC algorithms try to construct the consensus kernel or graph by exploring the complementary information among the pre-specified ones. For the learned consensus kernel, most of them simultaneously perform kernel k-means to obtain the partition matrix with size Rn k. Meanwhile, the graph methods produce the partition matrix of the same size by employing spectral clustering (normalized cut), which has been proven to be a special case of weighted kernel k-means [Dhillon, Guan, and Kulis 2004]. So, we only analyze the properties of kernel k-means here due to the space limit. Specifically, we take MKKM as an instance and present the eigenvalue distribution of the consensus kernel Kβ on the left of Fig. 2. As claimed in section 4.2 of [Sch olkopf, Smola, and M uller 1998], the eigenvector corresponding to a larger eigenvalue carries more discriminative information. By selecting the eigenvectors corresponding to top-k eigenvalues, kernel k-means preserves most of the discriminative details in a kernel. In an ideal kernel for clustering, eigenvalues except the top-k are supposed to be 0 [Kalman 1996; Zhang 2015]. However, it can be observed that there are around 100 more eigenvalues larger than 1, apart from the biggest 20 ones. If taking the eigenvalue to roughly measure the volume of discriminative information in a corresponding eigenvector, we can see that the resulting partition matrix only keeps 55% kernel details. Most early-fusion MKC approaches adopt kernel kmeans to compute multiple partitions separately [Liu et al. 2019; Wang et al. 2019]. Therefore, we visualize the eigenvalue distribution of the first kernel matrix on the right of Fig. 2. Similar results are observed that there are almost 500 eigenvalues larger than 1, while the top-20 eigenvectors keep about 60% kernel discriminative details. As a result, the multiple independent partition matrices are of low quality, acting as the bottleneck of performance improvement. We can summarize two observations as follows: 1. The discriminative details advantageous for clustering in a kernel with size Rn n, whether consensus or original one, can be encoded in a proportion of eigenvectors by classical kernel methods, such as kernel k-means and spectral clustering. 2. Directly adopting a partition matrix with size Rn k would lose a large volume of the kernel discriminative information. To balance the partition information distilling and preserving, we firstly group data samples into c1 clusters by employing intermediary matrices with size Rn c1, where n > c1 > k, to extract the clustering details from kernels. In specific, p-th intermediary matrix H(1) p can be constructed by performing kernel k-means on Kp as max H(1) p Tr Kp H(1) p H(1) p s.t. H(1) p H(1) p = Ic1, H(1) p Rn c1. (6) Furthermore, we find c1 might take values which are greatly larger than k, such k2 or even k3, especially when data samples are generated from a small number of distributions. This abrupt drop of dimension would cause the loss of important partition information. Therefore, we construct a new kernel H(1) p H(1) p , afterwards group samples into c2 clusters via kernel k-means and compute the intermediary matrix H(2) p Rn c2. By repeating this process s times, data samples are gradually grouped into {ct}s t=1 clusters, together with generating a sequence of intermediary matrices H(t) p Rn ct, where n > c1 > > cs > k. Without loss of generality, we formulate the relation between H(t 1) p and H(t) p as max H(t) p Tr H(t 1) p H(t 1) p H(t) p H(t) p s.t. H(t) p H(t) p = Ict, H(t) p Rn ct. (7) With the learned matrix {H(s) p }m p=1 for m pre-specified kernels, we also perform kernel k-means on each one to obtain a consensus partition and employ an vector β Rm to adjust their weights, which can be formulated as p=1 βp Tr H(s) p H(s) p HH s.t. β β = 1, H H = Ik, H Rn k, Utilizing Eq. (6), (7) and (8) into an unified objective, we can obtain the proposed final objective as max H,{Hp}m p=1,β,γ p=1 γ(1) p Tr Kp H(1) p H(1) p p=1 γ(t) p Tr H(t 1) p H(t 1) p H(t) p H(t) p p=1 βp Tr H(s) p H(s) p HH s.t. H H =Ik, H Rn k, H(t) p H(t) p = Ict, H(t) p Rn ct, n > c1 > > cs > k, β β = 1, βp 0, β Rm, γ(t) γ(t) = 1, γ(t) p 0, γ(t) Rm, (9) in which β and γ(t) are the weights corresponding to each kernel. It is worth to note that no hyper-parameters are employed, for we treat all partition information distilling processes equally. Overall, the partition information are firstly distilled from kernel matrices, Kp, to H(1) p , then to H(2) p , ..., to H(s) p , finally fused in H. With the dimensions reducing in steps, the relative important similarity information advantageous for clustering is gradually gathered and used to update the consensus partition matrix, H. Utilizing all steps into an unified objective, the update of H will guide the information distilling processes, achieving an ideal performance at last. Optimization Optimization Strategy The alternative optimization strategy is adopted to optimize the resultant objective. In specific, every unknown variable is solved and ensured to be globally optimal with the others fixed in each step, as shown in Eq. (10) - (18). i) Optimizing {H(1) p }m p=1. Given Kp, H(2) p , γ(1) p and γ(2) p , the optimization reduces to max H(1) p Tr h H(1) p H(1) p γ(1) p Kp + γ(2) p H(2) p H(2) p i s.t. H(1) p H(1) p = Ic1, H(1) p Rn c1. (10) ii) Optimizing {H(t) p }s 1 t=2. Given H(t 1) p , H(t+1) p , γ(t) p and γ(t+1) p , Eq. (9) w.r.t H(t) p can be solved as max H(t) p Tr h H(t) p H(t) p γ(t) p H(t 1) p H(t 1) p + γ(t+1) p H(t+1) p H(t+1) p i s.t. H(t) p H(t) p = Ict, H(t) p Rn ct. iii) Optimizing H(s) p . With fixed H(s 1) p , H, γ(s) p and βp, the optimization can be optimized as max H(s) p Tr h H(s) p H(s) p γ(s) p H(s 1) p H(s 1) p + βp HH i s.t. H(s) p H(s) p = Ics, H(s) p Rn cs. (12) iv) Optimizing H. Given {H(s) p }m p=1 and β, H can be optimized via βp H(s) p H(s) p # s.t. H H = Ik, H Rn k. v) Optimizing γ(t). γ(t) are the weights across m views at t distilling stage. In the first stage, Kp and H(1) p are known. In t = 2 s stage, H(t 1) p and H(t+1) p are known. The optimization problem w.r.t. γ(t) can be solved as max γ(t) ν(t) γ(t) s.t. ν(1) p = Tr Kp H(1) p H(1) p ν(t) p = Tr H(t 1) p H(t 1) p H(t) p H(t) p γ(t) γ(t) = 1, γ(t) p 0, γ(t) Rm. vi) Optimizing β. Given H and {H(s) p }m p=1, β is able to be optimized as s.t. νp = Tr HH H(s) p H(s) p β β = 1, βp 0, β Rm. The four optimization problems, from i) to iv), belong to one type and can be generalized as max U Tr UU B s.t. U U = Ic, U Rn c, (16) which can be easily solved via singular value decomposition (SVD) on B [Wang et al. 2019]. With B = ScΣc V c , where Sc Rn c, Σc Rc c and Vc Rn c, the solution can be computed as U = Sc V c . Meanwhile, the last two equations are equivalent to max α ν α s.t. α α = 1, αp 0, α Rm (17) whose closed-form solution is A trick can be adopted in the above optimization process with defining two procedures, i.e. forwarding and back propagation. Forwarding indicates updating the variables in such Algorithm 1: Hierarchical multiple kernel clustering Input: kernel matrices {Kp}m p=1, clustering number k and layer sizes {ct}s t=1 Initialize: {H(t) p }m,s p,t=1, H, {γ(t)}s t=1 and β while not convergence do update {H(t) p }m,s p,t=1 by solving Eq. (10) - (12) update H by solving Eq. (13) update {γ(t)}s t=1 by solving Eq. (14) update β by solving Eq. (15) end Output: partition matrix H. order: H(1) p H(2) p H(s) p H, i.e. blue arrows in Fig. 1(d), while back propagation refers to updating them in the inverse order, i.e. orange arrows in Fig. 1(d). We perform forwarding and back propagation alternately, which updates the variables more efficiently than only applying forwarding in whole optimization. At last, the optimization procedure is outlined in Algorithm 1. Convergence and Complexity Analysis For ease of expression, we formulate the objective in Eq. (9) as max {H(t) p }s t=1,H,{γ(t)}s t=1,β J ({H(t) p }s t=1, H, {γ(t)}s t=1, β) (19) It is obvious that J ({H(t) p }s (r) t=1 , H(r), {γ(t)}s (r) t=1 , β(r)) J ({H(t) p }s (r+1) t=1 , H(r), {γ(t)}s (r) t=1 , β(r)), (20) where superscript r indicates the optimization at round r. Eq. (20) holds, because an optimal solution can be obtained when fixing the other variables. The similar inequality holds at each step, leading to J ({H(t) p }s (r) t=1 , H(r), {γ(t)}s (r) t=1 , β(r)) J ({H(t) p }s (r+1) t=1 , H(r+1), {γ(t)}s (r+1) t=1 , β(r+1)). (21) Therefore, the objective monotonically increases at each round. Meanwhile, it is obvious that the whole optimization objective is upper-bounded. As a result, the alternative algorithm is guaranteed to converge to a local maximum. For Eq. (10)-(13), SVD technique is adopted on matrices of size Rn n, requiring O(n3) complexity. Eq. (14) and Eq. (15) are calculated with multiplication of matrices and require O(n2) and O(n) complexities, respectively. Overall, the computational complexity is O(n3). By the way, all comparative methods in experiments are of O(n3) complexity, guaranteeing the fairness of comparison. Experiment Experiment Settings Datasets. We evaluate the proposed algorithm on eight widely used MKC benchmark datasets, including AR10P1, 1featureselection.asu.edu/old/datasets.php Dataset Number of Samples Kernels Clusters AR10P 130 6 10 BBCSport 544 6 5 CCV 6773 3 20 Flower17 1360 7 17 Flower102 8189 4 102 Heart 270 13 2 Ionosphere 351 33 2 Plant 940 69 4 Table 1: Specifications of the used datasets. BBCSport2, CCV3, Flower174, Flower1025, Ionosphere6, Heart7 and Plant8. Their details are shown in Table 1. Comparative Methods. Thirteen algorithms are chosen from the MKC literatures for comparison. Specifically, AMKKM (baseline) obtains the consensus kernel by linearly combining all base kernels with the same weights and then performs kernel k-means on it. SB-KKM (baseline) performs kernel k-means on all kernels and selects the best one. We also select seven classical algorithms, including CSRC [Kumar, Rai, and III 2011], MKKM [Huang, Chuang, and Chen 2012], RMSC [Xia et al. 2014], RMKC [Zhou et al. 2015], RMKKM [Du et al. 2015], MKKM-MR [Liu et al. 2016] and ONKC [Liu et al. 2017]. Additionally, we choose three recently proposed methods in high-quality conferences and journals, i.e. LFAM [Wang et al. 2019], SPC [Kang et al. 2019], MVCC [Zhan et al. 2019] and SPMKC [Ren and Sun 2020]. Their source codes are publicly available and we directly use them without revision in comparison. Meanwhile, we open the source code of HMKC on Github9. Parameter Settings. Only the sizes of the intermediary matrices, {H(t) p }s t=1 are supposed to be given manually. In the following, we consider two experimental settings. The first is called HMKC-1 with employing 1-layer of intermediary matrices, {H(1) p }m p=1 Rn c, where c is searched from [2k, 3k, , 20k]. While the second one is named HMKC-2 with employing 2-layer of intermediary matrices, {H(1) p }m p=1 Rn c1 and {H(2) p }m p=1 Rn c2, in which c1 c2 and c1, c2 are grid searched from [2k, 3k, , 20k]. In specific, the search region is restricted to [2k, 3k, , 10k] for AR10P since it has relatively small number of samples in each cluster. Note that the aforementioned k refers to the number of clusters. For the other comparative methods, we perform the grid search on the parameters recommended in corresponding papers and report the best results. 2mlg.ucd.ie/datasets/segment.html 3www.ee.columbia.edu/ln/dvmm/CCV/ 4www.robots.ox.ac.uk/ vgg/data/flowers/17/ 5www.robots.ox.ac.uk/ vgg/data/flowers/102/ 6archive.ics.uci.edu/ml/datasets/Ionosphere 7archive.ics.uci.edu/ml/datasets/Statlog+(Heart) 8www.imageclef.org/2013/plant 9https://github.com/liujiyuan13/HMKC-code release Algorithm AR10P BBCSport CCV Flower17 Flower102 Heart Ionosphere Plant A-MKKM 38.46 65.99 19.74 51.03 27.29 82.22 61.25 61.70 SB-KKM 49.23 76.65 20.08 42.06 33.13 76.30 70.09 51.60 CSRC 36.92 67.65 23.40 51.69 35.19 80.37 75.78 54.57 MKKM 37.69 66.36 18.01 45.37 21.96 53.33 63.82 56.38 RMSC 34.62 86.03 16.29 52.57 32.97 83.33 84.62 55.53 RMKC 43.85 66.36 19.74 52.35 33.55 82.22 66.10 61.70 RMKKM 32.31 53.13 17.11 53.16 29.61 76.30 65.81 55.32 MKKM-MR 43.08 66.36 22.47 60.00 40.27 83.33 61.25 62.87 ONKC 48.46 67.65 23.12 59.85 41.26 83.70 63.82 64.89 LFAM 46.92 72.61 26.66 59.63 44.61 82.22 68.09 62.77 SPC 38.46 83.09 - 57.50 - 75.93 71.23 60.43 MVCC 43.85 74.26 22.10 51.47 37.23 82.96 55.56 55.64 SPMKC 54.62 40.81 13.54 35.81 - 57.04 72.93 53.94 HMKC-1 56.15 89.52 36.57 66.91 46.84 86.67 86.32 64.68 HMKC-2 60.00 90.99 37.37 71.18 50.32 86.30 86.89 67.02 A-MKKM 34.57 53.92 17.16 50.19 46.32 32.40 3.29 26.82 SB-KKM 51.44 59.39 17.73 45.14 48.99 20.49 10.35 17.18 CSRC 36.90 55.41 20.96 52.63 53.73 28.53 15.99 21.82 MKKM 38.92 54.67 15.52 45.35 42.30 0.02 5.12 20.02 RMSC 29.70 73.89 13.79 56.35 53.36 34.51 36.52 23.83 RMKC 44.38 54.30 17.16 50.42 49.74 32.40 7.02 26.82 RMKKM 27.96 28.48 12.54 53.31 48.55 20.49 9.33 19.71 MKKM-MR 42.62 54.67 18.62 57.11 57.38 35.22 3.29 28.29 ONKC 51.86 54.74 19.19 56.85 57.39 36.22 31.13 31.13 LFAM 47.11 54.99 19.85 57.83 57.58 32.40 1.15 27.64 SPC 39.02 65.49 - 61.32 - 19.67 0.35 29.65 MVCC 38.56 51.52 15.97 56.26 52.35 34.54 0.63 20.85 SPMKC 58.39 7.31 8.11 40.78 - 10.95 1.64 19.28 HMKC-1 54.46 72.73 31.64 63.37 60.40 42.98 39.11 34.11 HMKC-2 60.32 76.72 32.71 65.45 62.47 43.40 40.55 36.43 Table 2: The performance comparison of multiple kernel clustering algorithms in recent literature. Experiment Results In order to show the effectiveness and the superiority over algorithms in MKC literatures, we compare the proposed method with the other thirteen ones on eight widely used benchmark datasets. Correspondingly, two common metrics, including accuracy (ACC) and normalized mutual information (NMI), are adopted to measure the performances. The results are shown in Table 2, where - indicates the results of corresponding algorithms are unavailable due to the long execution time. We have the following observations: Effectiveness. Some MKC algorithms perform worse than the baselines, A-MKKM and SB-KKM. For instance, MKKM has 45.37% accuracy on Flower17 which is 5.66% lower than A-MKKM. However, both the proposed HMKC1 and HMKC-2 outperform the baselines consistently across all datasets. This illustrates the proposed algorithm is effective in exploring the complementary information of multiple kernels to boost clustering performance. Superiority. Although, comparing with the best of the second best results, HMKC-1 shows little performance drops, such as 0.21% accuracy on Plant, 3.93% NMI on AR10P and 1.16% NMI on BBCSport, it shows obvious promotion on the other datasets. Moreover, HMKC-2 consistently and largely outperforms the other algorithms across all datasets. Especially, the proposed algorithm exhibits ex- cellent performances on CCV, Flower17 and Flower102, where around 5/10% increases are obtained. In addition, SPC is of high complexity, and its results on big datasets, such as CCV and Flower102, are unavailable in limited execution time. This indicates the superiority of our methods again. Overall, the proposed method outperforms the recent algorithms by a large margin, achieving the state-of-the-art performance. Validation on Intermediary Matrix. CRSC[Kumar, Rai, and III 2011] also use base partitions of each view to construct the final partition with updating them concurrently. But it fails to relax the base partitions into Rn c where c > k. HMKC-1 exceeds it by 19.23% on AR10P, 21.87% on BBCSport, 13.17% on CCV, 15.22% on Flower17, 11.65% on Flower102, 6.30% on Heart, 10.54% on Ionosphere and 10.11% on Plant respect to accuracy. Meanwhile, the first kernel matrix K1 of CCV, the corresponding intermediary H1, H2 and the final partition H in the proposed HMKC-2 are visualized in Fig. 3. It can be seen that a more and more clear clustering structure is presented along with the clustering process. The two observations validate that adopting the intermediary matrices of decreasing sizes is able to keep more advantageous details for clustering. Multiple Layers. Comparing the results in Table 2, it can be observed that HMKC-2 has consistent performance im- 100 200 300 400 500 (a) K1 100 200 300 400 500 (b) H1 (1) * H1 (1)T 100 200 300 400 500 (c) H1 (2) * H1 (2)T 100 200 300 400 500 (d) H * HT Figure 3: Visualization of kernel matrix, the proposed intermediary matrices and final partition of in HMKC-2 model. 2k 8k 14k 20k c Proposed Second best SB-KKM A-MKKM 20 Proposed Second best SB-KKM A-MKKM 1 7 13 19 Layer number 1 5 9 13 Epoch Objective value Figure 4: (a). ACC variation of the proposed HMKC-1 model at 2k c 20k; (b). ACC variation of the proposed 2-layer model at 2k c 20k; (c). The objective value by epoch at c1 = 20k and c2 = 2k; (d). The number of epochs at different c1 and c2. Note that, k represents the number of clusters and legend Second best refers to the best result of the second best algorithm in recent literature. provements over all datasets and metrics. Nevertheless, we construct a HMKC-1 model with c = 2k. By adding a layer intermediary matrices one k bigger each step, 19 models are built finally. These models are tested on CCV, and the results are presented in Fig. 4(c). The performance shows a relatively large increase, around 2% in NMI at first. Although a sudden drop is observed at 18 layers, it rises up steadily when increasing the number of layers. In all, the two above observations illustrate that adopting multiple layers of intermediary matrices in the proposed algorithm will increase the performance. Parameter Study and Convergence In order to investigate the parameter stability of the proposed algorithm, we perform grid search on CCV. Specifically, we vary parameter c of HMKC-1 model from 2k to 20k. Corresponding accuracies are shown in Fig. 4(a). It can be observed that the proposed algorithm consistently outperforms the baselines, including A-MKKM and SB-KKM, and the best result of the second best algorithm in a wide parameter range. Also, the performance on CCV steadily increases while enlarging the size of intermediary matrices, verifying the effectiveness of the proposed intermediary matrix again. Fig. 4(b) presents the performance variation of HMKC-2 model when parameter c1 and c2 vary from 3k to 20k and 2k to 20k, respectively. We can find the proposed algorithm exhibits large performance promotions over the baselines and the second best algorithm across all parameter ranges. Sim- ilar to HMKC-1 model, the accuracy also rises up with employing larger c1 and c2. This well verifies that better performances can be obtained by designing larger intermediary matrices to keep more advantageous partition details in clustering. Fig. 4(c) shows the performances when increasing the layer number of intermediary matrices. Although some improvements are observed on bigger layer numbers, we recommend to use the HMKC-2 model in clustering tasks for its aforementioned stability and the relatively low complexity compared with models with 3-19 layers. Fig. 4(d) presents the variation of objective value on CCV at c1 = 20k and c2 = 2k. It can be seen that the objective monotonically increases, and quickly converges with 14 iterations. In fact, the algorithm converges in less than 15 iterations across all datasets in most cases, making it practical in real-word applications. Both early-fusion and late-fusion MKC methods suffer from information loss when encoding the clustering details from kernels or graphs with size Rn n to partition matrix with size Rn k. To address this issue, We propose a hierarchical algorithm. The proposed method is tested on benchmark datasets and outperforms the comparative algorithms, showing state-of-the-art performance by a large margin. We will explore the relationship between eigenvalues of kernel matrix with sizes of the intermediary matrices. Acknowledgments We would like to thank the support from National Key R & D Program of China (project no. 2020AAA0107100), Education Ministry-China Mobile Research Funding (project no. MCM20170404) and Natural Science Foundation of China (project no. 62006237). References Dhillon, I. S.; Guan, Y.; and Kulis, B. 2004. Kernel k-means: spectral clustering and normalized cuts. In Kim, W.; Kohavi, R.; Gehrke, J.; and Du Mouchel, W., eds., Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004, 551 556. ACM. Du, L.; Zhou, P.; Shi, L.; Wang, H.; Fan, M.; Wang, W.; and Shen, Y. 2015. Robust Multiple Kernel K-means Using L21-Norm. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, 3476 3482. G onen, M.; and Margolin, A. A. 2014. Localized Data Fusion for Kernel k-Means Clustering with Application to Cancer Biology. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 1305 1313. Huang, H.; Chuang, Y.; and Chen, C. 2012. Multiple Kernel Fuzzy Clustering. IEEE Trans. Fuzzy Systems 20(1): 120 134. Kalman, D. 1996. A Singularly Valuable Decomposition: The SVD of a Matrix. College Mathematics Journal 27(1): 2 23. Kang, Z.; Peng, C.; Cheng, Q.; and Xu, Z. 2018. Unified Spectral Clustering With Optimal Graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 3366 3373. Kang, Z.; Xu, H.; Wang, B.; Zhu, H.; and Xu, Z. 2019. Clustering with Similarity Preserving. Neurocomputing 365: 211 218. Kloft, M.; Brefeld, U.; Sonnenburg, S.; and Zien, A. 2011. lp-Norm Multiple Kernel Learning. J. Mach. Learn. Res. 12: 953 997. Kloft, M.; R uckert, U.; and Bartlett, P. L. 2010. A Unifying View of Multiple Kernel Learning. In Balc azar, J. L.; Bonchi, F.; Gionis, A.; and Sebag, M., eds., Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20-24, volume 6322, 66 81. Kumar, A.; Rai, P.; and III, H. D. 2011. Co-regularized Multi-view Spectral Clustering. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, 1413 1421. Liu, J.; Liu, X.; Xiong, J.; Liao, Q.; Zhou, S.; Wang, S.; and Yang, Y. 2020a. Optimal Neighborhood Multiple Kernel Clustering with Adaptive Local Kernels. IEEE Transactions on Knowledge and Data Engineering (TKDE) . Liu, X.; Dou, Y.; Yin, J.; Wang, L.; and Zhu, E. 2016. Multiple Kernel k-Means Clustering with Matrix-Induced Regularization. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, 1888 1894. Liu, X.; Li, M.; Tang, C.; Xia, J.; Xiong, J.; Liu, L.; Kloft, M.; and Zhu, E. 2020b. Efficient and Effective Regularized Incomplete Multi-view Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 1. doi: 10.1109/TPAMI.2020.2974828. Liu, X.; Zhou, S.; Wang, Y.; Li, M.; Dou, Y.; Zhu, E.; and Yin, J. 2017. Optimal Neighborhood Kernel Clustering with Multiple Kernels. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, 2266 2272. Liu, X.; Zhu, E.; Liu, J.; Hospedales, T. M.; Wang, Y.; and Wang, M. 2020c. Simple MKKM: Simple Multiple Kernel K-means. Ar Xiv abs/2005.04975. URL https://arxiv.org/abs/ 2005.04975. Liu, X.; Zhu, X.; Li, M.; Wang, L.; Tang, C.; Yin, J.; Shen, D.; Wang, H.; and Gao, W. 2019. Late Fusion Incomplete Multi-View Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 41(10): 2410 2423. Long, B.; Yu, P. S.; and Zhang, Z. M. 2008. A General Model for Multiple View Unsupervised Learning. In Proceedings of the SIAM International Conference on Data Mining, SDM 2008, April 24-26, 2008, Atlanta, Georgia, USA, 822 833. Peng, X.; Huang, Z.; Lv, J.; Zhu, H.; and Zhpu, J. T. 2019. COMIC: Multi-view Clustering Without Parameter Selection. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, 9-15 June 2019, Long Beach, California, USA, volume 97, 5092 5101. Ren, Z.; and Sun, Q. 2020. Simultaneous Global and Local Graph Structure Preserving for Multiple Kernel Clustering. IEEE Transactions on Neural Networks and Learning Systems 1 13. Ren, Z.; Yang, S. X.; Sun, Q.; and Wang, T. 2020. Consensus Affinity Graph Learning for Multiple Kernel Clustering. IEEE Transactions on Systems, Man, and Cybernetics 1 12. Sch olkopf, B.; Smola, A. J.; and M uller, K. 1998. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10(5): 1299 1319. Wang, S.; Liu, X.; Zhu, E.; Tang, C.; Liu, J.; Hu, J.; Xia, J.; and Yin, J. 2019. Multi-view Clustering via Late Fusion Alignment Maximization. In Proceedings of the Twenty Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, 3778 3784. Wen, J.; Zhang, Z.; Zhang, Z.; Fei, L.; and Wang, M. 2020. Generalized Incomplete Multiview Clustering With Flexible Locality Structure Diffusion. IEEE Transactions on Systems, Man, and Cybernetics 1 14. Xia, R.; Pan, Y.; Du, L.; and Yin, J. 2014. Robust Multi View Spectral Clustering via Low-Rank and Sparse Decomposition. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Qu ebec City, Qu ebec, Canada, 2149 2155. Zhan, K.; Nie, F.; Wang, J.; and Yang, Y. 2019. Multiview Consensus Graph Clustering. IEEE Trans. Image Processing 28(3): 1261 1270. Zhang, Z. 2015. The Singular Value Decomposition, Applications and Beyond. Co RR abs/1510.08532. Zhou, P.; Du, L.; Shi, L.; Wang, H.; and Shen, Y. 2015. Recovery of Corrupted Multiple Kernels for Clustering. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, 4105 4111. Zhou, S.; Liu, X.; Liu, J.; Guo, X.; Zhao, Y.; Zhu, E.; Zhai, Y.; Yin, J.; and Gao, W. 2020. Multi-View Spectral Clustering with Optimal Neighborhood Laplacian Matrix. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, 6965 6972. AAAI Press.