# discrete_multiple_kernel_kmeans__cc9ef25a.pdf

Discrete Multiple Kernel k-means

Rong Wang1 , Jitao Lu2 , Yihang Lu2 , Feiping Nie2 and Xuelong Li2

1School of Cybersecurity and School of Artiﬁcial Intelligence, Optics and Electronics (i OPEN), Northwestern Polytechnical University, Xi an 710072, Shaanxi, P. R. China 2School of Computer Science and School of Artiﬁcial Intelligence, Optics and Electronics (i OPEN), Northwestern Polytechnical University, Xi an 710072, Shaanxi, P. R. China wangrong07@tsinghua.org.cn, dianlujitao@gmail.com, sebastian.yihanglu@gmail.com, feipingnie@gmail.com, li@nwpu.edu.cn

The multiple kernel k-means (MKKM) and its variants utilize complementary information from different kernels, achieving better performance than kernel k-means (KKM). However, the optimization procedures of previous works all comprise two stages, learning the continuous relaxed label matrix and obtaining the discrete one by extra discretization procedures. Such a two-stage strategy gives rise to a mismatched problem and severe information loss. To address this problem, we elaborate a novel Discrete Multiple Kernel k-means (DMKKM) model solved by an optimization algorithm that directly obtains the cluster indicator matrix without subsequent discretization procedures. Moreover, DMKKM can strictly measure the correlations among kernels, which is capable of enhancing kernel fusion by reducing redundancy and improving diversity. What s more, DMKKM is parameter-free avoiding intractable hyperparameter tuning, which makes it feasible in practical applications. Extensive experiments illustrated the effectiveness and superiority of the proposed model.

1 Introduction

Clustering is one of the most fundamental topics in machine learning and data mining. Out of various clustering algorithms, k-means [Hartigan and Wong, 1979] enjoys a huge popularity because of efﬁciency and simpleness, but it fails to cope with non-globular clusters which are very common in practice. Thus, researchers put forward a series of models to solve this problem, e.g., kernel k-means clustering (KKM) [Sch olkopf et al., 1998] using a kernel function to embed the original data into a high-dimensional Reproducing Kernel Hilbert Space (RKHS) where standard k-means clustering is performed with linearly separable mapped data [Van Laarhoven and Marchiori, 2016; He and Zhang, 2018; Calandriello and Rosasco, 2018; Wang et al., 2019a; Vankadara and Ghoshdastidar, 2020]. Although KKM intends to improve the clustering performance by introducing kernel functions, it is unable to iden-

Corresponding Author

tify whether a speciﬁc kernel function is suitable for a particular task in advance. To alleviate this problem, it s a good idea to allow the algorithm to adaptively choose the appropriate kernels, exploiting complementary information from different kernels to enhance learning, which is known as multiple kernel learning [Zhao et al., 2009; Xu et al., 2017; Kang et al., 2018]. Huang et al. proposed multiple kernel k-means clustering (MKKM) applying multiple kernel learning settings to kernel k-means clustering [Huang et al., 2011], which uniﬁes the kernel fusion process and clustering into a single optimization framework. A concurrent work OKKC optimizes the kernel coefﬁcients and cluster membership based on the same Rayleigh quotient objective and claims to have less complexity [Yu et al., 2011]. In the past decades, many studies have been devoted to improving MKKM. G onen et al. proposed localized multiple kernel k-means (LMKKM) to adaptively change the kernel coefﬁcients with a localized data fusion approach acquiring sample-speciﬁc characteristics of the data [G onen and Margolin, 2014]. Besides, Du et al. replaced the squared error term of k-means with ℓ2,1-norm based one and proposed a robust multiple kernel k-means clustering (RMKKM) algorithm to improve the robustness with respect to noises and outliers [Du et al., 2015]. Liu et al. argued that previous works haven t signiﬁcantly considered the correlation among different kernels and proposed a MKKM-MR model which conducts a matrix-induced regularization to reduce the redundancy and enhance the diversity of selected kernels [Liu et al., 2016]. Although previous works made some progress in MKKM clustering performance, they suffer from various problems. Dealing with the NP-hard cluster assignment problem, they all utilize the two-stage process: learning the continuous relaxation matrix and obtaining the discrete clustering indicator matrix by extra discretization process, which result in a mismatched problem and severe information loss. Moreover, existing MKKM models overlook the correlation among different kernels, leading to fusion of mutually redundant kernels and bad effect on the diversity of information sources. What s worse, most existing models achieve desirable clustering performance by tuning hyperparameters from regularization terms, which is intractable in practice owing to tedious setting and hard searching of hyperparameters. In this paper, we elaborate a novel Discrete Multiple Kernel

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

k-means (DMKKM) clustering model, which aims at overcoming the limitations and weaknesses caused by the above problems. The major contributions of our model can be summarized as follows. Firstly, DMKKM is able to directly obtain cluster indicator matrix without subsequent steps, which works as the ﬁrst model to directly solve the cluster assignment problem avoiding information loss and over reliance on extra discretization procedures. Secondly, our model is capable of measuring the correlation among kernels by penalizing the selection of highly correlated kernels, which successfully enhances kernel fusion by reducing redundancy and improving diversity. Thirdly, the proposed model is completely parameter-free avoiding intractable hyperparameter tuning, which makes it more feasible in practice. Lastly, extensive experiments conducted on several real-world benchmark datasets demonstrate the effectiveness and efﬁciency of our proposed model.

2 Related Work 2.1 Kernel k-means (KKM) Given a data matrix X = [x1, , xn] Rd n where n is the number of data points and d is the dimension of features. Let φ( ) : Rd H be a kernel mapping that maps X onto a reproducing kernel Hilbert space H. Let F Bn c denote the cluster indicator matrix represented by F Ind where c is the number of clusters. If φ(xi) is assigned to the jth cluster then Fij = 1, otherwise Fij = 0. The objective of kernel k-means (KKM) is to minimize the sum of the squared errors deﬁned by

min mj,F Ind

j=1 φ(xi) mj 2 2Fij, (1)

where mj denotes the center of the j-th cluster. Problem (1) can be reformulated in matrix form:

min M,F Ind φ(X) MF T 2 F , (2)

where φ(X) = [φ(x1), , φ(xn)] and M = [m1, , mc] denotes the clustering centroid matrix. Taking the derivative of Eq. (2) w.r.t. M and setting it to zero, we have

M = φ(X)F (F T F ) 1. (3)

Substituting Eq. (3) to Eq. (2) yields

min F Ind Tr(K) Tr((F T F ) 1

2 F T KF (F T F ) 1

where K = φ(X)T φ(X) Rn n is a kernel matrix with the i, j -th element K(i, j) = φ(xi)T φ(xj). Problem (4) is NP-hard since the elements of F are constrained to be discrete values. A widely used way is to relax the discrete constraint of F and allow F = F (F T F ) 1

2 to take arbitrary real values, so problem (4) becomes

max F T F =I Tr( F T K F ), (5)

where I denotes the identity matrix. The optimal solution to problem (5) is formed by the eigenvectors of K corresponding to its largest c eigenvalues. Since F is now in relaxed

continuous form and has mixed signs, we have to lean upon other discretization procedures, such as k-means, so as to obtain the discrete cluster indicator matrix F .

2.2 Multiple Kernel k-means (MKKM) In multiple kernel settings, each data point has multiple feature representations via a group of kernel mappings {φp( )}v p=1 which is represented as φγ(xi) = [γ1φ1(xi); ; γvφv(xi)], where γ = [γ1, , γv]T denotes the coefﬁcients of each base kernel and needs to be optimized through learning. Kγ = Pv p=1 γ2 p Kp denotes kernel matrix with the i, j -th element Kγ(i, j) = φγ(xi)T φγ(xj) = Pv p=1 γ2 pφp(xi)T φp(xj). By replacing the kernel matrix K in Eq. (4) with Kγ, the objective of multiple kernel k-means (MKKM) can be formulated as min F Ind,γT 1=1,γp 0, p Tr(Kγ(I F (F T F ) 1F T )), (6)

where 1 is a column vector with all elements being 1. This problem can be solved by alternatively updating F and γ: 1) Update F with γ ﬁxed. Taking the same two-stage strategy as KKM does, problem (6) becomes max F T F =I Tr( F T Kγ F ), (7)

where F denotes the continuous relaxation of F (F T F ) 1

2 . The optimal solution F is obtained as the c eigenvectors of Kγ corresponding to the largest c eigenvalues. Then, F is obtained from F through other discretization procedures. 2) Update γ with F ﬁxed. Problem (6) becomes

min γT 1=1,γp 0, p

p=1 γ2 p Tr(Kp(I F (F T F ) 1F T )), (8)

where γ can be optimized via solving the above quadratic programming (QP) problem with linear constraints.

2.3 MKKM with Matrix-induced Regularization (MKKM-MR) By observing that MKKM does not sufﬁciently consider the correlation among base kernels, Liu et al. proposed to reduce the redundancy and enhance the diversity of selected kernels by incorporating a matrix-induced regularization [Liu et al., 2016], as fulﬁlled in the following

min F Ind,γT 1=1, γp 0, p

Tr(Kγ(I F (F T F ) 1F T ))+ λ

2 γT Mγ, (9)

where M denotes a matrix with M(p, q) = Tr(KT p Kq) and λ denotes the regularization parameter. This problem can be solved by alternatively updating F and γ: 1) Update F with γ ﬁxed. The solution of F is the same as MKKM. 2) Update γ with F ﬁxed. Problem (8) becomes

min γT 1=1,γp 0, p γT (D + λ

2 M)γ, (10)

where D is a diagonal matrix with the i-th diagonal element Dii = Tr(Kp(I F (F T F ) 1F T )) and γ can be optimized via solving the above quadratic programming problem with linear constraints.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

3 The Proposed Discrete Multiple Kernel k-means (DMKKM) In this section, we elaborate the Discrete Multiple Kernel kmeans (DMKKM) model. We ﬁrst present the formulation and then develop an efﬁcient algorithm for optimization.

3.1 The Proposed DMKKM Model As mentioned above, the solutions of F in KKM, MKKM and MKKM-MR comprise two independent stages: ﬁrst learning continuous relaxation F and then learning F from F by the discretization procedures. Such two-stage solutions are likely to result in severe information loss and unsatisfying clustering performance. To avoid this situation, we intend to devise a novel multiple kernel k-means (DMKKM) model which is able to directly generate the discrete clustering indicator matrix F . In the multiple kernel setting, each data point φα(xi) is represented as φα(xi) = [ α1φ1(xi); ; αvφv(xi)], where α = [α1, , αv]T denotes the coefﬁcients of each base kernel and needs to be optimized during learning. Kα = Pv p=1 αp Kp is the kernel matrix with the i, j -th element Kα(i, j) = Pv p=1 αpφp(xi)T φp(xj). Our discrete multiple kernel k-means (DMKKM) model can be formulated as

min F ,α Kα F (F T F ) 1F T 2 F ,

s.t. F Ind, αT 1 = 1, αp 0, p, (11)

where Kα = KT α is a symmetric matrix, F Ind denote the cluster indicator matrix and F denotes the Frobenius norm. Problem (11) can be further reduced to

min F ,α Tr(KαKα) 2 Tr(F T KαF (F T F ) 1),

s.t. F Ind, αT 1 = 1, αp 0, p. (12)

In problem (12), minimizing Tr(KαKα) is equivalent to minimizing Pv p=1 Pv q=1 αpαq Tr(Kp Kq), where Tr(Kp Kq) measures the correlation between Kp and Kq. To be speciﬁc, the larger the value of Tr(Kp Kq) is, the higher correlation between Kp and Kq, and vice versa. On the one hand, if Kp and Kq are more correlated, minimizing the value of αpαq Tr(Kp Kq) is able to greatly reduce the risk of simultaneously assigning αp and αq with large weights. On the other hand, if Kp and Kq are less correlated, minimizing the value of αpαq Tr(Kp Kq) is able to greatly reduce the risk of simultaneously assigning αp and αq with small weights. Therefore, minimizing Tr(KαKα) can signiﬁcantly contribute to reducing the redundancy and enforcing the diversity of selected kernels.

3.2 Optimization Algorithm Problem (12) can be solved with an alternative optimization approach. Concretely, the following shows the alternative optimization procedures updating F and α. Step 1: Update F when α is ﬁxed. Problem (12) becomes the following objective function:

max F Ind Tr(F T KαF (F T F ) 1), (13)

Algorithm 1 Coordinate descent to solve problem (13)

Input: Kα Rn n, initial cluster label F Ind Output: Final cluster label F Ind 1: Precompute f T l Kαfl and f T l fl, l {1, . . . , c}. 2: while not converge do 3: for i = 1 . . . n do 4: Let m be the location of 1 in the i-th row of F . 5: for s = 1 . . . c do 6: Calculate L(s) by Eq. (22) or Eq. (27). 7: end for 8: s arg maxs L(s). 9: F(i, s ) 1, F(i, m) 0. 10: end for 11: end while

which is equivalent to the following vector form:

f T l fl , (14)

where fl denotes the l-th column of F , which denotes indicator clustering matrix obtained from the latest iteration. Now, we are going to obtain the discrete clustering indicator matrix F directly by utilizing coordinate descent technique [Wright, 2015], during which all variables are ﬁxed except the i-th row being updated to its optimal value. When we aim at updating the i-th row of F , it s clear that there are c kinds of possible situations including {[1, 0, , 0], . . . , [0, 0, , 1]} with varying position of element 1. To be speciﬁc, we denote F (s) with s {1, . . . , c} as {F (1), , F (c)} varying from different situations of the i-th row of F . For example, F (s) denotes that only the s-th element in the i-th row of F is 1 and the rest ones are 0 s, noting that F (s) and F are identical except the i-th row. Hence, the objective function of updating a row can be expressed as

max s {1,...,c}

f (s)T l Kαf (s) l f (s)T l f (s) l , (15)

where f (s) l is the l-th column of F (s). It s viable to solve problem (15) by directly iterating through all s to ﬁnd the optimal one, but the computational complexity of the bruteforce search is expensive. Next, we introduce how the computational burden can be skillfully reduced. Let s introduce a constant F (0) in which all elements in the i-th row are 0, while the rest rows are identical to F . Therefore, problem (15) is equivalent to

max s {1,...,c}

l=1 (f (s)T l Kαf (s) l f (s)T l f (s) l f (0)T l Kαf (0) l f (0)T l f (0) l ), (16)

where f (0) l is the l-th column of F (0). It s clear that F (s) and F (0) are identical except the s-th column, so problem (16) can be simpliﬁed as

max s {1,...,c} L(s) = f (s)T s Kαf (s) s f (s)T s f (s) s f (0)T s Kαf (0) s f (0)T s f (0) s . (17)

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Algorithm 2 The procedure to solve problem (11)

Input: Kernels {Kp Rn n}v p=1, number of clusters c Output: Final cluster label F Ind 1: Let α = 1v/v, random initialize cluster labels F . 2: while not converge do 3: Update F by Algorithm 1. 4: Update α by Eq. (29). 5: end while

In order to ﬁnd the optimal s of problem (17), we need to calculate its numerators and denominators corresponding to different s {1, 2, , c}, but directly calculating them is still time-consuming. Inspired by the fact that F (s) is closely related to F , we then demonstrate how to efﬁciently calculate the numerators and denominators in Eq. (17) by reusing previously calculated intermediate variables. For convenience, we denote the position of element 1 in the i-th row of F as m, i.e., F (s) = F when s = m. To be more detailed, we separately discuss as: When s = m, we gain f (s) s = fs, and f (0) s = f (s) s δ denoting δ Rn as a column vector with i-th element being 1 and the rest being 0, so the numerators and denominators of Eq. (17) can be obtained by

f (s)T s Kαf (s) s = f T s Kαfs, (18)

f (s)T s f (s) s = f T s fs, (19)

f (0)T s Kαf (0) s =f T s Kαfs 2f T s Kα(:, i)+Kα(i, i), (20)

f (0)T s f (0) s = f T s fs 1, (21)

where Kα(:, i) is the i-th column of Kα, and Kα(i, i) is the element settled in the i-th row of Kα(:, i), and thus we have

L(s)= f T s Kαfs

f T s fs f T s Kαfs 2f T s Kα(:, i)+Kα(i, i) f T s fs 1 . (22)

When s = m, we gain f (0) s = fs, and f (s) s = fs + δ, so the numerators and denominators can be obtained by

f (s)T s Kαf (s) s = f T s Kαfs+2f T s Kα(:, i)+Kα(i, i), (23)

f (s)T s f (s) s = f T s fs + 1, (24)

f (0)T s Kαf (0) s = f T s Kαfs, (25)

f (0)T s f (0) s = f T s fs, (26)

L(s)= f T s Kαfs+2f T s Kα(:, i)+Kα(i, i) f T s fs + 1 f T s Kαfs

f T s fs . (27)

According to Eq. (22) or Eq. (27), we gain the ideal value of s reaching the optimal L(s) and denote it as s , which is the position of element 1 in the i-th row. Through the repeated iteration, we ﬁnish updating the i-th row of updated optimal F as shown in Algorithm 1. In practice, we break the iteration once the increasing rate of Eq. (13) is less than the threshold with the value of 10 3.

Step 2: Update α when F is ﬁxed. Problem (12) becomes:

q=1 αpαq Tr(Kp Kq)

p=1 2αp Tr(F T Kp F (F T F ) 1),

s.t. αT 1 = 1, αp 0, p.

For simplicity, let us introduce a matrix M Rv v whose p, q -th element is deﬁned as M(p, q) = Tr(Kp Kq) and a vector d Rv = [Tr(F T K1F (F T F ) 1), , Tr(F T Kv F (F T F ) 1)]T . Problem (28) can be transformed into the following quadratic programming (QP) problem with linear constraints

min α αT Mα 2d T α,

s.t. αT 1 = 1, αp 0, p, (29)

which can be solved by any standard QP solver. The overall optimization procedure for our proposed DMKKM model is summarized in Algorithm 2.

3.3 Complexity Analysis The time complexities of calculating Eqs. (22) and (27) are both O(n), so ﬁnding the optimal s is O(nc). Considering F contains only n 1s while the rest are all 0s, the complexity can be regarded as O(n). Suppose Algorithm 1 converges after t1 iterations, its time complexity is O(n2t1). The time complexity of the rest of Algorithm 2 is dominated by calculating the vector d, which takes O(n2v). Since the number of kernels v is usually very small, the QP solver converges very fast in practice, and thus can be ignored. In summary, the time complexity of Algorithm 2 is O((n2t1 + n2v)t) denoting the number of the outer loop as t. Our algorithm is highly efﬁcient comparing with other MKKM models utilizing the two-stage optimization algorithm, in which the time complexity of eigenvalue decomposition is cubic to n.

4 Experiments In this section, we evaluate the clustering performance of the proposed DMKKM model on a number of real-world datasets. Besides, we meticulously analyze the properties of coefﬁcients obtained by learning, meanwhile showing convergence curves to verify the efﬁciency of the algorithm.

Name # Samples # Kernels # Classes

Handwritten 2000 6 10 Pima 768 8 2 Protein Fold 694 12 27 Sens ITVehicle 1500 2 3 UCI DIGIT 2000 3 10 Washington 230 2 5 Wisconsin 265 2 5

Table 1: Dataset descriptions

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Dataset Metric MKKM OKKC LMKKM RMKKM MKKM-MR ONKC MVC-LFA DMKKM

Handwritten

ACC 0.6825 0.6870 0.6550 0.6410 0.8910 0.9235 0.9355 0.9160 NMI 0.6598 0.6703 0.6475 0.6742 0.8265 0.8454 0.8670 0.8472 ARI 0.5411 0.5510 0.5033 0.5384 0.7811 0.8398 0.8631 0.8267

ACC 0.5716 0.5104 0.5208 0.5065 0.6536 0.6549 0.6471 0.6563 NMI 0.0010 0.0005 0.0048 0.0000 0.0646 0.0753 0.0819 0.0824 ARI 0.0073 -0.0008 -0.0006 -0.0012 0.0928 0.0949 0.0852 0.0966

Protein Fold

ACC 0.2853 0.2997 0.2262 0.2882 0.3732 0.3732 0.3646 0.3862 NMI 0.3507 0.3988 0.3537 0.3940 0.4454 0.4471 0.4280 0.4717 ARI 0.1456 0.1385 0.0844 0.1425 0.1911 0.1927 0.1916 0.1932

Sens ITVehicle

ACC 0.5787 0.5567 0.4240 0.4120 0.6560 0.5433 0.6927 0.6813 NMI 0.1389 0.1388 0.0674 0.0537 0.2113 0.1143 0.2681 0.2405 ARI 0.1454 0.1584 0.0605 0.0473 0.2400 0.1149 0.2967 0.2739

ACC 0.4225 0.4725 0.4780 0.4150 0.9090 0.9125 0.9020 0.9330 NMI 0.4548 0.4813 0.4846 0.4687 0.8368 0.8425 0.8295 0.8715 ARI 0.3272 0.3066 0.3111 0.3173 0.8131 0.8206 0.8008 0.8589

ACC 0.4478 0.4783 0.4957 0.4696 0.5609 0.6087 0.5870 0.6130 NMI 0.0491 0.0571 0.1433 0.0433 0.3172 0.3494 0.3335 0.3799 ARI 0.0662 0.0390 0.1203 0.0594 0.3086 0.3904 0.3561 0.4033

ACC 0.5547 0.5623 0.4566 0.5245 0.5585 0.5962 0.5623 0.5925 NMI 0.2288 0.2532 0.1629 0.2474 0.3323 0.3360 0.3505 0.3565 ARI 0.2188 0.2028 0.0730 0.2339 0.2660 0.3073 0.2878 0.3166

Table 2: Clustering result on real-world benchmark datasets, the best ones are in bold.

4.1 Dataset Descriptions Seven real-world benchmark datasets are employed to evaluate the clustering performance, including Handwritten, Pima, Protein Fold, Sens ITVehicle, UCI DIGIT, Washington and Wisconsin. All these datasets are downloaded from Xinwang Liu s page1 and more details can be found in their published papers. More importantly, we summarize their statistical properties in Table 1.

4.2 Comparison Models We compare our proposed DMKKM model with several multi-kernel k-means clustering models, including:

Multiple Kernel k-means (MKKM) [Huang et al., 2011]: The algorithm performs kernel k-means and updates kernel coefﬁcients alternately, as introduced in Section 2.2.

Optimized Kernel k-means Clustering (OKKC) [Yu et al., 2011]: This is a concurrent work of MKKM, but optimizes the kernel coefﬁcients and cluster membership based on the same Rayleigh quotient objective.

Localized Multiple Kernel k-means (LMKKM) [G onen and Margolin, 2014]: The model learns the kernel coefﬁcients with a localized approach.

Robust Multiple Kernel k-means (RMKKM) [Du et al., 2015]: RMKKM replaced the squared error term of kmeans with ℓ2,1-norm based one to make it more robust.

1https://xinwangliu.github.io

Multiple Kernel k-means with Matrix-induced Regularization (MKKM-MR) [Liu et al., 2016]: The model introduced a matrix-induced regularization to reduce the redundancy and enhance the diversity of selected kernels.

Optimal Neighborhood Kernel Clustering (ONKC) [Liu et al., 2017]: This model allows the optimal kernel to reside in the neighborhood of the base kernels to enlarge the region from which an optimal kernel can be chosen.

Multi-view Clustering via Late Fusion Alignment Maximization (MVC-LFA) [Wang et al., 2019b]: This model proposes to maximally align the consensus partition with the weighted base partitions, and they theoretically prove that it is equivalent to minimize the loss function of k-means clustering.

Among these models, MKKM, OKKC, LMKKM and DMKKM are parameter-free, while RMKKM, MKKM-MR have one hyperparameter, and ONKC, MVC-LFA have two hyperparameters.

4.3 Experimental Settings

For the sake of fairness, we perform gird search to determine the hyperparameters for all comparison models with parameters choosing their best results, under the guidance of relative papers, though our DMKKM is completely parameter-free model. The source codes are downloaded from the authors pages or requested from the authors.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

1 2 3 4 5 6 7 8 Kernels

1 2 3 4 5 6 7 8 9 101112 Kernels

(b) Protein Fold

1 2 3 Kernels

(c) UCI DIGIT

1 2 Kernels

(d) Wisconsin

Figure 1: Learned coefﬁcients αp and KKM accuracy of each kernel

4.4 Result Analyses Clustering performance. Table 2 presents the results of three widely-adopted clustering performance metrics, including clustering accuracy (ACC), normalized mutual information (NMI) and adjusted rand index (ARI). According to the presented results, we can conclude through observation as follows: a) Our proposed DMKKM model achieved the best clustering performance on 5 out of 7 datasets, indicating its superb performance. b) Comparing to other parameter-free models, our model not only consistently outperforms them on all datasets, but also achievs a huge advantage over them, showing excellent superiority. c) Comparing to other parameterized models, our model surpasses them in most of cases. Despite MVC-LFA achieving better performance on Handwritten and Sens ITVehicle datasets with a very small lead. However, MVC-LFA has 2 hyperparameters and badly relies on rigorous grid search, which makes it impractical for production use. In reverse, our proposed model is totally parameter-free which is much more feasible for production use.

Properties of learned coefﬁcients. In this part, we analyse the properties of the learned coefﬁcients αp of each kernel. Kernels usually work as various feature representations of data samples, but some kinds of kernels may be more discriminative than others for particular tasks, which may contribute to better clustering performance. Thus, such kinds of kernels deserve more signiﬁcance, in other word, more weight, so as to achieve better performance through multi-kernel clustering models. Arguably, the single-kernel clustering models like KKM will obtain better performance with the help of more distinctive kernel, so we utilize the clustering accuracy (ACC) of KKM as a metric to identify the signiﬁcance of kernels. To be speciﬁc, the higher clustering accuracy KKM can achieve, the more discriminative a kernel is. Ideally, our proposed DMKKM model is supposed to assign larger αp to the kernels where KKM achieved higher ACC. We plot the learned αp and corresponding KKM accuracy of each kernel

1 2 3 4 5 6 7 8 9 10 Number of iteration

Objective values

1 2 3 4 5 6 7 8 9 10 Number of iteration

Objective values

(b) Protein Fold

1 2 3 4 5 6 7 8 9 10 Number of iteration

Objective values

(c) UCI DIGIT

1 2 3 4 5 6 7 8 9 10 Number of iteration

Objective values

(d) Wisconsin

Figure 2: Objective value of Eq. (11) after each iteration

on 4 datasets in Figure 1. It can be shown that our model succeeds in assigning larger coefﬁcients to kernels which are more discriminative, so our proposed model is able to identify more discriminative kernels among multiple kernels, which may help to improve the clustering performance.

Convergence analysis. To demonstrate the effectiveness of our optimization algorithm, we plot the convergence curves of Algorithm 2 on 4 datasets in Figure 2, where the x-axes denotes the number of iterations and the y-axes denotes the objective value of Eq. (11). It can be shown that the objective values decrease monotonically until Algorithm 2 converges. All the curves plateau within 10 iterations, which means that our optimization algorithm is of high efﬁciency.

5 Conclusion

In this paper, we propose a novel multiple kernel k-means clustering model, namely DMKKM, utilizing an efﬁcient iterative algorithm to solve it. Our model can directly obtain the cluster indicator matrix without subsequent discretization steps, avoiding severe information loss. Moreover, our model is capable of measuring the correlation among kernels, which greatly enhances kernel fusion through reducing redundancy and improving diversity. What s more, our model is totally parameter-free, which is more feasible in practice. Extensive experiments on several real-world datasets demonstrate its superb performance and potentiality in practical applications.

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grant 2018AAA0101902, in part by the Natural Science Basic Research Program of Shaanxi (Program No. 2021JM-071), in part by the National Natural Science Foundation of China under Grant 61936014 and Grant 61772427, and in part by the Fundamental Research Funds for the Central Universities under Grant G2019KY0501.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

References [Calandriello and Rosasco, 2018] Daniele Calandriello and Lorenzo Rosasco. Statistical and computational trade-offs in kernel k-means. In Proc. Neur IPS, pages 9357 9367, 2018. [Du et al., 2015] Liang Du, Peng Zhou, Lei Shi, Hanmo Wang, Mingyu Fan, Wenjian Wang, and Yi-Dong Shen. Robust multiple kernel k-means using l21-norm. In Proc. IJCAI, pages 3476 3482, 2015. [G onen and Margolin, 2014] Mehmet G onen and Adam A Margolin. Localized data fusion for kernel k-means clustering with application to cancer biology. In Proc. NIPS, pages 1305 1313, 2014. [Hartigan and Wong, 1979] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society, 28(1):100 108, 1979. [He and Zhang, 2018] Li He and Hong Zhang. Kernel kmeans sampling for nystr om approximation. IEEE Trans. Image Process., 27(5):2108 2120, 2018. [Huang et al., 2011] Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen. Multiple kernel fuzzy clustering. IEEE Trans. Fuzzy Syst., 20(1):120 134, 2011. [Kang et al., 2018] Zhao Kang, Xiao Lu, Jinfeng Yi, and Zenglin Xu. Self-weighted multiple kernel learning for graph-based clustering and semi-supervised classiﬁcation. In Proc. IJCAI, pages 2312 2318, 2018. [Liu et al., 2016] Xinwang Liu, Yong Dou, Jianping Yin, Lei Wang, and En Zhu. Multiple kernel k-means clustering with matrix-induced regularization. In Proc. AAAI, pages 1888 1894, 2016. [Liu et al., 2017] Xinwang Liu, Sihang Zhou, Yueqing Wang, Miaomiao Li, Yong Dou, En Zhu, and Jianping Yin. Optimal neighborhood kernel clustering with multiple kernels. In Proc. AAAI, pages 2266 2272, 2017. [Sch olkopf et al., 1998] Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299 1319, 1998. [Van Laarhoven and Marchiori, 2016] Twan Van Laarhoven and Elena Marchiori. Local network community detection with continuous optimization of conductance and weighted kernel k-means. The Journal of Machine Learning Research, 17(1):5148 5175, 2016. [Vankadara and Ghoshdastidar, 2020] Leena C Vankadara and Debarghya Ghoshdastidar. On the optimality of kernels for high-dimensional clustering. In Proc. AISTATS, pages 2185 2195, 2020. [Wang et al., 2019a] Shusen Wang, Alex Gittens, and Michael W Mahoney. Scalable kernel k-means clustering with nystr om approximation: relative-error bounds. The Journal of Machine Learning Research, 20(1):431 479, 2019. [Wang et al., 2019b] Siwei Wang, Xinwang Liu, En Zhu, Chang Tang, Jiyuan Liu, Jingtao Hu, Jingyuan Xia, and

Jianping Yin. Multi-view clustering via late fusion alignment maximization. In Proc. IJCAI, pages 3778 3784, 2019. [Wright, 2015] Stephen J Wright. Coordinate descent algorithms. Math. Program., 151(1):3 34, 2015. [Xu et al., 2017] Jinglin Xu, Junwei Han, Feiping Nie, and Xuelong Li. Re-weighted discriminatively embedded kmeans for multi-view clustering. IEEE Trans. Image Process., 26(6):3016 3027, 2017. [Yu et al., 2011] Shi Yu, Leon Tranchevent, Xinhai Liu, Wolfgang Glanzel, Johan AK Suykens, Bart De Moor, and Yves Moreau. Optimized data fusion for kernel kmeans clustering. IEEE Trans. Pattern Anal. Mach. Intell., 34(5):1031 1039, 2011. [Zhao et al., 2009] Bin Zhao, James T Kwok, and Changshui Zhang. Multiple kernel clustering. In Proc. SDM, pages 638 649, 2009.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)