# localized_incomplete_multiple_kernel_kmeans__26a74775.pdf

Localized Incomplete Multiple Kernel k-means

Xinzhong Zhu1,7, Xinwang Liu2, Miaomiao Li2, En Zhu2, Li Liu3,4, Zhiping Cai2, Jianping Yin5 and Wen Gao6

1 College of Mathematics, Physics and Information Engineering, Zhejiang Normal University, China 2 School of Computer, National University of Defense Technology, Changsha, China 3 College of System Engineering, National University of Defense Technology, Changsha, China 4 University of Oulu, Finland 5 Dongguan University of Technology, Guangdong, China 6 School of Electronics Engineering and Computer Science, Peking University, Beijing, China 7 School of Electronic Engineering, XIDIAN University, Xi an, Shanxi, China zxz@zjnu.edu.cn, xinwangliu@nudt.edu.cn, jpyin@dgut.edu.cn, wgao@pku.edu.cn

The recently proposed multiple kernel k-means with incomplete kernels (MKKM-IK) optimally integrates a group of pre-speciﬁed incomplete kernel matrices to improve clustering performance. Though it demonstrates promising performance in various applications, we observe that it does not sufﬁciently consider the local structure among data and indiscriminately forces all pairwise sample similarity to equally align with their ideal similarity values. This could make the incomplete kernels less effectively imputed, and in turn adversely affect the clustering performance. In this paper, we propose a novel localized incomplete multiple kernel k-means (LI-MKKM) algorithm to address this issue. Different from existing MKKM-IK, LIMKKM only requires the similarity of a sample to its k-nearest neighbors to align with their ideal similarity values. This helps the clustering algorithm to focus on closer sample pairs that shall stay together and avoids involving unreliable similarity evaluation for farther sample pairs. We carefully design a three-step iterative algorithm to solve the resultant optimization problem and theoretically prove its convergence. Comprehensive experiments demonstrate that our algorithm signiﬁcantly outperforms the state-of-the-art comparable algorithms proposed in the recent literature, verifying the advantage of considering local structure.

1 Introduction Multiple kernel clustering (MKC) aims to optimally combine a group of pre-speciﬁed base kernels to perform clustering,

which has been intensively studied during the past several years [Yu et al., 2012; Li et al., 2014; G onen and Margolin, 2014; Liu et al., 2016; Li et al., 2016; Liu et al., 2017b; Zhang et al., 2015; Cao et al., 2015; Gao et al., 2015; Nie et al., 2014; Cai et al., 2013; Xu et al., 2015a]. A common assumption adopted by these MKC algorithms is that all the pre-speciﬁed base kernel matrices are complete. However, it is not uncommon to see that some views of a sample are absent in practical applications [Xiang et al., 2013; Kumar et al., 2013]. Consequently, this will cause the corresponding rows and columns of related base kernel matrices unﬁlled. The presence of incomplete base kernel matrices makes it more challenging to utilize the information of all views for clustering. Many efforts have been devoted to address this issue [Ghahramani and Jordan, 1993; Trivedi et al., 2010; Yin et al., 2015; Xu et al., 2015b; Shao et al., 2015; Bhadra et al., 2016; Liu et al., 2017b]. They can roughly be grouped into two categories. The ﬁrst one ﬁrstly ﬁlls the incomplete kernels with an imputation algorithm and then applies a standard MKC algorithm with these imputed kernels. The widely used imputation algorithms include zero-ﬁlling, mean value ﬁlling, k-nearest-neighbor ﬁlling and expectation-maximization (EM) ﬁlling [Ghahramani and Jordan, 1993]. In contrast, the other category proposes to integrate the imputation and clustering into a single optimization procedure, leading to a clustering-oriented imputation [Liu et al., 2017a]. By this way, these two procedures are seamlessly connected to achieve better clustering performance. Although the aforementioned algorithms demonstrate promising clustering performance in various applications, we observe that they, no matter separately or jointly optimizing imputation and clustering, do not sufﬁciently consider the local structure among data, which is crucial for unsupervised

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

learning tasks like clustering [Li et al., 2016]. As a consequence, this could make the incomplete kernels less effectively imputed, and in turn adversely affect the clustering performance. To address this issue, we propose to integrate the local structure among data into incomplete multiple kernel clustering tasks, with the aim to further improve the clustering performance. As an instantiation, we design a localized incomplete multiple kernel k-means (LI-MKKM) algorithm and build it upon the latest incomplete multiple kernel clustering framework developed in the literature [Liu et al., 2017a]. Our algorithm inherits the advantage of [Liu et al., 2017a] that uniﬁes the imputation and clustering into a single optimization procedure. The clustering result at the last iteration guides the imputation of absent kernel elements, and the latter is used in turn to conduct the subsequent clustering. These two learning processes negotiate with each other to achieve the optimal clustering. More importantly, different from [Liu et al., 2017a] which rigidly forces closer and farther sample pairs to be equally aligned to the same ideal similarity, our algorithm only requires that the similarity of a sample to its k-nearest neighbours be aligned with the ideal similarity matrix. Such an alignment helps the clustering algorithm to better focus on closer sample pairs that shall stay together and avoids involving unreliable similarity evaluation for farther sample pairs. The optimization objective of our algorithm is carefully designed and an efﬁcient algorithm with proved convergence is developed to solve the resultant optimization problem. Extensive experimental study is carried out on eight multiple kernel learning (MKL) benchmark data sets to evaluate the clustering performance of the proposed algorithm. As demonstrated, our algorithm signiﬁcantly outperforms existing two-stage imputation methods and the recently proposed algorithm [Liu et al., 2017a], validating the advantage of incorporating the local structure of data. The main contributions of this paper are brieﬂy summarized as follows. i) We, for the ﬁrst time, identify the global kernel alignment issue in incomplete multiple kernel clustering and propose an effective solution; ii) We develop a general parametrization model to impute incomplete kernel matrices with theoretically proved feasibility; and iii) We conduct extensive experiments to validate our identiﬁcation of this issue and the effectiveness of our solution.

2 Related Work

Multiple advanced algorithms have recently been proposed to address incomplete multiple kernel clustering [Trivedi et al., 2010; Xu et al., 2015b; Shao et al., 2015; Yin et al., 2015; Bhadra et al., 2016]. With the help of one complete view, the work in [Trivedi et al., 2010] constructs a complete kernel matrix for the other incomplete view. The work in [Xu et al., 2015b] proposes an algorithm to accomplish multi-view learning with incomplete views by exploiting the connections of multiple views, where different views are assumed to be generated from a shared subspace. A multi-incomplete-view clustering algorithm is proposed in [Shao et al., 2015]. It learns latent feature matrices for all the views and generates a consensus matrix by minimizing the difference between each view and the consensus. In addition, the approach in [Bhadra

et al., 2016] proposes to predict missing rows and columns of a base kernel by modelling both within-view and betweenview relationships among kernel values. One drawback shared by the above-mentioned two-stage algorithms is that the processes of imputation and clustering are disconnected, and this prevents the two learning processes from negotiating with each other to achieve the optimal clustering. To overcome this drawback, some pioneering work are proposed to integrate imputation and clustering into a single optimization procedure [Liu et al., 2017a]. These two procedures are interacted to achieve better clustering performance. In the following, we give an introduction to the newly proposed MKKM with incomplete kernels [Liu et al., 2017a] upon which we develop a novel algorithm. Let sp denote the available sample indices from the pth (1 p m) view and K(cc) p Rnp np be a kernel sub-matrix computed with these observed samples, where np is the length of sp. The recently proposed MKKM-IK [Liu et al., 2017a] optimally integrates these incomplete kernel matrices {K(cc) p }m p=1 to improve clustering performance. It uniﬁes the imputation and clustering procedure into a single optimization objective and alternately optimizes each of them, which is mathematically fulﬁlled as follows,

min H, β, {Kp}m p=1 Tr(Kβ(In HH ))

s.t. H Rn k, H H = Ik, β 1m = 1, βp 0,

Kp(sp, sp) = K(cc) p , Kp 0, p, (1) where Kβ = Pm p=1 β2 p Kp, H is the clustering matrix, and n and k are the number of samples to be clustered and clusters. The constraint Kp(sp, sp) = K(cc) p is imposed to ensure that Kp maintains the known entries during the course. MKKMIK develops a three-step alternate optimization algorithm to solve Eq.(1), which simultaneously imputes the missing entries of base kernels and clustering. Interested readers are referred to [Liu et al., 2017a]. Although the uniﬁcation of clustering and imputation into a single procedure is elegant, it is implemented via globally maximizing the alignment between the combined kernel matrix Kβ and the ideal kernel matrix HH , as shown in Eq.(1). This criterion inappropriately exploits the local distribution of data and indiscriminately forces closer and farther sample pairs to be equally aligned to the same ideal similarity. As a result, this could make these base kernels less effectively utilized, and in turn adversely affect the clustering performance. In the following, we design a novel algorithm, termed localized incomplete multiple kernel k-means (LI-MKKM) to address these issues.

3 The Proposed LI-MKKM

Although it is well recognized that preserving the local structure of data is crucial in unsupervised learning tasks such as clustering analysis [Liu et al., 2014], it is not sufﬁciently considered in MKKM with incomplete kernel matrices. In light of this, we propose to incorporate the local structure of data by locally aligning the similarity of each sample to its knearest neighbours with corresponding ideal kernel matrix,

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

which is ﬂexible and able to well handle the intra-cluster variations. Let N (i) {0, 1}n τ(1 i n) denote the neighborhood indication matrices of the i-th sample. For example, N (i) jv = 1 denotes xj is the v-th nearest neighbor of xi, where 1 v τ and τ is the number of nearest neighbors. The local kernel alignment for the i-th sample is calculated as (N (i)) KβN (i), (N (i)) (I HH )N (i) . As seen, this local kernel alignment takes a sub-matrix corresponding to its neighbors from the whole Kβ, and let it align with the ideal sub-matrix. By taking over the local kernel alignment for each sample and deﬁning B(i) = N (i)N (i) , we obtain the objective of localized incomplete MKKM (LI-MKKM) as follows,

min β, {Kp}m p=1,H

i=1 Tr(Kβ(B(i) B(i)HH B(i)))

s.t. H Rn k, H H = Ik, β 1m = 1, βp 0,

Kp(sp, sp) = K(cc) p , Kp 0, p. (2) In the following, we carefully design a three-step algorithm to solve the optimization problem in Eq.(2).

Optimizing H with ﬁxed β and {Kp}m p=1 Given β and {Kp}m p=1, the optimization w.r.t H in Eq.(2) reduces to

max H Tr(H Xn

i=1(B(i)KβB(i))H) s.t. H H = Ik,

(3) which is a conventional kernel k-means optimization problem and readily solved by existing off-the-shelf packages.

Optimizing {Kp}m p=1 with ﬁxed β and H Given β and H, the optimization w.r.t {Kp}m p=1 in Eq.(2) is

min {Kp}m p=1

p=1 β2 p Tr(Kp Xn

i=1 Tr(B(i) B(i)HH B(i)))

s.t. Kp(sp, sp) = K(cc) p , Kp 0, p, (4) Directly solving the optimization problem in Eq.(4) appears to be computationally intractable because it involves multiple kernel matrices. Looking into this optimization problem, we can ﬁnd that the constraints are separately deﬁned on each Kp and that the objective function is a sum over each Kp. Therefore, the problem in Eq.(4) can be equivalently rewritten as m independent sub-problems, as stated in Eq.(5),

min Kp Tr(Kp T) s.t. Kp(sp, sp) = K(cc) p , Kp 0. (5)

where T = Pn i=1(B(i) B(i)HH B(i)). At the ﬁrst glance, it seems that the equality and PSD constraints imposed on Kp make Eq.(5) be difﬁcult to solve. To efﬁciently solve this problem, we propose to parameterize each Kp as

" K(cc) p K(cc) p Wp W p K(cc) p W p K(cc) p Wp

where Wp Rc m. The missing kernel entries are assumed to be represented by the observed ones. The following Theorem 3.1 shows that Kp in Eq.(6) satisﬁes both constraints by this parametrization.

Theorem 3.1 Kp in Eq.(6) is a feasible set of the optimization problem Eq.(5).

Proof 1 Firstly, it is not difﬁcult to check that the equality constraint is satisﬁed. For any vector x Rn, we have x = [x c , x m] . We have

x Kpx =[x c , x m]

" K(cc) p K(cc) p Wp W p K(cc) p W p K(cc) p Wp

[x c , x m]

=(xc + Wpxm) K(cc) p (xc + Wpxm) 0. (7) This veriﬁes the parametrization of Kp is PSD. It completes the proof.

Based on Theorem 3.1, the optimization problem in Eq.(5) is equivalent to the following unconstrained one,

" K(cc) p K(cc) p Wp W p K(cc) p W p K(cc) p Wp

# " T(cc) T(cm)

T(cm) T(mm)

(8) where the matrix T is expressed in a blocked form as h T(cc) T(cm)

T(cm) T(mm) i . By taking the derivative of Eq.(8) with respect to Wp and letting it vanish, we can obtain

Wp = T(cm)(T(mm)) 1. (9)

By substituting Wp in Eq.(9) into Eq.(6), we have a closedform expression for the optimal Kp. It is worth pointing out that Theorem 3.1 provides a general parametrization model to impute incomplete kernel matrices. In fact, the imputation in MKKM-IK [Liu et al., 2017a] can be treated as a special case of Eq.(6). Some regularization such as low-rank constraint on Wp in Eq.(6) will be incorporated to further improve the clustering performance in the future work. As observed, Eq.(5) exploits the local structure of data via T to guide the imputation of each base kernel matrix. It locally aligns the similarity of each sample to its τ-nearest neighbors with corresponding ideal kernel matrix, which is ﬂexible and able to well handle the intra-cluster variations. By this way, the incomplete kernels could be more effectively imputed, leading to improved clustering performance.

Optimizing β with ﬁxed {Kp}m p=1 and H Given {Kp}m p=1 and H, it is not difﬁcult to show that Eq.(2) w.r.t. β is as follows,

p=1 zpβ2 p s.t. β 1m = 1, βp 0, (10)

where zp = Tr(Kp V) and V = Pn i=1(A(i) A(i)HH A(i)). The optimization in Eq.(10) has a closed-form solution if zp 0 (1 p m). The following Theorem 3.2 shows that this optimization problem can be analytical solved. Theorem 3.2 The optimization in Eq.(10) has a closed-form solution as follows,

βp = wp/ Xm

p=1 wp, p, (11)

where wp = 1/zp.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Algorithm 1 Proposed LI-MKKM

1: Input: {K(cc) p , sp}m p=1, k, τ and ϵ0. 2: Output: H, β and {Kp}m p=1.

3: Initialize β(0) = 1m/m, {K(0) p }m p=1 and t = 1.

4: Generating S(i) for i-th samples (1 i n) by Kβ(0). 5: repeat

6: Kβ(t) = Pm p=1 β(t 1) p 2 K(t 1) p .

7: Update H(t) by solving Eq.(3) with Kβ(t).

8: Update {K(t) p }m p=1 with H(t) and β(t) by Eq.(5).

9: Update β(t) by solving Eq.(10) with H(t) and {K(t) p }m p=1. 10: t = t + 1. 11: until obj(t 1) obj(t) /obj(t) ϵ0

Proof 2 By denoting H = [h1, , hk], we can see that HH hc = hc( 1 c k) since H H = Ik. This means HH has k eigenvalue with 1. Meanwhile, its rank is no more than k, which implies its has n k eigenvalue with 0. Correspondingly, In HH has n k and k eigenvalue with 1 and 0. As a result, A(i)(In HH )A(i) is PSD, and this guarantees that V = Pn i=1(A(i) A(i)HH A(i)) is PSD. Therefore, we have zp = Tr(Kp V) 0, p since both Kp and V are PSD. The proof is completed by taking the derivative of the Lagrangian function of Eq.(10) on βp and letting it vanish.

In sum, our algorithm for solving Eq.(2) is outlined in Algorithm 1, where the absent elements of {K(0) p }m p=1 are initially imputed with zeros and obj(t) denotes the objective value at the t-th iteration. It is worth pointing out that the neighborhood of each sample is kept unchanged during the optimization. The τ-nearest neighbors of each sample are measured by Kβ(0). By doing so, the objective of Algorithm 1 is guaranteed to be monotonically decreased when optimizing one variable with the others ﬁxed at each iteration. At the same time, the objective is lower-bounded by zero. As a result, our algorithm is guaranteed to converge to a local minimum. Also, as shown in the experimental study, it usually converges in less than 10 iterations. We end up this section by discussion the computational complexity of the proposed algorithm. Speciﬁcally, the complexity of our algorithm is O(n3 + Pm p=1 n3 p) per iteration, where np (np n) is the number of observed samples of Kp. It is comparable to the case of existing MKKM-IK [Liu et al., 2017a]. Furthermore, it is worth pointing out that Kp can be trivially calculated in a parallel way, because each of them is independent. In this way, our algorithm can scale well with respect to the number of base kernels. Meanwhile, although the kernel alignment is optimized in a localized way, the resultant computational complexity is not altered signiﬁcantly and brings not much extra computation when compared with existing MKKM-IK [Liu et al., 2017a], as validated by the running time comparison in Table 3.

Dataset #Samples #Kernels #Classes

Flower17 1360 7 17 Flower102 8189 4 102 Calt102-5 510 48 102 Calt102-10 1020 48 102 Calt102-15 1530 48 102 Calt102-20 2040 48 102 Calt102-25 2550 48 102 Calt102-30 3060 48 102

Table 2: Datasets used in our experiments.

4 Experiments 4.1 Experimental Settings The proposed algorithm is experimentally evaluated on eight widely used MKL benchmark data sets shown in Table 2. They are Oxford Flower17 and Flower1021 and Caltech1022. For these datasets, all kernel matrices are pre-computed and can be publicly downloaded from the above websites. Meanwhile, Caltech102-5 means the number of samples belonging to each cluster is 5, and so on. We compare the proposed algorithm with several commonly used imputation methods, including zero ﬁlling (ZF), mean ﬁlling (MF), k-nearestneighbor ﬁlling (KNN) and the alignment-maximization ﬁlling (AF) proposed in [Trivedi et al., 2010]. The widely used MKKM [G onen and Margolin, 2014] is then applied with these imputed base kernels. These two-stage methods are termed MKKM+ZF, MKKM+MF, MKKM+KNN and MKKM+AF, respectively. In addition, we compare with the newly proposed MKKM-IK [Liu et al., 2017a], which jointly optimizes the imputation and clustering. We donot incorporate the algorithms in [Xu et al., 2015b; Shao et al., 2015; Zhao et al., 2016] into our experimental comparison since they only consider the absence of input features while not the rows/columns of base kernels. For all data sets, it is assumed that the true number of clusters is known and it is set as the true number of classes. We follow the approach in [Liu et al., 2017a] to generate the missing vectors {sp}m p=1. The parameter ε, termed missing ratio in this experiment, controls the percentage of samples that have absent views, and it affects the performance of the algorithms in comparison. In order to show this point in depth, we compare these algorithms with respect to ε. Speciﬁcally, ε on all the four data sets is set as [0.1 : 0.1 : 0.9]. The widely used clustering accuracy (ACC) and normalized mutual information (NMI) are applied to evaluate the clustering performance. For all algorithms, we repeat each experiment for 50 times with random initialization to reduce the effect of randomness caused by k-means, and report the best result. Meanwhile, we randomly generate the incomplete patterns for 10 times in the above-mentioned way and report the statistical results. The aggregated ACC and NMI are used to evaluate the goodness of the algorithms in comparison. Taking the aggregated ACC for example, it is obtained by averaging the averaged ACC achieved by an algorithm over different ε.

1http://www.robots.ox.ac.uk/ vgg/data/ﬂowers/ 2http://ﬁles.is.tue.mpg.de/pgehler/projects/iccv09/

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

Caltech102-25

MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

Caltech102-30 MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

Caltech102-25

MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 missing ratio

Caltech102-30

MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM

Figure 1: ACC and NMI comparison with the variation of missing ratios on Flower17, Flower102, Caltech102-25 and Caltech102-30. The curves on other datasets are similar and omitted due to space limit.

Datasets MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM [Trivedi et al., 2010] [Liu et al., 2017a] Proposed

ACC Flower17 36.9 0.8 36.8 0.6 37.8 0.6 40.5 0.7 44.6 0.6 47.9 0.3 Flower102 18.0 0.2 18.0 0.2 18.2 0.1 19.2 0.1 21.1 0.2 23.1 0.2 Calt102-5 26.1 0.3 25.7 0.3 27.3 0.3 29.0 0.3 28.9 0.3 31.3 0.3 Calt102-10 19.7 0.2 19.7 0.2 21.5 0.2 22.6 0.2 22.7 0.2 27.1 0.3 Calt102-15 17.1 0.2 17.1 0.2 18.9 0.1 20.3 0.2 20.8 0.2 25.0 0.2 Calt102-20 15.7 0.1 15.7 0.2 17.3 0.2 18.9 0.2 19.5 0.2 24.0 0.2 Calt102-25 14.7 0.2 14.6 0.1 16.2 0.1 17.7 0.2 18.3 0.2 23.2 0.2 Calt102-30 14.2 0.1 14.1 0.1 15.5 0.2 17.1 0.2 17.8 0.2 22.1 0.2

Flower17 37.3 0.4 37.3 0.5 38.2 0.5 40.1 0.4 43.7 0.3 46.3 0.2 Flower102 37.4 0.1 37.4 0.1 37.8 0.1 38.4 0.1 39.6 0.1 41.7 0.1 Calt102-5 64.3 0.2 63.9 0.1 65.9 0.2 66.6 0.1 66.5 0.2 67.0 0.1 Calt102-10 53.6 0.1 53.7 0.1 55.2 0.1 55.7 0.2 55.8 0.1 58.6 0.1 Calt102-15 47.4 0.1 47.4 0.1 48.8 0.1 49.7 0.1 50.1 0.1 53.5 0.1 Calt102-20 43.1 0.1 43.1 0.2 44.5 0.1 45.6 0.2 46.0 0.1 50.3 0.1 Calt102-25 40.0 0.1 39.9 0.1 41.5 0.1 42.5 0.2 43.0 0.2 47.7 0.1 Calt102-30 37.8 0.1 37.7 0.1 39.2 0.1 40.3 0.1 40.9 0.1 45.6 0.1

Table 1: Aggregated ACC and NMI comparison (mean std) of different clustering algorithms on eight benchmark datasets.

Dataset MKKM+ZF MKKM+MF MKKM+KNN MKKM+AF MKKM-IK LI-MKKM [Trivedi et al., 2010] [Liu et al., 2017a] Proposed

Flower17 2.5 0.0 2.4 0.1 3.4 0.1 3.0 0.1 7.3 0.5 6.3 0.1 Flower102 193.5 4.5 210.2 12.4 329.5 27.6 240.0 9.4 415.9 7.7 418.1 10.9 Calt102-5 3.8 0.1 3.8 0.1 6.4 0.1 4.3 0.2 31.9 3.3 8.1 0.1 Calt102-10 14.7 0.3 14.8 0.4 26.7 0.5 16.4 0.1 71.2 16.3 36.8 0.2 Calt102-15 58.6 0.3 59.1 0.2 86.5 2.2 61.0 2.7 202.3 28.9 143.7 0.7 Calt102-20 119.5 7.3 118.3 5.8 204.8 18.2 130.8 16.3 335.1 42.0 290.6 5.4 Calt102-25 235.5 11.3 220.4 7.9 395.5 25.5 215.6 6.3 599.4 87.3 537.3 1.6 Calt102-30 370.9 17.2 367.8 28.2 648.9 36.6 360.9 16.9 874.6 28.6 837.7 15.0

Table 3: Running time comparison of the aforementioned algorithms on all datasets (in seconds).

4.2 Experimental Results Figure 1 presents the ACC and NMI comparison of the above algorithms with different missing ratios on Flower17,

Flower102, Caltech102-25 and Caltech102-30 datasets. We have the following observations: i) The recently proposed

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

MKKM-IK [Liu et al., 2017a] (in blue) achieves comparable or better clustering performance when compared with existing two-stage imputation methods. These results verify the effectiveness of the joint optimization on imputation and clustering. ii) The proposed LI-MKKM (in red) consistently and signiﬁcantly further improves the clustering performance of MKKM-IK, as shown in all subﬁgures from Fig.1(a) to 1(h). This clearly demonstrates the advantage of well utilizing the local structure of data. iii) The improvement of our algorithm is more signiﬁcant with the decrease of missing ratios. For example, it improves the second best algorithm (MKKM-IK) by 5 percentage points on Caltech102-30 in terms of clustering accuracy when the missing ratio is 0.1 (see Figure 1(d)). We attribute the superiority of our algorithm as two aspects: i) Well utilizing the local structure of data. Our local kernel alignment criterion is ﬂexible and allows the prespeciﬁed kernels to be aligned for better clustering, making the pre-speciﬁed kernels be well utilized; and ii) The joint optimization on imputation and clustering. On one hand, the imputation is guided by the clustering results, which makes the imputation more directly targeted at the ultimate goal. On the other hand, this meaningful imputation is beneﬁcial to reﬁne the clustering results. These two learning processes negotiate with each other, leading to improved clustering performance. In contrast, MKKM+ZF, MKKM+MF, MKKM+KNN and MKKM+AF algorithms do not fully take advantage of the connection between the imputation and clustering procedures. This could produce imputation that does not well serve the subsequent clustering as originally expected, affecting the clustering performance. Both factors bring the signiﬁcant improvements on clustering performance. We also report the aggregated ACC and NMI, and the standard deviation in Table 1, where the best and second ones are marked in red and blue color, respectively. Again, we observe that the proposed algorithm signiﬁcantly outperforms MKKM+ZF, MKKM+MF, MKKM+KNN, MKKM+AF and MKKM-IK. For example, our algorithm exceeds the second best one (MKKM-IK) by nearly 2.3%, 4.4%, 4.2%, 4.5%, 4.9%, 4.3% in terms of clustering accuracy on Caltech102, respectively. These results are consistent with our observations in Figure 1. Meanwhile, we observe that LF-MKKM is inferior to MKKM-IK in the presence of intensive missing ratios. We conjecture that the neighbors of each sample calculated by Kβ(0) may no longer be reliable when the missing ratio is relatively intensive, and this adversely affects the clustering performance. We plan to further explore this point in the future work. From the above experiments, we conclude that the proposed algorithm well exploits the local structure of data, bringing forth signiﬁcant improvements on clustering performance. As aforementioned, LI-MKKM is with comparable computational complexity with existing MKKM-IK [Liu et al., 2017a]. In Table 3, we report their running time (in seconds) on eight datasets. As observed, LI-MKKM slightly improves the running time of MKKM-IK. This is because LI-MKKM requires less iterations to achieve the same convergence criterion compared with MKKM-IK. For example, LI-MKKM needs ﬁve iterations to satisfy the criterion, while this number

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The variation of τ

Flower17 (missing ratio=0.1)

LI-MKKM MKKM-IK

1 2 3 4 5 6 7 8 9 10 11 Number of Iterations

Objective Values

Flower17 (missing ratio=0.1)

(b) Figure 2: The sensitivity of LI-MKKM with the variation of τ and its convergence on Flower17.

is 14 for MKKM-IK. This motivates us to study the impact of incorporating the local structure of data on the algorithm convergence in the future.

4.3 Parameter Sensitivity and Convergence As can be seen in Eq.(2), LI-MKKM introduces the number of neighbors τ as an extra parameter. In all the above experiments, we empirically set τ = 0.1 for all datasets, and observe that our algorithm demonstrate very promising performance. In the following, we conduct experiments to show the effect of this parameter on the performance of LI-MKKM on Flower17 dataset. Figure 2(a) plots the ACC of LI-MKKM by varying τ in a large range [0.1, 0.2, , 0.9] n on Flower17. The NMI of MKKM-IK is also provided as a baseline. From the ﬁgure, we observe that: i) the NMI monotonically decreases with the increase of τ, clearly demonstrating the effectiveness of preserving the local structure of data; and ii) LI-MKKM signiﬁcantly outperforms the recently proposed MKKM-IK and shows stable performance across a wide range of τ. These results demonstrate that LI-MKKM is stable across a wide range of τ. The curves on other datasets are similar and omitted due to space limit. LI-MKKM is theoretically guaranteed to converge to a local minimum. In the above experiments, we observe that the objective value of our algorithm does monotonically decrease at each iteration and that it usually converges in less than 10 iterations. One example of the evolution of the objective value on Flower17 are demonstrated in Figure 2(b).

5 Conclusion While the recently proposed MKKM-IK is able to handle multi-kernel clustering with incomplete kernels, it does not sufﬁciently utilize the local structure of data, which is crucial for clustering analysis. This paper proposes LF-MKKM to calculate the kernel alignment in a local manner to address this issue. LF-MKKM effectively solves the resultant optimization problem, and demonstrates well improved clustering performance via extensive experiments on benchmark datasets. In the future, we plan to explore the parametrization on incomplete kernel matrices in order to further improve its computational complexity and clustering performance.

Acknowledgements The authors wish to gratefully acknowledge Prof. Huiying Xu from Zhejiang Normal University for her help in the proof-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

reading of this paper. This work was supported by the Opening Fund of Zhejiang Provincial Top Key Discipline of Computer Science and Technology at Zhejiang Normal University, and the Natural Science Foundation of China (project no. 61672528). Additionally, the authors thank Cixing intelligent manufacturing research institute, Cixing textile automation research institute, Ningbo Cixing corporation limited and Ningbo Cixing robotics company limited for their ﬁnancial support and application scenarios. Xinzhong Zhu and Xinwang Liu equally contribute to the paper. Xinzhong Zhu, Xinwang Liu and Jianping Yin are the corresponding authors of this paper.

[Bhadra et al., 2016] Sahely Bhadra, Samuel Kaski, and Juho Rousu. Multi-view kernel completion. In ar Xiv:1602.02518, 2016. [Cai et al., 2013] Xiao Cai, Feiping Nie, Weidong Cai, and Heng Huang. Heterogeneous image features integration via multi-modal semi-supervised learning model. In ICCV, pages 1737 1744, 2013. [Cao et al., 2015] Xiaochun Cao, Changqing Zhang, Huazhu Fu, Si Liu, and Hua Zhang. Diversity-induced multi-view subspace clustering. In CVPR, pages 586 594, 2015. [Gao et al., 2015] Hongchang Gao, Feiping Nie, Xuelong Li, and Heng Huang. Multi-view subspace clustering. In ICCV, pages 4238 4246, 2015. [Ghahramani and Jordan, 1993] Zoubin Ghahramani and Michael I. Jordan. Supervised learning from incomplete data via an EM approach. In NIPS, pages 120 127, 1993. [G onen and Margolin, 2014] Mehmet G onen and Adam A. Margolin. Localized data fusion for kernel k-means clustering with application to cancer biology. In NIPS, pages 1305 1313, 2014. [Kumar et al., 2013] Ritwik Kumar, Ting Chen, Moritz Hardt, David Beymer, Karen Brannon, and Tanveer Fathima Syeda-Mahmood. Multiple kernel completion and its application to cardiac disease discrimination. In ISBI, pages 764 767, 2013. [Li et al., 2014] Shao-Yuan Li, Yuan Jiang, and Zhi-Hua Zhou. Partial multi-view clustering. In AAAI, pages 1968 1974, 2014. [Li et al., 2016] Miaomiao Li, Xinwang Liu, Lei Wang, Yong Dou, Jianping Yin, and En Zhu. Multiple kernel clustering with local kernel alignment maximization. In IJCAI, pages 1704 1710, 2016. [Liu et al., 2014] Xinwang Liu, Lei Wang, Jian Zhang, Jianping Yin, and Huan Liu. Global and local structure preservation for feature selection. IEEE Trans. Neural Netw. Learning Syst., 25(6):1083 1095, 2014. [Liu et al., 2016] Xinwang Liu, Yong Dou, Jianping Yin, Lei Wang, and En Zhu. Multiple kernel k-means clustering with matrix-induced regularization. In AAAI, pages 1888 1894, 2016.

[Liu et al., 2017a] Xinwang Liu, Miaomiao Li, Lei Wang, Yong Dou, Jianping Yin, and En Zhu. Multiple kernel k-means with incomplete kernels. In AAAI, pages 2259 2265, 2017. [Liu et al., 2017b] Xinwang Liu, Sihang Zhou, Yueqing Wang, Miaomiao Li, Yong Dou, En Zhu, and Jianping Yin. Optimal neighborhood kernel clustering with multiple kernels. In AAAI, pages 2266 2272, 2017. [Nie et al., 2014] Feiping Nie, Xiaoqian Wang, and Heng Huang. Clustering and projected clustering with adaptive neighbors. In KDD, pages 977 986, 2014. [Shao et al., 2015] Weixiang Shao, Lifang He, and Philip S. Yu. Multiple incomplete views clustering via weighted nonnegative matrix factorization with ℓ2,1 regularization. In ECML PKDD, pages 318 334, 2015. [Trivedi et al., 2010] Anusua Trivedi, Piyush Rai, Hal Daum e III, and Scott L. Du Vall. Multiview clustering with incomplete views. In NIPS 2010: Machine Learning for Social Computing Workshop ,Whistler, Canada, 2010. [Xiang et al., 2013] Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, Paul M. Thompson, and Jieping Ye. Multi-source learning with block-wise missing data for alzheimer s disease prediction. In ACM SIGKDD, pages 185 193, 2013. [Xu et al., 2015a] Chang Xu, Dacheng Tao, and Chao Xu. Multi-view intact space learning. IEEE Trans. Pattern Anal. Mach. Intell., 37(12):2531 2544, 2015. [Xu et al., 2015b] Chang Xu, Dacheng Tao, and Chao Xu. Multi-view learning with incomplete views. IEEE Trans. Image Processing, 24(12):5812 5825, 2015. [Yin et al., 2015] Qiyue Yin, Shu Wu, and Liang Wang. Incomplete multi-view clustering via subspace learning. In CIKM, pages 383 392, 2015. [Yu et al., 2012] Shi Yu, L eon-Charles Tranchevent, Xinhai Liu, Wolfgang Gl anzel, Johan A. K. Suykens, Bart De Moor, and Yves Moreau. Optimized data fusion for kernel k-means clustering. IEEE TPAMI, 34(5):1031 1039, 2012. [Zhang et al., 2015] Changqing Zhang, Huazhu Fu, Si Liu, Guangcan Liu, and Xiaochun Cao. Low-rank tensor constrained multiview subspace clustering. In ICCV, pages 1582 1590, 2015. [Zhao et al., 2016] Handong Zhao, Hongfu Liu, and Yun Fu. Incomplete multimodal visual data grouping. In IJCAI, pages 2392 2398, 2016.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)