# multiview_multiple_clustering__3d639c4b.pdf

Multi-View Multiple Clustering

Shixin Yao1 , Guoxian Yu1,3, , Jun Wang1 , Carlotta Domeniconi2 and Xiangliang Zhang3

1College of Computer and Information Sciences, Southwest University, Chongqing, China 2Department of Computer Science, George Mason University, VA, USA 3CEMSE, King Abdullah University of Science and Technology, Thuwal, SA {ysx051836, gxyu, kingjun}@swu.edu.cn, carlotta@cs.gmu.edu, xiangliang.zhang@kaust.edu.sa

Multiple clustering aims at exploring alternative clusterings to organize the data in meaningful groups from different perspectives. Existing multiple clustering algorithms are designed for singleview data. We assume that the individuality and commonality of multi-view data can be leveraged to generate high-quality and diverse clusterings. To this end, we propose a novel multi-view multiple clustering (MVMC) algorithm. MVMC ﬁrst adapts multi-view self-representation learning to explore the individuality encoding matrices and the shared commonality matrix of multi-view data. It additionally reduces the redundancy (i.e., enhancing the individuality) among the matrices using the Hilbert-Schmidt Independence Criterion (HSIC), and collects shared information by forcing the shared matrix to be smooth across all views. It then uses matrix factorization on the individual matrices, along with the shared matrix, to generate diverse clusterings of high-quality. We further extend multiple co-clustering on multi-view data and propose a solution called multi-view multiple coclustering (MVMCC). Our empirical study shows that MVMC (MVMCC) can exploit multi-view data to generate multiple high-quality and diverse clusterings (co-clusterings), with superior performance to the state-of-the-art methods.

1 Introduction

The goal of clustering is to partition samples into disjoint groups to facilitate the discovery of hidden patterns in the data. Traditional clustering algorithms are designed for singleview data. With the diffusion of the internet of things and of big data, samples can be easily collected from different sources, or observed from different views. For example, a video can be characterized using image signals and audio signals, and a given news can be reported in different languages. Objects with diverse feature views are typically called multiview data. It s recognized that the integration of information contained in multiple views can achieve consolidated data clustering [Chao et al., 2017]. Many multi-view clustering

Multi-view Data

Clustering1 (Texture+Shape)

Clustering2 (Color+Shape)

Figure 1: An example of multi-view multiple clustering. Two alternative clusterings (texture+shape and color+shape) can be generated using the commonality (shape) and the individuality (texture and color) information of the same multi-view objects.

methods have been developed to extract comprehensive information from multiple feature views; examples are co-training based [Kumar and Daum e, 2011], multiple kernel learning [G onen and Alpaydın, 2011], and subspace learning based [Cao et al., 2015; Luo et al., 2018]. However, the aforementioned clustering methods typically provide a single clustering, which may fail to reveal the high-quality and diverse alternative clusterings of the same data. Figure 1 shows a collection of objects represented by a texture view and a color view. We can group the objects based on their shared shapes. By leveraging the commonality and the individuality of these multi-view objects, we can obtain two alternative clusterings (texture+shape and color+shape), as shown at the bottom of the ﬁgure. From this example, we can see that multi-view data include not only the commonality information for generating high-quality clustering (as multi-view clustering does), but also the individual (or specific) information for generating diverse clusterings (as multiple clustering aims to achieve). To explore different clusterings of the given data, multiple clustering has emerged as a new branch of clustering in recent years. Some approaches seek diverse clusterings in alternative to those already explored, by enforcing the new ones to be different [Bae and Bailey, 2006; Davidson and Qi, 2008; Yang and Zhang, 2017]; other solutions simultaneously seek

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

multiple clusterings by reducing their correlation [Caruana et al., 2006; Jain et al., 2008; Dang and Bailey, 2010; Wang et al., 2018], or by seeking orthogonal (or independent) subspaces and clusterings therein [Niu et al., 2010; Ye et al., 2016; Mautz et al., 2018; Wang et al., 2019]. However, these multiple clustering methods are designed for single-view data. Based on the discussed example in Figure 1, we leverage the individuality and the commonality of multi-view data to generate high-quality and diverse clusterings, and we propose an approach called multi-view multiple clustering (MVMC) to achieve this goal. To the best of our knowledge, MVMC is the ﬁrst attempt to encompass both multiple clusterings and multi-view clustering, where the former focuses on generating diverse clusterings from a single data view, and the latter focuses on a single consensus clustering by summarizing the information from different views. MVMC ﬁrst extends multiview self-representation learning [Luo et al., 2018] to explore the individuality information encoding matrices and the commonality information matrix shared across views. To obtain more credible commonality information from multiple views, we force the commonality information matrix to be smooth across all views. In addition, we use the Hilbert-Schmidt Independence Criterion (HISC) [Gretton et al., 2005] to enhance the individuality between matrices, and consequently increase the diversity between clusterings. We then use matrix factorization to jointly factorize each individuality matrix (for diversity) and the commonality matrix (for quality) into a clustering indicator matrix and a basis matrix. To simultaneously seek the individual and common data matrices, and diverse clusterings therein, we use an alternating optimization technique to solve the uniﬁed objective. In addition, we extend multiple co-clustering [Tokuda et al., 2017; Wang et al., 2018] to the multi-view scenario, and term the extended approach as multi-view multiple co-clustering (MVMCC). The main contributions of our work are as follows: We study how to generate multiple clusterings from multi-view data, which is an interesting and challenging problem, but largely overlooked. The problem we address is different from existing multi-view clustering, which generates a single clustering from multiple views, and also different from multiple clustering, which produces alternative clusterings from single-view data. We introduce a uniﬁed objective function to simultaneously seek multiple individuality information encoding matrices, and the commonality information encoding matrix. This uniﬁed function can leverage the individuality to generate diverse clusterings, and the commonality to boost the quality of the generated clusterings. We further adopt an alternative procedure to optimize the uniﬁed objective. Extensive experimental results show that MVMC (MVMCC) performs considerably better than existing multiple clustering (co-clustering) algorithms [Cui et al., 2007; Niu et al., 2010; Ye et al., 2016; Yang and Zhang, 2017; Tokuda et al., 2017; Wang et al., 2018] in exploring multiple clusterings and co-clusterings.

2 The Proposed Methods

2.1 Multi-View Multiple Clustering Suppose Xv Rdv n denotes the feature data matrix of the v-th view, for n objects in the dv dimensional space. v {1, 2, , m}, where m is the number of views. We aim to generate h (provided by the user) different clusterings from {Xv}m v=1 using the shared and individual information embedded in the data matrices. As an application example, one can group the same customers (represented by different feature views) from the perspective of purchase capability, loyalty (leave or stay), and fraud (yes or no). Most multi-view clustering approaches in essence focus on the shared and complementary information of multiple data views to generate a consolidated clustering [Chao et al., 2017]. By viewing each subspace as a feature view, an intuitive solution to explore multiple clusterings on multiview data is to ﬁrst concatenate different feature views, and then apply subspace-based multiple clustering methods on the concatenated feature vectors [Niu et al., 2010; Ye et al., 2016; Wang et al., 2019]. To ﬁnd high-quality and diverse multiple clusterings, we should make concrete use of the individuality and commonality of multi-view data. The individuality helps to explore diverse clusterings, while the commonality coordinates the diverse clusterings to capture the common knowledge of multiview data. To explore the individuality and commonality of multi-view data, we extend the multi-view self-representation learning [Cao et al., 2015; Luo et al., 2018] as follows:

JD({Dk}h k=1, U) =

v=1 Xv Xv(U + Dk) 2 F

+ λ1Φ1({Dk}h k=1) + λ2Φ2(U) (1)

where U Rn n is speciﬁed to encode the commonality of the data matrices {Xv}m v=1, and Dk Rn n is used to encode the individuality of the k-th (k {1, 2, , , h}) group of views. Φ1({Dk}h k=1) and Φ2(U) (deﬁned later) are two constraints used to enhance the individuality and commonality. The multi-view self-representation learning in [Cao et al., 2015; Luo et al., 2018] requires h = m. In contrast, Eq. (1) does not have this requirement. As a result, the group-wise individuality and diversity are jointly considered, and the number of alternative clusterings can be adjusted by the user. The assumption for the linear representation in Eq. (1) is that a data sample can be expressed as a linear combination of other samples in the subspace. This assumption is widely-used in sparse representation [Wright et al., 2010] and low-rank representation learning [Liu et al., 2013]. [Luo et al., 2018] recently combined U and {Dk}m k=1 into an integrated co-association matrix of samples, and then applied spectral clustering to seek a consistent clustering. Their empirical study shows that the individual information encoded by Dk helps producing a robust clustering. However, since Xv and Xv (v = v) describe the same objects using different types of features, the matrix Dk resulting from Eq. (1) may still have a large information overlap with Dk . As a

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

result, the expected individuality of Dk and Dk cannot be guaranteed. This information overlapping is not necessarily an issue for multi-view clustering, which aims at ﬁnding a single clustering, but it is for our problem, where multiple clusterings of both high-quality and diversity are expected. To enhance the diversity between the individuality encoding (representation) matrices {Dk}h k=1, we approximately quantify the diversity based on the dependency between these matrices. The smaller the dependency between data matrices is, the larger their diversity is, since the matrices are less correlated. Various measurements can be used to evaluate the dependence between variables. Here, we adopt the Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 2005], for its simplicity, solid theoretical foundation, and capability in measuring both linear and nonlinear dependence between variables. HSIC computes the squared norm of the cross-covariance operator over Dk and Dk in Hilbert Space to estimate the dependency. The empirical HISC does not have to explicitly compute the joint distribution of Dk and Dk . It is given by:

HSIC(Dk, Dk ) = (n 1) 2tr(Kk HKk H) (2)

where Kk, Kk , H Rn n, Kk, and Kk are used to measure the kernel induced similarity between the matrices Dk

and Dk respectively. H = δij 1/n, δij = 1 if i = j, δij = 0 otherwise. In this paper, we adopt the inner product kernel to specify Kk = (Dk)T Dk, k {1, 2, , h}. Then, we minimize the overall HISC on h individuality matrices to reduce their redundancy, and specify Φ1(Dk) as follows:

Φ1({Dk}h k=1) =

k=1,k =k HSIC(Dk, Dk )

k=1,k =k (n 1) 2tr(Dk HKk H(Dk)T )

= tr(Dk e Kk(Dk)T )

where e Kk = (n 1) 2 Ph k =1,k =k HKk H. Inspired by subspace-based multi-view learning [Gao et al., 2015; Chao et al., 2017] and manifold regularization [Belkin et al., 2006], we specify Φ2(U) in Eq. (1) to collect more shared information from multiple data views as follows:

i,j=1 ui uj 2 2 Wv ij = tr(Ue LUT ) (4)

where Wv ij is the feature similarity between xv i and xv j. To compute Wv, we simply adopt an ϵ = 5 nearest neighborhood graph, and use the Gaussian heat kernel (with kernel width set to the standard deviation of the distance between samples) to quantify the similarity between neighborhood samples. e L = Pm v=1(Λv Wv), and Λv is a diagonal matrix with Λv ii = Pn j=1 Wv ij. Minimizing Φ2(U) can guide U to encode consistent and complementary information shared across views. In this way, the quality of diverse clusterings can be boosted using enhanced shared information.

Given the equivalence between matrix factorization based clustering and spectral clustering (or k-means clustering), we adopt the widely used semi-nonnegative matrix factorization [Ding et al., 2010] to explore the k-th clustering on U + Dk as follows:

U + Dk = Bk(Rk)T (5)

where Rk Rn rk and Bk Rn rk (rk is the number of sample clusters) are the clustering indicator matrix and the basis matrix, respectively. Here, the k-th clustering is generated not only with respect to Dk, but also to U, which encodes the shared information of multi-view data. As such, the explored k-th clustering (encoded by Rk) not only reﬂects the individuality of views in the k-th group, but also captures the commonality of all data views. As a consequence, a highquality, and yet diverse clustering can be generated. The above process ﬁrst explores the individual information matrices and the shared information matrix, and then generates diverse clusterings on the data matrices. A sub-optimal solution may be obtained as a result because the two steps are performed separately. To avoid this, we advocate to simultaneously optimize {Dk}h k=1 and the diverse clusterings {Rk}h k=1 therein, and formulate a uniﬁed objective function for MVMC as follows:

JMC({Dk}h k=1, {Rk}h k=1, U)

k=1 (U + Dk) Bk(Rk)T 2 F

+ λ1tr(Dk e Kk(Dk)T ) + λ2tr(Ue LUT )

s.t. Xv = Xv(U + Dk), v {1, 2, , m}

By solving Eq. (6), we can simultaneously obtain multiple diverse clusterings of quality by leveraging the commonality and individuality information of multiple views. Binary matrices {Rk}h k=1 are hard to directly optimize. As such, we relax the entries of {Rk}h k=1 to nonnegative numeric values. Since Eq. (6) is not jointly convex for {Dk}h k=1, U and {Rk}h k=1, it is unrealistic to ﬁnd the global optimal values for all the variables. Here, we solve Eq. (6) via the alternating optimization method, which alternatively optimizes one variable, while ﬁxing the other variables. The detailed optimization process can be viewed in the supplementary ﬁle due to the limitation of space.

2.2 Multiple Views Multiple Co-Clusterings

Recently, multiple co-clustering algorithms have also been suggested to explore alternative co-clusterings from the same data [Tokuda et al., 2017; Wang et al., 2018]. Multiple co-clustering methods aim at exploring multiple twoway clusterings, where both samples and features are clustered. In contrast, multiple clustering techniques only explore diverse one-way clusterings, where only samples (or only features) are clustered. Based on the merits of matrix tri-factorization in exploring co-clusters [Wang et al., 2011; 2018], we can seek multiple co-clusterings on multiple views

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

by optimizing an objective function as follows:

JMCC({Dv}m v=1, {Rv}m v=1, {Cv}m v=1, U)

v=1 Xv(U + Dv) Cv Sv(Rv)T 2 F

+ λ1tr(Dv K(Dv)T ) + λ2tr(Ue LUT ) s.t. Xv = Xv(U + Dv), Cv 0, Rv 0

where Cv Rdv cv and Rv Rn rk correspond to the row-cluster (grouping features) and column-cluster (grouping samples) indicator matrices of the h-th co-clustering. Sv Rcv rv plays the role of absorbing the different scaling factors of Rv and Cv to minimize the squared error. Here we ﬁx h = m for MVMCC, since different feature views have different numbers of features. Eq. (7) can be optimized following the similar procedure for optimizing Eq. (6), which is provided in the supplementary ﬁle due to page limit.

3 Experimental Results and Analysis

3.1 Experimental Setup In this section, we evaluate our proposed MVMC and MVMCC on ﬁve widely-used multi-view datasets [Li et al., 2015; Tao et al., 2018], which are described in Table 1. The datasets have different number of views and are from different domains. Caltech-71 and Caltech-20 [Li et al., 2015] are two subsets of Caltech-101, which contains only 7 and 20 classes, respectively. The creation of these subsets is due to the unbalance of the number of data in each class of Caltech101. Each sample is made of 6 views of the same image. Mul-fea digits2 is comprised of 2,000 data points from 0 to 9 digit classes, with 200 data points for each class. There are six public features available: 76 Fourier coefﬁcients of the character shapes, 216 proﬁle correlations, 64 Karhunenlove coefﬁcients, 240 pixel averages in 2 3 windows, 47 Zernike moments, and 6 morphological features. Wiki article3 contains selected sections from the Wikipedia s featured articles collection. We considered only the 10 most populated categories. It contains two views: text and image. Corel4 [Tao et al., 2018] consists of 5000 images from 50 different categories. Each category has 100 images. The features are color histogram (9), edge direction histogram (18), and WT (9). Mirﬂickr5 contains 25,000 instances collected from Flicker. Each instance consists of an image and its associated textual tags. To avoid noise, here we remove textual tags that appear less than 20 times in the dataset, and then delete instances without textual tags or semantic labels. This process gives us 16,738 instances. Multiple clusterings need to quantify the quality and diversity of alternative clusterings. To measure quality, we use the widely-adopted Silhouette Coefﬁcient (SC) and Dunn Index (DI) as the internal index. Larger values of SC and

1https://github.com/yeqinglee/mvdata 2https://archive.ics.uci.edu/ml/datasets/Multiple+Features 3http://www.svcl.ucsd.edu/projects/crossmodal/ 4http://www.cais.ntu.edu.sg/œchhoi/SVMBMAL/ 5http://press.liacs.nl/mirﬂickr/mirdownload.html

Datasets n dv classes m Caltech-7 1474 [40,48,254,1984,512,928] 7 6 Caltech-20 2386 [40,48,254,1984,512,928] 20 6 Mul-fea digits 2000 [76,216,64,240,47,6] 10 6 Wiki article 2866 [128,10] 10 2 Corel 5000 [9,18,9] 50 3 Mirﬂickr 16738 [150,500] 24 2

Table 1: Characteristics of multi-view datasets. n is the number of samples, dv is the dimensionality of samples, classes is the number of ground-truth clusters, and m is the number of views.

DI indicate a high quality clustering. To quantify the redundancy between alternative clusterings, we use Normalized Mutual Information (NMI) and Jaccard Coefﬁcient (JC) as external indices. Smaller values of NMI and JC indicate smaller redundancy between alternative clusterings. All these metrics have been used in the multiple clustering literature [Bailey, 2013]. The formal deﬁnitions of these metrics, omitted here to save space, can be found in [Bailey, 2013; Yang and Zhang, 2017].

3.2 Discovering Multiple One-way Clusterings and Multiple Co-Clusterings We compare the one-way multiple clusterings found by MVMC against Dec-kmeans [Jain et al., 2008], MNMF [Yang and Zhang, 2017], OSC [Cui et al., 2007], m SC [Niu et al., 2010], ISAAC [Ye et al., 2016], and MISC [Wang et al., 2019]. We also compare the multiple co-clusterings found by MVMCC against Multi CC [Wang et al., 2018] and MCC-NBMM [Tokuda et al., 2017]. The input parameters of the comparing methods are set as the authors suggested in their papers or shared code. The parameter values of MVMC and MVMCC are λ1 = 10, λ2 = 100, and h = 2 for multiple one-way clusterings, and h = m for multiple coclusterings. Since none of the existing multiple clustering algorithms can work on multiple view data, we concatenate the feature vectors of multiple view data and then run these comparing methods on the concatenated vectors to seek alternative clusterings. Our MVMC and MCMCC directly run on the multiple view data, without such feature concatenation. We use k-means to generate the reference clustering for MNMF, and then use their respective solutions to generate two alternative clustering (C1, C2). We downloaded the source code of MNMF, ISAAC, Multi CC, MISC, and MCCNBMM, and implemented the other methods (Dec-kmeans, m SC, and OSC) following the respective original papers. Following the experimental protocol adopted by these methods, we quantify the average clustering quality on C1 and C2, and measure the diversity between C1 and C2. We ﬁx the number of row-clusters rk for each clustering as the respective number of classes of each dataset, as listed in Table 1. For coclustering, we adopt a widely used technique [Monti et al., 2003] to determine the number of column clusters ck. Detailed parameter values are given in the supplementary ﬁle. Table 2 reports the average results (of ten independent runs) and standard deviations of these comparing methods on exploring two alternative one-way clusterings with h = 2. From Table 2, we can see that MVMC often outperforms the comparing methods across different multi-view datasets, which proves the effectiveness of MVMC on exploring

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Dec-kmeans ISAAC MISC MNMF m SC OSC MVMC

SC 0.049 0.002 0.235 0.011 0.201 0.002 0.234 0.001 0.163 0.008 0.261 0.004 0.140 0.002 DI 0.044 0.000 0.034 0.001 0.048 0.000 0.034 0.000 0.056 0.000 0.066 0.000 0.062 0.000 NMI 0.024 0.000 0.485 0.023 0.513 0.016 0.022 0.000 0.152 0.002 0.693 0.015 0.006 0.000 JC 0.126 0.001 0.363 0.008 0.349 0.001 0.094 0.000 0.136 0.001 0.522 0.046 0.076 0.000

SC -0.124 0.001 0.085 0.000 0.036 0.000 -0.169 0.000 -0.172 0.006 0.196 0.001 0.004 0.000 DI 0.026 0.000 0.035 0.000 0.033 0.000 0.009 0.000 0.028 0.000 0.056 0.000 0.183 0.000 NMI 0.056 0.000 0.475 0.011 0.489 0.013 0.052 0.002 0.240 0.003 0.715 0.025 0.027 0.000 JC 0.050 0.001 0.222 0.002 0.198 0.002 0.033 0.001 0.074 0.001 0.444 0.004 0.023 0.000

SC 0.112 0.014 -0.052 0.000 -0.070 0.000 -0.277 0.000 -0.128 0.000 0.238 0.002 -0.016 0.000 DI 0.031 0.001 0.032 0.000 0.020 0.000 0.015 0.000 0.019 0.000 0.032 0.000 0.354 0.000 NMI 0.643 0.035 0.204 0.002 0.209 0.002 0.092 0.001 0.394 0.006 0.762 0.018 0.070 0.000 JC 0.219 0.004 0.031 0.001 0.029 0.001 0.013 0.000 0.072 0.001 0.410 0.013 0.010 0.000

SC -0.133 0.022 -0.001 0.000 0.061 0.000 -0.076 0.000 -0.117 0.000 0.471 0.013 0.064 0.000 DI 0.016 0.000 0.016 0.000 0.018 0.001 0.008 0.000 0.016 0.001 0.062 0.000 0.087 0.000 NMI 0.078 0.000 0.364 0.012 0.399 0.008 0.011 0.000 0.515 0.022 0.822 0.028 0.008 0.000 JC 0.076 0.000 0.279 0.003 0.298 0.000 0.053 0.000 0.279 0.004 0.656 0.015 0.052 0.000

Wiki article

SC 0.447 0.016 -0.024 0.000 -0.031 0.000 -0.029 0.000 0.108 0.002 0.418 0.012 0.066 0.000 DI 0.124 0.000 0.085 0.000 0.085 0.001 0.083 0.000 0.095 0.001 0.135 0.001 0.122 0.000 NMI 0.803 0.019 0.042 0.000 0.041 0.000 0.006 0.000 0.212 0.001 0.783 0.052 0.006 0.000 JC 0.593 0.006 0.078 0.001 0.078 0.000 0.056 0.001 0.113 0.002 0.535 0.014 0.052 0.000

SC -0.004 0.000 -0.092 0.000 -0.028 0.000 -0.058 0.000 -0.093 0.000 0.017 0.000 -0.038 0.000 DI 0.061 0.002 0.062 0.005 0.071 0.000 0.053 0.001 0.064 0.001 0.059 0.002 0.173 0.005 NMI 0.427 0.012 0.016 0.000 0.021 0.000 0.014 0.000 0.216 0.006 0.575 0.011 0.005 0.000 JC 0.878 0.022 0.047 0.000 0.037 0.000 0.023 0.000 0.073 0.000 0.368 0.011 0.022 0.000

Table 2: Quality and Diversity of the various competing methods on ﬁnding multiple clusterings. ( ) indicates the direction of preferred values for the corresponding measure. / indicates whether MVMC is statistically (according to pairwise t-test at 95% signiﬁcance level) superior/inferior to the other method.

MCC-NBMM Multi CC MVMCC

SC -0.100 0.002 -0.103 0.006 0.198 0.004 DI 0.034 0.000 0.011 0.000 0.047 0.000 NMI 0.376 0.014 0.005 0.000 0.005 0.000 JC 0.185 0.003 0.087 0.000 0.083 0.000

SC -0.134 0.000 -0.229 0.012 0.080 0.000 DI 0.026 0.000 0.011 0.000 0.156 0.008 NMI 0.325 0.010 0.021 0.000 0.026 0.000 JC 0.150 0.002 0.056 0.000 0.029 0.000

SC -0.087 0.002 -0.172 0.012 -0.017 0.000 DI 0.024 0.000 0.015 0.000 0.152 0.002 NMI 0.377 0.013 0.164 0.002 0.070 0.000 JC 0.176 0.004 0.044 0.000 0.010 0.000

SC -0.243 0.024 -0.214 0.013 0.144 0.002 DI 0.014 0.000 0.003 0.000 0.018 0.000 NMI 0.286 0.006 0.010 0.000 0.207 0.003 JC 0.166 0.001 0.060 0.000 0.115 0.000

Wiki article

SC -0.0694 0.000 -0.058 0.000 0.064 0.000 DI 0.079 0.000 0.041 0.001 0.078 0.000 NMI 0.287 0.005 0.007 0.000 0.006 0.000 JC 0.127 0.002 0.054 0.000 0.052 0.000

SC -0.095 0.000 -0.194 0.002 0.064 0.000 DI 0.052 0.000 0.065 0.000 0.151 0.003 NMI 0.017 0.000 0.081 0.000 0.005 0.000 JC 0.052 0.000 0.052 0.000 0.022 0.000

Table 3: Quality and Diversity of the various competing methods on ﬁnding multiple co-clusterings. / indicates whether MVMCC is statistically (according to pairwise t-test at 95% signiﬁcance level) superior/inferior to the other method.

alternative clusterings on multi-view data. MVMC always gives the best result on the diversity metrics (NMI and JC). This fact suggests it can ﬁnd two alternative clusterings with high diversity. MVMC occasionally has a lower quality value (SC and DI) than some of the comparing methods. This is explainable, since obtaining alternative clusterings with high-diversity and high-quality is a well-known dilemma, and MVMC achieves a much larger diversity than the comparing methods. Although the comparing methods employ different techniques to explore alternative clusterings in the subspaces, they almost always lose to MVMC. The cause is that

the concatenated long feature vectors override the intrinsic structures of different views. This also explains why the comparing methods have lower values on diversity metrics (NMI and JC). In practice, because of the long concatenated feature vectors, the comparing methods generally suffer from a long runtime and cannot be applied on multi-view datasets with high-dimensional feature views. For example, we tried experiments on a large text dataset with ﬁve views, 18,758 samples, and more than 10,000 features per view. Since the concatenated data matrix has more than 100,000 features, almost all the comparing methods cannot complete after a long time, or even cannot run on a moderate server. In contrast, our MVMC is rather efﬁcient, it does not need to concatenate features, and is directly applicable on each view. Table 3 reports the results of MVMCC, Multi CC, and MCC-NBMM on exploring multiple co-clusterings, whose number is equal to the number of views h = m (unlike h = 2 for Table 2), because of the feature view heterogeneity. For this evaluation, we report the average quality and diversity values of all pairwise alternative co-clusterings of h clusterings. We can see that MVMCC signiﬁcantly outperforms the two state-of-the-art multiple co-clustering methods across different evaluation metrics and datasets. Multi CC sometimes obtains a better diversity than MVMCC. This is because it directly optimizes the diversity on the samplecluster and feature-cluster matrices, while our MVMCC indirectly optimizes the diversity, mainly using the sample-cluster matrices. These results prove the effectiveness of our solution on exploring multiple co-clusterings on multi-view data. Following the experimental setup in Table 2, we conduct additional experiments to investigate the contribution of the shared matrix U on improving the quality of multiple clusterings. For this investigation, we introduce a variant of MVMC(n U), which only uses {Dk} h k=1 to generate multiple clusterings and disregards the shared information matrix U. From

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

SC DI NMI JC

Digits MVMC(n U) 0.061 0.076 0.008 0.050 MVMC 0.064 0.087 0.008 0.052

Wiki article MVMC(n U) 0.059 0.088 0.005 0.051 MVMC 0.066 0.122 0.006 0.052

Table 4: Comparison results with/without the shared information matrix U for discovering multiple clusterings.

the reported results on Digits and Wikipeda datasets in Table 4, we can see that the diversity (NMI and JC) of the two alternative clusterings remains almost the same when the shared information matrix U is excluded. However, the quality (SC and DI) is clearly reduced when U is excluded. This contrast proves that U indeed improves the quality of multiple clusterings, and justiﬁes our motivation to seek U across views and leverage it with the individuality information matrix Dv for multiple clusterings. In summary, we can conclude that the diversity information matrix helps generating diverse clusterings, and the commonality information matrix of multi-view data can improve the quality of the clusterings. These experimental results also conﬁrm our assumption that the diversity and commonality of multi-view data can be leveraged to generate diverse clusterings with high-quality.

3.3 Parameter Analysis λ1 and λ2 are two important input parameters of MVMC (MCMCC) for seeking the individuality and commonality information of multi-view data, which consequently affect the quality and diversity of multiple clusterings. We investigate the sensitivity of MVMC to these parameters by varying λ1 (it controls diversity) and λ2 (it controls quality) in the range [10 3, 10 2, , 103]. Figure 2 reports the Quality (DI) and Diversity (1-NMI, the larger the better) of MVMC on the Caltech-7 dataset. We have several interesting observations: (i) diversity (1-NMI) increases as λ1 increases, but not as much as with the increase of λ2 (see Figure 2(b)); quality (DI) increases as λ2 increases, but not as much as with the increase of λ1. (ii) The synchronous increase of λ1 and λ2 does not necessarily give the highest results in quality and diversity. (iii) When both λ1 and λ2 are ﬁxed to a small value, both quality and diversity are reduced. This fact suggests that both diversity and commonality information of multi-view data should be used for the exploration of alternative clusterings. These observations again conﬁrm the known dilemma between diversity and quality of multiple clusterings. The values λ1 = [10, 100] and λ2 = [10, 100] often provide the best balance between quality and diversity. We vary h from 2 to 2 m on the Caltech-7 dataset to explore the variation of average quality and diversity of multiple clusterings generated by MVMC. In Figure 3, as h increases, the average quality of multiple clusterings decreases gradually with small ﬂuctuations, and the average diversity ﬂuctuates in a small range. These patterns are due to the quality-diversity dilemma of multiple clusterings. Increasing the number of alternative and diverse clusterings results in a sacriﬁce of quality. Overall, we can ﬁnd that our MVMC can explore h 2 alternative clusterings with quality and diversity.

Figure 2: Quality (SI) and Diversity(1-NMI) of MVMC vs. λ1 and λ2 on the Caltech-7 dataset.

2 4 6 8 10 12 h

2 4 6 8 10 12 h

10-3 Caltech-7

Figure 3: Quality (SI) and Diversity(NMI, the lower the better) of MVMC vs. h from 2 to 2 m on the Caltech-7 dataset.

The time complexity of our MVMC is O(tn2d(h2v + 2hv + 2h)), where t is the number of iterations for optimization. The experiments are conducted on a server with Ubuntu 16.04, Intel Xeon8163 with 1TB RAM; all methods are implemented in Matlab2014a. The detailed runtime analysis can be viewed in the supplementary ﬁle due to the limitation of space.

4 Conclusion

In this paper, we proposed an approach to generate multiple clusterings (co-clusterings) from multi-view data, which is an interesting but largely overlooked clustering topic encompassing both multi-view clustering and multiple clusterings. Our approach leverages the diversity and commonality of multi-view data to generate multiple clusterings, and outperforms state-of-the-art multiple clustering solutions. Our study conﬁrms the existence of individuality and commonality of multi-view data, and their contribution for generating diverse clusterings with good quality. In the future, we plan to ﬁnd a principled way to automatically determine the number of alternative clusterings, and explore weighting schemes of data views. The code of MVMC (MVMCC) is available at http://mlda.swu.edu.cn/codes.php?name=MVMC.

Acknowledgments

This work is supported by NSFC (61872300 and 61873214), Fundamental Research Funds for the Central Universities (XDJK2019B024), NSF of CQ CSTC (cstc2018jcyj AX0228 and cstc2016jcyj A0351), King Abdullah University of Science and Technology (KAUST), Saudi Arabia.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Bae and Bailey, 2006] Eric Bae and James Bailey. Coala: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In ICDM, pages 53 62, 2006. [Bailey, 2013] James Bailey. Alternative clustering analysis: A review. In Aggarwal Charu and Reddy Chandan, editors, Data Clustering: Algorithms and Applications, pages 535 550. CRC Press, 2013. [Belkin et al., 2006] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7(11):2399 2434, 2006. [Cao et al., 2015] Xiaochun Cao, Changqing Zhang, Huazhu Fu, Si Liu, and Hua Zhang. Diversity-induced multi-view subspace clustering. In CVPR, pages 586 594, 2015. [Caruana et al., 2006] Rich Caruana, Mohamed Elhawary, Nam Nguyen, and Casey Smith. Meta clustering. In ICDM, pages 107 118, 2006. [Chao et al., 2017] Guoqing Chao, Shiliang Sun, and Jinbo Bi. A survey on multi-view clustering. ar Xiv preprint ar Xiv:1712.06246, 2017. [Cui et al., 2007] Ying Cui, Xiaoli Z Fern, and Jennifer G Dy. Non-redundant multi-view clustering via orthogonalization. In ICDM, pages 133 142, 2007. [Dang and Bailey, 2010] Xuan Hong Dang and James Bailey. Generation of alternative clusterings using the cami approach. In SDM, pages 118 129, 2010. [Davidson and Qi, 2008] Ian Davidson and Zijie Qi. Finding alternative clusterings using constraints. In ICDM, pages 773 778, 2008. [Ding et al., 2010] Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix factorizations. TPAMI, 32(1):45 55, 2010. [Gao et al., 2015] Hongchang Gao, Feiping Nie, Xuelong Li, and Heng Huang. Multi-view subspace clustering. In ICCV, pages 4238 4246, 2015. [G onen and Alpaydın, 2011] Mehmet G onen and Ethem Alpaydın. Multiple kernel learning algorithms. JMLR, 12(7):2211 2268, 2011. [Gretton et al., 2005] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch olkopf. Measuring statistical dependence with hilbert-schmidt norms. In ALT, pages 63 77, 2005. [Jain et al., 2008] Prateek Jain, Raghu Meka, and Inderjit S Dhillon. Simultaneous unsupervised learning of disparate clusterings. Statistical Analysis and Data Mining, 1(3):195 210, 2008. [Kumar and Daum e, 2011] Abhishek Kumar and Hal Daum e. A co-training approach for multi-view spectral clustering. In ICML, pages 393 400, 2011. [Li et al., 2015] Yeqing Li, Feiping Nie, Heng Huang, and Junzhou Huang. Large-scale multi-view spectral clustering via bipartite graph. In AAAI, 2015.

[Liu et al., 2013] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. TPAMI, 35(1):171 184, 2013. [Luo et al., 2018] Shirui Luo, Changqing Zhang, Wei Zhang, and Xiaochun Cao. Consistent and speciﬁc multiview subspace clustering. In AAAI, pages 3730 3737, 2018. [Mautz et al., 2018] Dominik Mautz, Wei Ye, Claudia Plant, and Christian B ohm. Discovering non-redundant k-means clusterings in optimal subspaces. In KDD, pages 1973 1982, 2018. [Monti et al., 2003] Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning, 52(1-2):91 118, 2003. [Niu et al., 2010] Donglin Niu, Jennifer G Dy, and Michael I Jordan. Multiple non-redundant spectral clustering views. In ICML, pages 831 838, 2010. [Tao et al., 2018] Hong Tao, Chenping Hou, Xinwang Liu, Tongliang Liu, Dongyun Yi, and Jubo Zhu. Reliable multiview clustering. In AAAI, pages 4123 4130, 2018. [Tokuda et al., 2017] Tomoki Tokuda, Junichiro Yoshimoto, Yu Shimizu, and et al. Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions. PLo S ONE, 12(10):e0186566, 2017. [Wang et al., 2011] Hua Wang, Feiping Nie, Heng Huang, and Fillia Makedon. Fast nonnegative matrix trifactorization for large-scale data co-clustering. In IJCAI, pages 1553 1558, 2011. [Wang et al., 2018] Xing Wang, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Zhiwen Yu, and Zili Zhang. Multiple co-clusterings. In ICDM, pages 1308 1313, 2018. [Wang et al., 2019] Xing Wang, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Guoqiang Xiao, and Maozu Guo. Multiple independent subspace clusterings. In AAAI, pages 1 8, 2019. [Wright et al., 2010] John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas S Huang, and Shuicheng Yan. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6):1031 1044, 2010. [Yang and Zhang, 2017] Sen Yang and Lijun Zhang. Nonredundant multiple clustering by nonnegative matrix factorization. Machine Learning, 106(5):695 712, 2017. [Ye et al., 2016] Wei Ye, Samuel Maurus, Nina Hubig, and Claudia Plant. Generalized independent subspace clustering. In ICDM, pages 569 578, 2016.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)