# robust_selfweighted_multiview_projection_clustering__0cf1ab7b.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Robust Self-Weighted Multi-View Projection Clustering

Beilei Wang,1 Yun Xiao,1 Zhihui Li,2,3 Xuanhong Wang,4 Xiaojiang Chen,1 Dingyi Fang1

1School of Information Science and Technology, Northwest University, Xi an 710127, P.R. China 2School of Information Science and Engineering, Shandong Normal University, Jinan 250358, P.R. China 3School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia 4School of Communications and Information Engineering, Xi an University of Posts & Telecommunications, Xi an 710121, P.R. China wangbeilei@stumail.nwu.edu.cn, {yxiao, xjchen, dyf}@nwu.edu.cn, zhihuilics@gmail.com, wxh@xupt.edu.cn

Many real-world applications involve data collected from different views and with high data dimensionality. Furthermore, multi-view data always has unavoidable noise. Clustering on this kind of high-dimensional and noisy multi-view data remains a challenge due to the curse of dimensionality and ineffective de-noising and integration of multiple views. Aiming at this problem, in this paper, we propose a Robust Selfweighted Multi-view Projection Clustering (RSw MPC) based on ℓ2,1-norm, which can simultaneously reduce dimensionality, suppress noise and learn local structure graph. Then the obtained optimal graph can be directly used for clustering while no further processing is required. In addition, a new method is introduced to automatically learn the optimal weight of each view with no need to generate additional parameters to adjust the weight. Extensive experimental results on different synthetic datasets and real-world datasets demonstrate that the proposed algorithm outperforms other state-ofthe-art methods on clustering performance and robustness.

Introduction In many real-world applications, instances can be described in many different ways or angles, which produce abundant multi-views data, such as web page classiﬁcation problems, image processing problems, and the like. Clustering on multi-view data is a fundamental and important topic in data mining, machine learning, pattern recognition and so on. In this era of information explosion, the dimensionality of data is getting higher and higher. Meanwhile, multi-view data always has unavoidable noise. These directly lead to the increase of data storage cost, the further improvement of the complexity of learning algorithm and the decrease of algorithm generalization ability. Clustering on this kind of highdimensional and noisy multi-view data remains a challenge due to the curse of dimensionality and ineffective de-noising and integration of multiple views. In the past decades, a number of multi-view clustering approaches have been proposed. In order to maintain the same clustering consistency among all graphs, (Kumar, Rai, and Daume 2011; Cai et al. 2011) proposed co-regularized and

Corresponding author Corresponding author Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

multi-modal spectral clustering methods, which can not distinguish the importance of different views and were vulnerable to poor quality views. In order to distinguish the importance of different views for clustering results, (Nie et al. 2016; Nie, Cai, and Li 2017) proposed to automatically assign an ideal weight to each view without introducing redundant hyperparameters. (Nie, Tian, and Li 2018) considered the clustering ability difference of different views and proposed multiview clustering via adaptively weighted procrustes. The above clustering methods successfully solved the clustering problem in low-dimensional data. However, there are still signiﬁcant challenges to high-dimensional data clustering problems with noise. In order to process high-dimensional data, (Chen et al. 2018; Xiao et al. 2019) proposed to project the original matrix from the high-dimensional subspace to the lowdimensional subspace by learning the projection matrix, and obtained better clustering results on the single view. For multi-view high-dimensional data processing, (Gao et al. 2015) proposed multi-view subspace clustering (MVSC), which performs clustering on the subspace representation of each view, using a common indicator to ensure consistency between different views. (Wang et al. 2019) proposed two parameter-free weighted projected clustering methods, which can simultaneously perform structural graph learning and dimensionality reduction. However, for multi-view high-dimensional data with noise, these existing algorithms adopt dimensionality reduction method to project data from high-dimensional subspace to low-dimensional subspace, without considering the effect of noise in the datasets. Since ℓ2,1-norm has advantages in noise processing and feature selection (Yang et al. 2011), in this paper, we combine it with dimensionality reduction method to solve the clustering problem on high dimensional and noisy multiview datasets. We propose a Robust Self-weighted Multiview Projection Clustering (RSw MPC) based on ℓ2,1-norm, which project the original dataset into the low-dimensional subspace through the projection matrix, and suppress the noise points and outliers through increase the ℓ2,1-norm penalty of the projection matrix. At the same time, in our proposed RSw MPC, the weight of each view depends on the projection matrix and the similarity matrix, and no redundant parameters are introduced. Different from the conventional post-processing methods (Tang, Lu, and Dhillon

2009; Kumar and Daum e 2011; Huang, Nie, and Huang 2013; Nie et al. 2016), which need similar k-means to obtain a cluster label, the Laplacian matrix rank constraint is also introduced, so that the learned afﬁnity matrix has a display clustering structure, and the clustering result can be obtained directly. Unlike Sw MPC (Wang et al. 2019), we suppress noise by increasing the ℓ2,1-norm penalty term of the projection matrix, and select the dimensionality corresponding to the best clustering result in each view as our projection dimensionality. Extensive experimental results on different synthetic datasets and real-world datasets demonstrate that the proposed algorithm outperforms other state-of-theart methods on clustering performance and robustness. The contributions of this paper are as follows: In our proposed RSw MPC, subspace learning is performed by adding a projection matrix. Meanwhile, feature selection and noise suppression are achieved by introducing the ℓ2,1-norm penalty term of the projection matrix. Therefore, RSw MPC is robust and effective. A parameter-free self-weighted strategy which combines the effective information of different views is proposed and used in the RSw MPC. In addition, the direct search method is adopted for dimensionality reduction, thus the optimal reduced dimensionality for each view is obtained. The clustering results can be obtained directly in the process of constructing afﬁnity matrix S without further processing. We have veriﬁed the effectiveness and robustness of our method by conducting extensive experiments on synthetic datasets and real-world datasets.

The Proposed RSw MPC Methodology Notation Throughout the paper, all matrices are capitalized. For matrix M Rd n, mi represents the i-th column vector of the matrix M and its j-th element is represented by mij. In addition, we also deﬁne the rank, trace, transposition and inverse of the matrix M as rank(M), Tr(M), M T and M 1, respectively. The Frobenius norm and the ℓ2,1-norm of matrix M are M F , M 2,1. v 2 represents the ℓ2-norm of vector v. 1 and I represent a unit column vector and an identity matrix in the paper, respectively.

Multi-view Initial Afﬁnity Matrix Learning In multi-view clustering, denote X1, X2, ..., Xv represent the data matrix for each view. Xv Rdv n represents the v-th view, where n is the number of data points and dv is the feature dimensionality. Most of the existing graph-based multi-view clustering algorithms need a pre-constructed graph, and the clustering performance of these algorithms depends on the quality of the constructed graph. One of the simple ways is that the features of all views are connected in series, and then the single-view clustering is performed based on the series-connected features. In this case, however, a view containing a larger amount of information is treated with other views that contain less information. Thus the ﬁnal solution is not optimal. Therefore, we use a method

of increasing weight to learn the similarity matrix. The optimization problem of learning uniﬁed similarity matrix can be described as follows (Nie, Cai, and Li 2017):

i,j=1 xv i xv j 2 2sij + λ αv 2 2) + β S 2 F ,

s.t. s T i 1 = 1, 0 sij 1, rank(LS) = n c,

αT v 1 = 1, 0 αv 1. (1)

where si Rn 1 is the i-th vector of similarity matrix S Rn n and its j-th element is sij. In spectral analysis, LS = DS (ST + S)/2 is a Laplace matrix, where the degree matrix DS is the diagonal matrix about S whose i-th diagonal element dii is j(sij + sji)/2. c is the number of 0 eigenvalues of LS matrix. β is the tuning parameter, β S 2 F is used to avoid assigning a similarity of 1 only to the nearest point of the data point xi, while the similarity of the other points is 0. λ is the non-negative value, which is used to smooth the weight distribution. Because artiﬁcial weight is subjective, and assigning the same weight to each graph ignores the difference between different features, it is easy to be interfered by poor quality features, which leads to poor clustering accuracy.

Robust Self-weighted Multi-view Projection Clustering Based on the superiority of subspace learning in the processing of high dimensional data, we introduce the projection matrix W into our method. Deﬁne the projection transformation matrix Wv Rdv d v, d v dv, which projects the original dataset X of the v-th view into a low dimensional subspace. Here d v is the characteristic dimension of the low dimensional subspace. This low dimensional subspace can be represented as W T v Xv, which not only preserves the effective information of the data, but also alleviates the dimension disaster problem. In the following, we give the optimization goal based on subspace learning:

min S,Wv,αv

i,j W T v xv i W T v xv j 2 2sij + λ αv 2 2

+ γ Wv 2,1) + β S 2 F ,

s.t. s T i 1 = 1, 0 sij 1, rank(LS) = n c,

W T v Xv XT v Wv = I, αT v 1 = 1, 0 αv 1. (2)

where γ is the tuning parameters. Wv 2,1 is the ℓ2,1-norm of projection matrix W, which is used to suppress noise and remove redundant features. The application of the orthogonal constraint to the scattering matrix W T v Xv XT v Wv is actually used for intrinsic subspace learning, and the dvdimensionality feature space on the original dataset Xv is converted into a statistically non-relevant d v-dimensionality intrinsic subspace. In order to solve the problem of weight distribution, we propose a Robust Self-weighted Multi-view Projection Clus-

tering, which is as follows:

i,j W T v xv i W T v xv j 2 2sij)1/2 + γ Wv 2,1)

+ β S 2 F ,

s.t. s T i 1 = 1, 0 sij 1, rank(LS) = n c,

W T v Xv XT v Wv = I. (3)

The Lagrange function of problem (3) can be written as

i,j W T v xv i W T v xv j 2 2sij)1/2 + γ Wv 2,1)

+ β S 2 F + G(ΛS, S) + G(ΛW , Wv), (4)

where ΛS, ΛW are the Lagrange multiplier, G(ΛS, S), G(ΛW , Wv) are the formalized term derived from constraints. Take the derivative of problem (4) with respect to S and make it equal to 0, we have

i,j W T v xv i W T v xv j 2 2sij

S + β S 2 F S

i,j W T v xvv W T v xv j 2 2sij)1/2 (6)

Since αv is dependent on the target variable S and Wv, so problem (5) cannot be directly solved. But if αv is set to be stationary, problem (5) can be considered accounting for following problem:

i,j W T v xv i W T v xv j 2 2sij + γ Wv 2,1)

+ β S 2 F ,

s.t. s T i 1 = 1, 0 sij 1, rank(LS) = n c,

W T v Xv XT v Wv = I. (7)

We update each Wv and S through problem (7). The weight αv of each view depends on Wv and S, so αv can update at the same time. This solves the problem of linearly combining different view weights without introducing extra parameters. That is, solving the problem (7) is equivalent to solving the problem (3). By solving the problem (7), the similarity matrix S can be learned directly for clustering.

Optimization Algorithm For the solution of problem (7), we ﬁrst optimize it by (Fan 1949), and iteratively update the algorithm by updating the parameters until converge. Optimization Objective Function

In the problem (7), the i-th smallest eigenvalue of LS is represented by σi(LS). Because the Laplace matrix LS is positive semideﬁnite, σi(LS) is non-negative, that is c

i=1 σi(LS) 0, which ensures the establishment of rank

constraint rank(LS) = n c. Given a sufﬁciently large η, problem (7) can be written as

i,j W T v xv i W T v xv j 2 2sij + γ Wv 2,1)

+ β S 2 F + 2η

i=1 σi(LS),

s.t. s T i 1 = 1, 0 sij 1, W T v Xv XT v Wv = I. (8)

According to Ky Fan s Theory, the following equation is true:

i=1 σi(LS) = min F Rn c,F T F =I Tr(F T LSF). (9)

Based on the problem (9), the problem (8) can be further equivalent to

i,j W T v xv i W T v xv j 2 2sij + γ Wv 2,1)

+ β S 2 F + 2ηTr(F T LSF),

s.t. s T i 1 = 1, 0 sij 1, F Rn c, F T F = I,

W T v Xv XT v Wv = I. (10)

Using an efﬁcient iterative algorithm, problem (10) can be optimized iteratively. i. Update F When S, W and αv are ﬁxed, problem (10) is equivalent to solving the following problem:

min F Rn c,F T F =I Tr(F T LSF). (11)

So the optimal solution of F is the c eigenvectors corresponding to the ﬁrst c minimum eigenvalues of LS. ii. Update W When S, F and αv are ﬁxed, problem (10) is equivalent to solving the following problem:

i,j W T v xv i W T v xv j 2 2sij + γ Wv 2,1),

s.t. Wv Rdv d v, W T v Xv XT v Wv = I. (12)

Since for different v, the problem (12) is independent, it is equivalent to solving the following problems for each view

i,j α W T xi W T xj 2 2sij + γ W 2,1,

s.t. W Rd d , W T XXT W = I.

Algorithm 1 Optimization in problem (7)

Input: X = {X1, X2, ..., Xv} , Xv Rdv n, clustering number k, parameter β, η and γ. Output: afﬁnity matrix S Rn n with exact c connected components, where c = k; projection matrix W = {W1, W2, ..., Wv} , W Rdv d v. Initialize each vector si of S by the optimal solution to the following problem, where the initial value of αv is 1

min s T i 1=1,0 sij 1

v xv i xv j 2 2sij + βs2 ij)

Initialize Wv by the optimal solution to the following problem: min Wv Tr(W T v Xv LSXT v Wv), s.t. W T v Xv XT v Wv = I

repeat i. update αv by solving the problem (6), ii. update F by solving the problem (11), iii. update the projection matrix Wv for each view by solving the problem (18), iv. update each row vector of S by solving the problem (23). until converge

If the function value of W T xi Rc 1 is regarded as the value of a node i, the following equation holds:

i,j α W T xi W T xj 2 2sij = 2Tr(W T (αXLSXT )W).

So the problem (13) can be written as follows:

min W Tr(W T (αXLSXT )W) + γ

s.t. W Rd d , W T XXT W = I. (15)

Due to the introduction of ℓ2,1-norm, it is difﬁcult to solve the problem (15). In this paper, we use Lagrangian multiplier method to solve the problem (15). The Lagrangian function of the problem (15) is

L(W) =Tr(W T αXLSXT W) + γ

Tr(Λ(W T XXT W I)), (16)

where Λ is a diagonal matrix to enforce the constraint on problem (16). The next step is to derive L(W) and make it equal to 0, that is

W = (A + AT )W + 2γDW 2BWΛ = 0, (17)

where D = diag( 1 4 W (1,:) 2 , ..., 1 4 W (d,:) 2 ), W represents

the current solution, A = αXLSXT , B = XXT . Because of A = AT , so problem (17) can be simpliﬁed as follows:

B W = WΛ. (18)

Let Q denote A+γD

B , so the problem (18) can be abbreviated to QW = WΛ. Since Λ is a diagonal matrix, and the

characteristic equation is Qωi = λiωi(i = 1, 2, ..., d ), then W is composed of eigenvectors corresponding to c smallest eigenvalues divided by 0. Here λi is the i-th smallest eigenvalue except the zero eigenvalues in the eigen-equation, and ωi is the eigenvector corresponding to the i-th eigenvalue. iii. Update S When W, F, and αv are ﬁxed, problem (10) is equivalent to solving the following problem:

i,j W T v xv i W T v xv j 2 2sij + β S 2 F

+ 2ηTr(F T LSF),

s.t. s T i 1 = 1, 0 sij 1.

Let Zv = XT v Wv , the problem (19) can be further rewritten as

v αv zv i zv j 2 2sij + βs2 ij + η fi fj 2 2sij),

s.t. s T i 1 = 1, 0 sij 1. (20)

where zv i is the i-th vector of Zv, fi is the i-th vector of F. Denote

v αv zv i zv j 2 2, df ij = fi fj 2 2. (21)

Note that the problem (20) is independent with respect to each i, we are equivalent to solving the following problem for each i:

j=1 (dz ijsij + βs2 ij + ηdf ijsij),

s.t. s T i 1 = 1, 0 sij 1.

Denote di Rc 1 as a vector with the j-th element as dij = dz ij + ηdf ij, then the above problem can be written as follow:

min s T i 1=1,0 sij 1 si + 1

2β di 2 2. (23)

The solution of the problem (23) has been given by (Nie, Wang, and Huang 2014), we can get the matrix S with k strong connected subgraphs by updating si. iv. Update αv, β, η and γ For the update of the weight coefﬁcient αv, it has been given in the previous problem (6), which depends on Wv, S.The value of the regularization parameter β can range from 0 to inﬁnity. In this paper, the solution of parameter β is equivalent to solving the following problem:

i,j W T v xv i W T v xv j 2 2sij + β S 2 F ,

s.t. s T i 1 = 1, 0 sij 1. (24)

So we can determine the value of the regularized parameter β by (Nie, Wang, and Huang 2014):

j=1 dij), (25)

(g) RSw MPC

Figure 1: Two initial views and comparison with different methods on toy1 dataset. The results show that the block diagonal matrix obtained by our algorithm is cleaner than the others.

(g) RSw MPC

Figure 2: Two initial views and comparison with different methods on toy2 dataset. The results show that as the noise increases, our algorithm can still maintain good performance while other algorithms fail.

where K is a nearest neighbor number. The initial value of η can be set to β. Algorithm 1 summarizes the detailed steps to solve the problem (7). Parameter η can be updated by algorithm 1 until the connected components of Laplace matrix LS is equal to the clustering number k. The value of parameter γ has been given later in the experiment. v. Time-complexity Analysis In each iteration, the time complexity of the view weight coefﬁcient αv, the matrix F, the afﬁnity matrix S and the projection matrix W are O(d dn), O(n2), O(c2) and O(d2), respectively. Usually we have c n, d d, and the number of iterations is less than 10, so the time complexity of our algorithm is O(d dn + n2 + d2 + c2) O(n2 + d2 + dn).

Convergence Analysis In order to prove the convergence of Algorithm 1, we need to use the following Lemma proposed by (Nie et al. 2010). Lemma 1. For any positive number u and v,the following inequality holds:

Theorem 1. In each iteration of Algorithm 1, updated S will decrease the objective value of problem (3), which makes problem (3) convergent.

Proof. Let S represents the S updated in each iteration, it s easy to have:

i,j yv ij sij

i,j yv ijsij)1/2 + β S 2 F

i,j yv ijsij

i,j yv ijsij)1/2 + β S 2 F .

where yv ij denotes W T v xv i W T v xv j 2 2. According to Lemma 1, we have

i,j yv ij sij)1/2

i,j yv ij sij

i,j yv ijsij)1/2

i,j yv ijsij)1/2

i,j yv ijsij

i,j yv ijsij)1/2 .

Combined with problem (27) and (28), we get the sum of them as follows:

i,j W T v xv i W T v xv j 2 2 sij)1/2 + β S 2 F

i,j W T v xv i W T v xv j 2 2sij)1/2 + β S 2 F . (29)

So the convergence of Algorithm 1 is proved by the above formula.

Experiments In this section, we prove the effectiveness and robustness of our proposed method on synthetic datasets and real-world datasets, and compare them with other multi-view clustering algorithms. We use the following three evaluation indicators to evaluate clustering performance: Clustering Accuracy (ACC), Normalized Mutual Information (NMI), Purity.

Synthetic Datasets In this part, we design two datasets to test the effectiveness of our proposed algorithm, toy1 and toy2 datasets, respectively. These two datasets are comprehensive datasets with two views, each view includes a 90 90 matrix and three

Figure 3: Robustness comparision with the top four algorithms on HW dataset. Compared with other three algorithms, our algorithm can still maintain good performance with the increase of noise.

Table 1: Statistics of datasets

View MSRC-v1 Caltech101-7(20) HW NUS-WIDE ORL Faces 1 CM(24) Gabor(48) FOU(76) CH(64) GIST(512) 2 HOG(576) WM(40) FAC(216) CORR(144) HOG(864) 3 GIST(512) CENT(254) KAR(64) EDH(75) LBP(59) 4 LBP(256) HOG(1984) PIX(240) WT(128) SURF(500) 5 CENT(254) GIST(512) ZER(47) CM55(225) - 6 - LBP(928) MOR(6) - - #Size 210 1474(2386) 2000 2400 400 #Class 7 7(20) 10 12 40

30 30 diagonal block matrices. In these two datasets, the data in the block represents the correlation between the two corresponding points in a cluster and is randomly set to the number between 0 and 1, and the data outside the block represents noise and is randomly set to the number between 0 and e. In toy1 dataset, the noise value of the ﬁrst and the second view are set as e = 0.6 and e = 1.0 respectively. In toy2 dataset, for the ﬁrst view, the initial noise is e = 0.6, the e = 0.8 is used for the ﬁrst and second block matrices; for the second view, the initial noise is e = 0.7, and e = 1.0 is used for the second and third block matrices. Figure 1 and Figure 2 show the two original views in the toy1 and toy2 datasets and the grayscale images obtained by clustering with MLAN, Sw MC, RAMC, Sw MPC and our RSw MPC. The experimental results show that our RSw MPC can achieve best clustering performance which ACC, NMI and Purity are all 1. On the toy1 dataset, these algorithms are able to get a clean block diagonal matrix, while RSw MPC get the clearest block diagonal matrix. On the toy2 dataset, after adding noise, all algorithms except RAMC and RSw MPC are invalid. In comparison, the matrix obtained by our RSw MPC is cleaner than others, which also veriﬁes the robustness of RSw MPC. It indicates that our RSw MPC arrives at the optimal solution with different noise.

Real-world Datasets

In Table 1, we brieﬂy summarize the datasets used in this paper, namely MSRC-v1 (Winn and Jojic 2005), Caltech101 (Fei-Fei, Fergus, and Perona 2007) (here, we use two regular subsets of Caltech101-7 and Caltech101-20), Handwritten numerals (HW) (Asuncion and Newman 2007), NUSWIDE (Chua et al. 2009) (We select 12 categories of animal

images, and the ﬁrst 200 images are selected for each category), ORL Face (Samaria and Harter 1994) datasets. On the six real-world datasets, we compare our proposed algorithm with the following algorithms: Co-regularized Multi-view Spectral Clustering (Kumar, Rai, and Daume 2011) (Co-reg), Robust Multi-view Spectral Clustering (Xia et al. 2014) (RMSC), Parameter-Free Auto-Weighted Multiple Graph Learning (Nie et al. 2016) (AMGL), Multi View Clustering with Adaptive Neighbours(Nie, Cai, and Li 2017) (MLAN), Self-weighted Multi-view Clustering (Nie et al. 2017) (Sw MC), Robust Auto-Weighted Multi-Feature Clustering (Ren et al. 2018) (RAMC) and Parameter-Free Weighted Multi-View Projected Clustering (Wang et al. 2019) (Sw MPC). For our proposed algorithm, except that the neighbor numbers K of the MSRC-v1 and ORL Face datasets are set to 30 and 5, respectively, the rest is 15. The parameter γ is searched between 1e 6 to 1e6, and the step size is 0.5. In addition, we search for the dimensionality d corresponding to the best clustering result by searching all the dimensionalities of each view in the original dataset. For the sake of simplicity, we set the step size of the parameter γ to 3. In order to distinguish the proportion of different views in the overall clustering performance, for each feature, we initialize αv to a random weight between 0 and 1. For other algorithms, we set the parameters to be optimal. All the experiments are repeated for 20 times to obtain the mean and standard deviation of clustering results. Table 2 shows the clustering results (including standard deviation) for all methods, and the best results have been marked in bold. It is easy to see that our algorithm has the best (at least the same) ACC, NMI, Purity on all six datasets. Most of all, the ACC of our proposed algorithm on the Caltech101-7 and Caltech101-20 datasets have increased by about 9.5% and 9.9%. In particular, the ACC, NMI, Purity of our algorithm on the ORL Face dataset have reached an optimal value of 1. Clustering results of our proposed algorithm on other datasets are also improved. In addition, from Table 2, it can be seen that the standard deviation value of all datasets are less than 0.01, and the maximum value is 0.0071, which verify the efﬁciency and stability of our proposed algorithm. That s because that our algorithm adopts dimensionality reduction processing for high-dimensional

Table 2: Clustering performance comparison. Our algorithm is compared with other algorithms on the MSRC-v1, Caltech1017 (20), HW, NUS-WIDE and ORL Face datasets. Experimental results show that the overall performance of our proposed algorithm is better than others.

MSRC-v1 Caltech101-7 Acc NMI Purity Acc NMI Purity Co-reg 0.6271 0.0123 0.5399 0.0099 0.6512 0.0105 0.5011 0.0067 0.3867 0.0042 0.8104 0.0022 RMSC 0.8020 0.0212 0.6941 0.0114 0.8110 0.0145 0.4409 0.0101 0.3814 0.0035 0.8033 0.0022 AMGL 0.7095 0.0535 0.6686 0.0330 0.7390 0.0384 0.6538 0.0558 0.5393 0.0482 0.8463 0.0182 MLAN 0.6810 0.0000 0.6299 0.0000 0.7333 0.0000 0.7802 0.0000 0.6304 0.0000 0.8894 0.0000 Sw MC 0.8714 0.0000 0.7861 0.0000 0.8714 0.0000 0.6472 0.0000 0.5335 0.0000 0.8365 0.0000 RAMC 0.8871 0.0100 0.8150 0.0169 0.8871 0.0100 0.7497 0.0408 0.6365 0.0243 0.8808 0.0051 Sw MPC 0.5571 0.0000 0.5900 0.0000 0.6476 0.0000 0.7436 0.0000 0.5399 0.0000 0.8562 0.0000 RSw MPC 0.9071 0.0024 0.8293 0.0071 0.9071 0.0024 0.8750 0.0008 0.7135 0.0015 0.9193 0.0000 Caltech101-20 HW Acc NMI Purity Acc NMI Purity Co-reg 0.4470 0.0065 0.5228 0.0040 0.7356 0.0030 0.7199 0.0107 0.6873 0.0057 0.7465 0.0086 RMSC 0.3662 0.0070 0.5037 0.0032 0.7198 0.0042 0.864 0.0118 0.8112 0.0056 0.8730 0.0087 AMGL 0.5067 0.0496 0.5332 0.0263 0.6711 0.0189 0.8007 0.0562 0.8472 0.0307 0.8320 0.0427 MLAN 0.5258 0.0070 0.4744 0.0025 0.6660 0.0000 0.9727 0.0006 0.9384 0.0009 0.9727 0.0006 Sw MC 0.5432 0.0000 0.4514 0.0000 0.6676 0.0000 0.7523 0.0531 0.8412 0.0332 0.7865 0.0429 RAMC 0.5928 0.0376 0.6105 0.0222 0.7228 0.0110 0.8574 0.0013 0.8923 0.0016 0.8808 0.0008 Sw MPC 0.5281 0.0000 0.4787 0.0000 0.6668 0.0000 0.9605 0.0000 0.9222 0.0000 0.9605 0.0000 RSw MPC 0.6918 0.0015 0.6451 0.0032 0.7875 0.0026 0.9822 0.0003 0.9576 0.0008 0.9822 0.0003 NUS-WIDE ORL Face Acc NMI Purity Acc NMI Purity Co-reg 0.2772 0.0019 0.1743 0.0012 0.2974 0.0017 0.7732 0.0095 0.9181 0.0038 0.8261 0.0074 RMSC 0.2546 0.0052 0.1632 0.0016 0.2870 0.0023 0.7251 0.0132 0.8566 0.0062 0.7590 0.0117 AMGL 0.1605 0.0250 0.0799 0.0247 0.1615 0.0250 0.9605 0.0203 0.9874 0.0070 0.9698 0.0149 MLAN 0.2108 0.0100 0.1475 0.0063 0.2356 0.0096 0.9475 0.0000 0.9761 0.0000 0.9575 0.0000 Sw MC 0.1488 0.0000 0.0818 0.0000 0.1629 0.0000 0.9625 0.0000 0.9906 0.0000 0.9750 0.0000 RAMC 0.2015 0.0062 0.1217 0.0056 0.2155 0.0075 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 Sw MPC 0.1679 0.0000 0.0899 0.0000 0.1845 0.0000 0.9475 0.0000 0.9802 0.0000 0.9600 0.0000 RSw MPC 0.2778 0.0060 0.1810 0.0061 0.2974 0.0049 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000

data, the noise in the datasets is processed by ℓ2,1-norm, and a non-parametric self-weighting method is adopted to successfully solve the clustering problem on high-dimensional and noisy multi-view datasets.

Robustness Assessment In this part, we verify the robustness of the algorithm by adding different proportions of noise to real-world datasets. First, we add different proportions of noise based on the original dataset to construct a set of datasets containing noise. Let r be the ratio of noise (r is 0 to 0.5, step size is 0.05), n is the number of data points in the original dataset. In the randomly selected n r data points, we add a normal distribution noise with an average of 300 and a standard deviation of 30 to form a set of noisy datasets. Due to space constraints, we only show the robustness comparison on HW dataset. As shown in Figure 3, for simplicity, we only show the top four algorithms in clustering performance. From Figure 3, we can see that as the noise ratio increases, the performance of all algorithms decrease, but the performance of our algorithm is always in the lead. Furthermore, the performance gain of our algorithm is more signiﬁcant. When the random noise increases from 0 to 0.5, the performance gain of our algorithm on ACC, NMI, Purity increase from 12.57%, 6.6%, 11.05% to 29.95%, 28.29%, 35.77% compared with the results of RAMC. Compared with the MLAN algorithm, the performance gain of our algorithm on ACC, NMI, Purity increase from 0.32%, 0.96%,

0.32% to 102.34%, 139.27%, 93.65% respectively. Compared with the AMGL algorithm, the performance gain of our algorithm on ACC, NMI, Purity increase from 19.04%, 10.45%, 15.86% to 48.58%, 44.02%, 51.27%, respectively. These results show that our algorithm can suppress noise and maintain good clustering performance when the dataset contains noise.

In this paper, a new robust self-weighted multi-view projection clustering algorithm is proposed. It can simultaneously study the projection matrix, similar matrix and weight coefﬁcients to obtain low-dimensional subspaces with cluster structure. The introduction of the ℓ2,1-norm on the term not only suppresses noise, but also makes the line sparse and easy to solve. At the same time, the obtained optimal graph can be directly used for clustering without further processing. Experiments on the synthetic datasets and real-world datasets demonstrate the superiority and robustness of our method for processing high-dimensional data with noise. In future work, the framework can be extended to semisupervised clustering.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (No. 61602379, 61972315 and 61906109) and the Shaanxi Science and Technology Inno-

vation Team Support Project under grant agreement (No. 2018TD-O26). The authors are grateful for the constructive advice on the revision of the manuscript from the anonymous reviewers.

Asuncion, A., and Newman, D. 2007. Uci machine learning repository. Cai, X.; Nie, F.; Huang, H.; and Kamangar, F. 2011. Heterogeneous image feature integration via multi-modal spectral clustering. In CVPR 2011, 1977 1984. IEEE. Chen, X.; Yuan, G.; Wang, W.; Nie, F.; Chang, X.; and Huang, J. Z. 2018. Local adaptive projection framework for feature selection of labeled and unlabeled data. IEEE transactions on neural networks and learning systems 29(12):6362 6373. Chua, T. S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; and Zheng, Y. 2009. Nus-wide: A real-world web image database from national university of singapore. In Acm International Conference on Image & Video Retrieval. Fan, K. 1949. On a theorem of weyl concerning eigenvalues of linear transformations i. Proceedings of the National Academy of Sciences of the United States of America 35(11):652. Fei-Fei, L.; Fergus, R.; and Perona, P. 2007. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding 106(1):59 70. Gao, H.; Nie, F.; Li, X.; and Huang, H. 2015. Multi-view subspace clustering. In Proceedings of the IEEE international conference on computer vision, 4238 4246. Huang, J.; Nie, F.; and Huang, H. 2013. Spectral rotation versus k-means in spectral clustering. In Twenty-Seventh AAAI Conference on Artiﬁcial Intelligence. Kumar, A., and Daum e, H. 2011. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 393 400. Kumar, A.; Rai, P.; and Daume, H. 2011. Co-regularized multi-view spectral clustering. In Advances in neural information processing systems, 1413 1421. Nie, F.; Huang, H.; Cai, X.; and Ding, C. H. 2010. Efﬁcient and robust feature selection via joint ℓ2,1-norms minimization. In Advances in neural information processing systems, 1813 1821. Nie, F.; Li, J.; Li, X.; et al. 2016. Parameter-free autoweighted multiple graph learning: A framework for multiview clustering and semi-supervised classiﬁcation. In IJCAI, 1881 1887. Nie, F.; Li, J.; Li, X.; et al. 2017. Self-weighted multiview clustering with multiple graphs. In IJCAI, 2564 2570. Nie, F.; Cai, G.; and Li, X. 2017. Multi-view clustering and semi-supervised classiﬁcation with adaptive neighbours. In Thirty-First AAAI Conference on Artiﬁcial Intelligence.

Nie, F.; Tian, L.; and Li, X. 2018. Multiview clustering via adaptively weighted procrustes. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2022 2030. ACM. Nie, F.; Wang, X.; and Huang, H. 2014. Clustering and projected clustering with adaptive neighbors. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 977 986. ACM. Ren, P.; Xiao, Y.; Xu, P.; Guo, J.; Chen, X.; Wang, X.; and Fang, D. 2018. Robust auto-weighted multi-view clustering. In IJCAI, 2644 2650. Samaria, F. S., and Harter, A. C. 1994. Parameterisation of a stochastic model for human face identiﬁcation. In Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, 138 142. IEEE. Tang, W.; Lu, Z.; and Dhillon, I. S. 2009. Clustering with multiple graphs. In 2009 Ninth IEEE International Conference on Data Mining, 1016 1021. IEEE. Wang, R.; Nie, F.; Wang, Z.; Hu, H.; and Li, X. 2019. Parameter-free weighted multi-view projected clustering with structured graph learning. IEEE Transactions on Knowledge and Data Engineering. Winn, J. M., and Jojic, N. 2005. Locus: Learning object classes with unsupervised segmentation. In Proc International Conference on Computer Vision. Xia, R.; Pan, Y.; Du, L.; and Yin, J. 2014. Robust multiview spectral clustering via low-rank and sparse decomposition. In Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence. Xiao, Y.; Ren, P.; Li, Z.; Chen, X.; Wang, X.; and Fang, D. 2019. Rs3cis: Robust single-step spectral clustering with intrinsic subspace. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, 5482 5489. Yang, Y.; Shen, H. T.; Ma, Z.; Huang, Z.; and Zhou, X. 2011. ℓ2,1-norm regularized discriminative feature selection for unsupervised. In Twenty-Second International Joint Conference on Artiﬁcial Intelligence.