# embedded_feature_selection_on_graphbased_multiview_clustering__80be61f5.pdf

Embedded Feature Selection on Graph-Based Multi-View Clustering

Wenhui Zhao1, Guangfei Li1, Haizhou Yang1, Quanxue Gao1*, Qianqian Wang1

1 School of Telecommunication Engineering, Xidian University, Shaanxi 710071, China whzhao@stu.xidian.edu.cn, liguangfei dream@hotmail.com, leoyhz@qq.com, qxgao@xidian.edu.cn, qqwang@xidian.edu.cn

Recently, anchor graph-based multi-view clustering has been proven to be highly efficient for large-scale data processing. However, most existing anchor graph-based clustering methods necessitate post-processing to obtain clustering labels and are unable to effectively utilize the information within anchor graphs. To solve these problems, we propose an Embedded Feature Selection on Graph-Based Multi-View Clustering (EFSGMC) approach to improve the clustering performance. Our method decomposes anchor graphs, taking advantage of memory efficiency, to obtain clustering labels in a single step without the need for post-processing. Furthermore, we introduce the ℓ2,p-norm for graph-based feature selection, which selects the most relevant data for efficient graph factorization. Lastly, we employ the tensor Schatten p-norm as a tensor rank approximation function to capture the complementary information between different views, ensuring similarity between cluster assignment matrices. Experimental results on five real-world datasets demonstrate that our proposed method outperforms state-of-the-art approaches.

Introduction Over the past few decades, there has been immense interest in developing numerous exceptional clustering algorithms, including subspace-based clustering (Luo et al. 2018; Xie et al. 2020), non-negative matrix factorization clustering (Gao et al. 2013; Salah, Ailem, and Nadif 2018), and graphbased clustering (Hu et al. 2020; Nie, Li, and Li 2017). Notably, graph-based clustering methods have been widely developed due to their excellent performance in capturing the spatial structure of nonlinear data. The key step in graph-based clustering methods is to construct an N N affinity graph matrix to represent the similarity between different N data points. However, this operation can be time-consuming and memory-intensive. To address this issue, anchor graph-based methods (Li et al. 2020) have been proposed to construct an N M(M << N) anchor graph, where anchor graphs are used to measure the relationship between N data points and M anchors. However, post-processing (e.g., K-means) is required in most anchor graph-based methods to obtain final clustering labels, which not only increases the computational time but

*Corresponding author Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

also leads to the clustering performance being limited by K-means. To this end, SFMC (Li et al. 2020) manipulates the joint graph by a connectivity constraint, so that the connected components can indicate clusters directly. MSC-BG (Yang et al. 2022) proposed imposing constraints on the rank of the Laplacian matrix to obtain an affinity graph matrix with K connected components. Nevertheless, imposing constraints on connected components may result in a smaller number of connected components than K, leading to a significant decrease in clustering performance. Moreover, most anchor graph-based clustering algorithms take advantage of all the data points, but the anchor points corresponding to noise and redundant data in data points are useless. Therefore, LAPIN (Nie et al. 2023) method obtains a better coefficient matrix by applying sparse constraints to the data matrix, thus to alleviate the impact of noise to some extent. However, the distribution of noise is difficult to estimate, and the sparse representation of the noise term is hard to guarantee. Furthermore, the quality differences between different data views can also significantly affect clustering performance. Accordingly, AMGL (Nie, Li, and Li 2016) automatically learns optimal weights for each view by minimizing the squared-root trace. Although these methods have achieved good results, they cannot fully utilize the complementary information in the adjacency matrix of different views. To address these issues, we propose an Embedded Feature Selection on Graph-Based Multi-View Clustering (EFSGMC) method, which can obtain the final cluster label in one step. Specifically, we adapt non-negative matrix factorization directly to the anchor graph to get the final cluster indicator matrix in one step, thus avoiding post-processing. Besides, we draw inspiration from feature selection for raw data points and apply feature selection to the anchor graph. Specifically, we minimize the ℓ2,p-norm to make the learned anchor map representation more sparse to filter out the anchor points corresponding to noise and redundant data, which significantly reduces the effect of noise. In addition, we refer to the weighted tensor Schatten p-norm minimization (WTSNM) (Xia et al. 2022) and propose to employ the tensor Schatten p-norm minimization to explore the lowrank structure embedded in inter-view graphs. The main contributions of our method are as follows:

Our method performs non-negative matrix decomposi-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

tion of the learned anchor graphs to obtain a discrete label matrix, allowing us to obtain clustering results directly in one step without the need for post-processing. We propose a method that minimizes the ℓ2,p-norm to ensure the sparsity of the learned anchor graph, thereby achieving the selection of representative anchor points while eliminating the redundant ones and present a novel and efficient algorithm with a closed-form solution. We employ LPP (Lu et al. 2016) manifold learning to ensure label consistency among adjacent sample points, and explore the low-rank structure of inter-view graphs using Schatten p-norm, which fully leverages the complementary information embedded in the graphs. We propose an efficient algorithm to solve the model via ALM, and we carry out experiments on real multi-view datasets to demonstrate the effectiveness of our proposed method.

Methodology Notations and Definitions: In this paper, we use bold calligraphy letters for 3rd-order tensors, e.g., A Rn1 n2 n3, and bold upper case letters for matrices, e.g., A. Ai: and A:j are the i-th row and j-th column of matrix A, separately. The v-th frontal slice of A is Av. A is the discrete Fast Fourier Transform (FFT) of A along the third dimension, i.e., A = fft(A, [ ], 3). The trace of matrix A is denoted by tr(A). I is an identity matrix.

Definition 1 (Gao et al. 2021) Given G Rn1 n2 n3, the tensor Schatten p-norm of G is defined as

j=1 σj(Gi)p

where h = min(n1, n2), p (0, 1], σj(Gi) is the j-th singular value of Gi. The Schatten p-norm can approximate the rank function more tightly when p is chosen appropriately.

Definition 2 (Wang et al. 2018; Liao et al. 2018) Given H Rn1 n2, the ℓ2,p-norm is defined as

i=1 Hi: p 2 =

where p (0, 1]. Specially, when p = 1, ℓ2,p-norm becomes

ℓ2,1-norm, i.e., H 2,1 = Pn1 i=1 q Pn2 j=1 H2 ij.

Definition 3 (Dong et al. 2016) Given Z RN M, weight matrix W RN N, the Sparse Gradient Pursuit is defined as

j=1 Wij Z:i Z:j 1 = KZ 1 (3)

where Z represents the gradient of Z and K denotes the gradient matrix of the adjacency KNN graph (Yang et al. 2014).

Problem Formulation and Objective Anchor graph-based methods typically require learning a shared graph using predefined graphs Sv RN M, which capture the relationships between N data points and M anchor points. However, there are some drawbacks to these methods: (1) The cluster labels must be obtained via postprocessing, which can limit clustering performance; (2) All data points in the anchor graphs are used, which can introduce redundant data and lead to inefficiencies; (3) These methods process each view separately, which prevents them from fully leveraging the complementary information in the adjacency matrices of different views. In response to the above-mentioned disadvantages, we use non-negative matrix factorization (Ding et al. 2006) to obtain the final global cluster assignment matrix by factorizing the anchor graph in one step, thus avoiding post-processing. To ensure that results before and after matrix factorization are close, ℓ2,1-norm is used for non-negative matrix factorization to avoid the increasing error caused by the square of F-norm. Thus, we have

n Sv Gv Hv T 2,1 o

s.t. Gv TGv=I, G 0,

v=1 αv = 1,αv 0

where αv is the non-negative normalized weight factor, Sv RN M is pre-defined anchor graph (Li et al. 2020), Gv RN C is the cluster assignment matrix, Hv RM C is the latent feature matrix, C is the number of clusters. To better construct anchor points, we consider selecting the most representative data points from the anchor map. When reconstructing the anchor graph Sv, we can add a diagonal matrix diag(f) with fi = {0, 1} corresponding to the matrix Sv for feature selection. At this point, the reconstruction matrix corresponding to anchor graph Sv is S v = Svdiag(f). We observe that the matrix Sv can be reconstructed by Gv and Hv, thus the i-th column vectors of the reconstruction matrix S v can be represented as S v :i = Gv Hv i: T. Considering S v :i 2 = Gv Hv i: T 2 = Hv i: 2, we can see that the reconstruction for Sv is heavily dependent on the matrix Hv, and when S v :i 2 is close to 0, it means that the corresponding anchor point is not representative and should be excluded. Therefore, ensuring the row sparsity of Hv can achieve feature selection on the graph easily. The corresponding model in this case is:

n Sv Gv(diag(f)Hv)T 2,1 o

s.t. Gv TGv=I, G 0,

v=1 αv = 1,αv 0,f {0, 1}M

(5) where diag(f) RM M with fi = {0, 1}. When fi = 0, the corresponding i-th row of diag(f)Hv is 0T, and the i-th column of the reconstructed anchor graph S v also tends toward

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

0. Otherwise, when fi = 1, it indicates that the feature associated with graph Sv is useful and should be retained. This allows for efficient feature selection on the anchor graph Sv. However, directly imposing a constraint on a specific row in matrix Hv to be 0T is too strict and difficult to solve. Therefore, we propose a new row sparsity norm, termed as the ℓ2,p-norm (see Definition 2). By using this norm, the resulting Hv matrix can be made even sparser, leading to further enhancement of the performance of the algorithm. Besides, using the ℓ2,p-norm constraint for non-negative matrix factorization can reduce the reconstruction error. Therefore, we can obtain:

n Sv Gv Hv T 2,p+λ Hv 2,p o

s.t. Gv TGv=I, Gv 0,

v=1 αv = 1,αv 0

In equation (6), the sparsity of the rows in matrix Hv controls the column sparsity of the anchor graph Sv, thereby enabling the Hv matrix to realize feature selection on the anchor graph. Furthermore, the matrix Gv serves as the corresponding label embedding matrix. To effectively learn the feature selection matrix Hv, it is necessary to ensure that adjacent samples in the high-dimensional manifold remain adjacent after dimension reduction. Inspired by the local preserving projection (LPP) (Lu et al. 2016) algorithm, we consider adding a regularization term to the matrix Gv to preserve the label consistency between adjacent sample points by employing the idea of LPP manifold learning. This leads to our new model formulation:

n Sv Gv Hv T 2,p

+γtr(Gv Te L v Gv)+λ Hv 2,p o

s.t. Gv TGv=I, Gv 0,

v=1 αv = 1,αv 0

where the normalized Laplacian matrix e L v can be calculated by e L v=I Sv( v) 1Sv T and the diagonal elements of diagonal matrix v are v ii=PN i=1Sv ij. During the optimization process of the model, tr(Gv Te L v Gv) needs to be transformed into the square of F-norm for computation, with the corresponding expression being:

tr(Gv Te L v Gv) = 1

j=1 Wv ij Gv i: Gv j: 2 F (8)

where Wv is the adjacency matrix and Gv is the cluster assignment matrix. Minimizing the ℓ1-norm optimization problem tends to set some elements to 0, i.e., only the part of the data that fits well is selected for estimating the matrix Gv to ensure the sparsity. Therefore, we propose to use

Rotate G1 G2

Figure 1: Construction of G RN V C.

the ℓ1-norm instead of the square of the F-norm. Inspired by Definition 3, we convert the row operation of matrix Gv

into the column operation of Gv T to obtain the corresponding expression:

j=1 Wv ij Gv i: Gv j: 1= 1

2 Gv TTv T 1 (9)

where Tv RO N is the corresponding gradient matrix of adjacent K-nearest neighbor (KNN) graph (the k-th row satisfies Tv ki = Tv kj = Wv ij), O = K N is the number of edges in the KNN graph. Combining (9) with (7), we have:

n Sv Gv Hv T 2,p+

γ 2 Gv TTv T 1+λ Hv 2,p o

s.t. Gv TGv=I, Gv 0,

v=1 αv = 1,αv 0

However, equation (10) does not fully exploit the complementary information in different views. Hence, we use the tensor Schatten p-norm (defined in Definition 1) to measure the similarity between different Gv and obtain the final global cluster assignment matrix C=PV v=1 Gv αv , incorporating weight information. Specifically, we construct a 3rdorder tensor G from Gv (as illustrated in Figure 1) and consider the corresponding Schatten p-norm after rotation. Notably, Ωm ensures that the relationship between the N data points and the c-th cluster is consistent across views. Therefore, G Sp allows for a comprehensive exploration of information hidden between different views. Combining (10) with the Schatten p-norm, our final model can be expressed as:

n Sv Gv Hv T 2,p+

γ 2 Gv TTv T 1+λ Hv 2,p o +β G p Sp

s.t. Gv TGv=I, Gv 0,

v=1 αv = 1,αv 0

Optimization We propose an efficient optimization method based on the Augmented Lagrange Multiplier (ALM) method. This

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

method involves introducing auxiliary variables J , Pv, and Qv, which allows us to rewrite (11) as:

2 Sv Gv Hv T Pv+Kv

γ 2 Qv 1+ρ1

2 Gv TTv T Qv+Mv

λ Hv 2,p o +β J p Sp +ρ2

s.t. Gv TGv=I, Qv 0,

v=1 αv = 1,αv 0

(12) where W, Kv and Mv are Lagrange multiplier, ρ0, ρ1 and ρ2 are the penalty parameters. The optimization process could be separated into the following steps: Qv sub-problem:

2 Gv TTv T Qv+Mv

s.t. Gv TGv=I, Qv 0,

v=1 αv = 1,αv 0

(13) Considering every single view individually, it follows that

arg min Qv γ 2ρ1 Qv 1+1

2 Qv Cv 2 F s.t. Gv TGv = I, Qv 0

(14) where Cv = Gv TTv T+ Mv

ρ1 . Inspired by (Hale, Yin, and Zhang 2008), we have

Qv = Θ γ 2ρ1 (Cv) (15)

where the i, j-th element of Θ γ 2ρ1 (Cv) is defined as

Θ γ 2ρ1 (Cv)ij = sgn Cv ij max |Cv ij γ

2ρ1 |, 0 (16)

Hv sub-problem:

2 Av Hv 2 F + λ

s.t. Gv TGv=I,

v=1 αv = 1,αv 0

where Av = Sv Pv+ Kv

T Gv. In order to solve (17), we need the following Lemma 1 2 and Theorem 1. Lemma 1 (Gao et al. 2021) Considering

min δ 0f(δ) = 1

2 (δ ω)2 +λδp s.t. 0<p<1 (18)

by using the Generalized Soft-Thresholding (GST):

τ GST p (λ) = (2λ(1 p))

1 2 p +λp(2λ(1 p))

p 1 2 p , (19)

Thus, δ can be obtained by ( δ = 0, ω τ GST p (λ)

δ = sign(ω)SGST p (ω, λ), otherwise (20)

where SGST p (ω, λ) can be solved by SGST p (ω, λ) ω + λp(SGST p (ω, λ))p 1 = 0.

Lemma 2 (Yang et al. 2020) Considering Q, P CN M, and F : CN M C is represented as F(Q) = f σ Q = f(σ1(Q), , σK(Q)), where σ Q is the vector with components of the non-increasing singular values of Q. If F(Q) is a complex invariant function and consider the SVD P = UP PVH, then the optimal solution to

2 P Q 2 F +F(Q) (21)

is Q = UP QVH, where P Q = diag( σ Q) and

σ Q = arg min σ

Theorem 1 Suppose H RN M, the solution of

1 2 A H 2 F +µ H 2,p (23)

is H = [H 1:; ; H N:]T, where the i-th row element is

H i: = σ Ai: Ai: 2 (24)

where σ can be obtained by

σ = arg min x 0

1 2 (x Ai: 2)2+µxp (25)

which can be obtained by the General Shrinkage Thresholding (GST) algorithm(Gao et al. 2021) (see Lemma 1).

Proof 1 We rewrite (23) as a row-wise manner to get

arg min H i:

2 Ai: Hi: 2 2+µ Hi: p 2

Considering each row separately, (26) can be rewritten as

arg min H i:

1 2 Ai: Hi: 2 2+µ Hi: p 2 (27)

We perform economy SVD to Hi: to get σ(Hi:) = q

Hi:HT i: = Hi: 2. Referring to Lemma 2, we have F(Hi:) = µ Hi: p 2 = µ(σ(Hi:))p = f(σ(Hi:)), i.e., f(x) = µ(x)p. Thus, the optimal result of (27) is

H i: = ui X

Hi:v T i = [1]σ (Hi:) Ai: Ai: 2 = σ (Hi:) Ai: Ai: 2 (28) where ui = 1T = [1] and v T i = Ai: Ai: 2 can be decomposed

by economy SVD of Ai:. Since σ(Ai:) = q

Ai:AT i: = Ai: 2 is the only singular value of Ai:, σ (Hi:) can be obtained by

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

σ (Hi:) = arg min x 0

1 2 x σ(Ai:) 2 2+f(x)

= arg min x 0

1 2 (x Ai: 2)2+µxp (29)

Thus, taking each view into account, the optimal solution of (17) can be easily obtained by Theorem 1 as

Hv = [Hv 1: ; ; Hv N: ]T (30)

where the i-th row element is Hv i: = σ Av i: Av i: 2 . J sub-problem:

arg min J β J p Sp +ρ2

ρ2 2 F s.t. Gv TGv = I

(31) Now, the idea is to use the following Theorem 2 (Gao et al. 2021): Theorem 2 Let S Rn1 n2 n3, h = min(n1, n2) have the t-SVD S = U A VT . Then the solution of

1 2 X S 2 F +τ X p Sp (32)

is the following

X = Γτ (S) = U ifft Cτ

where Cτ(S) is a 3rd-order tensor, whose diagonal elements can be obtained by the General Shrinkage Thresholding (GST) algorithm(Gao et al. 2021). Accordingly, the solution of (31) is

Gv sub-problem:

2 Sv Gv Hv T Pv+Kv

2 Gv TTv T Qv+Mv

s.t. Gv TGv=I,

v=1 αv = 1,αv 0

(35) By simple calculation, (35) is equivalent to

arg max Gv,Hv

n tr(Gv TDv Gv)+2tr(Gv TEv) o

s.t. Gv TGv=I,

v=1 αv = 1,αv 0

where Ev = ρ0 2αv Av Hv + ρ1 2αv Tv TBv T + ρ2Cv

2 and Dv = ρ1

2αv Tv TTv. Update Fv = Dv Gv+Ev and perform SVD decomposition on Fv to get UΣVT = Fv. Update Gv = UVT. Repeat until convergence and get new Gv.

Pv sub-problem:

2 Pv Nv 2 F + 1

s.t. Gv TGv=I,

v=1 αv = 1,αv 0

where Nv = Sv+Gv Hv T Kv

ρ0 . Inspired by Theorem 1, taking each view into account, the optimal solution of (37) can be easily obtained as

Pv = [Pv 1: ; ; Pv N: ]T (38)

where the i-th row element is Pv i: = σ Nv i: Nv i: 2 . αv sub-problem:

v=1 αv = 1,αv 0 (39)

where τ v = Pv 2,p+ ρ1

2 Gv TTv T Qv+ Mv

2 Qv 1 + ρ0

2 Sv Gv Hv T Pv+ Kv

ρ0 2 F ++λ Hv 2,p.

Due to the fact that PV v=1 αv = 1 and αv 0, it follows that (39) is equivalent to

arg min αv,η

τ v αv η VP

v=1 αv 1 (40)

where η is the Lagrange multiplier. Setting the partial derivatives with respect to αv and η in (40) to be zero gives

η and η = PV v=1

τ v 2 . Then it is not hard to see

The remaining variables are updated as follows

Kv = Kv+ρ0 Sv Gv Hv T Pv (42)

Mv = Mv+ρ1 (Gv Qv) (43)

W = W+ρ2 (G J ) (44)

ρi = min (pho ρ ρi, max ρi) (45) where i = 0, 1, 2, max ρi and pho ρ are constants. Clustering labels sub-problem: The final clustering assignment matrix C = PV v=1 Gv αv is constructed using Gv RN C, where N represents the number of data points and C denotes the number of categories. The whole algorithm is summarized in Algorithm 1.

Complexity The main computational complexity of this model is O(V NMd + V N 2Ct), where d = PV v=1 dv, V , M, N, C and t are the number of views, anchors samples, number of clusters and number of iterations, respectively.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Algorithm 1: EFSGMC

Input: Data matrices: {Xv}V v=1 RN dv, anchors number M, and cluster number K. Output: Cluster assignment matrix e G with K classes. 1: Initialize W = J = 0, Kv = Mv = 0, ρ0, ρ1, ρ2, pho ρ = 1.1, max ρi = 1010, αv = 1

V , γ, λ, β. 2: while not converg do 3: Update Qv by using (15) 4: Update Hv by using (30) 5: Update J by using (34) 6: Update Gv by solving (36) 7: Update Pv by solving (37) 8: Update αv by using (41) 9: Update Kv, Mv, W and ρi by using (42), (43), (44) and (45), respectively; 10: Directly achieve the K clusters based on the cluster assignment matrix e G=PV v=1 Gv αv ; 11: end while 12: return Clustering results.

Scale Normal Large Dataset MSRC HW Mnist Cal101 Reuters Size 210 2000 4000 2386 18758 Sample 1622 345 69 3766 107727 Views 5 4 3 6 5 Clusters 7 10 4 20 6

Table 1: Statistics of Real Benchmark Datasets.

Experiments Experimental Setup Datasets As shown in Table 1, MSRC (Winn and Jojic 2005), HW (Dua and Graff 2017), Mnist4 (Deng 2012), Cal101-20 (Fei-Fei, Fergus, and Perona 2007) and Reuters (Apt e, Damerau, and Weiss 1994) are selected.

Settings The hyperparameters are selected in the range of [0.0001, 0.001, 0.01, 0.1, 0.5, 1, 5, 10, 50, 100, 1000, 10000] to obtain the optimal results. Adjusting K within the range of 2-10, the optimal value of 5 was chosen.

Comparative algorithms A total of 11 state-of-the-art algorithms were selected as comparison algorithms, including Co-reg (Kumar and Rai 2011), Sw MC (Nie, Li, and Li 2017), MVSC (Li et al. 2015), SMSC (Hu et al. 2020), AMGL (Nie, Li, and Li 2016), MLAN (Nie et al. 2018), SFMC (Li et al. 2020), RMSC (Xia et al. 2014), CSMSC (Luo et al. 2018), MSC-BG (Yang et al. 2022), FPMVSCAG (Wang et al. 2021).

Evaluation metrics ACC, NMI and Purity indexes are used to evaluate the clustering performance of all methods.

Experimental Results Comparisons with State-of-the-art Methods The clustering performance of different algorithms on 5 data sets is

shown in Table 2. Specifically, our method and MSC-BG algorithm both obtain optimal and sub-optimal clustering performance. On the largest dataset Reuters, the proposed method and MSC-BG and SFMC algorithms achieve satisfactory clustering results, but our method is superior. All three algorithms use the anchor selection method, significantly reducing memory consumption. Our algorithm significantly improves performance compared to MSC-BG, which uses anchor selection and tensor Schatten p-norm. The primary reason may be that our method uses feature selection to filter out redundant information. Moreover, compared with the SMSC algorithm, our method considers both inter-view and intra-view potential information, substantially improving clustering performance.

Effect of parameter p of Schatten p-norm We vary the p from 0.1 to 1.0 to investigate the impact of the tensor Schatten p-norm on two datasets. Experiments show that the proposed approach achieves the optimal performance at 0.5, 0.8, which indicates that different p values can better capture the latent space distribution of different datasets. Thus, adjusting p on different datasets can improve performance in practical applications .

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 The value of p of Sp-norm

Clustering performance

ACC NMI Purity

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 The value of p of Sp-norm

Clustering performance

ACC NMI Purity

Figure 2: The clustering result with varying p of Schatten p-norm.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anchor rate

Clustering performance

ACC NMI Purity

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Anchor rate

Clustering performance

ACC NMI Purity

Figure 3: The clustering result with varying anchor rates.

Effect of anchor rate We conducted ten experiments by incrementally increasing the anchor rate from 0.1 to 1.0, as shown in Figure 3. Experiments show that the algorithm does not necessarily perform optimally when the anchor rate is set to 1.0. This could be attributed to redundant and irrelevant data in the actual dataset, which can be removed using the anchor selection method. Additionally, an anchor selection strategy reduces memory usage, enabling effective clustering of large-scale data.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Dataset MSRC HW Mnist Cal101 Reuters Metric ACC NMI Purity ACC NMI Purity ACC NMI Purity ACC NMI Purity ACC NMI Purity Co-reg 0.635 0.578 0.659 0.784 0.758 0.795 0.785 0.602 0.786 0.412 0.587 0.754 0.563 0.326 0.552 Sw MC 0.776 0.774 0.805 0.758 0.833 0.792 0.914 0.799 0.912 0.599 0.493 0.700 OM OM OM MVSC 0.794 0.672 0.756 0.796 0.820 0.808 0.733 0.651 0.780 0.595 0.613 0.717 0.596 0.347 0.574 SMSC 0.766 0.717 0.804 0.742 0.781 0.759 0.913 0.789 0.913 0.582 0.590 0.748 OM OM OM AMGL 0.751 0.704 0.789 0.704 0.762 0.732 0.910 0.785 0.910 0.557 0.552 0.677 OM OM OM MLAN 0.681 0.630 0.733 0.778 0.832 0.812 0.744 0.659 0.744 0.526 0.474 0.666 OM OM OM SFMC 0.810 0.721 0.810 0.853 0.871 0.873 0.917 0.801 0.917 0.642 0.595 0.748 0.602 0.354 0.552 RMSC 0.762 0.663 0.769 0.681 0.661 0.713 0.705 0.486 0.705 0.385 0.512 0.742 OM OM OM CSMSC 0.758 0.735 0.793 0.806 0.793 0.867 0.643 0.645 0.832 0.474 0.648 0.563 OM OM OM MSC-BG 0.981 0.960 0.981 0.889 0.922 0.889 0.938 0.861 0.938 0.667 0.727 0.794 0.640 0.484 0.686 FPMVS-CAG 0.843 0.738 0.843 0.85 0.787 0.850 0.887 0.719 0.887 0.635 0.611 0.723 0.526 0.323 0.603 EFSGMC 1.000 1.000 1.000 0.994 0.984 0.994 0.951 0.866 0.951 0.741 0.725 0.839 0.618 0.518 0.739

Table 2: The clustering result of the selected methods. ( OM means out-of-memory.)

ACC NMI Purity 0.75

Clustering performance

Our Our w.o. Sp-norm

ACC NMI Purity 0.75

Clustering performance

Our Our w.o. Sp-norm

ACC NMI Purity 0.7

Clustering performance

Our Our w.o. Sp-norm

ACC NMI Purity 0.6

Clustering performance

Our Our w.o. Sp-norm

Figure 4: The ablation studies on selected datasets.

Ablation studies about Schatten p-norm To evaluate the effectiveness of utilizing the tensor Schatten p-norm to extract complementary information from inter-view low-rank space, we conducted corresponding experiments on four datasets. An analysis of Figure 4 shows that utilizing this norm can help reveal complementary information that may be hidden between different views.

Effect of minimizing ℓ2,p-norm of Hv To investigate the role of the ℓ2,p-norm, we set p = 1 to obtain the ℓ2,1-norm and present the corresponding visual analysis of the feature selection matrix Hv in Figure 5. The usage of the ℓ2,p-norm results in sparser Hv. This indicates that the ℓ2,p-norm can lead to sparser feature selection matrices and facilitate feature selection on the graph, enabling the removal of redundant and noisy anchor points in the anchor graph.

Analysis of convergence curves We analyzed the relationship between the reconstruction error of anchor graph Sv and the number of iterations, where the Reconstruction Error (RE) is defined as RE = Sv Gv Hv T Pv on

1 2 3 4 5 6 7

(a) ℓ2,1-norm #1

1 2 3 4 5 6 7

(b) ℓ2,p-norm #1

Figure 5: The visual analysis of Hv on MSRC dataset.

0 10 20 30 40 50 Iteration

Reconstruction Error

RE_MSRC RE_Cal101

0 5 10 15 20 25 30 Iteration

Reconstruction Error

RE_mnist RE_HW

Figure 6: The convergence curves.

the dataset MSRC, HW, Mnist4, and Cal101. Experiments show that the models converge within 50 iterations on the four datasets.

In this paper, we utilize the non-negative matrix decomposition on the anchor graph to obtain the cluster label in one step. Additionally, we introduce a novel ℓ2,p-norm for feature selection on the anchor graph and provide an effective solution, significantly improving clustering efficiency. We include the minimization tensor Schatten p-norm of cluster assignment matrices to enhance clustering performance, which helps explore the complementary information and representation space structure between different views. We introduce our algorithm and provide an efficient solution. Extensive experiments verify the validity of EFSGMC.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgements This work is supported by the Science and technology project of Xi an (Grant 2022JH-JSYF-0009), the Natural Science Basic Research Program of Shaanxi Province (Grant 2023-JC-YB-534), the National Natural Science Foundation of China under Grants 62176203, Natural Science Foundation of Shandong Province under Grant ZR202102180986, the Fundamental Research Funds for the Central Universities and the Innovation Fund of Xidian University, Guangxi Key Laboratory of Digital Infrastructure under Grant GXDIOP2023010 Natural Science Foundation of Guangdong Province, 2023A1515011845.

References Apt e, C.; Damerau, F.; and Weiss, S. M. 1994. Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst., 12(3): 233 251. Deng, L. 2012. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Process. Mag., 29(6): 141 142. Ding, C.; Li, T.; Peng, W.; and Park, H. 2006. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 126 135. Dong, J.; Liu, R.; Tang, K.; Wang, Y.; Zhang, X.; and Su, Z. 2016. Sparse Gradient Pursuit for Robust Visual Analysis. In ACCV, 369 384. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. Fei-Fei, L.; Fergus, R.; and Perona, P. 2007. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Comput. Vis. Image Understand., 106(1): 59 70. Gao, J.; Han, J.; Liu, J.; and Wang, C. 2013. Multi-View Clustering via Joint Nonnegative Matrix Factorization. 252 260. SIAM. Gao, Q.; Zhang, P.; Xia, W.; Xie, D.; Gao, X.; and Tao, D. 2021. Enhanced Tensor RPCA and its Application. IEEE Trans. Pattern Anal. Mach. Intell., 43(6): 2133 2140. Hale, E. T.; Yin, W.; and Zhang, Y. 2008. Fixed-Point Continuation for l1-Minimization: Methodology and Convergence. SIAM J. Optim., 19(3): 1107 1130. Hu, Z.; Nie, F.; Wang, R.; and Li, X. 2020. Multi-view spectral clustering via integrating nonnegative embedding and spectral embedding. Inf. Fusion, 55: 251 259. Kumar, A.; and Rai, P. 2011. Co-regularized multi-view spectral clustering. In Neur IPS, 1413 1421. Li, X.; Zhang, H.; Wang, R.; and Nie, F. 2020. Multiview clustering: A scalable and parameter-free bipartite graph fusion method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1): 330 344. Li, Y.; Nie, F.; Huang, H.; and Huang, J. 2015. Large Scale Multi-View Spectral Clustering via Bipartite Graph. In Bonet, B.; and Koenig, S., eds., AAAI, 2750 2756.

Liao, S.; Li, J.; Liu, Y.; Gao, Q.; and Gao, X. 2018. Robust Formulation for PCA: Avoiding Mean Calculation With ℓ2,p-norm Maximization. In AAAI, 3604 3610. Lu, Y.; Lai, Z.; Xu, Y.; Li, X.; Zhang, D.; and Yuan, C. 2016. Low-Rank Preserving Projections. IEEE Trans. Cybern., 46(8): 1900 1913. Luo, S.; Zhang, C.; Zhang, W.; and Cao, X. 2018. Consistent and Specific Multi-View Subspace Clustering. In AAAI, 3730 3737. Nie, F.; Cai, G.; Li, J.; and Li, X. 2018. Auto-Weighted Multi-View Learning for Image Clustering and Semi Supervised Classification. IEEE Trans. Image Process., 27(3): 1501 1511. Nie, F.; Chang, W.; Wang, R.; and Li, X. 2023. Learning an Optimal Bipartite Graph for Subspace Clustering via Constrained Laplacian Rank. IEEE Trans. Cybern., 53(2): 1235 1247. Nie, F.; Li, J.; and Li, X. 2016. Parameter-Free Auto Weighted Multiple Graph Learning: A Framework for Multiview Clustering and Semi-Supervised Classification. In IJCAI, 1881 1887. Nie, F.; Li, J.; and Li, X. 2017. Self-weighted Multiview Clustering with Multiple Graphs. In IJCAI, 2564 2570. Salah, A.; Ailem, M.; and Nadif, M. 2018. Word Co-Occurrence Regularized Non-Negative Matrix Tri Factorization for Text Data Co-Clustering. In AAAI, 3992 3999. Wang, Q.; Gao, Q.; Gao, X.; and Nie, F. 2018. ℓ2,p-Norm Based PCA for Image Recognition. IEEE Trans. Image Process., 27(3): 1336 1346. Wang, S.; Liu, X.; Zhu, X.; Zhang, P.; Zhang, Y.; Gao, F.; and Zhu, E. 2021. Fast parameter-free multi-view subspace clustering with consensus anchor guidance. IEEE Transactions on Image Processing, 31: 556 568. Winn, J. M.; and Jojic, N. 2005. LOCUS: Learning Object Classes with Unsupervised Segmentation. In ICCV, 756 763. Xia, R.; Pan, Y.; Du, L.; and Yin, J. 2014. Robust Multi View Spectral Clustering via Low-Rank and Sparse Decomposition. In AAAI, 2149 2155. Xia, W.; Zhang, X.; Gao, Q.; Shu, X.; Han, J.; and Gao, X. 2022. Multiview Subspace Clustering by an Enhanced Tensor Nuclear Norm. IEEE Trans. Cybern., 52(9): 8962 8975. Xie, D.; Zhang, X.; Gao, Q.; Han, J.; Xiao, S.; and Gao, X. 2020. Multiview Clustering by Joint Latent Representation and Similarity Learning. IEEE Trans. Cybern., 50(11): 4848 4854. Yang, H.; Gao, Q.; Xia, W.; Yang, M.; and Gao, X. 2022. Multiview Spectral Clustering With Bipartite Graph. IEEE Trans. Image Process., 31: 3591 3605. Yang, M.; Luo, Q.; Li, W.; and Xiao, M. 2020. Multiview Clustering of Images with Tensor Rank Minimization via Nonconvex Approach. SIAM J. Imaging Sci., 13(4): 2361 2392. Yang, Y.; Wang, Z.; Yang, J.; Han, J.; and Huang, T. S. 2014. Regularized l1-Graph for Data Clustering. In BMVC.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)