# embedded_feature_selection_on_graphbased_multiview_clustering__80be61f5.pdf Embedded Feature Selection on Graph-Based Multi-View Clustering Wenhui Zhao1, Guangfei Li1, Haizhou Yang1, Quanxue Gao1*, Qianqian Wang1 1 School of Telecommunication Engineering, Xidian University, Shaanxi 710071, China whzhao@stu.xidian.edu.cn, liguangfei dream@hotmail.com, leoyhz@qq.com, qxgao@xidian.edu.cn, qqwang@xidian.edu.cn Recently, anchor graph-based multi-view clustering has been proven to be highly efficient for large-scale data processing. However, most existing anchor graph-based clustering methods necessitate post-processing to obtain clustering labels and are unable to effectively utilize the information within anchor graphs. To solve these problems, we propose an Embedded Feature Selection on Graph-Based Multi-View Clustering (EFSGMC) approach to improve the clustering performance. Our method decomposes anchor graphs, taking advantage of memory efficiency, to obtain clustering labels in a single step without the need for post-processing. Furthermore, we introduce the ℓ2,p-norm for graph-based feature selection, which selects the most relevant data for efficient graph factorization. Lastly, we employ the tensor Schatten p-norm as a tensor rank approximation function to capture the complementary information between different views, ensuring similarity between cluster assignment matrices. Experimental results on five real-world datasets demonstrate that our proposed method outperforms state-of-the-art approaches. Introduction Over the past few decades, there has been immense interest in developing numerous exceptional clustering algorithms, including subspace-based clustering (Luo et al. 2018; Xie et al. 2020), non-negative matrix factorization clustering (Gao et al. 2013; Salah, Ailem, and Nadif 2018), and graphbased clustering (Hu et al. 2020; Nie, Li, and Li 2017). Notably, graph-based clustering methods have been widely developed due to their excellent performance in capturing the spatial structure of nonlinear data. The key step in graph-based clustering methods is to construct an N N affinity graph matrix to represent the similarity between different N data points. However, this operation can be time-consuming and memory-intensive. To address this issue, anchor graph-based methods (Li et al. 2020) have been proposed to construct an N M(M << N) anchor graph, where anchor graphs are used to measure the relationship between N data points and M anchors. However, post-processing (e.g., K-means) is required in most anchor graph-based methods to obtain final clustering labels, which not only increases the computational time but *Corresponding author Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. also leads to the clustering performance being limited by K-means. To this end, SFMC (Li et al. 2020) manipulates the joint graph by a connectivity constraint, so that the connected components can indicate clusters directly. MSC-BG (Yang et al. 2022) proposed imposing constraints on the rank of the Laplacian matrix to obtain an affinity graph matrix with K connected components. Nevertheless, imposing constraints on connected components may result in a smaller number of connected components than K, leading to a significant decrease in clustering performance. Moreover, most anchor graph-based clustering algorithms take advantage of all the data points, but the anchor points corresponding to noise and redundant data in data points are useless. Therefore, LAPIN (Nie et al. 2023) method obtains a better coefficient matrix by applying sparse constraints to the data matrix, thus to alleviate the impact of noise to some extent. However, the distribution of noise is difficult to estimate, and the sparse representation of the noise term is hard to guarantee. Furthermore, the quality differences between different data views can also significantly affect clustering performance. Accordingly, AMGL (Nie, Li, and Li 2016) automatically learns optimal weights for each view by minimizing the squared-root trace. Although these methods have achieved good results, they cannot fully utilize the complementary information in the adjacency matrix of different views. To address these issues, we propose an Embedded Feature Selection on Graph-Based Multi-View Clustering (EFSGMC) method, which can obtain the final cluster label in one step. Specifically, we adapt non-negative matrix factorization directly to the anchor graph to get the final cluster indicator matrix in one step, thus avoiding post-processing. Besides, we draw inspiration from feature selection for raw data points and apply feature selection to the anchor graph. Specifically, we minimize the ℓ2,p-norm to make the learned anchor map representation more sparse to filter out the anchor points corresponding to noise and redundant data, which significantly reduces the effect of noise. In addition, we refer to the weighted tensor Schatten p-norm minimization (WTSNM) (Xia et al. 2022) and propose to employ the tensor Schatten p-norm minimization to explore the lowrank structure embedded in inter-view graphs. The main contributions of our method are as follows: Our method performs non-negative matrix decomposi- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) tion of the learned anchor graphs to obtain a discrete label matrix, allowing us to obtain clustering results directly in one step without the need for post-processing. We propose a method that minimizes the ℓ2,p-norm to ensure the sparsity of the learned anchor graph, thereby achieving the selection of representative anchor points while eliminating the redundant ones and present a novel and efficient algorithm with a closed-form solution. We employ LPP (Lu et al. 2016) manifold learning to ensure label consistency among adjacent sample points, and explore the low-rank structure of inter-view graphs using Schatten p-norm, which fully leverages the complementary information embedded in the graphs. We propose an efficient algorithm to solve the model via ALM, and we carry out experiments on real multi-view datasets to demonstrate the effectiveness of our proposed method. Methodology Notations and Definitions: In this paper, we use bold calligraphy letters for 3rd-order tensors, e.g., A Rn1 n2 n3, and bold upper case letters for matrices, e.g., A. Ai: and A:j are the i-th row and j-th column of matrix A, separately. The v-th frontal slice of A is Av. A is the discrete Fast Fourier Transform (FFT) of A along the third dimension, i.e., A = fft(A, [ ], 3). The trace of matrix A is denoted by tr(A). I is an identity matrix. Definition 1 (Gao et al. 2021) Given G Rn1 n2 n3, the tensor Schatten p-norm of G is defined as j=1 σj(Gi)p where h = min(n1, n2), p (0, 1], σj(Gi) is the j-th singular value of Gi. The Schatten p-norm can approximate the rank function more tightly when p is chosen appropriately. Definition 2 (Wang et al. 2018; Liao et al. 2018) Given H Rn1 n2, the ℓ2,p-norm is defined as i=1 Hi: p 2 = where p (0, 1]. Specially, when p = 1, ℓ2,p-norm becomes ℓ2,1-norm, i.e., H 2,1 = Pn1 i=1 q Pn2 j=1 H2 ij. Definition 3 (Dong et al. 2016) Given Z RN M, weight matrix W RN N, the Sparse Gradient Pursuit is defined as j=1 Wij Z:i Z:j 1 = KZ 1 (3) where Z represents the gradient of Z and K denotes the gradient matrix of the adjacency KNN graph (Yang et al. 2014). Problem Formulation and Objective Anchor graph-based methods typically require learning a shared graph using predefined graphs Sv RN M, which capture the relationships between N data points and M anchor points. However, there are some drawbacks to these methods: (1) The cluster labels must be obtained via postprocessing, which can limit clustering performance; (2) All data points in the anchor graphs are used, which can introduce redundant data and lead to inefficiencies; (3) These methods process each view separately, which prevents them from fully leveraging the complementary information in the adjacency matrices of different views. In response to the above-mentioned disadvantages, we use non-negative matrix factorization (Ding et al. 2006) to obtain the final global cluster assignment matrix by factorizing the anchor graph in one step, thus avoiding post-processing. To ensure that results before and after matrix factorization are close, ℓ2,1-norm is used for non-negative matrix factorization to avoid the increasing error caused by the square of F-norm. Thus, we have n Sv Gv Hv T 2,1 o s.t. Gv TGv=I, G 0, v=1 αv = 1,αv 0 where αv is the non-negative normalized weight factor, Sv RN M is pre-defined anchor graph (Li et al. 2020), Gv RN C is the cluster assignment matrix, Hv RM C is the latent feature matrix, C is the number of clusters. To better construct anchor points, we consider selecting the most representative data points from the anchor map. When reconstructing the anchor graph Sv, we can add a diagonal matrix diag(f) with fi = {0, 1} corresponding to the matrix Sv for feature selection. At this point, the reconstruction matrix corresponding to anchor graph Sv is S v = Svdiag(f). We observe that the matrix Sv can be reconstructed by Gv and Hv, thus the i-th column vectors of the reconstruction matrix S v can be represented as S v :i = Gv Hv i: T. Considering S v :i 2 = Gv Hv i: T 2 = Hv i: 2, we can see that the reconstruction for Sv is heavily dependent on the matrix Hv, and when S v :i 2 is close to 0, it means that the corresponding anchor point is not representative and should be excluded. Therefore, ensuring the row sparsity of Hv can achieve feature selection on the graph easily. The corresponding model in this case is: n Sv Gv(diag(f)Hv)T 2,1 o s.t. Gv TGv=I, G 0, v=1 αv = 1,αv 0,f {0, 1}M (5) where diag(f) RM M with fi = {0, 1}. When fi = 0, the corresponding i-th row of diag(f)Hv is 0T, and the i-th column of the reconstructed anchor graph S v also tends toward The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 0. Otherwise, when fi = 1, it indicates that the feature associated with graph Sv is useful and should be retained. This allows for efficient feature selection on the anchor graph Sv. However, directly imposing a constraint on a specific row in matrix Hv to be 0T is too strict and difficult to solve. Therefore, we propose a new row sparsity norm, termed as the ℓ2,p-norm (see Definition 2). By using this norm, the resulting Hv matrix can be made even sparser, leading to further enhancement of the performance of the algorithm. Besides, using the ℓ2,p-norm constraint for non-negative matrix factorization can reduce the reconstruction error. Therefore, we can obtain: n Sv Gv Hv T 2,p+λ Hv 2,p o s.t. Gv TGv=I, Gv 0, v=1 αv = 1,αv 0 In equation (6), the sparsity of the rows in matrix Hv controls the column sparsity of the anchor graph Sv, thereby enabling the Hv matrix to realize feature selection on the anchor graph. Furthermore, the matrix Gv serves as the corresponding label embedding matrix. To effectively learn the feature selection matrix Hv, it is necessary to ensure that adjacent samples in the high-dimensional manifold remain adjacent after dimension reduction. Inspired by the local preserving projection (LPP) (Lu et al. 2016) algorithm, we consider adding a regularization term to the matrix Gv to preserve the label consistency between adjacent sample points by employing the idea of LPP manifold learning. This leads to our new model formulation: n Sv Gv Hv T 2,p +γtr(Gv Te L v Gv)+λ Hv 2,p o s.t. Gv TGv=I, Gv 0, v=1 αv = 1,αv 0 where the normalized Laplacian matrix e L v can be calculated by e L v=I Sv( v) 1Sv T and the diagonal elements of diagonal matrix v are v ii=PN i=1Sv ij. During the optimization process of the model, tr(Gv Te L v Gv) needs to be transformed into the square of F-norm for computation, with the corresponding expression being: tr(Gv Te L v Gv) = 1 j=1 Wv ij Gv i: Gv j: 2 F (8) where Wv is the adjacency matrix and Gv is the cluster assignment matrix. Minimizing the ℓ1-norm optimization problem tends to set some elements to 0, i.e., only the part of the data that fits well is selected for estimating the matrix Gv to ensure the sparsity. Therefore, we propose to use Rotate G1 G2 Figure 1: Construction of G RN V C. the ℓ1-norm instead of the square of the F-norm. Inspired by Definition 3, we convert the row operation of matrix Gv into the column operation of Gv T to obtain the corresponding expression: j=1 Wv ij Gv i: Gv j: 1= 1 2 Gv TTv T 1 (9) where Tv RO N is the corresponding gradient matrix of adjacent K-nearest neighbor (KNN) graph (the k-th row satisfies Tv ki = Tv kj = Wv ij), O = K N is the number of edges in the KNN graph. Combining (9) with (7), we have: n Sv Gv Hv T 2,p+ γ 2 Gv TTv T 1+λ Hv 2,p o s.t. Gv TGv=I, Gv 0, v=1 αv = 1,αv 0 However, equation (10) does not fully exploit the complementary information in different views. Hence, we use the tensor Schatten p-norm (defined in Definition 1) to measure the similarity between different Gv and obtain the final global cluster assignment matrix C=PV v=1 Gv αv , incorporating weight information. Specifically, we construct a 3rdorder tensor G from Gv (as illustrated in Figure 1) and consider the corresponding Schatten p-norm after rotation. Notably, Ωm ensures that the relationship between the N data points and the c-th cluster is consistent across views. Therefore, G Sp allows for a comprehensive exploration of information hidden between different views. Combining (10) with the Schatten p-norm, our final model can be expressed as: n Sv Gv Hv T 2,p+ γ 2 Gv TTv T 1+λ Hv 2,p o +β G p Sp s.t. Gv TGv=I, Gv 0, v=1 αv = 1,αv 0 Optimization We propose an efficient optimization method based on the Augmented Lagrange Multiplier (ALM) method. This The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) method involves introducing auxiliary variables J , Pv, and Qv, which allows us to rewrite (11) as: 2 Sv Gv Hv T Pv+Kv γ 2 Qv 1+ρ1 2 Gv TTv T Qv+Mv λ Hv 2,p o +β J p Sp +ρ2 s.t. Gv TGv=I, Qv 0, v=1 αv = 1,αv 0 (12) where W, Kv and Mv are Lagrange multiplier, ρ0, ρ1 and ρ2 are the penalty parameters. The optimization process could be separated into the following steps: Qv sub-problem: 2 Gv TTv T Qv+Mv s.t. Gv TGv=I, Qv 0, v=1 αv = 1,αv 0 (13) Considering every single view individually, it follows that arg min Qv γ 2ρ1 Qv 1+1 2 Qv Cv 2 F s.t. Gv TGv = I, Qv 0 (14) where Cv = Gv TTv T+ Mv ρ1 . Inspired by (Hale, Yin, and Zhang 2008), we have Qv = Θ γ 2ρ1 (Cv) (15) where the i, j-th element of Θ γ 2ρ1 (Cv) is defined as Θ γ 2ρ1 (Cv)ij = sgn Cv ij max |Cv ij γ 2ρ1 |, 0 (16) Hv sub-problem: 2 Av Hv 2 F + λ s.t. Gv TGv=I, v=1 αv = 1,αv 0 where Av = Sv Pv+ Kv T Gv. In order to solve (17), we need the following Lemma 1 2 and Theorem 1. Lemma 1 (Gao et al. 2021) Considering min δ 0f(δ) = 1 2 (δ ω)2 +λδp s.t. 0