# graph_structure_refinement_with_energybased_contrastive_learning__02235f80.pdf Graph Structure Refinement with Energy-based Contrastive Learning Xianlin Zeng1, 2, Yufeng Wang1, Yuqi Sun1, Guodong Guo3, Wenrui Ding1, Baochang Zhang1 1Beihang University, Beijing, P.R.China 2Postdoctoral Research Station at China Rong Tong Academy of Sciences Group Corporation Limited, Beijing, P.R.China 3Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, P.R.China zengxianlin@buaa.edu.cn, wyfeng@buaa.edu.cn, sunyuqi@buaa.edu.cn, guodong.guo@mail.wvu.edu, ding@buaa.edu.cn, bczhang@buaa.edu.cn Graph Neural Networks (GNNs) have recently gained widespread attention as a successful tool for analyzing graphstructured data. However, imperfect graph structure with noisy links lacks enough robustness and may damage graph representations, therefore limiting the GNNs performance in practical tasks. Moreover, existing generative architectures fail to fit discriminative graph-related tasks. To tackle these issues, we introduce an unsupervised method based on a joint of generative training and discriminative training to learn graph structure and representation, aiming to improve the discriminative performance of generative models. We propose an Energy-based Contrastive Learning (ECL) guided Graph Structure Refinement (GSR) framework, denoted as ECL-GSR. To our knowledge, this is the first work to combine energy-based models with contrastive learning for GSR. Specifically, we leverage ECL to approximate the joint distribution of sample pairs, which increases the similarity between representations of positive pairs while reducing the similarity between negative ones. Refined structure is produced by augmenting and removing edges according to the similarity metrics among node representations. Extensive experiments demonstrate that ECL-GSR outperforms the stateof-the-art on eight benchmark datasets in node classification. ECL-GSR achieves faster training with fewer samples and memories against the leading baseline, highlighting its simplicity and efficiency in downstream tasks. Introduction With the explosive growth of graph-structured data, Graph Neural Networks (GNNs) (Zhu et al. 2019; Zhang and Zitnik 2020; Zhang et al. 2019a; Xu et al. 2018) have emerged as a potent deep learning tool, experiencing notable advancements across diverse applications, such as node classification (Veliˇckovi c et al. 2017; Kipf and Welling 2016), node clustering (Wang et al. 2019a; Zhang et al. 2019b), graph classification (Duvenaud et al. 2015; Lee, Lee, and Kang 2019), link prediction (Peng et al. 2020; Srinivasan and Ribeiro 2019), recommendation systems (Wang et al. 2019b; Yu et al. 2021b), drug discovery (Wang et al. 2021b), and anomaly detection (Ding et al. 2019). GNNs usually adopt a message-passing scheme (Gilmer et al. 2017), aggregating information from neighboring nodes within the observed Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. topology to compute graph representations. The strong representational capacity of most GNNs hinges on the assumption that graph structure is sufficiently reliable and perfectly noise-free (Szegedy et al. 2013), considered as ground-truth information for model training. However, this assumption may hardly hold in real-world applications. This is due to the fact: i) Raw structure is typically derived from complex interactive systems, leading to inherent uncertainties and incomplete connections. Even worse, the GNN iterative mechanism with cascading effects repeatedly aggregates neighborhood features. Minor noises in the graph can propagate to adjacent nodes, influencing other node embeddings and potentially introducing further inaccuracies (Dai et al. 2018). ii) Graph representation containing explicit structure is not informative enough to improve task performance. Raw topology only incorporates necessary physical connections, such as chemical bonds in molecules, and fails to capture abstract or implicit links among nodes. Furthermore, in various graph-related tasks, such as text graph in natural language processing (Yao, Mao, and Luo 2019) or scene graph for images in computer vision (Suhail et al. 2021), the explicit structure may either be absent or unavailable. To tackle these challenges mentioned above, Graph Structure Refinement (GSR) involves learning invariant underlying relationships by extracting general knowledge from graph data, rather than relying solely on task-specific information. Therefore, the primary concern of GSR lies in graph representation learning, which can be broadly classified into two categories (Wang et al. 2018a). On one hand, graph generative models (Grover and Leskovec 2016; Dong, Chawla, and Swami 2017; Zheng et al. 2020) assume that each node follows an inherent connectivity distribution. Edges are viewed as samples from these distributions, with the models enhancing node representations by optimizing the likelihood of these observed edges. However, most downstream tasks are inherently discriminative, such as node classification and graph prediction. The state-of-the-art generative models have significantly deviated from discriminative architectures (Grathwohl et al. 2020). On the other hand, discriminative models (Veliˇckovi c et al. 2019; Wang et al. 2018b; Hassani and Khasahmadi 2020) focus on learning a classifier to predict the presence of edges directly. They output a single scalar to represent the probability between node pair, thereby differentiating the connectivity of edges. Nonethe- The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25) less, these models may suffer from overfitting to the training data, capturing noise instead of extracting latent useful features, as well as lack the ability to generalize across different datasets and diverse graph structures. Recently, (Wang et al. 2022) and (Kim and Ye 2022) establish a crucial connection between discriminative paradigms and Energy-based Models (EBMs) (Le Cun et al. 2006), creating a unified framework to generate higher-quality samples and better visual representations. Motivated by these findings, we advocate for incorporating EBMs and Contrastive Learning (CL) to unlock the potential of generative models in addressing discriminative problems of graph-related tasks. In this paper, we explore a novel Energy-based Contrastive Learning (ECL) approach to guide the GSR framework, termed ECL-GSR, which integrates EBMs with CL for unsupervised graph representation learning. Specifically, ECL complements discriminative training loss with generative loss, supplying higher quality and more robust representations for downstream tasks. Theoretically, we demonstrate that the existing discriminative loss is merely a specific instance of the ECL loss when the generative term is disabled. Empirically, ECL can be interpreted as maximizing the joint log-likelihood of the similarity between positive sample pairs with EBMs and minimizing the similarity between negative ones with CL, indicating the augmentations of identical and different samples, respectively. In GSR, we perform edge prediction by adding or removing links based on the similarity probabilities among node representations, further refining the raw structure. Finally, we evaluate ECLGSR on the node classification task using the refined graph. The major contributions are threefold as follows: We present a novel ECL-GSR framework for joint graph structure and representation learning. It is the first work to combine EBMs with CL as generative and discriminative paradigms for GSR. Contrary to most GSR methods, ECL-GSR is a straightforward implementation, demanding fewer training iterations, memory costs, and data samples to obtain the equivalent or better performance. Extensive experiments on eight benchmark datasets demonstrate the superiority of ECL-GSR over current state-of-the-art methods. Ablation studies further confirm its effectiveness, efficiency, and robustness. Graph Structure Refinement Given a raw graph G = (V, E, X) = (A, X) with noisy topology, where V is the set of V = |V| nodes, E is the set of M = |E| edges, X RV D is the node feature matrix (the ith entry xi RD represents the attribute of node vi), and A RV V is the adjacency matrix (Ai,j > 0 indicates ei,j = (vi, vj), i, j M). The target of graph structure refinement (Zhu et al. 2021) is to acquire a refined graph e G with a clean adjacency e A, along with corresponding representation e Z RV e F , e F D, for downstream tasks. Energy-based Models Given a point χ sampled from the data distribution pd(χ), EBMs assign a scalar-valued energy function Eθ(χ) R by a DNN with parameters θ. The energy function define a probability distribution using the Boltzmann distribution pθ(χ) = exp( Eθ(χ)) Z(θ) , where Z(θ) is a normalizing constant or partition function ensuring pθ integrates to 1. EBMs leverage the defined distribution pθ to model the data distribution pd by minimizing the negative log-likelihood of pθ under pd, as indicated by: min θ Eχ pd[ log pθ(χ)]. (1) The derivative of the negative log-likelihood L(θ) is: θL(θ) = Eχ+ pd[ θEθ(χ+)] Eχ pθ[ θEθ(χ )] . (2) Eq. 2 decreases the energy values of positive samples χ+ while increasing those of negative ones χ (Hinton 2002). However, computing Z(θ) for most parameterizations of Eθ(χ) is intractable. We employ Stochastic Gradient Langevin Dynamics (SGLD) (Welling and Teh 2011) derived from Markov Chain Monte Carlo (MCMC) methods to reduce the mixing time of sampling procedure. Specifically, it generates pθ as an approximation of pd via iteratively updating χ, denoted as: χk+1 = χk λ 2 χEθ(χk) + ωk , (3) where ωk N(0, λ). As k and λ 0, then pθ converges to pd. This process generates data samples through the energy function implicitly rather than explicitly. Contrastive Learning Given a set of random variables {χn}N n=1, we define a data augmentation T to generate two distinct views νn = t(χn), ν n = t (χn), i.e., t, t T . CL constitutes an unsupervised framework for representation learning, aiming to maximize the mutual information I between the representations of two views νn and ν m w.r.t the joint distribution p(νn, ν m). This is expressed as: max Dθ I(zn, z m) , (4) where zn = Dθ(νn) is the representation and Dθ( ) is a parametric DNN. When n = m, the views (νn, ν m) are referred to as a positive pair with the same marginal distribution. Conversely, they are called a negative pair. In practice, each pair provides supervisory information to the other, playing a role similar to that of labels in a supervised manner. CL trains Dθ to encourage zn and its positive pair z n to be close in the projection space while pushing away representations of all negative pairs z m. This principle has been proven to be key in boosting performance (Chen et al. 2020). Methodology In this section, we delineate the proposed ECL-GSR framework. As shown in Fig. 1, our pipeline consists of four steps: Figure 1: Illustration of the procedure of ECL-GSR. Preprocessed dual-attribute graph undergoes data augmentations, energybased contrastive learning, and edge prediction, achieving structure refinement. Refined graph is applied in node classification. preprocessing, energy-based contrastive learning, edge prediction, and node classification. Initially, we construct a dual-attribute graph by extracting contextual and structural information and acquiring subgraphs as input through edge sampling. Then, the ECL approach is introduced with the theorem and implementation. Next, we fine-tune the graph structure and evaluate the node classification task with raw features. Lastly, we present the final training objective. Preprocessing Dual-attribute graph Our framework can make full use of all trustworthy observations to maximize informativeness by constructing a dual-attribute graph. We concatenate the contextual information Xc, and structural embedding Xs as a new attribute, where Xc is derived directly from the raw node features and Xs is extracted using the Deep Walk (Perozzi, Al-Rfou, and Skiena 2014). Finally, the dual-attribute graph is represented as G = (X , A), where X = [Xc, Xs] is the new node features. Edge sampling To address memory limitations, neighbor sampling techniques enable stochastic training on large graphs G by decomposing them into smaller subgraphs g. Each subgraph sequentially contributes to GNNs optimization, executing multiple gradient descent steps within a training epoch. We independently select several edges (node pairs) from edge set {ei,j}M i,j=1 to produce a subgraph. This process yields a mini-batch of subgraphs {gn}N n=1, each with a fixed number of edges. Energy-based Contrastive Learning Initially, we define pd as a distribution of graph data and T as a set of predetermined data augmentation operators. Given a dual-attribute subgraph g and two augmentation views t, t T selected uniformly at random, we propose an ECL approach to build a joint distribution pθ(ν, ν ) over two views ν, ν = t(g ), t (g ), aiming to approximate the data distribution pd(ν, ν ). Definition 1. The joint distribution pθ(ν, ν ) can be defined as: pθ(ν, ν ) = exp( fθ(ν, ν )) where Z(θ) = R R exp( fθ(ν, ν ))dνdν . Building upon the assumption that semantically similar pairs (ν, ν ) have nearby projections with high pd, while dissimilar ones would correspond to distant projections with low pd, we solve for the distance between ν and ν . Let fθ( ) = φθ(ϕθ( )), ϕθ( ) is a GNN encoder, and φθ( ) is a linear projection. z = fθ(ν) is the corresponding representation. The term z z indicates the inverse of semantic similarity of ν and ν . To approximate pθ(ν, ν ) to pd(ν, ν ), Eq. 1 can be rephrased as: min θ Epd[ log pθ(ν, ν )] . (6) Proposition 1. The joint distribution pθ(ν, ν ) can be formulated as an EBM: pθ(ν, ν ) = exp( Eθ(ν, ν )) where Eθ(ν, ν ) = z z 2/τ, and τ is a temperature parameter. The gradient of the objective of Eq. 6 is expressed as: θEpd[ logpθ(ν, ν )] = Epd[ θEθ(ν, ν )] Epθ[ θEθ(ν, ν )] . (8) To avoid directly calculating Z(θ), we employ Bayes rule (Bayes 1763) to reformulate Epd[ log pθ(ν, ν )] as: Epd[ logpθ(ν, ν )] = Epd[ log pθ(ν |ν)] + Epd[ log pθ(ν)] , (9) where pθ(ν) is the marginal distribution of pθ(ν, ν ) over ν . Theorem 1. The marginal distribution pθ(ν) is an EBM: pθ(ν) = exp( Eθ(ν)) Z(θ) , (10) where Eθ(ν) = log R exp ( z z 2/τ)dν . Its proof is detailed in Appendix A.1. The gradient of the objective of Eq. 10 is defined as: θEpd[ log pθ(ν)] = Epd[ θEθ(ν)] Epθ[ θEθ(ν)] . (11) According to Eq. 9, the objective of ECL is decomposed into the generative term and discriminative term, given by: Lb(θ) = Epd[ log pθ(ν |ν)] + αEpd[ log pθ(ν)] , (12) where α is a hyperparameter to trade off the strength of two terms. According to Eq. 11, the gradient of Eq. 12 is written as: θLb(θ) = Epd[ θ log pθ(ν |ν)]+ αEpd[ θEθ(ν)] αEpθ[ θEθ(ν)] . (13) By this way, Z(θ) ingeniously cancels itself out in the discriminative term without additional calculations. For the generative term, we merely need to sample ν from pd(ν) with adding Gaussian noise N(0, λ) and iteratively optimize ν through SGLD, as indicated in Eq. 3. Implementation 1. To implement the training of ECL, we approximate the generative term and discriminative term of Eq. 12, respectively, using the empirical mean of pθ(ν). Given a mini-batch of samples {(νn, ν n)}N n=1, along with its representations {(zn, z n)}N n=1, we have N positive and 2(N 1) negative samples. Therefore, the empirical mean ˆpθ(νn) (Kim and Ye 2022) is defined as: ˆpθ(νn) = 1 2N ν m:ν m =νn pθ(νn, ν m) . (14) For the discriminative term, we utilize pθ(νn,ν n) ˆpθ(νn) to approximate the conditional probability density pθ(ν |ν). According the Sim CLR framework (Chen et al. 2020), min θ Epd[ log ˆpθ(ν n|νn)] can be represented as: min z fθ(ν) log exp ( zn z n 2/τ) 1 2N P2(N 1) ν m:ν m =νn exp ( zn z m 2/τ) (15) Considering only N positive samples in the generative term, we simplify Eq. 14 to ˆpθ(νn) = 1 N PN n=1 pθ(νn, ν n). The approximation of min θ Epd[ log ˆpθ(νn)] is denoted as: min z fθ(ν) log n=1 exp ( zn z n 2/τ) Algorithm 1: The entire process of ECL-GSR Input: Dual-attribute graph G with node classification label Y , EBM Eθ, augmentation operators t, t , batch size N, number of batches B, SGLD iterations K, and training epochs P Output: The predicted label ˆY Construct G from G, randomly initialize Eθ( ) (including ϕθ( ), φθ( )) and Cθ( ) with α, β, τ, µ; for p = 1, 2, . . . , P do for b = 1, 2, . . . , B do Batch {g n}N n=1 from G ; Build pd(ν, ν ) over ν, ν = t(g ), t (g ); Sample {νn, ν n}N n=1 from pd(ν, ν ); Calculate the discriminative term of Lb with Eq. 15; Sample {ν n}N n=1 from pd(ν) with N(0, λ); for k = 1, 2, . . . , K do Sample ωk N(0, λ); Update {ν n,k+1}N n=1 from {ν n,k}N n=1 with Eq. 3; Calculate the generative term of Lb with Eq. 16; Calculate θLb with Eq. 13 and LE with Eq. 17; Calculate A with Eq. 18 and binarize A to yield e A; Predict ˆY with Cθ and calculate LC with Y ; Update θE and θC to minimize L with Eq. 19; In summation, the final objective of ECL is: LE(θ) = Lb(θ) + βLr(θ) , (17) where Lr(θ) = 1 2N P n =m Eθ (νn, ν m)2 is the L2 regularization loss to prevent gradient overflow due to the excessive energy values. β is also a trade-off hyperparameter. Edge Prediction Upon completion of the ECL training, we are able to finetune the graph structure through edge prediction. The edge predictor receives the graph representation and subsequently outputs an edge probability matrix, denoted as A. Each element Ai,j symbolizes the predicted probability of an edge existing between the pair of nodes (vi, vj): Ai,j = Norm(cos(ezi, ezj)) , (18) where ezi, ezj e Z, cos ( ) is the cosine similarity function, e Z denotes the representation output by the encoder ϕθ( ), and Norm( ) is a normalization function. For the purpose of end-to-end training, we binarize A with the relaxed Bernoulli sampling (Zhao et al. 2021) on each edge to produce the final matrix e A. Node Classification Using e A and X as inputs, we utilize a simple three-layer GNN as a node classifier Cθ, which can be instantiated with GCN or GAT architecture. Node representations are defined as H = Cθ( e A, X), and the predicted label ˆY aligns with the ground truth Y . For each node representation hi H, ˆyi ˆY is denoted as Softmax(hi). The node classification loss LC(θ) is the cross-entropy between ˆY and Y . Method Cora Citeseer Cornell Texas Wisconsin Actor Pubmed OGB-Arxiv GCN 81.46 0.58 71.36 0.31 47.84 5.55 57.83 2.76 57.45 4.30 30.01 0.77 79.18 0.29 70.77 0.19 GAT 81.41 0.77 70.69 0.58 46.22 6.33 54.05 7.35 57.65 7.75 28.91 0.83 77.85 0.42 69.90 0.25 LDS 83.01 0.41 73.55 0.54 47.87 7.14 58.92 4.32 61.70 3.58 31.05 1.31 OOM OOM GEN 80.21 1.72 71.15 1.81 57.02 7.19 65.94 4.13 66.07 3.72 27.21 2.05 78.91 0.69 OOM SGSR 83.48 0.43 72.96 0.25 44.32 2.16 60.81 4.87 56.86 1.24 30.23 0.38 78.09 0.53 OOM GRCN 83.87 0.49 72.43 0.61 54.32 8.24 62.16 7.05 56.08 7.19 29.97 0.71 78.92 0.39 OOM IDGL 83.88 0.42 72.20 1.18 50.00 8.98 62.43 6.09 59.41 4.11 28.16 1.41 OOM OOM GAu G-O 82.20 0.80 71.60 1.10 57.60 3.80 56.90 3.60 54.80 5.70 25.80 1.00 79.30 0.40 OOM SUBLIME 83.40 0.42 72.30 1.09 70.29 3.51 70.21 2.32 66.73 2.44 30.79 0.68 73.80 0.60 55.50 0.10 Pro GNN 80.30 0.57 68.51 0.52 54.05 6.16 48.37 8.75 62.54 7.56 22.35 0.88 71.60 0.46 OOM Co GSL 81.76 0.24 73.09 0.42 52.16 3.21 59.46 4.36 58.82 1.52 32.95 1.20 OOM OOM STABLE 80.20 0.68 68.91 1.01 44.03 4.05 55.24 6.04 53.00 5.27 30.18 1.00 OOM OOM Node Former 80.28 0.82 71.31 0.98 42.70 5.51 58.92 4.32 48.43 7.02 25.51 1.17 78.21 1.43 55.40 0.23 ECL-GSR 84.06 0.84 73.70 0.75 71.27 2.06 72.97 3.39 67.79 1.03 33.71 0.96 80.91 1.12 71.09 0.31 Table 1: Node classification accuracy (mean(%) std) with the standard splits on various benchmark datasets. The top three results are highlighted in first best, second best, and third best, respectively. OOM indicates out of memory. Training Objective During the training process, we can efficiently compute the joint classification loss LC(θ) and ECL loss LE(θ) using gradient descent-based backpropagation techniques. The overall loss is: min θE,θC L(θ) = LE(θ) + µLC(θ) , (19) where θE and θC are parameters of Eθ( ) and Cθ( ), respectively. The pseudocode of ECL-GSR is illustrated in Algorithm 1. The training stability is presented in Appendix A.6. Experiments We conduct comprehensive experiments to sequentially evaluate the proposed framework s effectiveness, complexity, and robustness, addressing five research questions: RQ1: How effective is ECL-GSR on the node classification task? RQ2: How efficient is ECL-GSR in terms of training time and space? RQ3: How do ECL architecture and its hyperparameters impact the performance of node-level representation learning? RQ4: How robust is ECL-GSR in the face of structural attacks or noises? RQ5: What kind of refined structure does ECL-GSR learn? Experimental Setups Datasets For extensive comparison, we execute experiments on eight benchmark datasets: four citation networks (Cora, Citeseer (Sen et al. 2008), Pubmed (Namata et al. 2012), and OGB-Arxiv (Hu et al. 2020)), three webpage graphs (Cornell, Texas, and Wisconsin (Pei et al. 2020)), and one actor co-occurrence network (Actor (Tang et al. 2009)). Baselines To corroborate the promising performance of ECL-GSR, we compare it against 13 GSR baseline methods. There are two GNN baselines (GCN (Kipf and Welling 2016) and GAT (Veliˇckovi c et al. 2017)), three adjacency matrix direct-optimization methods (Node Former (Wu et al. 2022), STABLE (Li et al. 2022), and Pro GNN (Jin et al. 2020)), four probability estimation techniques (GEN (Wang et al. 2021a), GAu G-O (Zhao et al. 2021), SGSR (Zhao et al. 2023), and LDS (Franceschi et al. 2019)), and four metric learning approaches (SUBLIME (Liu et al. 2022b), GRCN (Yu et al. 2021a), Co GSL (Liu et al. 2022a), and IDGL (Chen, Wu, and Zaki 2020)). Implementation details Our framework operates on an Ubuntu system with an NVIDIA Ge Force 3090 GPU, employing Py Torch 1.12.1, DGL 1.1.0, and Python 3.9.16. All experiments are conducted using the reimplementation of GSLB (Li et al. 2023). We maintain the dimensions of contextual Xc and structural Xs features equal to that of raw attribute. Subgraph sampling batch size N is fixed at 64 for efficiency consideration. In ECL, the backbone fθ( ) is divided into ϕθ( ) for encoding, utilizing three GCN layers with the hidden and output dimension e F of 128, and φθ( ) for projection, comprising two fully-connected layers with an output dimension F of 128. The learned representation e Z is produced by ϕθ( ). Batch normalization is discarded when utilizing SGLD. The data augmentation operator T is a random Gaussian blur. For node classification, classifier Cθ( ) mirrors the architecture of ϕθ( ). Our model s final hyperparameters are set as: α=0.1, β=0.01, µ=0.01, and τ=0.1. We adopt the Adam optimizer with an initial learning rate of 0.001, halving every 20 epochs. The epochs P for Cora, Citeseer, Cornell, Texas, and Wisconsin are 40, and those for Actor, Pubmed, and OGB-Arxiv are 80. The number of SGLD s iterations K only takes 3 steps. Node Classification Performance (RQ1) Evaluation on standard splits As stated in Table 1, three key observations can be made: i) ECL-GSR shows robust performance across all benchmark datasets, demonstrating its superior generalizability to diverse data. Notably, within the ambit of eight datasets, ECL-GSR achieves the state-ofthe-art with margins ranging from 0.15% to 1.61% over the second-highest approach. ii) Compared to other baselines, ECL-GSR exhibits enhanced performance stability and reduced standard deviation, particularly evident on the Cor- Method Cora Citeseer 1% 3% 5% 10% 1% 3% 5% 10% GCN 59.31 0.29 77.14 0.21 80.73 0.63 83.53 0.42 60.64 1.07 67.60 0.47 70.05 0.54 74.38 0.27 GAT 65.36 0.99 76.36 0.61 81.73 0.21 83.92 0.42 58.48 2.35 68.41 0.76 70.73 0.22 74.54 0.14 LDS 68.47 1.11 78.06 0.98 81.42 0.66 83.87 0.48 61.35 1.57 67.29 1.34 70.82 0.79 74.54 0.49 IDGL 70.83 1.21 78.60 0.28 83.82 0.28 85.51 0.08 60.61 1.32 64.34 1.61 69.39 1.24 74.19 0.58 SGSR 55.11 0.43 77.32 0.17 83.51 0.22 85.56 0.25 54.28 0.47 71.61 0.17 72.88 0.20 74.31 0.24 GRCN 68.38 2.10 75.24 1.06 79.16 0.82 84.82 0.41 59.06 1.80 66.17 0.75 72.11 0.56 74.49 0.73 Co GSL 64.43 3.35 73.21 1.10 79.02 3.22 81.05 0.53 56.41 0.91 66.60 0.79 69.96 0.56 74.17 0.53 Pro GNN 70.32 1.16 75.93 0.78 81.35 0.68 82.01 0.67 56.77 0.88 70.34 0.66 70.67 0.79 74.23 0.36 SUBLIME 65.94 4.90 73.37 0.78 79.14 0.26 82.37 0.20 57.85 1.64 67.67 0.84 70.53 0.16 71.47 0.08 Node Former 67.11 1.07 75.87 0.79 82.05 0.67 83.92 0.45 67.03 0.89 67.84 0.60 70.65 1.05 73.03 0.37 ECL-GSR 72.33 0.39 79.99 0.21 84.30 0.18 85.71 0.20 68.06 0.54 72.18 0.15 73.38 0.22 74.90 0.23 Table 2: Node classification accuracy (mean(%) std) with the different train ratios on Cora and Citeseer datasets. The top two results are highlighted in first best and second best, respectively. Figure 2: Training time and space analysis on Cora, Citeseer, Actor, and Pubmed datasets. nell, Texas, Wisconsin, and Actor datasets. iii) Whereas certain competing algorithms such as Co GSL, GEN, and IDGL encounter OOM errors with Pubmed and OGB-Arxiv, our approach achieves the state-of-the-art on large benchmarks. Evaluation on different train ratios In Table 2, we conduct experiments on Cora and Citeseer datasets with varying amounts of supervised information, specifically at training ratios of 1%, 3%, 5%, and 10%. The results indicate that our framework substantially outperforms existing baselines in terms of accuracy at a low training ratio. Among GSR approaches, we validate that ECL-GSR achieves equivalent or better performance with fewer training samples as well as maintains competitive performance at a high training ratio. Efficiency and Scalability Analysis (RQ2) In this section, we analyze the efficiency and scalability of ECL-GSR on Cora, Citeseer, Actor, and Pubmed datasets. As illustrated in Fig. 2, the position nearer to the figure s upper left corner signifies superior overall performance. For efficiency, the time complexity of performing an ECL-GSR is delineated by O(P K B), where B represents the number of batches. The higher time efficiency of our approach stems from its expedited convergence, necessitating only a limited quantity of training epochs P and iterations K. Regarding scalability, we can flexibly adapt the stochastic training by adjusting the mini-batch N, enabling to achieve an acceptable space complexity of O(N 2). Conventional GSR algorithms are typically hindered by their considerable time and space demands, constraining their applicability in large-scale graphs. Some alternative solutions, such as Node Former and SGSR, have been recognized for their speed, albeit due to diminished classification accuracy. Methods like Co GSL and LDS are notable for their effectiveness, yet they demand considerable computational and storage requirements. Conversely, ECL-GSR Figure 3: Performance study of ECL-GSR variants and other baselines over multiple training epochs on four datasets. achieves advantages in terms of accuracy, speed, and memory usage, especially on Citeseer and Pubmed datasets. Figure 5: Hyperparameter α and dimensionality e F analysis of ECL-GSR on four datasets. Ablation Study (RQ3) Component analysis As illustrated in Fig. 3, we investigate the impact of various configurations on Cora, Citeseer, Actor, and Pubmed datasets, evaluating the performance of GEN, GRCN, SGSR, Pro GNN, ECL with raw attr., ECL without gen. term, ECL without disc. term, and full ECL over a range of training epochs. 1st and 2nd Mean are av- Figure 4: Hyperparameter µ and dimensionality e F analysis of ECL-GSR on four datasets. Mean denotes the averages. erages of baselines and variations, respectively. Our findings indicate that: i) Utilizing raw graph attributes without structural embeddings marginally reduces the accuracy of ECLGSR. (ii) When either generative or discriminative terms are absent, a notable decrease in performance suggests a vital role for the combination of them. (iii) All variants reach their peak performance within fewer training epochs, highlighting our framework s swift convergence compared to other GSR. Parameter analysis The impacts of varying hyperparameter α, µ and output dimension e F on Cora, Citeseer, Actor, and Pubmed datasets are indicated in Fig. 4 and Fig. 5, respectively. With respect to µ, a low value weakens the constraint of classification loss, whereas a high value leads our framework to degrade to baseline, thereby diminishing the role of ECL. Regarding α, we explore the importance of the generative term relative to the discriminative term in ECL. It suggests that setting α to 1.0 or higher yields suboptimal results. The performance peaks at 0.1 and then experiences a marginal decline as α is decreased further. For selecting e F, it is crucial for balancing representational adequacy and preventing overfitting. Lower dimensions compromise performance due to insufficient representation, while higher dimensions maintain performance but add model complexity. Robustness Analysis (RQ4) To evaluate the robustness of ECL-GSR, we randomly add or remove edges from the raw graph on Cora and Citeseer datasets and then evaluate the performance of various algorithms on the corrupted graph. We change the ratios of modified edges from 0 to 0.8 to simulate different noise intensities and compare our framework with GCN, GRCN, LDS, Pro GNN, and IDGL. As revealed in Fig. 6, the performance of models generally shows a downward trend with increased attack intensity. Nonetheless, GSR approaches commonly Figure 6: Robustness analysis by randomly adding and removing edges on Cora and Citeseer datasets. exhibit better stability than GCN baseline. When edge addition and deletion rates increase, ECL-GSR consistently achieves better or comparable results in both scenarios, indicating its robustness in severe structural attacks. Structure Visualization (RQ5) To enhance the comprehension of refined topology, we present the visualization results of edge weights for both the raw and refined graph, as depicted in Fig. 7. We adhere to the previous strategy (Li et al. 2023) to select several subgraphs from Cora and Citeseer datasets. Randomly sampling 20 labeled (L) and 20 unlabeled (U) nodes, we extract four subgraphs and separate them with red lines. A subgraph contains two classes, each with 10 nodes. Intraand inter-class connections are separated by green lines. The diagonal elements represent self-loops. Comparing the sparse intraand inter-class connections of raw graph, the refined graph shows a significantly denser structure. However, a denser graph does not necessarily equate to improved performance. We find that ECL-GSR maintains a lower frequency of inter-class connections than intra-class ones. This observation aligns with the basic principle of ECL, which is to pull close similar semantic information and push away dissimilar ones. In this paper, we advance an Energy-based Contrastive Learning approach to guide GSR and introduce a novel ECL-GSR framework, which jointly optimizes graph structure and representation. ECL is capable of approximating the joint distribution to learn good representations by combining generative and discriminative paradigms. We evaluate the proposed method on the graph node classification task. Experimental results verify its superior effectiveness, efficiency, and robustness. Figure 7: Visualization of the refined adjacency matrixes by various GSR algorithms on Cora and Citeseer datasets. Acknowledgments The work was supported by the National Key Research and Development Program of China (Grant No. 2023YFC3306401). This work was also supported by the National Natural Science Foundation of China (Grant No. U20B2042 and 62076019). References Bayes, T. 1763. LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S. Philosophical Transactions of the Royal Society of London, (53): 370 418. Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In International Conference on Machine Learning. PMLR. Chen, Y.; Wu, L.; and Zaki, M. 2020. Iterative Deep Graph Learning for Graph Neural Networks: Better and Robust Node Embeddings. Advances in Neural Information Processing Systems, 33: 19314 19326. Dai, H.; Li, H.; Tian, T.; Huang, X.; Wang, L.; Zhu, J.; and Song, L. 2018. Adversarial Attack on Graph Structured Data. In International Conference on Machine Learning. PMLR. Ding, K.; Li, J.; Bhanushali, R.; and Liu, H. 2019. Deep Anomaly Detection on Attributed Networks. In Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM. Dong, Y.; Chawla, N. V.; and Swami, A. 2017. metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 135 144. Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, R. P. 2015. Convolutional Networks on Graphs for Learning Molecular Fingerprints. Advances in Neural Information Processing Systems, 28. Franceschi, L.; Niepert, M.; Pontil, M.; and He, X. 2019. Learning Discrete Structures for Graph Neural Networks. In International Conference on Machine Learning. PMLR. Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural Message Passing for Quantum Chemistry. In International Conference on Machine Learning. PMLR. Grathwohl, W.; Wang, K.-C.; Jacobsen, J.-H.; Duvenaud, D.; Norouzi, M.; and Swersky, K. 2020. Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One. In International Conference on Learning Representations. Grover, A.; and Leskovec, J. 2016. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855 864. Hassani, K.; and Khasahmadi, A. H. 2020. Contrastive Multi-view Representation Learning on Graphs. In International Conference on Machine Learning, 4116 4126. PMLR. Hinton, G. E. 2002. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14(8): 1771 1800. Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. Advances in Neural Information Processing Systems, 33: 22118 22133. Jin, W.; Ma, Y.; Liu, X.; Tang, X.; Wang, S.; and Tang, J. 2020. Graph Structure Learning for Robust Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Kim, B.; and Ye, J. C. 2022. Energy-Based Contrastive Learning of Visual Representations. Advances in Neural Information Processing Systems, 35: 4358 4369. Kipf, T. N.; and Welling, M. 2016. Semi-supervised Classification with Graph Convolutional Networks. ar Xiv preprint ar Xiv:1609.02907. Le Cun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; and Huang, F. 2006. A Tutorial on Energy-based Learning. Predicting Structured Data, 1(0). Lee, J.; Lee, I.; and Kang, J. 2019. Self-attention Graph Pooling. In International Conference on Machine Learning. PMLR. Li, K.; Liu, Y.; Ao, X.; Chi, J.; Feng, J.; Yang, H.; and He, Q. 2022. Reliable Representations Make a Stronger Defender: Unsupervised Structure Refinement for Robust GNN. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Li, Z.; Wang, L.; Sun, X.; Luo, Y.; Zhu, Y.; Chen, D.; Luo, Y.; Zhou, X.; Liu, Q.; Wu, S.; et al. 2023. GSLB: The Graph Structure Learning Benchmark. ar Xiv preprint ar Xiv:2310.05174. Liu, N.; Wang, X.; Wu, L.; Chen, Y.; Guo, X.; and Shi, C. 2022a. Compact Graph Structure Learning via Mutual Information Compression. In Proceedings of the ACM Web Conference 2022. Liu, Y.; Zheng, Y.; Zhang, D.; Chen, H.; Peng, H.; and Pan, S. 2022b. Towards Unsupervised Deep Graph Structure Learning. In Proceedings of the ACM Web Conference 2022. Namata, G.; London, B.; Getoor, L.; Huang, B.; and Edu, U. 2012. Query-driven Active Surveying for Collective Classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8. Pei, H.; Wei, B.; Chang, K. C.-C.; Lei, Y.; and Yang, B. 2020. Geom-GCN: Geometric Graph Convolutional Networks. ar Xiv preprint ar Xiv:2002.05287. Peng, Z.; Huang, W.; Luo, M.; Zheng, Q.; Rong, Y.; Xu, T.; and Huang, J. 2020. Graph Representation Learning via Graphical Mutual Information Maximization. In Proceedings of the Web Conference 2020. Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701 710. Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; and Eliassi-Rad, T. 2008. Collective Classification in Network Data. AI Magazine, 29(3): 93 93. Srinivasan, B.; and Ribeiro, B. 2019. On the Equivalence between Node Embeddings and Structural Graph Representations. ar Xiv preprint ar Xiv:1910.00452. Suhail, M.; Mittal, A.; Siddiquie, B.; Broaddus, C.; Eledath, J.; Medioni, G.; and Sigal, L. 2021. Energy-based Learning for Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing Properties of Neural Networks. ar Xiv preprint ar Xiv:1312.6199. Tang, J.; Sun, J.; Wang, C.; and Yang, Z. 2009. Social Influence Analysis in Large-scale Networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Veliˇckovi c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph Attention Networks. ar Xiv preprint ar Xiv:1710.10903. Veliˇckovi c, P.; Fedus, W.; Hamilton, W. L.; Li o, P.; Bengio, Y.; and Hjelm, R. D. 2019. Deep Graph Infomax. In International Conference on Learning Representations. Wang, C.; Pan, S.; Hu, R.; Long, G.; Jiang, J.; and Zhang, C. 2019a. Attributed Graph Clustering: A Deep Attentional Embedding Approach. ar Xiv preprint ar Xiv:1906.06532. Wang, H.; Wang, J.; Wang, J.; Zhao, M.; Zhang, W.; Zhang, F.; Xie, X.; and Guo, M. 2018a. Graph GAN: Graph Representation Learning with Generative Adversarial Nets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32. Wang, H.; Zhang, F.; Hou, M.; Xie, X.; Guo, M.; and Liu, Q. 2018b. Shine: Signed Heterogeneous Information Network Embedding for Sentiment Link Prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 592 600. Wang, R.; Mou, S.; Wang, X.; Xiao, W.; Ju, Q.; Shi, C.; and Xie, X. 2021a. Graph Structure Estimation Neural Networks. In Proceedings of the Web Conference 2021. Wang, S.; Hu, L.; Wang, Y.; Cao, L.; Sheng, Q. Z.; and Orgun, M. 2019b. Sequential Recommender Systems: Challenges, Progress and Prospects. ar Xiv preprint ar Xiv:2001.04830. Wang, Y.; Min, Y.; Chen, X.; and Wu, J. 2021b. Multi-view Graph Contrastive Representation Learning for Drug-Drug Interaction Prediction. In Proceedings of the Web Conference 2021. Wang, Y.; Wang, Y.; Yang, J.; and Lin, Z. 2022. A Unified Contrastive Energy-based Model for Understanding the Generative Ability of Adversarial Training. ar Xiv preprint ar Xiv:2203.13455. Welling, M.; and Teh, Y. W. 2011. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). Wu, Q.; Zhao, W.; Li, Z.; Wipf, D. P.; and Yan, J. 2022. Nodeformer: A Scalable Graph Structure Learning Transformer for Node Classification. Advances in Neural Information Processing Systems, 35: 27387 27401. Xu, K.; Hu, W.; Leskovec, J.; and Jegelka, S. 2018. How Powerful are Graph Neural Networks? ar Xiv preprint ar Xiv:1810.00826. Yao, L.; Mao, C.; and Luo, Y. 2019. Graph Convolutional Networks for Text Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33. Yu, D.; Zhang, R.; Jiang, Z.; Wu, Y.; and Yang, Y. 2021a. Graph-revised Convolutional Network. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14 18, 2020, Proceedings, Part III. Springer. Yu, J.; Yin, H.; Li, J.; Wang, Q.; Hung, N. Q. V.; and Zhang, X. 2021b. Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation. In Proceedings of The Web Conference 2021. Zhang, C.; Song, D.; Huang, C.; Swami, A.; and Chawla, N. V. 2019a. Heterogeneous Graph Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Zhang, X.; Liu, H.; Li, Q.; and Wu, X.-M. 2019b. Attributed Graph Clustering via Adaptive Graph Convolution. ar Xiv preprint ar Xiv:1906.01210. Zhang, X.; and Zitnik, M. 2020. GNNGUARD: Defending Graph Neural Networks against Adversarial Attacks. Advances in Neural Information Processing Systems, 33: 9263 9275. Zhao, J.; Wen, Q.; Ju, M.; Zhang, C.; and Ye, Y. 2023. Self Supervised Graph Structure Refinement for Graph Neural Networks. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. Zhao, T.; Liu, Y.; Neves, L.; Woodford, O.; Jiang, M.; and Shah, N. 2021. Data Augmentation for Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35. Zheng, S.; Zhu, Z.; Zhang, X.; Liu, Z.; Cheng, J.; and Zhao, Y. 2020. Distribution-induced Bidirectional Generative Adversarial Network for Graph Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7224 7233. Zhu, D.; Zhang, Z.; Cui, P.; and Zhu, W. 2019. Robust Graph Convolutional Networks against Adversarial Attacks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Zhu, Y.; Xu, W.; Zhang, J.; Du, Y.; Zhang, J.; Liu, Q.; Yang, C.; and Wu, S. 2021. A Survey on Graph Structure Learning: Progress and Opportunities. ar Xiv preprint ar Xiv:2103.03036.