# calibrating_and_improving_graph_contrastive_learning__bd74a30b.pdf Published in Transactions on Machine Learning Research (07/2023) Calibrating and Improving Graph Contrastive Learning Kaili Ma* klma@cse.cuhk.edu.hk Department of Computer Science and Engineering, The Chinese University of Hong Kong Garry Yang* hcyang@cse.cuhk.edu.hk Department of Computer Science and Engineering, The Chinese University of Hong Kong Han Yang hyang@cse.cuhk.edu.hk Department of Computer Science and Engineering, The Chinese University of Hong Kong Yongqiang Chen yqchen@cse.cuhk.edu.hk Department of Computer Science and Engineering, The Chinese University of Hong Kong James Cheng jcheng@cse.cuhk.edu.hk Department of Computer Science and Engineering, The Chinese University of Hong Kong *These authors contributed equally to this work. Reviewed on Open Review: https: // openreview. net/ forum? id= Ld SP6cv TS4 Graph contrastive learning algorithms have demonstrated remarkable success in various applications such as node classification, link prediction, and graph clustering. However, in unsupervised graph contrastive learning, some contrastive pairs may contradict the truths in downstream tasks and thus the decrease of losses on these pairs undesirably harms the performance in the downstream tasks. To assess the discrepancy between the prediction and the ground-truth in the downstream tasks for these contrastive pairs, we adapt expected calibration error (ECE) to graph contrastive learning. The analysis of ECE motivates us to propose a novel regularization method, Contrast-Reg, to ensure that decreasing the contrastive loss leads to better performance in the downstream tasks. As a plug-in regularizer, Contrast-Reg effectively improves the performance of existing graph contrastive learning algorithms. We provide both theoretical and empirical results to demonstrate the effectiveness of Contrast-Reg in enhancing the generalizability of the Graph Neural Network (GNN) model and improving the performance of graph contrastive algorithms with different similarity definitions and encoder backbones across various downstream tasks. 1 Introduction Graph structures are widely used to capture abundant information, such as hierarchical configurations and community structures, present in data from various domains like social networks, e-commerce networks, knowledge graphs, the World Wide Web, and semantic webs. By incorporating graph topology along with node and edge attributes into machine learning frameworks, graph representation learning has demonstrated remarkable success in numerous essential applications, such as node classification, link prediction, and graph clustering. Graph contrastive learning has emerged as an effective method for learning graph representations, including algorithms such as DGI (Velickovic et al., 2019), GMI (Peng et al., 2020), GCC (Qiu et al., 2020) Published in Transactions on Machine Learning Research (07/2023) and Py GCL (Zhu et al., 2021a). These algorithms use the noise contrastive estimation loss (NCEloss) to contrast the similarities of similar (or positive) node pairs against those of negative pairs. The algorithms differ in their definition of node similarity, hence the design of contrastive pairs. The design of the encoder backbone can vary from Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019) to a skip-gram based model (Tang et al., 2015; Perozzi et al., 2014; Grover & Leskovec, 2016). In this paper, we focus on employing graph contrastive learning algorithms in an unsupervised learning manner and utilize GNNs as the encoder. The learned node embeddings are directly delivered to the downstream tasks. Although graph contrastive algorithms have demonstrated strong performance in some downstream tasks, we have discovered that directly applying contrastive pairs and GNN models to new tasks or data is not always effective (as shown in Section 6). Nodes with different ground-truth labels can be selected as positive pairs during unsupervised training. Contrastive models that solely minimize the NCE loss can be misled by these pairs and learn certain spurious features that decrease the loss but are detrimental to downstream tasks. To address this issue, we adapt expected calibration error (ECE) to graph contrastive learning to assess the discrepancy between the predicted probability and the ground-truth in downstream tasks for the selected contrastive pairs. High ECE values indicate mis-calibration of the model. We identify two factors that contribute to the model s mis-calibration: (a) the expectation of the prediction probability for randomly sampled pairs, E(v, v )[σ(hv hv )], and (b) the probability q+ of v + sharing the same label as v in positive sampling. To address the mis-calibration in existing graph contrastive learning algorithms, we introduce a novel regularization method, denoted as Contrast-Reg. Contrast-Reg employs a regularization vector r, which consists of a random vector with each entry within the range (0, 1]. By ensuring node representations maintain similarity with r and that pseudo-negative representations, generated through shuffled node features, remain dissimilar to r, Contrast-Reg guarantees that minimizing the contrastive loss results in highquality representations that improve accuracy in downstream tasks, rather than overfitting specific spurious features. We provide both theoretical evidence and empirical experiments to support the effectiveness of Contrast-Reg. First, we derive a generalization bound for our contrastive GNN framework, building upon the theoretical framework depending on Rademacher complexity (Saunshi et al., 2019), and demonstrate that Contrast-Reg contributes to a decrease in the upper bound. This result indicates that this term promotes better alignment with the performance of downstream tasks while simultaneously minimizing the training loss, thereby improving the generalizability of the GNN model. Furthermore, we design experiments to examine the empirical performance of Contrast-Reg by formulating the graph contrastive learning framework into four components: a similarity definition, a GNN encoder, a contrastive loss function, and a downstream task. We apply Contrast-Reg to different compositions of these components and achieve superior results across various compositions. The main contributions of this paper can be summarized as follows: (Section 4.1) We adapt expected calibration error (ECE) to assess the discrepancy between the prediction and the ground-truth in the downstream tasks on contrastive pairs, and identify two key factors that influence the discrepancy. (Section 4.2) We propose a novel regularization method, Contrast-Reg, to ensure that minimizing the contrastive loss results in high-quality representations and improved accuracy in downstream tasks. Contrast Reg is a plugin regularizer generally effective with respect to various graph contrastive learnings. (Section 4.2 & Section 5 & Section 6) We provide both theoretical results and empirical results to demonstrate the effectiveness of Contrast-Reg in improving the generalizability of the GNN model and achieving superior performance across different existing graph contrastive learning algorithms. 2 Related Work Graph representation learning. Many graph representation learning models have been proposed. Factorization-based models (Ahmed et al., 2013; Cao et al., 2015; Qiu et al., 2018) factorize an adjacency matrix to obtain node representations. Random walk-based models such as Deep Walk (Perozzi et al., 2014) Published in Transactions on Machine Learning Research (07/2023) sample node sequences as the input to skip-gram models to compute the representation for each node. Node2vec (Grover & Leskovec, 2016) balances depth-first and breadth-first random walk when it samples node sequences. HARP (Chen et al., 2018) compresses nodes into super-nodes to obtain a hierarchical graph to provide hierarchical information to random walk. GNN models (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019; Qu et al., 2019; Thomas et al., 2023) have shown great capability in capturing both graph topology and node/edge feature information. Most GNN models follow a neighborhood aggregation schema, in which each node receives and aggregates the information from its neighbors in each GNN layer, i.e., for the k-th layer, hk i = aggregate(hk 1 j , j neighborhood(i)), and hk i = combine( hk i ,hk 1 i ). This work employs GNN models as the backbone and tests the representation across various downstream tasks, such as node classification, link prediction, and graph clustering. Graph contrastive learning. Contrastive learning is a self-supervised learning method that learns representations by contrasting positive pairs against negative pairs. Contrastive pairs can be constructed in various ways for different types of data and tasks, such as multi-view (Tian et al., 2020; 2019), target-tonoise (van den Oord et al., 2018; Hénaff et al., 2019), mutual information (Belghazi et al., 2018; Hjelm et al., 2019), instance discrimination (Wu et al., 2018), context co-occurrence (Mikolov et al., 2013), clustering (Asano et al., 2020; Caron et al., 2018), multiple data augmentation (Chen et al., 2020), known and novel pairs (Sun & Li, 2023), and contextually relevant (Neelakantan et al., 2022). Contrastive learning has been successfully applied to numerous graph representation learning models (Velickovic et al., 2019; Peng et al., 2020; You et al., 2020; Qiu et al., 2020; Zeng et al., 2021) to pseudo subgraph instance discrimination as a contrastive learning training objective and to leverage contrastive learning to empower graph neural networks in learning node representations. We characterize different types of node-level similarity as follows: Structural similarity: Structural similarity can be captured from various perspectives. From a graph theory viewpoint, Graph Wave (Donnat et al., 2018) leverages the diffusion of spectral graph wavelets to capture structural similarity, while struc2vec (Ribeiro et al., 2017) uses a hierarchy to measure node similarity at different scales. From an induced subgraph perspective, GCC (Qiu et al., 2020) treats the induced subgraphs of the same ego network as similar pairs and those from different ego networks as dissimilar pairs. To capture community structure, v Graph (Sun et al., 2019) utilizes the high correlation between community detection and node representations to incorporate more community structure information into node representations. To capture global-local structure, DGI (Velickovic et al., 2019) maximizes the mutual information between node representations and graph representations to allow node representations to contain more global information. GDCL (Zhao et al., 2021) incorporates the results of graph clustering to decrease the false-negative samples. Di GCL (Tong et al., 2021) generates contrastive pairs by Laplacian perturbations to retain more structural features of directed graphs. GCA (Zhu et al., 2021b) constructs contrastive pairs by designing adaptive augmentation schemes based on node centrality measures. AD-GCL (Suresh et al., 2021) adopts trainable edge-dropping graph augmentation and optimizes both adversarial graph augmentation strategies and contrastive objectives to avoid the model learning redundant graph features. Attribute similarity: Nodes with similar attributes are likely to have similar representations. GMI (Peng et al., 2020) maximizes the mutual information between node attributes and high-level representations, and Pretrain GNN (Hu et al., 2020b) applies attribute masking to help capture domain-specific knowledge. GCA (Zhu et al., 2021b) corrupts node features by adding more noise to unimportant node features, to enforce the model to recognize underlying semantic information. Given the above subgraph instance discrimination objective with GNN backbones, NCEloss (Gutmann & Hyvärinen, 2012; Dyer, 2014; Mnih & Teh, 2012; Yang et al., 2020) is applied to optimize the model s parameters. In addition, Graph Autoencoder, which is another popular graph self-supervised approach by reconstructing useful graph information, has also been extensively studied. For example, Graph MAE (Hou et al., 2022) focuses on feature reconstruction with a masking strategy and scaled cosine error. Graph MAE2 (Hou et al., 2023) designs strategies of multi-view random re-mask decoding and latent representation prediction to regularize the reconstruction process. S2GAE (Tan et al., 2023) randomly masks a portion of edges and Published in Transactions on Machine Learning Research (07/2023) learns to reconstruct these missing edges with a novel masking strategy. See Gera (Li et al., 2023) adopts a novel hierarchical variational framework. Numerous studies have demonstrated the success of enhancing a model s generalizability by devising advanced similarity functions. However, some designs may require extensive tuning, preventing these advanced techniques from adapting to settings beyond the scope of their experiments. This problem is especially prevalent in the unsupervised learning setting, where the ideal protocol is that the learned node embeddings, which are the output of the unsupervised contrastive model, should be applicable to downstream tasks upon convergence. The key difference between Contrast-Reg and other algorithms, such as GCA, GDCL, and Di GCL, lies in the fact that the latter methods are intended to incorporate structures that are critical to the performance of the downstream tasks, while Contrast-Reg ensures that minimizing the contrastive loss results in high-quality representations that enhance the accuracy in downstream tasks. Moreover, Contrast Reg can naturally function as a plugin to these advanced graph contrastive learning algorithms to improve their performance. Expected Calibration Error. Expected Calibration Error (ECE) (Guo et al., 2017) is a metric employed to quantify the calibration between confidence (largest predicted probability) and accuracy in a model. Calibration refers to the consistency between a model s predicted probabilities and the actual outcomes. When a model is mis-calibrated, its generalization to unseen data in supervised learning tasks is likely to be poor (Müller et al., 2019; Pereyra et al., 2017; Guo et al., 2017; Zhang et al., 2018). We propose using ECE to evaluate the quality of node embeddings generated by unsupervised graph contrastive learning models. This calibration offers insights into addressing the challenge that applying graph contrastive learning algorithms to downstream tasks does not always yield optimal results. Regularization for graph representation learning. Graph AT (Feng et al., 2019) and BVAT (Deng et al., 2019) introduce adversarial perturbations f x to the input data x as regularizers to obtain more robust models. Graph SGAN (Ding et al., 2018) generates fake input data in low-density regions by incorporating a generative adversarial network as a regularizer. P-reg (Yang et al., 2021) leverages the smoothness property in real-world graphs to enhance GNN models. Graphnorm (Cai et al., 2021) proposes a novel feature normalization method. ALS (Zhou et al., 2021) employs label propagation to adaptively integrate label smoothing into GNN training. G-Mixup (Han et al., 2022) uses graphons as a surrogate to apply mixup techniques to graph data. The above regularizers are designed for general representation generalizability, while Contrast-Reg is specifically intended to address the mis-calibration problem in the unsupervised graph contrastive learning optimization process and its performance on downstream tasks. It is worth noting that Contrast-Reg could be used in conjunction with the aforementioned regularization techniques. 3 Preliminaries We begin by introducing the concepts and foundations of graph contrastive learning (GCL). Let a graph G = (V, E) be denoted, where V = v1, v2, , vn and E represent the vertex set and edge set of G, and the node feature vector of node vi be xi. Our objective is to learn node embeddings through unsupervised graph contrastive learning and subsequently apply simple classifiers leveraging these embeddings for various downstream tasks, such as node classification, link prediction, and graph clustering. Graph contrastive learning Given a graph G with node features X = (x1, x2, , xn), the aim of graph contrastive learning is to train an encoder f : (G, X) Rd for all input data points vi V with node feature vector xi by constructing positive pairs (vi, v+ i ) and negative pairs (vi, v i1, , v i K). The encoder f is typically implemented using Graph Neural Networks (GNNs). Specifically, let hi represent the output embedding of the encoder f for node vi, h(k) i as the embedding at the k-th layer, and h0 i = xi. Published in Transactions on Machine Learning Research (07/2023) The output of the encoder f is then iteratively defined as: m(k) i = Aggregatek n h(k 1) j : vj N(vi) o h(k) i = RELU W k Update h(k 1) i , m(k) i , (1) where N(vi) the set of nodes adjacent to vi, and Aggregate and Update are the aggregation and update function of GNNs (Gilmer et al., 2017). hi is the last layer of h(k) i for node vi. Leveraging various types of similarity as pseudo subgraph instance discrimination labels, graph contrastive learning constructs positive and negative pairs to train the embedding hi by optimizing the loss on these pairs. The most commonly employed loss is the NCEloss: ˆLnce = 1 M log σ(h T i h+ i ) + k=1 log σ(h T i h ij) (2) with M samples vi,v+ i ,v i1, ,v i K M i=1 in empirical setting where σ( ) is the sigmoid function. Calibrating graph contrastive learning by the expected calibration error (ECE) Expected Calibration error measures the degree to which the model output probabilities match ground-truth accuracy in supervised tasks (Naeini et al., 2015; Guo et al., 2017). It s defined as the expectation of absolute difference between the largest predicted probability (confidence) and its corresponding accuracy, ECE = E(v,v ) S [|p(v, v ) acc(v, v )|] (3) In this paper, we extend the use of ECE to evaluate the quality of graph contrastive learning. Specifically, we can compare the predicted probability of a positive pair (i.e., a pair of nodes with the same label) with the true probability that the pair belongs to the same class. We can also compare the predicted probability of a negative pair (i.e., a pair of nodes with different labels) with the true probability that the pair belongs to different classes. By calculating the differences between the predicted and true probabilities for both positive and negative pairs, we can quantify the overall mis-calibration of the model and identify areas for improvement. The formal definition is presented as follows. The largest predicted probability, p(v, v ), is determined as ( σ(hv hv), (v, v ) as positive pair 1 σ(hv hv), (v, v ) as negative pair (4) where hv and hv are the embeddings for the target node v and the selected sample v , respectively, and σ( ) is the sigmoid function. The corresponding accuracy, acc(v, v ), is defined as follows: acc(v, v ) = ( I(v, v ), (v, v ) as positive pair 1 I(v, v ), (v, v ) as negative pair (5) where I is an indicator function denoting whether v and v has the same label, the positive/negative pairs are pseudo-labels for training the contrastive learning algorithm which may not equal to the true label. In this case, when two nodes are selected as positive pairs, acc(v, v ) is equal to 1 if v and v belong to the same class and equal to 0 otherwise; when the two nodes are selected as negative pairs, acc(v, v ) = 1 I(v, v ). Using the ECE metric in this way allows us to evaluate the quality of the embeddings outputted by the model (σ(hv hv)) and their accuracy in downstream tasks (acc(v, v )), such as node classification. We can improve the general performance and utility of graph contrastive learning algorithms by detecting regions of mis-calibration and developing novel techniques for enhancing calibration performance. Published in Transactions on Machine Learning Research (07/2023) 0 50 100 150 200 250 𝔼(v,v )[σ(hv hv )] 𝔼(v,v )[p(v, v )] (a) NCEloss and E(v,v )[σ(hv hv )] over epochs 0 50 100 150 200 250 Positive sampling Accuracy q+ (b) ECE and q+ over epochs Figure 1: Calibrating Graph Contrastive Learning (LC (Alg. 3) w/o Contrast-Reg on the Pubmed dataset) 4 Methodology To calibrate the model, we first adapt the expected calibration error (ECE) formula for the graph contrastive learning setting (Section 4.1). In order to ensure that minimizing the contrastive loss leads to high-quality representations that improve accuracy in downstream tasks, we propose a regularization term designed to lower the ECE value and enhance the model s generalizability (Section 4.2). We also provide theoretical evidence to support the effectiveness of the proposed regularization term. 4.1 Empirical Calibration Reveals Limitations of Existing Graph Contrastive Learning Algorithms To calibrate the existing graph contrastive learning algorithms, we measure the degree of calibration using the expected calibration error (ECE) metric, which is defined by Equation 3, 4, 5: 1 Eacc(v,v +)=1[p(v, v +)] q+ (true positive) + Eacc(v,v +)=0[p(v, v +)] 1 q+ ! (false positive) Eacc(v,v )=1[p(v, v )] q (false negative) + 1 Eacc(v,v )=0[p(v, v )] (1 q ) (true negative), where q+ and q are the probabilities that the node v has the same label as node v in positive and negative sampling, respectively; r+ and r are the ratios of sampling positive and negative samples, with r++r = 1. In this study, we set r+ = r = 0.5. Building upon this formulation, we propose the following claim that takes into account the positive and negative pair construction assumptions to analyze the potential issues that lead to the mis-calibration between the decrease in graph contrastive learning training loss and the degradation of downstream task performance. Claim 4.1. Under the assumption that negative sampling is uniformly sampled, and positive sampling is sampled based on the calculated distance between pairwise embeddings, ECE is positively correlated with the expectation of the prediction value for randomly sampled pairs Ev,v [σ(hv hv )], and negatively correlated with the probability q+ of v + having the same label as v in positive sampling. We provide a thorough analysis of this claim in Appendix A.1. To investigate the changes in these two factors and the ECE value over the process of minimizing the NCEloss (Equation 2), we conduct an illustrative Published in Transactions on Machine Learning Research (07/2023) experiment with an existing graph contrastive learning algorithm and present the relevant values in Figure 1. In Figure 1a, we show that as the NCEloss (red solid line) is minimized over the epochs, the expectation of the prediction value for randomly sampled pairs E(v, v )[σ(hv hv )] (blue dashed line) increases. Moreover, Figure 1b demonstrates the model s challenge in identifying genuine positive pairs, with the probability q+ initially increasing before gradually declining (red solid line), and the ECE value first decreasing before rising (blue dashed line). These findings suggest that, as the epochs progress, merely reducing the NCEloss is insufficient to enhance the accuracy of downstream tasks. Our empirical calibration shows that the model learns certain spurious features that reduce the loss but are ultimately harmful to downstream tasks when solely minimizing the NCEloss. The above analysis emphasizes the need for graph contrastive learning algorithms to effectively quantify the relationships between learned embeddings and the accuracy of downstream tasks. In the next section, we present our approach to mitigate the possible risks associated with spurious feature learning in graph contrastive learning algorithms and show its effectiveness through a number of experiments. 4.2 Proposed Regularization Term: Contrast-Reg To ensure that minimizing the NCEloss aligns with downstream task accuracy, we propose a contrastive regularization term, denoted as Contrast-Reg, given by: Lreg = E h, h log σ(h T i Wr) + log σ( h T i Wr) , (7) where r is a random vector uniformly sampled from (0, 1], W is a trainable parameter, and h is the noisy feature generated by different data augmentation techniques, such as those used in the previous literature (Chen et al., 2020; Velickovic et al., 2019). In Section 5, we will discuss how we calculate the noisy features in the GNN setting. In the following, we will introduce the impact of Contrast-Reg on the ECE value and the generalizability of the GNN model. ECE decreases by incorporating Contrast-Reg In Claim 4.1 and Appendix A.1, it is proven that the ECE value is positively correlated with E(v, v )[σ(hv, hv )] and inversely correlated with q+. In addition, Theorem 2 in Appendix A.3 shows that Contrast-Reg reduces the variance of the learned node embeddings norm h , thereby preventing the loss from decreasing due to exploding embedding norms. As a result, the norm values are lower than those obtained with the vanilla contrastive loss, leading to a reduction in E(v, v )[σ(hv, hv )] and an increase in representation quality. With the improvement in the representation quality, E(v, v )[σ(hv, hv )] decreases and q+ increases, resulting in a decreased ECE value. Furthermore, we conducted an empirical study to examine the impact of Contrast-Reg on the ECE value and investigate the two factors that cause miscalibration. Figure 2a shows that the expectation of the prediction value for randomly sampled pairs E(v,v )[σ(hv hv )] with Contrast-Reg increased much more slowly compared to the vanilla NCEloss (green solid line compared to the red solid line). Figure 2b presents the changes in positive sampling accuracy q+ over the epochs of the vanilla NCEloss and NCEloss with Contrast-Reg, where Contrast-Reg helps q+ increase, while q+ trained by vanilla NCEloss decreased after the initial increases. These two factors contribute to the different ECE changes in the vanilla NCEloss and NCEloss with Contrast-Reg. The red dashed line indicates that the ECE value increases after the initial decreases, while the green dashed line shows that the ECE value decreases slightly. The comparison demonstrates that applying Contrast-Reg alleviates the miscalibration in representation learning, ensuring that minimizing the contrastive loss results in high-quality representations with increased accuracy in downstream tasks, rather than overfitting certain spurious features. We provide a detailed comparison of the impact on the ECE value between models with and without Contrast-Reg in Appendix A.2 across different datasets. Generalizability improved by incorporating Contrast-Reg In Section 6.3, we present empirical evidence supporting the effectiveness of Contrast-Reg s ability to enhance the generalizability of the GNN model through an ablation study. Furthermore, we analyze the impact of Contrast-Reg on the generalizability of graph contrastive learning algorithms using the Rademacher complexity (Saunshi et al., 2019). In Theorem 1, we offer performance guarantees for the learned graph embeddings outputted by the GNN function class F = f with the unsupervised loss function Lnce on the downstream average classification task Lµ sup( ˆf). Detailed settings can be found in Appendix A.4. Assume that f is bounded, i.e., hi R Published in Transactions on Machine Learning Research (07/2023) 0 50 100 150 200 250 ECE w/o reg ECE with reg w/o reg 𝔼(v,v )[σ(hv hv )] with reg 𝔼(v,v )[σ(hv hv )] 𝔼(v,v )[p(v, v )] (a) Influence of Contrast-Reg on ECE and E(v,v )[σ(hv hv )] 0 50 100 150 200 250 ECE w/o reg ECE with reg with reg q+ (b) Influence of Contrast-Reg on ECE and q+ 0 50 100 150 200 250 w/o reg σmax(M(f, c)) with reg σmax(M(f, c)) w/o reg 𝔼vi Dc hi 2 with reg 𝔼vi Dc hi 2 σmax(M(f, c)) 𝔼vi Dc hi 2 (c) Influence of Contrast-Reg on σmax(M(f, c)) and Evi Dc hi 2 Figure 2: Effects of Contrast-Reg (LC (Alg. 3) w/ & w/o Contrast-Reg on Pubmed) with R > 0. Let c, c be two classes sampled independently from latent classes C with distribution ρ. Let τ = Ec,c ρ2I (c = c ) be the probability that c and c come from the same class. Let L = nce(f) be the NCEloss when negative samples come from different classes. We have the following theorem. Theorem 1. f F, with probability at least 1 δ, Lµ sup( ˆf) L = nce(f) + βs(f) + ηGen M, (8) where Gen M = 8RRS(F) M 8 log(σ( R2)) q , β = τ 1 τ , η = 1 1 τ , and s(f) = E(vi,vj) Dsim(vi,vj) [(h T i hi)2], In Equation (8), Gen M represents the generalization error in terms of Rademacher complexity and will converge to 0 when the encoder function f is bounded and the number of samples M is sufficiently large. The theorem above indicates that only when βs(f)+ηGen M converges to 0 as M increases, will the encoder ˆf = arg minf F ˆLnce selected perform well in downstream tasks. Next, we will analyze the impact of Contrast-Reg on the condition in the aforementioned theorem, βs(f) + ηGen M, by reformulating s(f) as follows: E(vi,vj) Dsim(vi,vj) h T i hjh T j hi Ec ρ Evi Dc h T i Exj Dc hjh T j hi Ec ρ h M(f, c) 2 Evi Dc hi 2i , where M(f, c) := Evi Dc hih T i . The advantage of incorporating Contrast-Reg into graph contrastive learning lies in its ability to reduce both the largest singular value of matrix M(f, c) (σmax(M(f, c))) and the expectation of the embedding norm within the same class, Evi Dc[ hi 2] (as illustrated in Figure 2c). This reduction leads to a decrease in the upper bound of s(f), ultimately yielding a lower value for the term βs(f) + ηGen M compared to vanilla graph contrastive learning. The reduction in this term promotes better alignment with the performance of downstream tasks while simultaneously minimizing the training loss. Further experiments on the generalizability of the Contrast-Reg approach can be found in Section 6.3, and a more detailed explanation on lowering the term s(f) is provided in Appendix A.3. 5 A Contrastive GNN Framework Algorithm 1 presents the graph contrastive learning framework and how Contrast-Reg is plugged into the framework. The framework employs a GNN model, such as GCN (Kipf & Welling, 2017), GAT (Velickovic Published in Transactions on Machine Learning Research (07/2023) et al., 2018), and GIN (Xu et al., 2019), to obtain node embeddings. To train the parameters of the GNN backbone, we generate contrastive pairs using the functions Seed Select and Contrast. In addition, we incorporate the Contrast-Reg term into the optimization process, along with the contrastive loss on these pairs. After the model convergence, the embeddings are directly delivered to the downstream tasks. Algorithm 1: Graph Contrastive Learning Framework Input: Graph G = (V, E), node attributes X, a GNN model f : V RH, the number of epochs e; Output: A trained GNN model f; 1 Initialize training parameters; 2 for epoch 1 to e do 3 C=Seed Select(G, X, f, epoch); 4 P=Contrast(C, G, X, f); 5 loss = NCEloss(P) + Contrast-Reg(G, X, f); 6 Back-propagation and update f; The functions Seed Select and Contrast are responsible for selecting the seed nodes and sampling the positive/negative pairs for these seeds while incorporating various priors for both structural and attribute aspects of the graph. To illustrate the graph contrastive learning framework, we utilize two simple graph contrastive learning algorithms (ML and LC). LC: Structural similarity We adopt clustering methods to detect communities in a graph and generate positive/negative pairs in the node representation space (Newman, 2006). The Seed Select function is implemented using curriculum learning, following the design of AND (Huang et al., 2019). Initially, Seed Select utilizes node embeddings to select a node set C with low entropy, reducing randomness and uncertainty during early training stages. As training progresses, Seed Select gradually expands C by adding more nodes. For each node xi in C, Contrast generates a positive node x+ i that is highly similar, and a randomly sampled negative node x i , returning a tuple of their representations: (f(xi), f(x+ i ), f(x i )) ML: Attribute similarity In ML, we assume that nodes with similar attributes should have similar representations to preserve the attribute information. Following the design of DIM (Hjelm et al., 2019) and GMI (Peng et al., 2020), we adopt a multi-level representation approach. Initially, Seed Select includes all nodes in the node set G. For each seed node xi, Contrast uses the node itself as the positive node and randomly samples a negative node. The positive representation of the seed node xi is generated by adding an additional GNN layer upon f, denoted as g. The tuple containing the representations is denoted as (f(xi), g(xi), f(x i )) Advanced graph contrastive learning algorithms can also be easily implemented using this framework. As an example, Graph Contrastive Analysis (GCA) (Zhu et al., 2021b) algorithm can be implemented by making a simple modification to the Contrast function found in Appendix B. Overall, Contrast-Reg serves as a highly effective plugin regularizer for graph contrastive learning with various backbones, and with different Contrast and Seed Select implementations. 6 Experimental Results We begin by introducing the experimental settings in Section 6.1. Section 6.2 presents the main results across various downstream tasks. Moreover, we assess the benefits of Contrast-Reg through ablation studies. 6.1 Experiment Settings Downstream tasks We conduct our experiments on three distinct downstream tasks, namely node classification, graph clustering, and link prediction. The experimental procedure consists of two stages: first, we utilize positive and negative contrastive pairs to train the GNN models in an unsupervised manner, obtaining the node embeddings. Subsequently, we apply these embeddings to the downstream tasks by integrating Published in Transactions on Machine Learning Research (07/2023) Table 1: Downstream task: node classification Algorithm Contrast-Reg Cora Citeseer Pubmed Wiki Computers Photo ogbn-arxiv ogbn-products Reddit GCN 81.54 0.68 71.25 0.67 79.26 0.38 72.40 0.95 79.82 2.04 88.75 1.99 71.74 0.29 75.64 0.21 94.02 0.05 node2vec 71.07 0.91 47.37 0.95 66.34 1.40 58.76 1.48 75.37 1.52 83.63 1.53 70.07 0.13 72.49 0.10 93.26 0.04 DGI 81.90 0.84 71.85 0.37 76.89 0.53 63.70 1.43 64.92 1.93 77.19 2.60 69.66 0.18 77.00 0.21 94.14 0.03 GMI 80.95 0.65 71.11 0.15 77.97 1.04 63.35 1.03 79.27 1.64 87.08 1.23 68.36 0.19 75.55 0.39 94.19 0.04 GCA 82.51 0.75 70.83 0.89 85.69. 1.03 76.38 1.25 87.47 0.49 92.07 1.0 OOM OOM OOM Graph MAE 84.14 0.50 73.06 0.34 80.98 0.41 77.40 0.18 80.77 1.43 88.93 1.58 71.47 0.27 OOM OOM GCA 83.90 1.22 72.39 0.88 87.06 0.26 77.71 0.06 88.21 0.37 93.13 0.32 OOM OOM OOM Graph MAE 84.18 0.49 73.52 0.32 81.43 0.41 78.22 0.13 81.16 1.10 90.71 1.29 71.56 0.19 OOM OOM LC 82.33 0.41 72.88 0.39 79.33 0.59 69.19 1.13 81.98 1.52 87.59 1.50 69.94 0.11 76.96 0.34 94.43 0.03 ML 82.65 0.57 72.98 0.41 80.10 1.04 67.20 0.96 82.11 1.47 86.78 1.70 70.05 0.09 76.27 0.20 94.38 0.04 Table 2: Downstream task: graph clustering Algorithm Contrast-Reg Cora Citeseer Wiki Acc NMI F1 Acc NMI F1 Acc NMI F1 node2vec 61.78 0.30 44.47 0.21 62.65 0.26 39.58 0.37 24.23 0.27 37.54 0.39 43.29 0.58 37.39 0.52 36.35 0.51 DGI 71.81 1.01 54.90 0.66 69.88 0.90 68.60 0.47 43.75 0.50 64.64 0.41 44.37 0.92 42.20 0.90 40.16 0.72 AGC 68.93 0.02 53.72 0.04 65.62 0.01 68.37 0.02 42.44 0.03 63.73 0.02 49.54 0.07 47.02 0.09 42.16 0.11 GMI 63.44 3.18 50.33 1.48 62.21 3.46 63.75 1.05 38.14 0.84 60.23 0.79 42.81 0.40 41.53 0.20 38.52 0.22 LC 70.04 2.04 55.08 0.75 67.36 2.17 67.90 0.74 43.63 0.57 64.21 0.60 50.12 0.96 49.70 0.49 43.74 0.97 ML 71.59 1.07 56.01 0.64 68.11 1.32 69.17 0.43 44.47 0.46 64.74 0.41 53.13 1.01 51.81 0.57 46.11 0.93 additional straightforward models. For example, we employ multiclass logistic regression for the node classification, k-means for graph clustering, and a single MLP layer for link prediction. Moreover, we conduct experiments in the pretrain-finetune paradigm, as this approach constitutes a significant component of graph contrastive learning (Qiu et al., 2020). We first employ a large graph to train the GNN models, followed by finetuning such models and train a simple downstream classifier on a separate graph. Training details The datasets we employ encompass citation networks, web graphs, co-purchase networks, and social networks. Comprehensive statistics for these datasets can be found in Appendix D. For Cora, Citeseer, Pubmed, ogbn-arxiv, ogbn-products, and Reddit, we adhere to the standard dataset splits and conduct 10 different runs with fixed random seeds ranging from 0 to 9. For Computers, Photo, and Wiki, we randomly divide the train/validation/test sets, allocating 20/30/all remaining nodes per class, in accordance with the recommendations in the previous literature (Shchur et al., 2018). We measure performance across 25 (5 5) different runs, comprising 5 random splits and 5 fixed-seed runs (from 0 to 4) for each random split. The hyperparameter configurations can be found in Appendix D. 6.2 Main Results Our primary results involve comparing our proposed Contrast-Reg, along with the chosen similarity definitions, against state-of-the-art algorithms across various downstream tasks. 6.2.1 Node Classification We evaluate node classification performance on all datasets, utilizing both full-batch training and stochastic mini-batch training. Our methods are compared with DGI (Velickovic et al., 2019), GMI (Peng et al., 2020), node2vec (Grover & Leskovec, 2016), GCA (Zhu et al., 2021b), Graph MAE (Hou et al., 2022) and supervised GCN (Kipf & Welling, 2017). GCA, DGI, and GMI represent state-of-the-art algorithms in unsupervised graph contrastive learning. Node2vec is an exemplary algorithm for random walk-based graph representation algorithms (Grover & Leskovec, 2016; Tang et al., 2015; Perozzi et al., 2014), while GCN is a classic semisupervised GNN model. We incorporate the Contrast-Reg into ML, LC, GCA and Graph MAE models and compare them against the baseline models. The results, presented in Table 1, demonstrate that GCA and Graph MAE with Contrast-Reg achieve excellent performance across all the datasets that are trained in the full-batch mode and even surpass the performance of the supervised GCN. Published in Transactions on Machine Learning Research (07/2023) Table 3: Downstream task: link prediction Algorithm Contrast-Reg Cora Citeseer Pubmed Wiki GCN neg 92.40 0.51 92.27 0.90 97.24 0.19 93.27 0.31 node2vec 86.33 0.87 79.60 1.58 81.74 0.57 92.41 0.35 DGI 93.62 0.98 95.03 1.73 97.24 0.13 95.55 0.35 GMI 91.31 0.88 92.23 0.80 95.14 0.25 95.30 0.29 ML-GCN 94.81 0.55 96.38 0.63 92.86 0.26 89.84 0.69 GRACE 92.82 0.65 93.14 0.73 96.05 0.91 87.72 1.16 LC 94.61 0.64 95.63 0.88 97.26 0.15 96.28 0.21 ML-GCN 95.18 0.31 96.83 0.55 95.10 0.22 92.70 0.56 GRACE 93.60 0.57 93.24 0.62 96.60 1.56 90.82 0.99 Table 4: Pretraining Algorithm Contrast-Reg Reddit ogbn-products No pretraining 90.44 1.62 84.69 0.79 DGI 92.09 1.05 86.37 0.19 GMI 92.13 1.16 86.14 0.16 ML 92.18 0.97 86.28 0.20 LC 92.52 0.55 86.45 0.13 6.2.2 Graph Clustering We assess clustering performance using three metrics: accuracy (Acc), normalized mutual information (NMI), and F1-macro (F1), following the previous literature (Xia et al., 2014). Higher values indicate better clustering performance. We compare our methods with DGI, node2vec, GMI, and AGC (Zhang et al., 2019) on the Cora, Citeseer, and Wiki datasets. AGC is a state-of-the-art graph clustering algorithm that leverages high-order graph convolution for attribute graph clustering. For all models and datasets, we employ k-means to cluster both the labels and representations of nodes. The clustering results of labels are considered as the ground truth. To reduce dimensionality, we apply PCA to the representations before using k-means, since high dimensionality can negatively impact clustering (Chen, 2018). The random seed setting for model training is consistent with that in the node classification task. To minimize randomness, we set the clustering random seed from 0 to 4 and compute the average result for each learned representation. Table 2 presents improved results with and without PCA for each cell. Our algorithms, particularly ours (ML), exhibit superior performance in all cases, demonstrating the effectiveness of Contrast-Reg. It is worth noting that the superior results of ours (ML) compared to ours (LC) suggest that attributes play a crucial role in clustering, as graph clustering is applied to attribute graphs. 6.2.3 Link Prediction In order to circumvent the data linkage issue in link prediction, we employ an inductive setting for graph representation learning. We randomly extract induced subgraphs (comprising 85% of the edges) from each original graph for training both the representation learning model and the link predictor, while reserving the remaining edges for validation and testing (10% for the test edge set and 5% for the validation edge set). We assess performance across 25 (5x5) different runs, utilizing a fixed-seed random split scheme with five distinct induced subgraphs and five fixed-seed runs (ranging from 0 to 4). We compare our results with the baseline algorithms, including DGI, GMI, node2vec, ML-GCN (Shiao et al., 2022), GRACE (Zhu et al., 2020), and unsupervised GCN (GCN-neg in Table 3). In addition, we incorporate Contrast-Reg into LC, ML-GCN, and GRACE against the aforementioned baselines. Our results, presented in Table 3, demonstrate that Contrast-Reg consistently improves the performance of graph contrastive learning models, as evidenced by comparing the results of ML-GCN and GRACE with and without Contrast-Reg. Furthermore, we achieve the state-of-the-art results with the algorithms incorporating Contrast-Reg. Published in Transactions on Machine Learning Research (07/2023) Table 5: Graph contrastive learning with regularization Algorithm Regularization Cora Wiki Computers Reddit ML 73.22 0.77 58.70 1.51 77.08 2.48 94.33 0.07 ML ℓ2-normalization 80.14 0.92 61.83 1.36 80.80 1.45 94.07 0.19 ML Weight-decay 81.65 0.42 63.67 1.47 79.09 0.32 94.34 0.06 ML Contrast-Reg 82.65 0.57 67.20 0.96 82.11 1.47 94.38 0.04 LC 79.73 0.75 65.14 1.48 79.80 1.49 94.42 0.03 LC ℓ2-normalization 81.09 0.59 63.96 0.64 80.73 1.36 94.20 0.06 LC Weight-decay 81.94 0.44 65.17 1.34 81.89 1.58 94.44 0.03 LC Contrast-Reg 82.33 0.41 69.19 1.13 81.98 1.52 94.43 0.03 6.2.4 Pretraining We further assess the performance of Contrast-Reg in the context of the pretrain-finetune paradigm. For the Reddit dataset, we naturally partition the data by time, pretraining the models using the first 20 days. We generate an induced subgraph based on the pretraining nodes and divide the remaining data into three parts: the first part produce a new subgraph for fine-tuning the pre-trained model and training the classifier, while the second and third parts are designated for validation and testing. For the ogbn-products dataset, we split the data according to node ID, pretraining the models using a subgraph generated by the initial 70% of the nodes. The data splitting scheme for the remaining data mirrors that of the Reddit dataset. We conduct baseline experiments on DGI and GMI, employing the same Graph SAGE with GCN-aggregation encoder as in our model. Table 4 reveals that pretraining the model facilitates convergence to a more robust representation model with reduced variance, and Contrast-Reg can enhance the transferability of the pre-trained model. 6.3 Discussion of Contrast-Reg To evaluate the benefits of Contrast-Reg, we conduct experiments comparing it to two approaches, specifically, ℓ2-normalization and weight decay, on four networks from different domains, with a focus on node classification task performance. It is important to note that some regularization techniques, such as those mentioned in Section 2, are not explicitly designed to address the miscalibration problem. ℓ2-normalization (Chen et al., 2020) mitigates the potential risk of the expectation of the prediction value for randomly sampled pairs E(v,v )[σ(v v )] exploding by explicitly eliminating the embedding norm for each node s embeddings to tackle the miscalibration problem, while weight decay strives to achieve the same result by implicitly restricting the gradient descent step length. Table 5 presents a comparison of the accuracy achieved by different contrastive learning algorithms, ML and LC, when incorporating ℓ2-normalization, weight decay, and Contrast-Reg, as opposed to using the vanilla algorithms. The results indicate that integrating ℓ2-normalization, weight decay, and Contrast-Reg into graph contrastive learning algorithms improves the accuracy of downstream tasks, suggesting that addressing miscalibration enhances the generalization of learned embeddings for downstream tasks. However, the performance gains provided by Contrast-Reg exceed those of ℓ2-normalization and weight decay. This implies that while alternative algorithms exist to address miscalibration, Contrast-Reg emerges as the most effective method for improving the generalization of learned embeddings for downstream tasks. It is crucial to acknowledge the existence of other algorithms that may also contribute to mitigating the miscalibration problem, and further research is needed to explore and compare their effectiveness. 6.4 Generality of Contrast-Reg The ablation study of Contrast-Reg on the GAT and GIN backbone is presented in Table 6. The results demonstrate that Contrast-Reg consistently enhances graph contrastive learning performance across various backbones and graph contrastive learning algorithms. This conclusion is further supported by the comparison of various models, including GCA with GCA+Contrast-Reg in Table 1, GRACE with GRACE+Contrast Reg, as well as ML-GCN with ML-GCN+Contrast-Reg in Table 3. These experiments support that Contrast- Published in Transactions on Machine Learning Research (07/2023) Table 6: GAT/GIN as an encode backbone w/ and w/o Contrast-Reg Algorithm Backbone Contrast-Reg Cora Citeseer Pubmed Wiki Computers ML GAT 76.90 1.20 70.43 0.31 75.05 0.51 69.02 1.89 78.47 2.60 ML GAT 81.56 0.94 72.73 0.44 78.50 0.63 69.41 0.85 80.41 1.96 LC GAT 74.30 1.10 69.30 0.42 72.50 0.49 61.55 1.89 73.27 5.98 LC GAT 80.87 0.96 72.69 0.52 78.50 0.63 69.92 0.80 80.28 2.09 ML GIN 77.78 0.98 70.78 0.68 80.12 0.50 63.50 0.93 70.69 1.82 ML GIN 81.04 0.94 71.17 0.45 80.23 0.69 66.77 0.53 77.85 1.00 LC GIN 80.23 0.61 71.46 0.44 78.46 0.51 61.95 0.81 70.61 1.96 LC GIN 81.38 0.58 71.59 0.36 78.51 0.46 67.14 0.44 76.79 1.91 Reg serves as an effective regularization term for graph contrastive learning algorithms that utilize different similarity definitions and GNN encoder backbones. Furthermore, we have also conducted experiments on a graph autoencoder algorithm, Graph MAE (Hou et al., 2022), which is also graph self-supervised learning. The empirical results presented in Table 1 reveal that when combined with Contrast-Reg, Graph MAE achieves superior empirical performance, thus highlighting the potential benefits of Contrast-Reg for graph autoencoder methods. A comprehensive analysis of the impact of Contrast-Reg on graph autoencoder algorithms is an interesting direction and will be left as future work, which includes the theoretical investigations and empirical evaluations involving a broader range of graph autoencoder algorithms (Hou et al., 2022; 2023; Tan et al., 2023; Li et al., 2023). 7 Conclusions We adapted expected calibration error (ECE) to the graph contrastive learning framework and identified the key factors that influence the discrepancy between the prediction in unsupervised training and the accuracy in downstream tasks. Motivated by ECE, we proposed a novel regularization regularizer, Contrast Reg, to ensure that decreasing the contrastive loss leads to better performance in the downstream tasks. Our theoretical and empirical results both demonstrate the effectiveness of Contrast-Reg in improving the generalizability of the base graph contrastive learning model and achieving superior performance across different existing graph contrastive learning algorithms. Amr Ahmed, Nino Shervashidze, Shravan M. Narayanamurthy, Vanja Josifovski, and Alexander J. Smola. Distributed large-scale natural graph factorization. In WWW 13, pp. 37 48, 2013. Yuki M. Asano, Christian Rupprecht, and Andrea Vedaldi. Self-labelling via simultaneous clustering and representation learning. In ICLR 20, 2020. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, and Aaron C. Courville. Mutual information neural estimation. In ICML 18, pp. 530 539, 2018. Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2014. ISBN 978-0-521-83378-3. doi: 10.1017/CBO9780511804441. URL https://web.stanford.edu/%7Eboyd/ cvxbook/. Tianle Cai, Shengjie Luo, Keyulu Xu, Di He, Tie-Yan Liu, and Liwei Wang. Graphnorm: A principled approach to accelerating graph neural network training. In ICML 21, volume 139 of Proceedings of Machine Learning Research, pp. 1204 1215. PMLR, 2021. Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In CIKM 15, pp. 891 900, 2015. Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In ECCV 18, pp. 139 156, 2018. Published in Transactions on Machine Learning Research (07/2023) Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. Harp: Hierarchical representation learning for networks. In (AAAI) 18, (IAAI) 18, and (EAAI) 18, pp. 2127 2134, 2018. Lei Chen. Curse of dimensionality. In Encyclopedia of Database Systems, Second Edition. Springer, 2018. doi: 10.1007/978-1-4614-8265-9\_133. URL https://doi.org/10.1007/978-1-4614-8265-9_133. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. Co RR, abs/2002.05709, 2020. Zhijie Deng, Yinpeng Dong, and Jun Zhu. Batch virtual adversarial training for graph convolutional networks. Co RR, abs/1902.09192, 2019. Ming Ding, Jie Tang, and Jie Zhang. Semi-supervised learning on graphs with generative adversarial nets. In CIKM 18, pp. 913 922, 2018. Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node embeddings via diffusion wavelets. In KDD 18, pp. 1320 1329, 2018. Chris Dyer. Notes on noise contrastive estimation and negative sampling. Co RR, abs/1410.8251, 2014. URL http://arxiv.org/abs/1410.8251. Fuli Feng, Xiangnan He, Jie Tang, and Tat-Seng Chua. Graph adversarial training: Dynamically regularizing based on graph structure. Co RR, abs/1902.08226, 2019. Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. Co RR, abs/1903.02428, 2019. Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In ICML 17, 2017. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD 16, pp. 855 864, 2016. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh (eds.), ICML 17, volume 70 of Proceedings of Machine Learning Research, pp. 1321 1330. PMLR, 2017. Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res., 13:307 361, 2012. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Neur IPS 17, pp. 1024 1034, 2017. Xiaotian Han, Zhimeng Jiang, Ninghao Liu, and Xia Hu. G-mixup: Graph data augmentation for graph classification. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), ICML 22, volume 162 of Proceedings of Machine Learning Research, pp. 8230 8248. PMLR, 2022. Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aäron van den Oord. Data-efficient image recognition with contrastive predictive coding. Co RR, abs/1905.09272, 2019. R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR 19, 2019. Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang, and Jie Tang. Graphmae: Self-supervised masked graph autoencoders. In KDD 22, 2022. Zhenyu Hou, Yufei He, Yukuo Cen, Xiao Liu, Yuxiao Dong, Evgeny Kharlamov, and Jie Tang. Graphmae2: A decoding-enhanced masked self-supervised graph learner. In WWW 23, 2023. Published in Transactions on Machine Learning Research (07/2023) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In Neur IPS 20, 2020a. Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay S. Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In ICLR 20, 2020b. Jiabo Huang, Qi Dong, Shaogang Gong, and Xiatian Zhu. Unsupervised deep learning by neighbourhood discovery. In ICML 19, pp. 2849 2858, 2019. Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR 17, 2017. Xiang Li, Tiandi Ye, Caihua Shan, Dongsheng Li, and Ming Gao. Seegera: Self-supervised semi-implicit graph variational auto-encoders with masking. In WWW 23, 2023. Andreas Maurer. A vector-contraction inequality for rademacher complexities. In ALT 16, pp. 3 17, 2016. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Neur IPS 13, pp. 3111 3119, 2013. Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language models. In ICML 12, 2012. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2nd edition, 2018. ISBN 0262039400. Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. When does label smoothing help? In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Neur IPS 19, pp. 4696 4705, 2019. Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI 15, pp. 2901 2907. AAAI Press, 2015. Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, Peter Welinder, and Lilian Weng. Text and code embeddings by contrastive pre-training. Co RR, abs/2201.10005, 2022. URL https://arxiv.org/abs/2201.10005. M. E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577 8582, 2006. Zhen Peng, Wenbing Huang, Minnan Luo, Qinghua Zheng, Yu Rong, Tingyang Xu, and Junzhou Huang. Graph representation learning via graphical mutual information maximization. In WWW 20, pp. 259 270, 2020. Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton. Regularizing neural networks by penalizing confident output distributions. In ICLR 17, 2017. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In KDD 14, pp. 701 710, 2014. Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM 2018, pp. 459 467. ACM, 2018. Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. GCC: graph contrastive coding for graph neural network pre-training. In KDD 20, pp. 1150 1160, 2020. Published in Transactions on Machine Learning Research (07/2023) Meng Qu, Yoshua Bengio, and Jian Tang. GMNN: graph markov neural networks. In ICML 19, volume 97, pp. 5241 5250, 2019. Leonardo Filipe Rodrigues Ribeiro, Pedro H. P. Saverese, and Daniel R. Figueiredo. struc2vec: Learning node representations from structural identity. In KDD 17, pp. 385 394, 2017. Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In ICML 19, pp. 5628 5637, 2019. Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. Co RR, abs/1811.05868, 2018. William Shiao, Zhichun Guo, Tong Zhao, Evangelos E. Papalexakis, Yozen Liu, and Neil Shah. Link prediction with non-contrastive learning. Co RR, abs/2211.14394, 2022. doi: 10.48550/ar Xiv.2211.14394. URL https://doi.org/10.48550/ar Xiv.2211.14394. Fan-Yun Sun, Meng Qu, Jordan Hoffmann, Chin-Wei Huang, and Jian Tang. vgraph: A generative model for joint community detection and node representation learning. In Neur IPS 19, pp. 512 522, 2019. Yiyou Sun and Yixuan Li. Opencon: Open-world contrastive learning. TMLR, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=2w WJxtp Fer. Susheel Suresh, Pan Li, Cong Hao, and Jennifer Neville. Adversarial graph augmentation to improve graph contrastive learning. In Neur IPS 21, 2021. URL https://openreview.net/forum?id=ioyq7Ns R1KJ. Qiaoyu Tan, Ninghao Liu, Xiao Huang, Soo-Hyun Choi, Li Li, Rui Chen, and Xia Hu. S2GAE: self-supervised graph autoencoders are generalizable learners with graph masking. In WSDM 23, 2023. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: large-scale information network embedding. In WWW 15, pp. 1067 1077, 2015. Josephine Thomas, Alice Moallemy-Oureh, Silvia Beddar-Wiesing, and Clara Holzhüter. Graph neural networks designed for different graph types: A survey. TMLR, 2023. ISSN 2835-8856. URL https: //openreview.net/forum?id=h4BYt Z79uy. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. Co RR, abs/1906.05849, 2019. Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning. Co RR, abs/2005.10243, 2020. Zekun Tong, Yuxuan Liang, Henghui Ding, Yongxing Dai, Xinke Li, and Changhu Wang. Directed graph contrastive learning. In Neur IPS 21, 2021. URL https://openreview.net/forum?id=s6JD_x BS31. Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Co RR, abs/1807.03748, 2018. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In ICLR 18, 2018. Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. Deep graph infomax. In ICLR 19, 2019. Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. ar Xiv preprint ar Xiv:1909.01315, 2019. Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance-level discrimination. Co RR, abs/1805.01978, 2018. Published in Transactions on Machine Learning Research (07/2023) Rongkai Xia, Yan Pan, Lei Du, and Jian Yin. Robust multi-view spectral clustering via low-rank and sparse decomposition. In AAAI 14, pp. 2149 2155, 2014. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR 19, 2019. Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y. Chang. Network representation learning with rich text information. In IJCAI 15, pp. 2111 2117, 2015. Han Yang, Kaili Ma, and James Cheng. Rethinking graph regularization for graph neural networks. In AAAI 21, pp. 4573 4581. AAAI Press, 2021. Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie Tang. Understanding negative sampling in graph representation learning. In KDD 20, pp. 1666 1676, 2020. Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In ICML 16, volume 48 of JMLR Workshop and Conference Proceedings, pp. 40 48, 2016. Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Neur IPS 20, 2020. Liang Zeng, Jin Xu, Zijun Yao, Yanqiao Zhu, and Jian Li. Graph symbiosis learning. Co RR, abs/2106.05455, 2021. URL https://arxiv.org/abs/2106.05455. Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR 18, 2018. Xiaotong Zhang, Han Liu, Qimai Li, and Xiao-Ming Wu. Attributed graph clustering via adaptive graph convolution. In IJCAI 19, pp. 4327 4333, 2019. Han Zhao, Xu Yang, Zhenru Wang, Erkun Yang, and Cheng Deng. Graph debiased contrastive learning with joint representation clustering. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 3434 3440. IJCAI 21, 8 2021. doi: 10.24963/ijcai.2021/473. URL https://doi.org/10.24963/ijcai.2021/473. Main Track. Kaixiong Zhou, Ninghao Liu, Fan Yang, Zirui Liu, Rui Chen, Li Li, Soo-Hyun Choi, and Xia Hu. Adaptive label smoothing to regularize large-scale graph training. Co RR, abs/2108.13555, 2021. Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep Graph Contrastive Representation Learning. In ICML Workshop on Graph Representation Learning and Beyond, 2020. Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. An empirical study of graph contrastive learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021a. URL https://openreview.net/forum?id=Uu Ub IYn HKO. Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Graph contrastive learning with adaptive augmentation. In WWW 21, 2021b. doi: 10.1145/3442381.3449802. Published in Transactions on Machine Learning Research (07/2023) Table 7: Impact of Contrast-Reg on ECE Value Contrast-Reg Cora Citeseer Pubmed ML 0.477 0.540 0.399 ML 0.413 0.525 0.273 LC 0.477 0.537 0.416 LC 0.437 0.524 0.274 A Details of the analysis A.1 A through analysis of Claim 4.1 Considering that positive sampling is based on the calculated distance between pairwise embeddings, we can express the following relationship: Eacc(v,v +)=1[p(v, v +)] = Eacc(v,v +)=0[p(v, v +)] = Etopk σ(hv,hv )[σ(hv hv )] (9) Here, Etopk σ(hv hv )[σ(hv hv )] represents the expectation of the top k pairs (v, v ). When the loss converges, Etopk σ(hv hv )[σ(hv hv )] 1. Furthermore, by considering the definition in Equation 4 and the fact that negative samples are uniformly sampled, we can write: Eacc(v,v )=0[p(v, v )] = Eacc(v,v )=1[p(v, v )] = 1 E(v,v )[σ(v, v )]. (10) Thus, Equation 6 can be reformulated as: ECE = r+(1 2Etopk σ(hv hv )[σ(hv hv )])q+ + r (1 2q )E[σ(hv hv )] + r q + r+Etopk σ(hv hv )[σ(hv h v)]. (11) When the loss converges, Etopk σ(hv hv )[σ(hv h v)] 1, indicating that the ECE value is negatively correlated with the probability q+ of v +. In addition, negative samples are uniformly sampled, so that q = 1/K, where K represents the number of classes when K > 2. As a result, ECE is positively correlated with the expectation of the confidence value for randomly sampled pairs E(v,v )[σ(hv, hv )]. A.2 Contrast-Reg Leads to a Decreased ECE Value We empirically investigate the impact of loss convergence on the ECE value, with and without Contrast-Reg, across various datasets and contrastive strategies, in Table 7. The results demonstrate that Contrast-Reg leads to a decrease in ECE values for all tested datasets and contrastive losses. This reduction in ECE indicates that Contrast-Reg promotes better alignment with the performance of downstream tasks while minimizing the training loss, ensuring that minimizing the contrastive loss with Contrast-Reg results in high-quality representations. A.3 Explanation of Contrast-Reg s Impact on the Term s(f) In this section, we provide an explanation for why Contrast-Reg leads to a decrease in the upper bound of s(f). Firstly, we will prove that minimizing Eq. (7) results in a decrease in Var( hi ) when h T i Wr > c, as stated in Theorem 2. Then, based on the assumption that models with lower Var( hi ) inherently favor lower values of E[ hi ], both Evi Dc[ hi 2] and σmax(M(f, c)) will decrease. Consequently, the upper bound of s(f) decreases with the implementation of Contrast-Reg. In the subsequent proof, hi = f(x), and the notations hi and f(x) may be used interchangeably for the sake of presentation clarity. Theorem 2. Minimizing Eq. (7) induces the decrease in Var( hi ) when h T i Wr > c. Proof. We minimize Lreg by gradient descent with learning rate β. f(x)Lreg = σ( f(x)T Wr)Wr (12) Published in Transactions on Machine Learning Research (07/2023) The embedding of f(x) is updated as the following after adding Lreg: f(x) f(x) + β σ( f(x)T Wr)Wr (13) Eq. (13) shows that in every optimization step, f(x) extends by βσ( f(x)T Wr) Wr along r0 := W r W r . If we do orthogonal decomposition for f(x) along r0 and its unit orthogonal hyperplain Π(r0), f(x) = f(x)T r0 r0 + f(x)T Π(r0) Π(r0). Thus we have (f(x)T r0)2 + (f(x)T Π(r0))2. (14) The projection of f(x) along r0 is f(x)T r0 = f(x)T W r W r , while the projection of f(x) plus the Contrast-Reg update along r0 is f(x)T r0 reg = f(x)T Wr Wr + β 1 + ef(x)T W r Wr . Note that f(x)T Π(r0) reg = f(x)T Π(r0). Based on Lemma 1 and Eq. (14), when β Wr 2 1 and f(x)T Wr > 1.5, we have Var (f(x))reg < Var ( f(x) ) . (15) Lemma 1. For a random variable X [1.5, + ), a constant τ (0, 1] and a constant c2, we have (X + τ 1 + e X )2 + c2 < Var p X2 + c2 . (16) Proof. First, we consider (x + τ 1 + ex )2 + c2 p where h(x) is strictly decreasing in [x0, + ) and strictly increasing in ( , x0], and x0 is the solution of h (x) = dh(x) dx = 0. Thus, we can approximate the range of x0 (0, 1.5) by the fact that h (0)h (1.5) < 0 for all τ and c2. Thus, for x > y 1.5, (x + τ 1 + ex )2 + c2 p x2 + c2 < r (y + τ 1 + ey )2 + c2 p and since (x + τ 1+ex ) is monotonically increasing, we get (x + τ 1 + ex )2 + c2 r (y + τ 1 + ey )2 + c2 < p When y > x 1.5, y2 + c2 < r (x + τ 1 + ex )2 + c2 r (y + τ 1 + ey )2 + c2 < 0. Published in Transactions on Machine Learning Research (07/2023) Further, we assume that X and Y are i.i.d. random variables sampled from [1.5, + ), (X + τ 1 + e X )2 + c2 (X + τ 1 + e X )2 + c2 r (Y + τ 1 + e Y )2 + c2 2# (x + τ 1 + ex )2 + c2 r (y + τ 1 + ey )2 + c2 2 p(x)p(y)dxdy y2 + c2 2 p(x)p(y)dxdy Remark 1. β Wr 2 1, which is the condition of Eq. (15) , is not difficult to satisfy, since the magnitude of r could be tuned. In practice, r (0, 1] can fit in all our experiments. Remark 2. The range of f(x)T wr in Theorem 2 is not a tight bound for x0 in Lemma 1. Since when Eq. (7) converges, f(x)T Wr is much larger than 1.5 for almost all the samples empirically, we prove the case for f(x)T wr [1.5, + ). A.4 Comprehensive Explanation of Theorem 1: Notations and Proof To formally analyze the behavior of contrastive learning, we introduce the following concepts as the previous literature (Saunshi et al., 2019) does. Latent classes: Data are considered as drawn from latent classes C with distribution ρ. Further, distribution Dc is defined over feature space X that is associated with a class c C to measure the relevance between x and c. Semantic similarity: Positive samples are drawn from the same latent classes, with distribution Dsim(x, x+) = Ec ρ Dc(x)Dc(x+) , (17) while negative samples are drawn randomly from all possible data points, i.e., the marginal of Dsim, as Dneg(x ) = Ec ρ Dc(x ) (18) Supervised tasks: Denote K as the number of negative samples. The object of the supervised task, i.e., feature-label pair (x, c), is sampled from DT (x, c) = Dc(x)DT (c), where DT (c) = ρ(c|c T ), and T C with |T | = K + 1. Mean classifier W µ is naturally imposed to bridge the gap between the representation learning performance and linear separability of learn representations, as W µ c := µc = Ex Dc[f(x)]. Empirical Rademacher complexity: Suppose F : X [1, 0]. Given a sample S, RS(F) = E e sup f F e T f(S) where e = (e1, , em)T , with ei are independent random variables taking values uniformly from { 1, +1}. Published in Transactions on Machine Learning Research (07/2023) In addition, the theoretical framework (Saunshi et al., 2019) makes an assumption: encoder f is bounded, i.e., maxx X f(x) R2, R R. To prove Theorem 1, we first list some key lemmas. Lemma 2. For all f F, Lµ sup(f) 1 1 τ (Lnce(f) 2τ log 2). (19) This bound connects contrastive representation learning algorithms and its supervised counterpart. This lemma is achieved by Jensen s inequality. The details are given in Appendix A.6. Lemma 3. With probability at least 1 δ over the set S, for all f F, Lnce( ˆf) Lnce(f) + Gen M. (20) This bound guarantees that the chosen ˆf = arg minf F Lµ nce cannot be too much worse than f = arg minf F Lnce. The proof applies Rademacher complexity of the function class (Mohri et al., 2018) and vector-contraction inequality (Maurer, 2016). More details are given in Appendix A.7. Lemma 4. L= nce(f) 4s(f) + 2 log 2. This bound is derived by the loss caused by both positive and negative pairs that come from the same class, i.e., class collision. The proof uses Bernoulli s inequality (details in Appendix A.8). Proof to Theorem 1. Combining Lemma 2 and Lemma 3, we obtain with probability at least 1 δ over the set S, for all f F, Lµ sup( ˆf) 1 1 τ (Lnce(f) + Gen M) (21) Then, we decompose Lnce = τL= nce(f) + (1 τ)L = nce(f), apply Lemma 4 to Eq. (21), and obtain the result of Theorem 1 A.5 Contrastive Learning with NCEloss The contrastive loss is Lun := E (x,x+) Dsim, (x 1 , ,x K) Dneg ℓ({f(x)T (f(x+) f(x i ))}K i=1) , where ℓcan be the hinge loss as ℓ(v) = max {0, 1 + maxi { vi}} or the logistic loss as ℓ(v) = log2 (1 + P i exp( vi)). And its supervised counterpart is defined as Lµ sup := E (x,c) DT (x,c) h ℓ f(x)T µc f(x)T µc A more powerful loss function, NCEloss, used in the previous literature (Velickovic et al., 2019; Yang et al., 2020; Mnih & Teh, 2012; Dyer, 2014), can be framed as E (x,x+) Dsim, (x 1 , ,x K) Dneg logσ(f(x)T f(x+))+ k=1 logσ( f(x)T f(x k )) and its empirical counterpart with M samples xi,x+ i ,x i1, ,x i K M i=1 is given as log σ(f(xi)T f(x+ i ))+ k=1 log σ( f(xi)T f(x ij)) where σ( ) is the sigmoid function. For its supervised counterpart, it is exactly the cross entropy loss for the (K+1)-way multi-class classification task: Lµ sup := E (x,c) DT (x,c) log σ(f(x)T µc)+log σ( f(x)T µc )|c =c . (24) Published in Transactions on Machine Learning Research (07/2023) A.6 Proof of Lemma 2 First, we prove that ℓ(f(x+), {f(x i )}) = (log σ(f(x)T f(x+)) + PK i=1 log(σ(f(x)T f(x ))) is convex w.r.t. f(x+), f(x i ), , f(x K). Consider that ℓ1(z) = log σ(z) and ℓ2(z) = log σ( z) are both convex functions since ℓ 1 > 0 and ℓ 2 > 0 for z R. Given f(x) R, z+ = f(x)T f(x+) and z = f(x)T f(x+) are affine transformation w.r.t. f(x+) and f(x ). Thus, when f(x) is fixed, ℓ1(f(x+)) = log σ(f(x)T f(x+)) and ℓ2(f(x )) = log σ( f(x)T f(x )) are convex functions. As ℓ1 > 0 and ℓ2 > 0, we obtain ℓ(f(x+), {f(x i )}) = (log σ(f(x)T f(x+))+PK i=1 log(σ(f(x)T f(x )))) is convex since non-negative weighted sums preserve convexity (Boyd & Vandenberghe, 2014). By the definition of convexity, Lnce(f) = Ec+,c ρ2; x Dc+ Ex+ Dc+ ; ℓ(f(x+), {f(x i )}) Ec+,c ρ2Ex Dc+ ℓ(f(x)T {µc+, µc )} = (1 τ)Lµ sup(f) + τEc+ ρEx Dc+ log σ(f(x)T µc+) log σ( f(x)T µc+) (1 τ)Lµ sup(f) + 2τ log 2 A.7 Generalization bound Denote F = f xi, x+ i , x i1, , x i K = f(xi), f(x+ i ), f(x i1), , f(x i K) |f F . Let q f = h f, and its function class, Q = q = h f| f F . Denote zi = xi, x+ i , x i1, , x i K , suppose ℓis bounded by B, then we can decompose h = 1 B ℓ ϕ. Then we have q f(zi) = 1 B ℓ(ϕ( f(zi))), where ϕ( f(zi)) = t=1 f(xi)tf(x+ i0)t, t=1 f(xi)tf(x i1)t, , t=1 f(xi)tf(x i K)t log σ(x0) + i=1 log σ( xi)) From Eq. (25), we know that ϕ : R(K+2)d RK+1. Then we will prove that h is L-Lipschitz by proving that ϕ and ℓare both Lipschitz continuity. First, f(xi)t = f(xik)t = f(x+ i0)t, k = 0 f(x ik)t, k = 1, , K f(x+ i0)t = f(xi)t, ϕ( f(zi)) f(x ik)t = f(xi)t. If we assume Pd t=1 f(xik)2 t R2 and Pd t=1 f(xi)2 t R2, t=1 f(x+ i0)2 t + t=1 f(x ik)2 t + (K + 1) t=1 f(xi)2 t 2(K + 1)R2 = p Published in Transactions on Machine Learning Research (07/2023) Algorithm 2: ML Parameter: Parameters of an (additional) GNN layer g. 1 Function Contrast(C, G, X, f): 2 Let g(xi) be the representation of xi by stacking g upon f; 3 Randomly pick a negative node x i from V for each xi C; 4 return (g(xi), f(xi), f(x i )) 6 Function Seed Select(G, X, f, epoch): 7 return V; Combining J 2 J F , we obtain that ϕ is p 2(K + 1)R-Lipschitz. Similarly, ℓis K + 1-Lipschitz. Since we assume that the inner product of embedding is no more than R2. Thus, l is bounded by B = (K + 1) log(σ( R2)). Above all, h is L-Lipschitz with L = B . Applying vector-contraction inequality (Maurer, 2016), we have Eσ { 1}M [sup f F σ, (h f)|S ] 2LEσ { 1}(K+1)d M [sup f F σ, f|S ]. If we write it in Rademacher complexity manner, we have RS(Q) 2(K + 1)R Applying generalization bounds based on Rademacher complexity (Mohri et al., 2018) to q Q. For any δ > 0, with the probability of at least 1 δ i=1 q(zi) + 2RS(Q) i=1 q(zi) + 4(K + 1)RRS(F) Thus for any f, Lnce(f) Lnce(f) + 4(K + 1)RRS(F) δ 2M . (26) Let ˆf = arg minf F Lnce(f) and f = arg minf F Lnce(f). By Hoeffding s inequality, with probability of 1 δ Lnce(f ) Lnce(f ) + B Substituting ˆf into Eq. (26), combining Lnce( ˆf) Lnce(f ) and applying union bound, with probability of at most δ Lnce( ˆf) Lnce( ˆf) + 4(K + 1)RRS(F) Lnce(f ) + 4(K + 1)RRS(F) Lnce(f) + 4(K + 1)RRS(F) M 4(K + 1) log(σ( R2)) fails. Thus, with probability of at least 1 δ, Eq. (28) holds. Published in Transactions on Machine Learning Research (07/2023) A.8 Class collision loss Let pi = |f(x)T f(xi)| and p = maxi {0,1, ,K} pi. Considering L= nce(f) = E log σ(f(x)T f(x+ 0 )) + i=1 log σ( f(x)T f(x i ))) log(1 + e f(x)T f(x+ 0 )) + i=1 log(1 + ef(x)T f(x i )) (K + 1)E [log(1 + ep)] (K + 1) log 2 + (K + 1)E [p] Since x, x+ 0 , x 1 , , x K are sampled i.i.d. from the same class, E[p] = Z P[p x]dx = Z (1 (1 P[p0 x])K+1dx. (30) Applying Bernoulli s inequality, we have E[p] Z (1 (1 (K + 1)P[p0 x]))dx = Z (K + 1)P[p0 x]dx = (K + 1)E[p0] = (K + 1)E[|f(x)T f(x+ 0 )|] E[(f(x)T f(x+ 0 ))2]. Therefore, L= nce(f) (K + 1) log 2 + (K + 1)2s(f) (32) Published in Transactions on Machine Learning Research (07/2023) Algorithm 3: LC Hyperparameter: R: curriculum update epochs; k: the number of candidate positive samples for seed node; 1 Function Contrast(C, G, X, f): 2 For xi C, let N + i be the set of k nodes in {xj Neighbor(xi)} with largest f(xi)T f(xj); 3 Randomly pick one positive node x+ i from N + i for each xi C; 4 Randomly pick one negative node x i from V for each xi C; 5 return (f(xi), f(x+ i ), f(x i )) 7 Function Seed Select(G, X, f, epoch): 8 if epoch % R = 1 then 9 return the same set of seed nodes C as in the last epoch ; 11 pi,j f(i)T f(j) P k V f(i)T f(k) for i, j V; j V(pi,j log (pi,j)) for i V; 13 return ( epoch e |G| nodes with smallest H; Algorithm 4: GCA Hyperparameter: two stochastic augmentation functions set T and T ; 1 Function Contrast(C, G, X, f): 2 Sample two stochastic augmentation functions t T and t T ; 3 Generate two graph views G1 = t(G) and G2 = t (G) by performing corruption on G; 4 return (f( G1(xi)), f( G2(xi)), f( G2(xj)) xi,xi G,xj =xi; 6 Function Seed Select(G, X, f, epoch): 7 return V; B Details of Graph Contrastive Algorithms Algorithm 1 presents the graph contrastive learning framework and how Contrast-Reg is plugged into it. The functions Seed Select and Contrast in the graph contrastive learning framework are responsible for selecting the seed nodes and sampling the positive/negative pairs for these seeds while incorporating various priors for both structural and attribute aspects of the graph. he detailed Seed Select and Contrast implementations for ML, LC, GCA are referred to Algorithm 2, 3 and 4. C Supplementary Experiments C.1 Plug Contrast-Reg into Di GCL and GDCL We applied Contrast-Reg to two graph contrastive learning algorithms, namely Di GCL (Tong et al., 2021) and GDCL (Zhao et al., 2021). GDCL employs graph clustering to reduce the number of false-negative samples, whereas Di GCL generates contrastive pairs using Laplacian perturbations to better preserve the structural attributes of directed graphs. In Table 8, we present a comparative analysis of the performance of Contrast-Reg, applied to both algorithms, as well as their performance without Contrast-Reg. We used the official implementations and the parameters of both algorithms, and conducted experiments with random seeds ranging from 0 to 9. Our results demonstrate that Contrast-Reg serves as an effective plugin to these advanced graph contrastive learning algorithms, even when applied to directed graphs (Di GCL). Published in Transactions on Machine Learning Research (07/2023) Table 8: Di GCL & GDCL with and without Contrast-Reg Algorithm Contrast-Reg Cora-ml Citeseer Cora Di GCL 76.62 1.23 65.54 1.39 Di GCL 79.74 0.62 68.74 1.13 GDCL 72.98 0.73 85.41 0.53 GDCL 76.39 1.16 85.80 0.46 Table 9: Datasets Dataset Node # Edge # Feature # Class # Cora (Yang et al., 2016) 2,708 5,429 1,433 7 Citeseer (Yang et al., 2016) 3,327 4,732 3,703 6 Pubmed (Yang et al., 2016) 19,717 44,338 500 3 ogbn-arxiv (Hu et al., 2020a) 169,343 1,166,243 128 40 Wiki (Yang et al., 2015) 2,405 17,981 4,973 3 Computers (Shchur et al., 2018) 13,381 245,778 767 10 Photo (Shchur et al., 2018) 7,487 119,043 745 8 ogbn-products (Hu et al., 2020a) 2,449,029 61,859,140 100 47 Reddit (Hamilton et al., 2017) 232,965 114,615,892 602 41 D Experiment details Dataset statistics The datasets we employed encompass citation networks, web graphs, co-purchase networks, and social networks. The detailed dataset statistics are shown in Table 9. Hardware Configuration: The experiments are conducted on Linux servers installed with an Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 256GB RAM and 8 NVIDIA 2080Ti GPUs. Software Configuration: Our models, as well as the DGI, GMI and GCN baselines, were implemented in Py Torch Geometric (Fey & Lenssen, 2019) version 1.4.3, DGL (Wang et al., 2019) version 0.5.1 with CUDA version 10.2, scikit-learn version 0.23.1 and Python 3.6. Our codes and datasets will be made available. Hyper-parameters: For full batch training, we used 1-layer GCN as the encoder with prelu activation, for mini-batch training, we used a 3-layer GCN with prelu activation. We conducted grid search of different learning rate (from 1e-2, 5e-3, 3e-3, 1e-3, 5e-4, 3e-4, 1e-4) and curriculum settings (including learning rate decay and curriculum rounds) on the fullbatch version. We used 1e-3 or 5e-4 as the learning rate; 10,10,15 or 10,10,25 as the fanouts and 1024 or 512 as the batch size for mini-batch training.