# graph_contrastive_learning_with_reinforcement_augmentation__6045dba8.pdf Graph Contrastive Learning with Reinforcement Augmentation Ziyang Liu , Chaokun Wang , Cheng Wu School of Software, BNRist, Tsinghua University, Beijing, China {liu-zy21, wuc22}@mails.tsinghua.edu.cn, chaokun@tsinghua.edu.cn Graph contrastive learning (GCL), designing contrastive objectives to learn embeddings from augmented graphs, has become a prevailing method for extracting embeddings from graphs in an unsupervised manner. As an important procedure in GCL, graph data augmentation (GDA) directly affects the model performance on downstream tasks. Currently, the GCL methods typically treat GDA as independent events, neglecting its continuity. In this paper, we regard the GDA in GCL as a Markov decision process and propose a novel graph reinforcement augmentation framework for GCL. Based on this framework, we design a Graph Advantage Actor Critic (GA2C) model. We conduct extensive experiments to evaluate GA2C on unsupervised learning, transfer learning, and semi-supervised learning. The experimental results demonstrate the performance superiority of GA2C over the state-of-the-art GCL models. Furthermore, we verify that GA2C is more efficient than the other GCL methods with learnable GDA and provide two examples of chemical molecular graphs from ZINC-2M to demonstrate that GA2C generates meaningful augmented views, where the edge weights reflect the importance of chemical bonds in the molecule. 1 Introduction Graph representation learning aims to extract low-dimensional representation vectors from the graph data by supervised or unsupervised learning [Cui et al., 2019; Yang et al., 2023; Liu et al., 2022]. These representation vectors encode the abundant structural and semantic information of the graph. Graph representation learning has become an effective technique of graph mining and is applied to multiple fields including bioinformatics [Ang et al., 2021; You et al., 2020], social networks [Wang et al., 2018; Wu et al., 2023], and recommender systems [Yu et al., 2022; Xu et al., 2023]. Due to the sparsity of labeling information in the real world, learning representations from graphs in a self-supervised manner has become particularly crucial. Furthermore, as an important member of self-supervised representation learning on graphs, graph contrastive learning (GCL) has achieved superior performance on a series of downstream tasks such as graph classification [Suresh et al., 2021; Yin et al., 2022; Liu et al., 2023]. As a necessary procedure in GCL, graph data augmentation (GDA) generally defines augmented views based on the original graph and directly affects the model performance on downstream tasks. The common GDA strategies include edge removing, edge perturbation, attribute masking, and so on. Prior work designs three types of GDA in total: trialand-error methods, precomputed methods, and adversarial methods. In the trial-and-error methods [Zhu et al., 2020; Thakoor et al., 2021] and precomputed methods [Zhu et al., 2021], GDA is frozen in the whole training. While in the adversarial methods [Suresh et al., 2021; You et al., 2022], GDA is learnable in the training and its parameters are updated by a pre-designed view learner. In addition, some GCL models leverage augmentation-free methods, such as encoder perturbation [Xia et al., 2022], to construct self-supervised signals instead of GDA. Motivation 1: Evolvability of augmented views. Currently, GCL with learnable augmentation has unleashed the potential of contrastive learning and achieved state-of-theart performance on downstream tasks [Suresh et al., 2021; Yin et al., 2022]. Then a new potential question arises: How does a good augmented view evolve to promote the performance of GCL? We argue that a good augmented view should maintain the characteristics of progressive evolution, akin to the step-by-step learning progression in human cognition, where it is easier to comprehend and accept new information when there is a connection between what is learned on successive days. It corresponds to a Markov decision process, i.e., the augmented view of the current epoch is only affected by that of the last epoch. Motivation 2: Preservation of original graph structure information. In specific graph structures, GDA strategies like edge removing or node deletion can disrupt the fundamental structure information of the original graph. For instance, in molecular property prediction, removing atoms or chemical bonds destroys the basic structure of the chemical molecule, or even alters the types of functional groups [Xia et al., 2022; Wang et al., 2022a]. Some existing GCL models use augmentation-free methods like encoder perturbation to address this issue and have achieved commendable results. Con- Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) sidering the difficulty in determining the appropriate encoder perturbation ratio (e.g., inappropriate perturbations can lead to performance degradation [Xia et al., 2022]), we cannot help but wonder: Is there a scheme that preserves the fundamental structure information of the original graph without perturbing the encoder? In this study, we propose the reinforced GDA where the view learner adjusts GDA parameters actively based on the previous state (Motivation 1). Our view learner in GDA is regarded as an agent that interacts with the encoder network. Moreover, the view learner learns the weights of edges in the graph, without deleting any node, edge, or attribute in the graph. As a result, the augmented view preserves the original graph structure information completely (Motivation 2). Applying reinforcement learning to GDA is non-trivial because of two primary challenges: Firstly, the existing reinforcement learning methods on graphs are not adequately suited for GDA [Ju et al., 2023; Munikoti et al., 2023]; secondly, designing an effective reward function for graph-based environments remains unclear. To solve the two challenges, we propose a novel graph reinforcement augmentation (GRA) framework for GCL and design a Graph Advantage Actor-Critic (GA2C) model based on this framework. In GA2C, the joint use of the Actor submodel and Critic submodel contributes to the progressive evolution of the augmented view. Also, the reward function in GA2C is designed as the negative mutual information between two augmented graph embeddings to reduce the redundant information between two views. The main contributions of this work are as follows: We propose the GRA framework, a novel type of GDA for GCL, based on reinforcement learning. This framework formulates a Markov decision process for GDA and preserves the original graph structure information (Section 4.1). We design the GA2C model, as an instantiation of the GRA framework, to achieve continuous and learnable graph data augmentation (Section 4.2). On 13 public datasets, we demonstrate that GA2C outperforms the state-of-the-art GCL models in unsupervised, transfer, and semi-supervised learning. (Section 5). 2 Related Work Graph contrastive learning with frozen GDA parameters. Traditional GCL models like Graph CL [You et al., 2020] and Info Graph [Sun et al., 2020] adopt frozen GDA parameters to generate augmented views. Generally, these models set GDA parameters empirically by trial and error. Different from the trial-and-error methods, GCA [Zhu et al., 2021] adopts precomputed methods, i.e., it calculates GDA parameters based on some indicators of centrality in the graph (e.g., degree centrality) and uses these precomputed parameters to augment the graph. However, there is a risk in frozen GDA parameters that the redundant information is captured by the Info Max principle [Tschannen et al., 2020]. The redundant information hinders the optimal performance of GZL on downstream tasks. Graph contrastive learning with learnable augmentation. To reduce the negative impact of redundant information, learnable augmentation is introduced into GCL, i.e., the parameters Contrastive loss 𝒢% Forward propagation Backward propagation Frozen module Learnable module Figure 1: Framework of GCL with frozen GDA parameters. of GDA are learned automatically in the training. For example, AD-GCL [Suresh et al., 2021] adopts adversarial training, i.e., the contrastive optimization aims to i) maximize the correspondence between the embeddings of different views when fixing GDA and updating network encoding; ii) minimize the correspondence between the embeddings of different views when fixing network encoding and updating GDA. Also, inspired by image manifolds, the study in [You et al., 2022] extends the frozen GDA parameters to the learnable ones, and leverages both principles of information minimization and information bottleneck to regularize the learned GDA parameters. JOAO [You et al., 2021] is an automated augmentation selection framework for Graph CL that aligns with best practices. As an enhanced version of JOAO, JOAOv2 [You et al., 2021] utilizes an augmentation-aware projection head to counter training distribution distortions caused by aggressive augmentations. The key difference between the above methods and ours is that our method captures the evolutionary aspect of augmented views and preserves the original graph structure information. Augmentation-free graph contrastive learning. In the literature, augmentation-free GCL models learn embeddings from graphs without using GDA. These models typically leverage encoder perturbation or embedding perturbation instead of GDA to construct contrast pairs. For example, Sim GRACE [Xia et al., 2022] takes the raw graph as the input and uses the graph neural networks (GNNs) encoder [Yang et al., 2022; Yang et al., 2023] as well as its perturbed version to generate two augmented views for contrast. AF-GCL [Wang et al., 2022a] optimizes high-dimensional embedding distances using aggregated node features for positive and negative pair construction. Sim GCL [Yu et al., 2022] creates contrastive views by adding noises to the embeddings obtained by GNNs. Both the aforementioned methods and our method preserve the original graph structure information for contrastive training. The key difference is that our method does not require designing perturbed versions of encoders or embeddings, where manually tuning the perturbation ratio significantly affects model performance. 3 Preliminaries Pre-training using GCL. Conventional GCL models use predefined GDA with frozen GDA parameters [Thakoor et al., 2021; Zhu et al., 2021]. As shown in Figure 1, it typically includes three main procedures: graph data augmentation, network encoding, and contrastive loss based optimization. Assume the input graph set D= {Gi|i=1, , n} where Gi Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24) Notation Description Gi The i-th graph in input graph set D. G(t) i Augmented view of Gi at time t. θA, θC Parameter sets of Actor and Critic. Mi R+ Number of node attributes in Gi. Ni R+ Number of nodes in Gi. D R+ Hidden size of embeddings. T R+ Duration time of Actor or Critic. Zi, Z(t) i RD Graph embeddings of Gi and G(t) i . [XGi]v RMi Attribute vector of node v in Gi. ω(t) i RNi Ni GDA parameter matrix of G(t) i . Q(t) i , V (t) i RNi Ni Q-value matrix and V-value matrix. A(t) i RNi Ni Advantage-value matrix. fϕ( ), gφ( ) Encoder network, projection network. Table 1: Frequently used notations. denotes the i-th graph and n denotes the graph count. The augmented view Gi is obtained from Gi by a specific GDA operation A ω where ω represents the GDA parameter matrix. For example, A ω can be edge removing with the removing ratio of 0.2 for each node. Next, Gi, Gi are passed into encoder network fϕ( ) as well as projection network gφ( ) and then are extracted into two D-dimensional graph embeddings Zi, Zi RD. Finally, the parameters ϕ and φ are updated by optimizing a predefined contrastive loss like Info NCE [Zhu et al., 2020; You et al., 2020] between Zi and Zi. Fine-tuning on downstream tasks. GCL is generally followed by a certain downstream task such as graph classification [Sun et al., 2020; Suresh et al., 2021; Yin et al., 2022]. When the training of GCL is completed, the graph embedding Zi is generated from graph Gi by the trained encoder network and projection network. Then Zi serves as the input of the downstream task s classifier such as a linear logistic regression model or an SVM model. For frequently used notations, we list and describe them in Table 1. 4 Methods 4.1 GRA Framework Reinforcement learning involves no supervisor, and only a reward signal is used for an agent to determine if it is doing well or not [Wang et al., 2022b]. From this point, reinforcement learning is suitable for characterizing an evolving view learner in GDA. We illustrate the proposed GRA framework in Figure 2. Specifically, there are three key components in this framework. (C1) Min-max optimization of contrastive loss and cumulative reward. We design a mechanism of double-layer model learning: The outer layer is to minimize the Info NCE loss LNCE( ) between two views by only updating ϕ and φ; the inner layer is to maximize the expected cumulative reward R( ) by only updating ωi (i = 1, , n). The whole min-max optimization is formalized as follows: min ϕ,φ LNCE(gφ(fϕ(Gi)), gφ(fϕ(A ω i (Gi)))) s.t. ω i = arg max ωi R( ωi). (1) 𝐴!𝝎! (#)(#) 𝐴!𝝎! (%)(#) 𝐴!𝝎! (%)(#) 𝑔𝝋(#) 𝒢% ()) Time 1 Time T Learnable module Frozen module Cumulative reward based optimization Contrastive loss based optimization + Figure 2: Forward propagation process of the GRA framework. For the backpropagation, it is operated twice in one epoch: One is at the end of the inner layer optimization and the other is at the end of the outer layer optimization. (C2) Markov decision process for GDA. The GDA procedure in the GRA framework is continuous and learnable. The continuity ensures that the view learner adjusts the current GDA parameters according to the state at the previous time. Therefore, it makes the augmented view progressively evolve to be a learning-friendly view during model training. To be specific, the determination of GDA parameters is regarded as a Markov decision process. Given the present augmented view G(t) i at time t (note that one epoch includes T times), the next augmented view G(t+1) i at time t + 1 is independent of any past view G(t ) i where t