# graph_contrastive_learning_automated__55afd88b.pdf Graph Contrastive Learning Automated Yuning You 1 Tianlong Chen 2 Yang Shen 1 Zhangyang Wang 2 Self-supervised learning on graph-structured data has drawn recent interest for learning generalizable, transferable and robust representations from unlabeled graphs. Among many, graph contrastive learning (Graph CL) has emerged with promising representation learning performance. Unfortunately, unlike its counterpart on image data, the effectiveness of Graph CL hinges on adhoc data augmentations, which have to be manually picked per dataset, by either rules of thumb or trial-and-errors, owing to the diverse nature of graph data. That significantly limits the more general applicability of Graph CL. Aiming to fill in this crucial gap, this paper proposes a unified bilevel optimization framework to automatically, adaptively and dynamically select data augmentations when performing Graph CL on specific graph data. The general framework, dubbed JOint Augmentation Optimization (JOAO), is instantiated as min-max optimization. The selections of augmentations made by JOAO are shown to be in general aligned with previous best practices observed from handcrafted tuning: yet now being automated, more flexible and versatile. Moreover, we propose a new augmentationaware projection head mechanism, which will route output features through different projection heads corresponding to different augmentations chosen at each training step. Extensive experiments demonstrate that JOAO performs on par with or sometimes better than the state-ofthe-art competitors including Graph CL, on multiple graph datasets of various scales and types, yet without resorting to any laborious datasetspecific tuning on augmentation selection. We release the code at https://github.com/ Shen-Lab/Graph CL_Automated. 1Texas A&M University 2The University of Texas at Austin. Correspondence to: Yang Shen , Zhangyang Wang . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1. Introduction Self-supervised learning on graph-structured data has raised significant interests recently (Hu et al., 2019; You et al., 2020b; Jin et al., 2020; Hu et al., 2020b; Hwang et al., 2020; Manessi & Rozza, 2020; Zhu et al., 2020a; Peng et al., 2020a; Rong et al., 2020; Jin et al., 2021b; Wu et al., 2021; Roy et al., 2021; Huang et al., 2021; Li et al., 2021). Among many others, graph contrastive learning methods extend the contrastive learning idea (He et al., 2020; Chen et al., 2020c), originally developed in the computer vision domain, to learn generalizable, transferable and robust representations from unlabeled graph data (Veliˇckovi c et al., 2018; Sun et al., 2019; You et al., 2020a; Qiu et al., 2020; Hassani & Khasahmadi, 2020; Zhu et al., 2020b;c; Chen et al., 2020b;a; Ren et al., 2019; Park et al., 2020; Peng et al., 2020b; Jin et al., 2021a; Wang & Liu, 2021). Nevertheless, unlike images, graph datasets are abstractions of diverse nature (e.g. pandemics, citation networks, biochemical molecules, or social networks). Such a unique diversity challenge was not fully addressed by prior graph self-supervised learning approaches (Hu et al., 2019; You et al., 2020a;b). For example, the state-of-the-art graph contrastive learning framework, Graph CL (You et al., 2020a), constructs specific contrastive views of graph data via handpicking ad-hoc augmentations for every dataset (You et al., 2020a; Zhao et al., 2020; Kong et al., 2020). The choice of augmentation follows empirical rules of thumb, typically summarized from many trial-and-error experiments per dataset. That seriously prohibits Graph CL and its variants from broader applicability, considering the tremendous heterogeneity of practical graph data. Moreover, even such trial-and-error selection of augmentations relies on a labeled validation set for downstream evaluation, which is not always available (Dwivedi et al., 2020; Hu et al., 2020a). Contributions. Given a new and unseen graph dataset, can our graph contrastive learning methods automatically select their data augmentation, avoiding ad-hoc choices or tedious tuning? This paper targets at overcoming this crucial, unique, and inherent hurdle. We propose joint augmentation optimization (JOAO), a principled bi-level optimization framework that learns to select data augmentations for the first time. To highlight, the selection framework by JOAO is: (i) automatic, completely free of human labor of trial-and- Graph Contrastive Learning Automated error on augmentation choices; (ii) adaptive, generalizing smoothly to handling diverse graph data; and (iii) dynamic, allowing for augmentation types varying at different training stages. Compared to previous ad-hoc, per-dataset and prefixed augmentation selection, JOAO achieves an unprecedented degree of flexibility and ease of use. We summarize our contributions: Leveraging Graph CL (You et al., 2020a) as the baseline model, we introduce joint augmentation optimization (JOAO) as a plug-and-play framework. JOAO is the first to automate the augmentation selection when performing contrastive learning on specific graph data. It frees Graph CL from expensive trial-and-errors, or empirical ad-hoc rules, or any validation based on labeled data. JOAO can be formulated as a unified bi-level optimization framework, and be instantiated as min-max optimization. It takes inspirations from adversarial perturbations as data augmentations (Xie et al., 2020), and can be solved by an alternating gradient-descent algorithm. In accordance with diverse and dynamic augmentations enabled by JOAO, we design a new augmentation-aware projection head for graph contrastive learning. The rationale is to avoid too many complicated augmentations distorting the original data distribution. The idea is to keep one nonlinear projection head per augmentation pair, and each time using the single head corresponding to the augmentation currently selected by JOAO. Extensive experiments demonstrate that Graph CL with JOAO performs on par with or even sometimes better than state-of-the-art (SOTA) competitors, across multiple graph datasets of various types and scales, yet without resorting to tedious dataset-specific manual tuning or domain knowledge. We also show the augmentation selections made by JOAO are in general informed and often aligned with previous best practices . We leave two additional remarks: (1) JOAO is designed to be flexible and versatile. Although this paper mainly demonstrates JOAO on Graph CL, they are not tied with each other. The general optimization formulation of JOAO allows it to be easily integrated with other graph contrastive learning frameworks too. (2) JOAO is designed for automating the tedious and ad-hoc augmentation selection. It intends to match the state-of-the-art results achieved by exhaustive manual tuning, but not necessarily to surpass them all. To re-iterate, our aim is to scale up graph contrastive learning to numerous types and scales of graph data in the real world, via a hassle-free framework rather than tuning one by one. 2. Preliminaries and Notations Graph neural networks (GNNs) have grown into powerful tools to model non-Euclidean graph-structured data arising from various fields (Xu et al., 2018; You et al., 2020c; You & Shen, 2020; Liu et al., 2020; Zhang et al., 2020). Let G = {V, E}1 denote an undirected graph in the space G with V and E being the set of nodes and edges, respectively, and Xv RD for v V being node features. A GNN is defined as the mapping f : G RD that encodes a sample graph G into an D -dimensional vector. Self-supervised learning on graphs is shown to learn more generalizable, transferable and robust graph representations, through exploiting vast unlabelled data (Jin et al., 2020; Hu et al., 2020b; Hwang et al., 2020; Manessi & Rozza, 2020; Zhu et al., 2020a; Peng et al., 2020a; Rong et al., 2020; Jin et al., 2021b; Wu et al., 2021; Roy et al., 2021; Huang et al., 2021; Li et al., 2021). However, earlier selfsupervised tasks often need to be carefully designed with domain knowledge (You et al., 2020b; Hu et al., 2019) due to the intrinsic complicacy of graph datasets. Graph contrastive learning recently emerges as a promising direction (Veliˇckovi c et al., 2018; Sun et al., 2019; Qiu et al., 2020; Hassani & Khasahmadi, 2020; Zhu et al., 2020b;c; Chen et al., 2020b;a; Ren et al., 2019; Park et al., 2020; Peng et al., 2020b; Jin et al., 2021a; Wang & Liu, 2021). For example, the SOTA Graph CL framework (You et al., 2020a) enforces the perturbation invariance in GNNs through maximizing agreement between two augmented views of graphs: an overview is illustrated in Figure 1. Contrastive loss optimization Figure 1: Overview of the Graph CL pipeline in (You et al., 2020a). Specifically, we denote the input graph-structured sample G from certain empirical distribution PG. Its samples two random augmentation operators A1, A2 from a given pool of augmentation types as A = {Node Drop, Subgraph, Edge Pert, Attr Mask, Identical} (You et al., 2020a) and A A : G G. Graph CL (You et al., 2020a) optimizes the following loss: minθ L(G, A1, A2, θ) = minθ n ( EPG P(A1,A2)sim( Positive pairs z }| { Tθ,1(G), Tθ,2(G)) (1) + EPG PA1log(EPG PA2exp(sim(Tθ,1(G), Tθ,2(G ) | {z } Negative pairs 1We use the sans-serif typeface to denote a random variable (e.g. G). The same letter in the italic font (e.g. G) denotes a sample, and the calligraphic font (e.g. G) denotes the sample space. Graph Contrastive Learning Automated where Tθ,i = Ai fθ gθ (i = 1, 2) is parameterized by θ = {θ , θ }, and fθ : G RD , gθ : RD RD are the shared-weight GNN and projection head, respectively, sim(u, v) = u Tv u v is the cosine similarity function, PG = PG acts as the negative sampling distribution, and PA1 and PA2 are the marginal distributions. After the contrastive pre-training, the pre-trained fθ can be further leveraged for various downstream task fine-tuning. In the current Graph CL framework, (A1, A2) are selected by hand and pre-fixed for each dataset. In other words, P(A1,A2) is a Dirac distribution with the only spike at the selected augmentation pair. Yet given new graph data, how to select (A1, A2) relies on no more than loose heuristics. 3. Methodology 3.1. JOAO: The Unified Framework One clear limitation in (1) is that one needs to pre-define the sampling distribution P(A1,A2) based on prior rules, and only a Dirac distribution (i.e., only one pair for each dataset) was explored. Rather, we propose to dynamically and automatically learn to optimize P(A1,A2) when performing Graph CL (1), via the following bi-level optimization framework: minθ L(G, A1, A2, θ), s.t. P(A1,A2) arg min P(A 1,A 2)D(G, A 1, A 2, θ), (2) We refer to (2) as joint augmentation optimization (JOAO), where the upper-level objective L is the same as the Graph CL objective (or the objective of any other graph contrastive learning approach), and the lower-level objective D optimizes the sampling distribution P(A1,A2) jointly for augmentation-pair selections. Notice that JOAO (2) only exploits the signals from the self-supervised training itself, without accessing downstream labeled data for evaluation. 3.2. Instantiation of JOAO as Min-Max Optimization Motivated from adversarial training (Wang et al., 2019; Xie et al., 2020), we follow the same philosophy to always exploit the most challenging data augmentation of the current loss, hence instantiating the general JOAO framework as a concrete min-max optimization form: minθ L(G, A1, A2, θ), s.t. P(A1,A2) arg max P(A 1,A 2) n L(G, A 1, A 2, θ) 2 dist(P(A 1,A 2), Pprior) o , (3) where γ R 0, Pprior is the prior distribution on all possible augmentations, and dist : P P R 0 is a distance function between the sampling and the prior distribution (P is the probability simplex). Thereby, JOAO s formulation aligns with the idea of model-based adversarial training (Robey et al., 2020), where adversarial training is known to boost generalization, robustness and transferability (Robey et al., 2020; Wang et al., 2019). In this work, we choose Pprior as the uniform distribution to promote diversity in the selections, following a common principle of maximum entropy (Guiasu & Shenitzer, 1985) in Bayesian learning. No additional information is assumed about the dataset or the augmentation pool. In practice, it encourageds more diverse augmentation selections rather than collapsing to few. Comparison between the formulations with and without the prior is shown in Table S5 of Appendix E. We use a squared Euclidean distance for dist( , ). Accordingly, we have dist(P(A1,A2), Pprior) = P|A| i=1 P|A| j=1(pij 1 |A|2 )2 where the probability pij = Prob(A1 = Ai, A2 = Aj). We will next present how to optimize (3). Following (Wang et al., 2019), we adopt the alternating gradient descent algorithm (AGD), alternating between upper-level minimization and lower-level maximization, as outlined in Algorithm 1. Algorithm 1 AGD for optimization (3) Input: initial parameter θ(0), sampling distribution P(0) (A1,A2), optimization step N. for n = 1 to N do 1. Upper-level minimization: fix P(A1,A2) = P(n 1) (A1,A2), and call equation (4) to update θ(n). 2. Lower-level maximization: fix θ = θ(n), and call equation (9) to update P(n) (A1,A2). end for Return: Optimized parameter θ(N). Upper-level minimization. The upper-level minimization w.r.t. θ follows the conventional gradient descent procedure as in the Graph CL optimization (1) given the sampling distribution P(A1,A2), represented as: θ(n) = θ(n 1) α θL(G, A1, A2, θ), (4) where α R>0 is the learning rate. Lower-level maximization. Since it is not intuitive to directly calculate the gradient of the lower-level objective w.r.t. P(A1,A2), we first rewrite the contrastive loss in (1) as: L(G, A1, A2, θ) = Targeted z}|{ pij n EPGsim(T i θ(G), T j θ (G)) j =1 pj |{z} Undesired EPG exp(sim(T i θ(G), T j θ (G )))) o , where T i θ = Ai fθ gθ , (i = 1, ..., 5), and the marginal probabilities pj = pj = Prob(A2 = Aj). In the equation Graph Contrastive Learning Automated Figure 2: Empirical training curves of AGD in JOAO on datasets NCI1 and PROTEINS with different γ values. (5), we expand the expectation on augmentations A1, A2 into the form of weighted summation related to pij in order to calculate the gradient. However, within the expectation on G of the negative pair term there is the marginal probabilities pj entangled, and therefore we make the following numerical approximation for the lower bound of the negative pair term to disentangle pij in the equation (5): EPG PA1log(EPG PA2exp(sim(Tθ,1(G), Tθ,2(G )))) EPG PA1 PA2log(EPG exp(sim(Tθ,1(G), Tθ,2(G )))) EPG P(A1,A1)log(EPG exp(sim(Tθ,1(G), Tθ,2(G )))), (6) where the first inequality comes from Jensen s inequality, and the second approximation is numerical. It results in the approximated contrastive loss: L(G, A1, A2, θ) Targeted z}|{ pij ℓ(G, Ai, Aj, θ) j=1 pij n EPGsim(T i θ(G), T j θ (G)) + EPGlog(EPG exp(sim(T i θ(G), T j θ (G )))) o . (7) Through approximating the contrastive loss, the lower-level maximization in the optimization (3) is rewritten as: P(A1,A2) arg maxp P,p=[pij],i,j=1,...,|A|{ψ(p)}, j=1 pijℓ(G, Ai, Aj, θ) γ j=1 (pij 1 |A|2 )2, where ψ(p) is a strongly-concave function w.r.t. p in the probability simplex P. Thus, a projected gradient descent (Wang et al., 2019; Boyd et al., 2004) is performed to update the sampling distribution P(A1,A2) for selecting augmentation pairs, expressed as: b = p(n 1) + α pψ(p(n 1)), p(n) = (b µ1)+, (9) where α R>0 is the learning rate, µ is the root of the equation 1T(b µ1) = 1, and ( )+ is the element-wise non-negative operator. µ can be efficiently found via the bi-jection method (Wang et al., 2019; Boyd et al., 2004). Even though an optimizer with theoretical guarantee of convergence for non-convex non-concave min-max problems remains an open challenge, we acknowledge that AGD is an approximation of solving the bi-level optimization (3) precisely, which typically costs Bayesian optimization (Srinivas et al., 2010; Snoek et al., 2012), automatic differentiation (Luketina et al., 2016; Franceschi et al., 2017; Baydin et al., 2017; Shaban et al., 2019), or first-order techniques based on some inner-loop approximated solution (Maclaurin et al., 2015; Pedregosa, 2016; Gould et al., 2016). As most of them suffer from high time or space complexity, AGD was adopted as an approximated heuristic mainly for saving computational overhead. It showed some level of empirical convergence as seen in Figure 2. 3.2.1. SANITY CHECK: JOAO RECOVERS AUGMENTATION-PAIRS ALIGNED WITH PREVIOUS BEST PRACTICES How reasonable are the JOAO-selected augmentation pairs per dataset? This section pass JOAO through a sanity check, by comparing its selections with the previous trial-and-error findings by manually and exhaustively combining different augmentations (using downstream labels for validation) (You et al., 2020a). Table 1: Datasets statistics. Datasets Category Graph Num. Avg. Node Avg. Degree NCI1 Biochemical Molecules 4110 29.87 1.08 PROTEINS Biochemical Molecules 1113 39.06 1.86 COLLAB Social Networks 5000 74.49 32.99 RDT-B Social Networks 2000 429.63 1.15 To examine such alignment, we visualize in the top row of Figure 3 the JOAO-optimized sampling distributions P(A1,A2), and in the bottom row the Graph CL s manual trialand-error results over various augmentation pairs, for four different datasets (data statistics in Table 1). Please refer to the caption on Figure 3 how to interpret the percentage numbers in the top and bottom rows respectively. Overall, we observe a decent extent of alignments between the two rows trends and especially the high-value locations, indicating that: if an augmentation pair was manually verified to yield better Graph CL results, it is also more likely to be selected by JOAO. More specifically we can see: (1) augmentation pairs containing Edge Pert and Attr Mask are more likely to be selected for biochemical molecules and denser graphs, respectively; (2) Node Drop and Subgraph are generally adopted on all four datasets; and (3) the augmentation pairs of two identy transformations are completely abandoned by JOAO, and those pairs with more diverse transformation types are more desired. All those observations are well aligned with the rules of thumb summarized in (You et al., 2020a). More discussions are provided in Appendix D. Graph Contrastive Learning Automated Figure 3: Top row: sampling distributions (%, defined as the percentage of this specific augmentation pair being selected during the entire training process) for augmentation pairs selected by JOAO on four different datasets (NCI1, PROTEINS, COLLAB, and RDT-B ). Bottom row: Graph CL performance gains (classification accuracy %, see (You et al., 2020a) for the detailed setting) when exhaustively trying every possible augmentation pair. Note that the percentage numbers in the first and second rows have different meanings and are not apple-to-apple comparable; however, the overall alignments between the two rows trends and high-value locations indicate that, if an augmentation pair was manually verified to yield better Graph CL results, it is also more likely to be selected by JOAO. Warmer (colder) colors indicate higher (lower) values, and white marks 0. Therefore, the selections of augmentations made by JOAO are shown to be generally consistent with previous best practices observed from manual tuning yet now being fully automated, flexible, versatile. It is also achieved without using any downstream task label, while (You et al., 2020a) would hinge on a labeled set to compare two augmentations by their downstream performance. 3.3. Augmentation-Aware Multi-Projection Heads: Addressing A New Challenge from JOAO JOAO conveys the blessing of diverse and dynamic augmentations that are selected automatically during each Graph CL training, which may yield more robust and invariant features. However, that blessing could also bring up a new challenge: compared to one fixed augmentation pair throughout training, those varying and more aggressive augmentations can distort the training distribution more (Lee et al., 2020; Jun et al., 2020). Even mild augmentations, such as adding/dropping nodes or edges, could result in graphs very unlikely under the original distribution. Models trained with these augmentations may fit the original distribution poorly. To address this challenge arising from using JOAO, we introduce multiple projection heads and an augmentation-aware selection scheme into Graph CL, as glimpsed in Figure 4 (see Figure S2 in Appendix D for a schematic diagram). Specifically, we construct |A| projection heads each of which corresponds to one augmentation type (|A| denotes the cardinality of the augmentation pool). Then during training, once an augmentation is sampled, it will only go through and update its corresponding projection head. The main idea is to explicitly disentangle the distorted feature distributions caused by various augmentation pairs, and each time we only use the one head corresponding to the augmentation currently selected by JOAO. Contrastive loss optimization Figure 4: An overview of Graph CL with multiple augmentationaware projection heads where P(gΘ 1 ,gΘ 2 ) = P(A1,A2). In mathematical forms, we route the output features from f through the projection head sampled from P(gΘ 1 ,gΘ 2 ) at each training step, where P(gΘ 1 ,gΘ 2 ) = P(A1,A2), and Θ 1, Θ 2 denote the head parameters, resulting in Tθ,i = Ai fθ gΘ i , (i = 1, 2). Denoting Lv2(G, A1, A2, θ , Θ 1, Θ 2) = EP(gΘ 1 ,gΘ 2 )L(G, A1, A2, {θ , (Θ 1, Θ 2)}), we could then integrate the augmentation- Graph Contrastive Learning Automated aware projection head mechanism into the JOAO framework, referred to as JOAOv2: minθ Lv2(G, A1, A2, θ , Θ 1, Θ 2), s.t. P(A1,A2) arg max P(A 1,A 2) n Lv2(G, A1, A2, θ , Θ 1, Θ 2) 2 dist(P(A 1,A 2), Pprior) o , P(gΘ 1 ,gΘ 2 ) = P(A1,A2). (10) Algorithm 1 could be easily adapted to solve (10) (see Algorithm S1 of Appendix A). Our preliminary experiments in Table 2 show that, without bells and whistles, augmentation-aware projection heads improve the performance upon JOAO under different augmentation strengths. That aligns with the observations in (Lee et al., 2020; Jun et al., 2020), showing that disentangling augmented and original feature distributions could have the model benefit more from stronger augmentations. Table 2: Experiments with JOAO and JOAOv2 without explicit hyper-parameter tuning under different augmentation strengths on NCI1 and PROTEINS. A.S. is short for augmentation strength. Datasets A.S. JOAO JOAOv2 NCI1 0.2 61.77 1.61 62.52 1.16 0.25 60.95 0.55 61.67 0.72 PROTEINS 0.2 71.45 0.89 71.66 1.10 0.25 71.61 1.65 73.01 1.02 We also plot the learned P(A1,A2) in Figure S1 of Appendix D, where we can observe an ever stronger alignment than presented in Figure 3. 4. Experiments In this section, we evaluate our proposed methods, JOAO and JOAOv2, against state-of-the-art (SOTA) competitors including self-supervised approaches heuristically designed with domain knowledge, and graph contrastive learning (Graph CL) with pre-defined rules for augmentation selection, using the scenarios of datasets originated from diverse sources, and datasets on specific bioinformatics domains. An summary of main results can be found in Table 3. Table 3: Summary of JOAO performance. v.s. Graph CL v.s. Heuristic methods Across diverse fields Comparable Better On specific domains Better Worse 4.1. Datasets and Experiment Settings Datasets. We use datasets of diverse nature from the benchmark TUDataset (Morris et al., 2020), including graph data for small molecules & proteins (Riesen & Bunke, 2008; Dobson & Doig, 2003), computer vision (Nene et al., 1996) and various relation networks (Yanardag & Vishwanathan, 2015; Rozemberczki et al., 2020) of diverse statistics (see Table S1 of Appendix B), under semi-supervised and unsupervised learning. Additionally we gather domain-specific bioinformatics datasets from the benchmark (Hu et al., 2019) of relatively similar statistics (see Table S2 of Appendix B), under transfer-learning tasks for predicting molecules chemical property or proteins biological function. Lastly we take two large-scale benchmark datasets, ogbg-ppa & ogbg-code from Open Graph Benchmark (OGB) (Hu et al., 2020a) (see Table S3 of Appendix B for statistics) to evaluate scalability under semi-supervised learning. Learning protocols. Learning experiments are performed in three settings, following the same protocols as in SOTA work. (1) In semi-supervised learning (You et al., 2020a) on datasets without the explicit training/validation/test split, we perform pre-training with all data and did finetuning & evaluation with K folds where K = 1 label rate; and on datasets with the train/validation/test split, we only perform pre-training with the training data, finetuning on the partial training data and evaluation on the validation/test sets. (2) In unsupervised representation learning (Sun et al., 2019), we pre-train using the whole dataset to learn graph embeddings and feed them into a downstream SVM classifier with 10fold cross-validation. (3) In transfer learning (Hu et al., 2019), we pre-train on a larger dataset then finetune and evaluate on smaller datasets of the same category using the given training/validation/test split. GNN architectures & augmentations. We adopt the same GNN architectures with default hyper-parameters as in the SOTA methods under individual experiment settings. Specifically, (1) in semi-supervised learning, Res GCN (Chen et al., 2019) is used with 5 layers and 128 hidden dimensions, (2) in unsupervised representation learning, GIN (Xu et al., 2018) is used with 3 layers and 32 hidden dimensions, and (3) in transfer learning and on large-scale OGB datasets, GIN is used with 5 layers and 300 hidden dimensions. Plus, we adopt the same graph data augmentations as in Graph CL (You et al., 2020a) with the default augmentation strength 0.2. We tune the hyper-parameter γ controlling the trade-off in the optimization (3) in the range of {0.01, 0.1, 1}. 4.2. Compared Algorithms. Training from scratch (with augmentations) and graph kernels. The na ıve baseline training from the random initialization (with same augmentations as in Graph CL (You et al., 2020a)) is compared, as well as SOTA graph kernel methods including GL (Shervashidze et al., 2009), WL (Shervashidze et al., 2011) and DGK (Yanardag & Vishwanathan, 2015). Heuristic self-supervised methods. Heuristic self- Graph Contrastive Learning Automated Table 4: Semi-supervised learning on TUDataset. Shown in red are the best accuracy (%) and those within the standard deviation of the best accuracy or the best average ranks. - indicates that label rate is too low for a given dataset size. L.R. and A.R. are short for label rate and average rank, respectively. The compared results except those for Context Pred are as published under the same experiment setting. L.R. Methods NCI1 PROTEINS DD COLLAB RDT-B RDT-M5K GITHUB A.R. 1% No pre-train. 60.72 0.45 - - 57.46 0.25 - - 54.25 0.22 7.6 Augmentations 60.49 0.46 - - 58.40 0.97 - - 56.36 0.42 6.6 GAE 61.63 0.84 - - 63.20 0.67 - - 59.44 0.44 4.0 Infomax 62.72 0.65 - - 61.70 0.77 - - 58.99 0.50 3.3 Context Pred 61.21 0.77 - - 57.60 2.07 - - 56.20 0.49 6.6 Graph CL 62.55 0.86 - - 64.57 1.15 - - 58.56 0.59 2.6 JOAO 61.97 0.72 - - 63.71 0.84 - - 60.35 0.24 3.0 JOAOv2 62.52 1.16 - - 64.51 2.21 - - 61.05 0.31 2.0 10% No pre-train. 73.72 0.24 70.40 1.54 73.56 0.41 73.71 0.27 86.63 0.27 51.33 0.44 60.87 0.17 7.0 Augmentations 73.59 0.32 70.29 0.64 74.30 0.81 74.19 0.13 87.74 0.39 52.01 0.20 60.91 0.32 6.2 GAE 74.36 0.24 70.51 0.17 74.54 0.68 75.09 0.19 87.69 0.40 53.58 0.13 63.89 0.52 4.5 Infomax 74.86 0.26 72.27 0.40 75.78 0.34 73.76 0.29 88.66 0.95 53.61 0.31 65.21 0.88 3.0 Context Pred 73.00 0.30 70.23 0.63 74.66 0.51 73.69 0.37 84.76 0.52 51.23 0.84 62.35 0.73 7.2 Graph CL 74.63 0.25 74.17 0.34 76.17 1.37 74.23 0.21 89.11 0.19 52.55 0.45 65.81 0.79 2.4 JOAO 74.48 0.27 72.13 0.92 75.69 0.67 75.30 0.32 88.14 0.25 52.83 0.54 65.00 0.30 3.5 JOAOv2 74.86 0.39 73.31 0.48 75.81 0.73 75.53 0.18 88.79 0.65 52.71 0.28 66.66 0.60 1.8 Table 5: Unsupervised representation learning on TUDataset. Red numbers indicate the top-3 accuracy (%) or the top-2 average ranks. The compared results are from the published papers, and - indicates that results were not available in published papers. For MVGRL we report the numbers with the NT-Xent loss to be comparable with Graph CL. Methods NCI1 PROTEINS DD MUTAG COLLAB RDT-B RDT-M5K IMDB-B A.R. GL - - - 81.66 2.11 - 77.34 0.18 41.01 0.17 65.87 0.98 7.4 WL 80.01 0.50 72.92 0.56 - 80.72 3.00 - 68.82 0.41 46.06 0.21 72.30 3.44 5.7 DGK 80.31 0.46 73.30 0.82 - 87.44 2.72 - 78.04 0.39 41.27 0.18 66.96 0.56 4.9 node2vec 54.89 1.61 57.49 3.57 - 72.63 10.20 - - - - 8.6 sub2vec 52.84 1.47 53.03 5.55 - 61.05 15.80 - 71.48 0.41 36.68 0.42 55.26 1.54 9.5 graph2vec 73.22 1.81 73.30 2.05 - 83.15 9.25 - 75.78 1.03 47.86 0.26 71.10 0.54 5.7 MVGRL - - - 75.40 7.80 - 82.00 1.10 - 63.60 4.20 7.2 Info Graph 76.20 1.06 74.44 0.31 72.85 1.78 89.01 1.13 70.65 1.13 82.50 1.42 53.46 1.03 73.03 0.87 3.0 Graph CL 77.87 0.41 74.39 0.45 78.62 0.40 86.80 1.34 71.36 1.15 89.53 0.84 55.99 0.28 71.14 0.44 2.6 JOAO 78.07 0.47 74.55 0.41 77.32 0.54 87.35 1.02 69.50 0.36 85.29 1.35 55.74 0.63 70.21 3.08 3.3 JOAOv2 78.36 0.53 74.07 1.10 77.40 1.15 87.67 0.79 69.33 0.34 86.42 1.45 56.03 0.27 70.83 0.25 2.8 supervised methods are designed based on certain domain knowledge, which work well when such knowledge is available and benefits downstream tasks. The compared ones include: (1) edge-based reconstruction including GAE (Kipf & Welling, 2016), node2vec (Grover & Leskovec, 2016) and Edge Pred (Hu et al., 2019), (2) vertex feature masking & recover, namely Attr Masking (Hu et al., 2019), (3) sub-structure information preserving such as sub2vec (Adhikari et al., 2018), graph2vec (Narayanan et al., 2017) and Context Pred (Hu et al., 2019), and (4) global-local representation consistency such as Infomax (Veliˇckovi c et al., 2018) & Info Graph (Sun et al., 2019). We adopt the default hyper-parameters published for these methods. Graph CL with pre-fixed augmentation sampling rules. For constructing the sampling pool of augmentations, we follow the same rule as in (You et al., 2020a) that uses (1) Node Drop and Subgraph for biochemical molecules, (2) all for dense relation networks, and (3) all except Attr Mask for sparse relation networks. The exact augmentations for each dataset are shown in Table S4 of Appendix C. 4.3. Results 4.3.1. ON DIVERSE DATASETS FROM TUDATASET The results of semi-supervised learning & unsupervised representation learning on TUDataset are in Tables 4 & 5, respectively. Through comparisons between (1) JOAO and Graph CL, (2) JOAOv2 and JOAO, and (3) JOAOv2 and heuristic self-supervised methods, we have the following observations. (i) With automated selection, JOAO is comparable to Graph CL with ad hoc rules from exhaustive manual tuning. With the automatic, adaptive, and dynamic augmentation selection procedure, JOAO performs comparably to Graph CL whose augmentations are based on empirical ad-hoc rules gained from expensive trial-and-errors on the same TUDatase. Specifically in semi-supervised learning (Table 4), JOAO matches or beats Graph CL in 7 out of 10 experiments, albeit with a slightly worse average rank. Similar observations were made in unsupervised learning (Table 5). These results echo our earlier results in Section Graph Contrastive Learning Automated Table 6: Transfer learning on bioinformatics datasets. Red numbers indicate the top-3 performances (AUC of ROC in %). Results for SOTA methods are as published. Methods BBBP Tox21 Tox Cast SIDER Clin Tox MUV HIV BACE PPI A.R. No pre-train. 65.8 4.5 74.0 0.8 63.4 0.6 57.3 1.6 58.0 4.4 71.8 2.5 75.3 1.9 70.1 5.4 64.8 1.0 6.6 Infomax 68.8 0.8 75.3 0.5 62.7 0.4 58.4 0.8 69.9 3.0 75.3 2.5 76.0 0.7 75.9 1.6 64.1 1.5 5.3 Edge Pred 67.3 2.4 76.0 0.6 64.1 0.6 60.4 0.7 64.1 3.7 74.1 2.1 76.3 1.0 79.9 0.9 65.7 1.3 3.8 Attr Masking 64.3 2.8 76.7 0.4 64.2 0.5 61.0 0.7 71.8 4.1 74.7 1.4 77.2 1.1 79.3 1.6 65.2 1.6 3.1 Context Pred 68.0 2.0 75.7 0.7 63.9 0.6 60.9 0.6 65.9 3.8 75.8 1.7 77.3 1.0 79.6 1.2 64.4 1.3 3.4 Graph CL 69.68 0.67 73.87 0.66 62.40 0.57 60.53 0.88 75.99 2.65 69.80 2.66 78.47 1.22 75.38 1.44 67.88 0.85 4.6 JOAO 70.22 0.98 74.98 0.29 62.94 0.48 59.97 0.79 81.32 2.49 71.66 1.43 76.73 1.23 77.34 0.48 64.43 1.38 4.5 JOAOv2 71.39 0.92 74.27 0.62 63.16 0.45 60.49 0.74 80.97 1.64 73.67 1.00 77.51 1.17 75.49 1.27 63.94 1.59 4.3 3.2.1 that automated selections of augmentation pairs made by JOAO were generally consistent with Graph CL s best practices observed from manual tuning. (ii) Augmentation-aware projection heads provide further improvement upon JOAO. With augmentation-aware projection heads introduced into JOAO, JOAOv2 further improves the performance and sometimes even outperforms Graph CL with the pre-defined rules of thumb for augmentation selections. In semi-supervised learning (Table 4), JOAOv2 achieves the best average ranks of 2.0 and 2.8 under 1% and 10% label rate, respectively, and in unsupervised representation learning (Table 5) its average rank (2.8) is only edged by Graph CL (2.6). The performance acquired by JOAOv2 echoes our conjecture in sec. 3.3 that such explicit disentanglement of the distorted feature distributions caused by various augmentation pairs would reel in the benefits from stronger augmentations. (iii) Across diverse datasets, JOAOv2 generally outperform heuristic self-supervised methods. On datasets originated from diverse sources, JOAOv2 generally outperforms heuristic self-supervised methods. Specifically in Table 4, JOAOv2 achieves no less than 0.3 average ranking gap with all heuristic self-supervised methods under 1% label rate, and 1.0 with all but Infomax under 10% label rate, which outperforms JOAO but still underperforms JOAOv2, and in Table 5 only Info Graph outperforms JOAO but underperforms JOAOv2 where there is no less that 1.5 average ranking gap between others and JOAOv2. This in general meets our expectation that heuristic self-supervised methods can work well when guided by useful domain knowledge, which is hard to guarantee across diverse datasets. In contrast JOAOv2 can dynamically and automatically adapt augmentation selections during self-supervised training, exploiting the signals (knowledge) from data. 4.3.2. ON SPECIFIC BIOINFORMATICS DATASETS. The results of transfer learning on bioinfomatics datasets are in Table 6. Similarly, through comparisons among JOAO(v2), Graph CL and heuristic self-supervised methods, we make the following findings. (iv) Without domain expertise incorporated, JOAOv2 underperforms some heuristic self-supervised methods in specific domains. Nevertheless, converse to that on diverse datasets, on the specific bioinformatics datasets, JOAOv2 underperforms some heuristic self-supervised methods designed with dataset-specific domain knowledge (Hu et al., 2019) even though it improves average rank against Graph CL and JOAO. As stated in Section 4.2, the specific domain knowledge encoded in the compared heuristically designed methods correlates with the downstream datasets as shown in (Hu et al., 2019), which is not made available to benefit JOAOv2 that works well for a general dataset (as shown in Section 4.3.1), which may not suffice to capture the sophisticated domain expertise. Therefore, to make JOAOv2 even more competitive, domain knowledge can be introduced into the framework, for instance through proposing dataset-specific augmentation types and/or priors. We leave this to future work. (v) With better generalizability, JOAOv2 outperforms Graph CL on unseen datasets. Different from results on diverse datasets from TUDataset, both JOAO and JOAOv2 outperform Graph CL with empirically pre-defined rules for augmentation selection on the unseen bioinfomatics datasets. Specifically in Table 6 JOAO improves the average rank by 0.1 and JOAOv2 did by 0.3 compared to Graph CL. Note that the sampling rules of Graph CL were empirically derived from TUDataset hence these rules are not necessarily effective for the previously-unseen bioinfomatics datasets. In contrast, JOAOv2 dynamically and automatically learns the sampling distributions during self-supervised training, possessing the better generalizability. 4.3.3. ON LARGE-SCALE OGB DATASETS. (vi) JOAOv2 scales up well for large datasets. Both JOAO and JOAOv2 scale up for large datasets at least as well as Graph CL does. In Table 7, for the ogbg-ppa dataset, JOAO improved even more significantly compared to Graph CL (by reference to earlier, smaller datasets), with >3.49% and >1.55% accuracy gains at 1% and 10% label rates, respectively. Graph Contrastive Learning Automated Table 7: Semi-supervised learning on large-scale OGB datasets. Red numbers indicate the top-2 performances (accuracy in % on ogbg-ppa, F1 score in % on ogbg-code). L.R. Methods ogbg-ppa ogbg-code 1% No pre-train. 16.04 0.74 6.06 0.01 Graph CL 40.81 1.33 7.66 0.25 JOAO 47.19 1.30 6.84 0.31 JOAOv2 44.30 1.67 7.74 0.24 10% No pre-train. 56.01 1.05 17.85 0.60 Graph CL 57.77 1.25 22.45 0.17 JOAO 60.91 0.83 22.06 0.30 JOAOv2 59.32 1.11 22.65 0.22 4.4. Summary of Main Findings We briefly summarize the aforementioned results as follows: Across datasets originated from diverse sources, JOAO with adaptive augmentation selection performs comparably to Graph CL, a strong baseline with exhaustively tuned augmentation rules by hand. With augmentation-aware projection heads, JOAOv2 further boosts the performance and sometimes even outperforms Graph CL. On datasets from specific bioinformatics domains, JOAOv2 achieves better performance than Graph CL whose empirical rules were not derived from such data, indicating its better generalizability to unseen datasets. Both JOAO and JOAOv2 outperform heuristic selfsupervised methods with few exceptions. They might be further enhanced by encoding domain knowledge. JOAOv2 scales up to large datasets as well as Graph CL does, sometimes with even more significant improvement compared with that for smaller datasets. 5. Conclusions & Discussions In this paper, we propose a unified bi-level optimization framework to dynamically and automatically select augmentations in Graph CL, named JOint Augmentation Optimization (JOAO). The general framework is instantiated as min-max optimization, with empirical analysis showing that JOAO makes augmentation selections in general accordance with previous best practices from exhaustive hand tuning for every dataset. Furthermore, a new augmentation-aware projection head mechanism is proposed to overcome the potential training distribution distortion, resulting from the more aggressive and varying augmentations by JOAO. Experiments demonstrate that JOAO and its variant performs on par with and sometimes better than the state-of-the-art competitors including Graph CL on multiple graph datasets of various scales and types, yet without resorting to tedious dataset-specific manual tuning. Although JOAO automates Graph CL in selecting augmentation pairs, it still relies on human prior knowledge in constructing and configuring the augmentation pool to select from. In this sense full automation is still desired and will be pursued in future work. Meanwhile, in parallel to the principled formulation of bi-level optimization, a meta-learning formulation can also be pursued. Adhikari, B., Zhang, Y., Ramakrishnan, N., and Prakash, B. A. Sub2vec: Feature learning for subgraphs. In Pacific Asia Conference on Knowledge Discovery and Data Mining, pp. 170 182. Springer, 2018. Baydin, A. G., Cornish, R., Rubio, D. M., Schmidt, M., and Wood, F. Online learning rate adaptation with hypergradient descent. ar Xiv preprint ar Xiv:1703.04782, 2017. Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. Chen, B., Zhang, J., Zhang, X., Tang, X., Cai, L., Chen, H., Li, C., Zhang, P., and Tang, J. COAD: Contrastive pretraining with adversarial fine-tuning for zero-shot expert linking. ar Xiv preprint ar Xiv:2012.11336, 2020a. Chen, D., Lin, Y., Li, L., Li, X. R., Zhou, J., Sun, X., et al. Distance-wise graph contrastive learning. ar Xiv preprint ar Xiv:2012.07437, 2020b. Chen, T., Bian, S., and Sun, Y. Are powerful graph neural nets necessary? a dissection on graph classification. ar Xiv preprint ar Xiv:1905.04579, 2019. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020c. Dobson, P. D. and Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4):771 783, 2003. Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. ar Xiv preprint ar Xiv:2003.00982, 2020. Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pp. 1165 1173. PMLR, 2017. Gould, S., Fernando, B., Cherian, A., Anderson, P., Cruz, R. S., and Guo, E. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016. Graph Contrastive Learning Automated Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855 864, 2016. Guiasu, S. and Shenitzer, A. The principle of maximum entropy. The mathematical intelligencer, 7(1):42 48, 1985. Hassani, K. and Khasahmadi, A. H. Contrastive multiview representation learning on graphs. ar Xiv preprint ar Xiv:2006.05582, 2020. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., and Leskovec, J. Strategies for pre-training graph neural networks. ar Xiv preprint ar Xiv:1905.12265, 2019. Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020a. Hu, Z., Dong, Y., Wang, K., Chang, K.-W., and Sun, Y. GPTGNN: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1857 1867, 2020b. Huang, T., Pei, Y., Menkovski, V., and Pechenizkiy, M. Hop-count based self-supervised anomaly detection on attributed networks. ar Xiv preprint ar Xiv:2104.07917, 2021. Hwang, D., Park, J., Kwon, S., Kim, K., Ha, J.-W., and Kim, H. J. Self-supervised auxiliary learning with metapaths for heterogeneous graphs. Advances in Neural Information Processing Systems, 33, 2020. Jin, M., Zheng, Y., Li, Y.-F., Gong, C., Zhou, C., and Pan, S. Multi-scale contrastive siamese networks for selfsupervised graph representation learning. ar Xiv preprint ar Xiv:2105.05682, 2021a. Jin, W., Derr, T., Liu, H., Wang, Y., Wang, S., Liu, Z., and Tang, J. Self-supervised learning on graphs: Deep insights and new direction. ar Xiv preprint ar Xiv:2006.10141, 2020. Jin, W., Liu, X., Zhao, X., Ma, Y., Shah, N., and Tang, J. Automated self-supervised learning for graphs, 2021b. Jun, H., Child, R., Chen, M., Schulman, J., Ramesh, A., Radford, A., and Sutskever, I. Distribution augmentation for generative modeling. In International Conference on Machine Learning, pp. 5006 5019. PMLR, 2020. Kipf, T. N. and Welling, M. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016. Kong, K., Li, G., Ding, M., Wu, Z., Zhu, C., Ghanem, B., Taylor, G., and Goldstein, T. FLAG: Adversarial data augmentation for graph neural networks. ar Xiv preprint ar Xiv:2010.09891, 2020. Lee, H., Hwang, S. J., and Shin, J. Self-supervised label augmentation via input transformations. In International Conference on Machine Learning, pp. 5714 5724. PMLR, 2020. Li, M. M., Huang, K., and Zitnik, M. Representation learning for networks in biology and medicine: Advancements, challenges, and opportunities. ar Xiv preprint ar Xiv:2104.04883, 2021. Liu, M., Gao, H., and Ji, S. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 338 348, 2020. Luketina, J., Berglund, M., Greff, K., and Raiko, T. Scalable gradient-based tuning of continuous regularization hyperparameters. In International conference on machine learning, pp. 2952 2960. PMLR, 2016. Maclaurin, D., Duvenaud, D., and Adams, R. Gradientbased hyperparameter optimization through reversible learning. In International conference on machine learning, pp. 2113 2122. PMLR, 2015. Manessi, F. and Rozza, A. Graph-based neural network models with multiple self-supervised auxiliary tasks. ar Xiv preprint ar Xiv:2011.07267, 2020. Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. Tudataset: A collection of benchmark datasets for learning with graphs. ar Xiv preprint ar Xiv:2007.08663, 2020. Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., and Jaiswal, S. graph2vec: Learning distributed representations of graphs. ar Xiv preprint ar Xiv:1707.05005, 2017. Nene, S. A., Nayar, S. K., Murase, H., et al. Columbia object image library (coil-100). 1996. Park, C., Kim, D., Han, J., and Yu, H. Unsupervised attributed multiplex network embedding. In AAAI, pp. 5371 5378, 2020. Graph Contrastive Learning Automated Pedregosa, F. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737 746. PMLR, 2016. Peng, Z., Dong, Y., Luo, M., Wu, X.-M., and Zheng, Q. Self-supervised graph representation learning via global context prediction. ar Xiv:2003.01604, 2020a. Peng, Z., Huang, W., Luo, M., Zheng, Q., Rong, Y., Xu, T., and Huang, J. Graph representation learning via graphical mutual information maximization. In Proceedings of The Web Conference 2020, pp. 259 270, 2020b. Qiu, J., Chen, Q., Dong, Y., Zhang, J., Yang, H., Ding, M., Wang, K., and Tang, J. GCC: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020. Ren, Y., Liu, B., Huang, C., Dai, P., Bo, L., and Zhang, J. Heterogeneous deep graph infomax. ar Xiv preprint ar Xiv:1911.08538, 2019. Riesen, K. and Bunke, H. Iam graph database repository for graph based pattern recognition and machine learning. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 287 297. Springer, 2008. Robey, A., Hassani, H., and Pappas, G. J. Model-based robust deep learning. ar Xiv preprint ar Xiv:2005.10247, 2020. Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., and Huang, J. Self-supervised graph transformer on largescale molecular data. Advances in Neural Information Processing Systems, 33, 2020. Roy, K. K., Roy, A., Rahman, A., Amin, M. A., and Ali, A. A. Node embedding using mutual information and selfsupervision based bi-level aggregation. ar Xiv preprint ar Xiv:2104.13014, 2021. Rozemberczki, B., Kiss, O., and Sarkar, R. An API oriented open-source python framework for unsupervised learning on graphs. ar Xiv preprint ar Xiv:2003.04819, 2020. Shaban, A., Cheng, C.-A., Hatch, N., and Boots, B. Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1723 1732. PMLR, 2019. Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., and Borgwardt, K. Efficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics, pp. 488 495. PMLR, 2009. Shervashidze, N., Schweitzer, P., Van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12(9), 2011. Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 2012. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning, 2010, 2010. Sun, F.-Y., Hoffmann, J., Verma, V., and Tang, J. Info Graph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. ar Xiv preprint ar Xiv:1908.01000, 2019. Veliˇckovi c, P., Fedus, W., Hamilton, W. L., Li o, P., Bengio, Y., and Hjelm, R. D. Deep graph infomax. ar Xiv preprint ar Xiv:1809.10341, 2018. Wang, C. and Liu, Z. Learning graph representation by aggregating subgraphs via mutual information maximization. ar Xiv preprint ar Xiv:2103.13125, 2021. Wang, J., Zhang, T., Liu, S., Chen, P.-Y., Xu, J., Fardad, M., and Li, B. Towards a unified min-max framework for adversarial exploration and robustness. ar Xiv preprint ar Xiv:1906.03563, 2019. Wu, L., Lin, H., Gao, Z., Tan, C., Li, S., et al. Selfsupervised on graphs: Contrastive, generative, or predictive. ar Xiv preprint ar Xiv:2105.07342, 2021. Xie, C., Tan, M., Gong, B., Wang, J., Yuille, A. L., and Le, Q. V. Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018. Yanardag, P. and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1365 1374, 2015. You, Y. and Shen, Y. Cross-modality protein embedding for compound-protein affinity and contact prediction. ar Xiv preprint ar Xiv:2012.00651, 2020. You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., and Shen, Y. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33, 2020a. Graph Contrastive Learning Automated You, Y., Chen, T., Wang, Z., and Shen, Y. When does selfsupervision help graph convolutional networks? In International Conference on Machine Learning, pp. 10871 10880. PMLR, 2020b. You, Y., Chen, T., Wang, Z., and Shen, Y. L2-GCN: Layerwise and learned efficient training of graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2127 2135, 2020c. Zhang, H., Lin, S., Liu, W., Zhou, P., Tang, J., Liang, X., and Xing, E. P. Iterative graph self-distillation. ar Xiv preprint ar Xiv:2010.12609, 2020. Zhao, T., Liu, Y., Neves, L., Woodford, O., Jiang, M., and Shah, N. Data augmentation for graph neural networks. ar Xiv preprint ar Xiv:2006.06830, 2020. Zhu, Q., Du, B., and Yan, P. Self-supervised training of graph convolutional networks. ar Xiv preprint ar Xiv:2006.02380, 2020a. Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Deep graph contrastive representation learning. ar Xiv preprint ar Xiv:2006.04131, 2020b. Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Graph contrastive learning with adaptive augmentation. ar Xiv preprint ar Xiv:2010.14945, 2020c.