# directed_graph_contrastive_learning__c2c5e0cd.pdf Directed Graph Contrastive Learning Zekun Tong1 Yuxuan Liang1 Henghui Ding 2,3, Yongxing Dai 4 Xinke Li1 Changhu Wang2 1National University of Singapore 2Byte Dance 3ETH Zürich 4Peking University {zekuntong,liangyuxuan,xinke.li}@u.nus.edu henghui.ding@vision.ee.ethz.ch, yongxingdai@pku.edu.cn changhu.wang@bytedance.com Graph Contrastive Learning (GCL) has emerged to learn generalizable representations from contrastive views. However, it is still in its infancy with two concerns: 1) changing the graph structure through data augmentation to generate contrastive views may mislead the message passing scheme, as such graph changing action deprives the intrinsic graph structural information, especially the directional structure in directed graphs; 2) since GCL usually uses predefined contrastive views with hand-picking parameters, it does not take full advantage of the contrastive information provided by data augmentation, resulting in incomplete structure information for models learning. In this paper, we design a directed graph data augmentation method called Laplacian perturbation and theoretically analyze how it provides contrastive information without changing the directed graph structure. Moreover, we present a directed graph contrastive learning framework, which dynamically learns from all possible contrastive views generated by Laplacian perturbation. Then we train it using multi-task curriculum learning to progressively learn from multiple easy-to-difficult contrastive views. We empirically show that our model can retain more structural features of directed graphs than other GCL models because of its ability to provide complete contrastive information. Experiments on various benchmarks reveal our dominance over the state-of-the-art approaches. 1 Introduction There is a growing interest in learning from directed graphs using Graph Neural Networks (GNNs) [17, 12, 18, 50] to tackle practical problems, such as time-series problems [1, 4, 49], fine-grained traffic prediction [29, 21] and recommendation system [35, 7, 40]. Current GNNs designed for learning from directed graphs [23, 43, 44] are trained end-to-end under supervision. This training scheme shows excellent performance by virtue of enough labeled data. Correspondingly, to make use of rich unlabeled data, several Graph Contrastive Learning (GCL) [39, 63, 69, 34, 13, 67] works are proposed based on GNNs and Contrastive Learning (CL) [5, 45, 14]. They utilize data augmentation methods to generate contrastive views from the original graphs, then force views generated from the same instance (node or graph) closer while views from different instances apart using Info NCE-like [28] objective function. However, current GCL methods encounter some issues in processing directed graphs, mainly in data augmentation method and contrastive learning framework. Firstly, most data augmentation methods used in GCL [63, 69, 68] do not take the directed graph structure into account and may discard distinctive direction information. For example, the idea of dropping nodes/edges [68, 63, 52] is borrowed from random erasing used in images [66] , which overlooks the discrepancy of nodes and edges in different graph structures. Moreover, by grasping contrastive information through changing graph structure, part of the distinctive direction information, e.g., irreversible time-series relationships, will inevitably be lost. The message passing scheme will Corresponding author 35th Conference on Neural Information Processing Systems (Neur IPS 2021). also be mislead, making it difficult for GNN-based encoders to learn from directed graphs. There is a lack of data augmentation methods that are specifically designed for directed graphs to retain the original directed graph structure while providing enough contrastive information. Besides, common contrastive learning frameworks are not optimized for directed graphs, they can only learn from a limited number of contrastive views [63, 69, 13]. However, due to the complex structure of directed graphs, it is insufficient to use a small number of contrastive views to fully understand their structural characteristics [43]. Furthermore, since the contrastive views have to be determined before training, it becomes another problem to select the ideal views for the downstream task during pre-processing, about which we actually have no knowledge in advance [42, 59]. Handpicking contrastive views also cause a decrease in generalization. Several works [41, 61] try to obtain more information by increasing the number of contrastive views. However, they still have to pre-define the views and trade off between the number of views and the training difficulty. To address these two issues, we first propose a directed graph data augmentation method called Laplacian perturbation. Since the contrastive views are passed to the encoder in the form of Laplacian matrix, a desirable data augmentation method is to perturb the Laplacian matrix without changing the directed graph structure. To achieve this, we adopt the approximate directed graph Laplacian [43] where a teleport probability is introduced to control the degree of approximation. Thus, by adding a perturbation term to this probability, we can simply augment the Laplacian matrix without altering the directed graph structure. As it is time-consuming to calculate the Laplacian matrix, we then speed it up using the power method [24]. In addition, from the theoretical analysis, we find that the Laplacian perturbation is essentially a perturbation to the directed graph entropy. The perturbation term essentially determines the magnitude of the entropy perturbation error, which affects the magnitude of the contrastive information. To learn the complete structural characteristics of directed graphs, we design a directed graph contrastive learning framework to learn from contrastive views generated by Laplacian perturbation. First, we introduce a generalized dynamic-view contrastive objective which maximizes the sum of the mutual information between the representations of all contrastive views. This objective allows the contrastive views to change dynamically during training and motivates the encoder to learn a generalized representation from all views provided by data augmentation. However, this dynamic-view objective function is hard to optimize. Thus, we leverage the multi-task curriculum learning strategy [33, 36, 10, 26] to divide multiple contrastive views into sub-tasks with various difficulties and progressively learn from easy-to-difficult sub-tasks. In this way, we can learn all views step-by-step to reach the final objective. Moreover, learning from all contrastive views eliminates the need to adjust the data augmentation parameters anymore. We empirically show that our Directed Graph Contrastive Learning (Di GCL) outperforms the competitive baselines in the settings of unsupervised and supervised learning. Systematic analysis is also carried out to analyze the performance of various augmentations on the mainstream benchmarks and the impact of different pacing functions on the performance of directed graph contrastive learning. 2 Directed Graph Data Augmentation In this section, we first design a directed graph data augmentation scheme named Laplacian perturbation. As it is time-consuming to calculate the Laplacian matrix, we then speed it up using the power method [24]. Finally, we analyze the proposed data augmentation scheme theoretically. 2.1 Directed Graph Laplacian and its Approximation Formally, let a directed graph G = (V,E), its adjacency matrix can be denoted as A = {0,1}n n, where n = V . The nodes are described by the feature matrix X Rn c, with the number of features c per node. Intuitively, the data augmentation can be performed at the topologicaland feature-level, operating on A and X, respectively [63, 69]. The distinctive attribute of directed graphs is the directionality of the edges, which leads us to focus on topological node-level data augmentation. The augmented directed graph is fed to the GNN-based encoder in the form of Laplacian matrix to learn the representation. Since G may contain isolated nodes or could be formed into bipartite graph, it is not appropriate to trivially use directed graph Laplacian [6]. To relax this constraint, we use its approximate form: the approximate directed graph Laplacian matrix [43], which is defined as Lappr(G,α) = I 1 1 2appr PΠ 1 2 appr + Π 1 2 appr PT Π 1 2appr), (1) where Πappr = 1 πappr 1 Diag(πappr). The approximate eigenvector πappr is defined as (1 α)πappr P + 1 n α 1 + α11 n = πappr, (2) where P = D 1 A, A = A+In n denotes the transition matrix with added self-loops and the diagonal degree matrix D(u,u) = v V A(u,v). Tong et al. [43] add self-loops to the original directed graph to ensure G to be aperiodic and redefine the transition matrix based on personalized Page Rank with the teleport probability α (0,1) to guarantee the redefined matrix to be irreducible. Note that we follow the [43] and set α = 0.1 in this paper. According to Eq. (1), existing data augmentation methods [63, 69], e.g., node/edge dropping, subgraph sampling, need to obtain the contrastive information by changing P. Their use of samplingbased methods inevitably corrupts the directed graph structure and thus misleads the message passing in the GNN-based encoder. A desirable data augmentation method on Lappr is to perturb πappr reasonably without altering the directed graph structure P1. 2.2 Directed Draph Data Augmentation with Laplacian Perturbation From Eq. (2), it is easy to find that the eigenvector πappr depends on the P and α. In other words, we can shift the teleport probability α without changing the directed graph structure P, thus altering the eigenvector πappr and finally perturbing the Laplacian matrix Lappr. DEFINITION 1 Laplacian perturbation. Given a directed graph G = (V,E) and the teleport probability α, we define the Laplacian perturbation opertation Φ( ) on the Lappr( P,α) as Φ α(G,α) = Lappr(G,α + α), (3) where α is a perturbation term that satisfies α 0 and α + α (0,1). Through this operation, the directed graph structure and the sparsity of the Laplacian matrix are maintained, which means there is no training burden to the subsequent GNN-based encoder. However, using this operation for data augmentation is very time-consuming, since the time complexity of computing Eq. (2) is O (n2). To deal with it, we design an accelerating algorithm for computing this approximate eigenvector based on the power method [24]. We take advantage of the sparsity of the transition matrix P to compute the approximate eigenvector πappr and rewrite Eq. (2) to the form of discrete-time Markiv chain at time t as πt+1 appr = (1 α)πt appr P + 1 n α 1 + α11 n. (4) As stated in Tong et al. [43], πt appr has a special property that πt appr1n 1 = 1 1+α. We then multiply the last term of Eq. (4) with (1 + α)πt appr1n 1 to πt+1 appr = (1 α)πt appr P + 1 n α 1 + α[(1 + α)πt appr1n 1]11 n = (1 α)πt appr P + α nπt appr1n n. (5) We can solve Eq. (5) iteratively using the power method [24]. Therefore, the complexity decreases to O (nk), where k is the number of iteration times. Due to page limit, we do not discuss the number of iterations and convergence rate here. Please refer to [22] for more details. We take k = 100 and the tolerance as 1e 6 throughout. It can be seen easily that for large-scale graphs, our method can significantly improve the speed and makes it possible to do Laplacian perturbation during the training, which is the basis for the framework proposed in Section 3. In addition, we compare the running time of our fast Laplacian perturbation with different data augmentation methods in Section 5.2. 1Note that adding self-loops to transform P into P is a common trick [17, 56], and we ignore the effect of this operation on the directed graph structure. Add teleport probability Laplacian Perturbation o/Pqv Dnvk9YZJ5v Zl Ny Pr4B6R2hfw= = + Lappr(G, ) Lappr(G, ) Approximate directed graph Laplacian Figure 1: Illustration of Laplacian perturbation. ξ is the auxiliary node defined in [43] and the red dotted lines represent adding the self-loops. 2.3 Justification of Laplacian Perturbation Noticing the inherent connections between graph Laplacian and von Neumann entropy [62], we can use the proprieties of von Neumann entropy to analyze the Laplacian perturbation operation quantitatively. We start with defining the von Neumann entropy of directed graphs. DEFINITION 2 Directed graph von Neumann entropy. Given a directed graph G = (V,E), its von Neumann entropy [62] based on the Lappr is defined as HVN(G,α) = 1 1 (u,v) E ( 1 dout u dout v + πappr(u) πappr(v)dout2 u ) 1 dout u dout v where E = {(u,v) (u,v) E and (v,u) E}. din u = v V A(v,u) and dout u = v V A(u,v) are the inand out-degree of the node u. Since the Laplacian perturbation does not involve the nodes/edges and only changes the teleport probability connecting to the auxiliary node ξ as shown in Figure 1, the number of nodes n and the degrees of nodes du (edge distribution) will not change. Conversely, the approximate eigenvector πappr has been changed. According to Eq. 6, the essence of this operation is a perturbation to the directed graph entropy. We further introduce the perturbation error to quantify this impact. DEFINITION 3 Perturbation error. Given the perturbation term α, we define the perturbation error of directed graph von Neumann entropy caused by Laplacian perturbation as HVN(α,α + α) = HVN(G,α) HVN(G,α + α). THEOREM 1 Monotonicity of the perturbation error. The perturbation error HVN increases monotonically with the Laplacian perturbation term α. THEOREM 2 Bounds on the perturbation error. Given a directed graph G = (V,E) and the teleport probability α, the inequality 0 < HVN(α,α + α) < 1 2n2 (u,v) E ( 1 dout2 u πα appr(u) παappr(v)dout2 u ) (7) holds. When the perturbation term α = 0, HVN = 0 and when α 1 α, the perturbation error HVN towards the upper bound. We show in THEOREM 1 that Laplacian perturbation can provide contrastive information in various magnitude for the encoder to learn the representations by changing α. Meanwhile, since it does not need to alter the structure and the number of nodes n is generally high, the perturbation error will be in a very small range as shown in THEOREM 2. This can help the encoder to focus more on the directed graph structure rather than just learning from the errors. Proofs of THEOREM 1 and 2 are attached in the Supplementary Material. 3 Directed Graph Contrastive Learning In this section, we introduce a new directed graph contrastive learning framework (Di GCL) with a more generalized objective, and then we present multi-task curriculum learning scheme to help Di GCL progressively learn from multiple easy-to-difficult contrastive views. The model illustration is in Figure 2 and the pseudo-code is given in the Supplementary Material. Scoring Function D Pacing Function P Input Views Head g Loss AugmentationΦ Original G RA92h Jmohgp7QC3p Fb8az8W58GJ/T1SWjv Dl AMz C+fg EX4Zx ZΦ 1 o CN0j E6Rj S5RA92h Jmohgp7QC3p Fb8az8W58GJ/T1SWjv Dl AMz C+fg EX4Zx ZΦ 1 Multi-task Curriculum Learning Figure 2: Illustration of our Di GCL model using Laplacian perturbation. For a directed graph, we first generate M different pairs of contrastive views by Laplacian perturbation. The different contrastive view pairs are then scored by a scoring function and mapped to different training paces by a pacing function. Finally, the arranged contrastive view pairs are input into a shared encoder to progressively learn the unsupervised graph representation with contrastive loss. 3.1 Learning with Dynamic-view Contrastive Objective As we have defined the directed graph data augmentation in Eq. (3), we follow the common graph contrastive learning (GCL) paradigm using our Laplacian perturbation. Fixed-view Objective. GCL seeks to maximize the mutual information (MI) between the representations of augmented graphs under different views [63, 69]. Based on it, we design a framework for directed graph contrastive learning, which contains three steps. (1) First, given a directed graph G = (V,E) and the teleport probability α, we generate two correlated views using for the G: U α1 = Φ α1(G,α) and V α2 = Φ α2(G,α) as a contrastive pair. This generation is through executing Laplacian perturbation Φ on the input G with the parameters α1, α2 separately. (2) Second, we take a GNN-based encoder f( ) to extract representation H α1 = f(U α1), H α2 = f(V α2) of two contrastive views. (3) Finally, the encoder is trained with the contrastive objective as max f I (H α1;H α2) = max f I (f(Φ α1(G,α));f(Φ α2(G,α))), (8) where I( ) is the mutual information. This framework lets the encoder f( ) learn the representation of two different views U α1, V α2 while preserving as much MI as possible. As shown in Eq. (8), the view generation is controlled by two perturbation terms α1 and α2, which may affect the objective function. The current GCLs manually select these parameters in advance, e.g., grid search, and fix contrastive views during training. This parameter selection strategy could make the encoder obtained by Eq. (8) learn only the contrastive information in some particular views and lead to a reduced generalization of the model [59]. Dynamic-view Objective. To deal with it, we propose a more generalized objective to learn from dynamic changing contrastive views. We rewrite Eq. (8) as max f I (H α1;H α2) = max f E α1, α2 I(f (Φ α1(G,α));f (Φ α2(G,α))), (9) which requires the obtained optimal encoder f works well with α1, α2 [0,1 α) to learn a balanced and generalized representation. Note that Info Min [42] designs a similar strategy that iteratively maximizes the MI in the worst views. In contrast, our objective maximizes the average MI in all views because finding the worst views is related to the downstream task, about which we have zero prior knowledge. Besides, unlike other contrastive learning methods that also use multiple contrastive views [41, 13, 61], there is no fixed number of views in our approach, and we do not preset the parameters to fix the contrastive views but let them dynamically change during training. However, maximizing Eq. (9) is not easy. Varying the perturbation term brings changes to the MI of views, which may result in large variance during the training process and unable to converge [42]. To increase the stability, we empirically make one of the views unperturbed in this paper, i.e., α1 = 0. We mark this view as U and design a multi-task training scheme to optimize proposed objective2. 3.2 Training using Multi-task Curriculum Learning To tackle this problem, we propose a training scheme using curriculum learning, which utilizes prior knowledge about the difficulty of the learning tasks to learn from easy-to-difficult contrastive views. 2The objective of fixing a view is not exactly the same as the original one, and we ignore this trick s impact. Multi-task Curriculum Learning. Curriculum Learning facilitates the optimization on such a complex objective by scheduling the sub-objectives in a certain order [2, 33]. Inspired by it, we regard the process of training Di GCL as a multi-task problem [33, 36, 10, 11, 26, 20] and decompose it into M sub-tasks η1,...,ηM. Thus, the objective Eq. (9) can be rewritten as M m=1 E αm I(f (Lappr(G,α));f (Φ αm(G,α))), (10) where αm {Θ1,...,M} is a uniform partition of Θ = [0,1 α) and the sub-task ηm is associated with the αm. Then we consider sub-tasks that have less difficulty to learn in the optimization process as easy sub-tasks and vice versa as difficult sub-tasks. We quantify this difficulty into difficulty score by means of a scoring function D( ). Based on it, we learn sub-tasks sequentially in a certain order decided by the pacing function P( ) to solve the main multi-task problem. Via information transmission by a shared encoder, the previously solved easy sub-tasks will assist in solving the next difficult sub-tasks, while the latter can fine-tune the encoder learned by the former. Scoring Function. The scoring function D( ) measures how difficult the sub-task ηm is, and can be any function D( ) {η1,...,ηM} {d1,...,d M}, where d denotes the difficulty score. In each sub-task, there exists a pair of contrastive views as defined in Eq. (10). We empirically assume that the difficulty dm of the sub-task ηm depend on how difficult it is for the encoder f( ) to learn the contrastive information from this sub-task. Referring to DEFINITION 3, we can find Laplacaian perturbation provides contrastive information to the encoder, and the perturbation term α controls how much of this information is. Therefore, we can define the scoring function D( ) to calculate the difficulty dm of a sub-task ηm as dm = D(ηm) = 1 αm The higher this score represents the smaller the perturbation error, the less contrastive information provided, and the less the encoder is able to distinguish the difference between the two views, resulting in a more difficult sub-task. This idea coincides with the design of the discriminator in GAN [65, 9]. Pacing Function. The pacing function decides the learning sequence of sub-tasks and can be any function P( ) {1,...,L} {d1,...,d M}, where L denotes all the learning iterations. We consider three function families [58]: logarithmic, exponential, and linear. Table in Figure 3(left) illustrates the pacing functions used in our experiments, which are parameterized by (da,db). Here da is the initial difficulty and db is the difficulty at the end of training, thus any pacing function with da = db is equivalent to fixed view contrastive learning. We mark their corresponding perturbation terms as ( αa, αb). Considering the score in Eq. (11) is defined in a continuous domain, we set the step size of the pacing function to 1, the finest-grained unit, i.e., L = M. This can help the model to switch between different training views in a more delicate way. Name Expression P(da,db)(l) log da + (db da)(1 + 1/3log ( l L + e 3)) exp da + db da e3 1 (exp( 3l L ) 1)) linear da + (db da) l Figure 3: (left) pacing function definitions for the three families of pacing functions used throughout; (right) the plot of pacing function curves from each family. l is the training epoch. It is desired to find the optimal solution over the entire definition domain of α [0,1 α), we set the start point αa = 0.9 α (0.9 is used because the boundary value cannot be obtained) and ending point αb = 0 in this paper, i.e., da = 1/9,db = 1. More analysis on the pacing functions can be seen in the Supplementary Material. Contrastive loss. After discussing how to measure expectations in the objective, our attention shifts to maximizing the MI. Several works have investigated the lower bound of MI in contrastive learning, here we adopt Info NCE [28]. For ith node ui in view U, the node-wise objective is defined as ℓ(ui, vi) = log exp (S (zi U, zi V αm ) /τ) n j=1 exp (S (zi U, zj V αm ) /τ) , (12) where vi is its corresponding positive node in view V αm and other nodes in view V αm is negative nodes of ui. zi U is the projection of node feature that zi U ZU = g(HU). A non-linear transformation g( ) named projection head maps augmented representations H to another lower dimension where the contrastive loss is calculated. S( ) denotes cosine similarity function and τ denotes the temperature parameter. The final loss is computed across all positive node pairs in the views of one pace. Note that GRACE [68] proposes similar Info NCE-like loss function. Different from it, we do not treat intra-view nodes (nodes in U except ui) as the negative samples since GNN-based encoder is already able to learn intra-graph structure well, we want to focus on inter-graph contrastive information. 4 Related Work Directed Graph Neural Networks. Several works have tried to learn the complex directed graph structure by defining motifs [25], using magnetic Laplacian [64], inheritance relationships [16], redefining the propagation scheme from Markov process view [43, 23] and second-order proximity[44]. However, these methods are supervised, and they cannot learn the structural features without labels. Our model is unsupervised, which is the most significant difference between them. Directed Graph Data Augmentation. Good data augmentations are the prerequisite for contrastive learning [63, 48, 54]. Most current data augmentation methods, such as dropping nodes or edges [63, 68, 69, 51] and attribute masking [15, 63, 55] are migrated from methods of visual representation [57, 66] with graph structural improvements. Other methods learn graph structure features by combining with traditional graph algorithms to go self-supervised, for example, subgraph sampling [63, 34, 52], graph diffusion based on Page Rank/Heat kernel [13], and multi-hop neighbor prediction [31]. However, most of these methods are not optimized for directed graphs and thus cannot efficiently obtain structure information that can be used for self-supervised learning. The performance of existing methods on directed graphs can be seen in Table 2. Dynamic-view GCL and Multi-task Curriculum Learning. Previous works [41, 13, 61] have found the powerful capabilities of using multiple contrastive views at the same time in contrastive learning. However, their methods require to set contrastive views in advance, which means the views are fixed during training [13, 41]. Unlike them, we do not preset the parameters to fix the contrastive views but let them dynamically change during training. However, our dynamic-view contrastive objective is hard to optimize, thus we adopt the idea of multi-task curriculum learning [33, 36, 10, 11, 26, 20] to progressively learn from multiple easy-to-difficult contrastive views. Besides, using curriculum learning [8, 53] in graph learning tasks is not a new idea, but we are the first to use curriculum learning as a solution to the self-supervised graph contrastive learning problem. 5 Experiments We conduct extensive experiments to evaluate the effectiveness of our model. Our implement can be obtained at https://github.com/flyingtango/Di GCL. The Supplementary Material reports further details on the experiments and reproducibility. 5.1 Experimental Settings Experimental Task. As explained in Section 2.1, our approach is more applicable to node-level contrastive learning, thus, we adopt the task of node classification in directed graphs [43] to verify the learning ability of models. Compared with the common experiments for undirected graphs [17, 56, 18], the challenge is that the given adjacency matrix A is asymmetric, which means message passing has its direction. We will also use datasets of undirected graphs for controlled trials. Task definition, model structures and configurations are included in the Supplementary Material. Datasets and Splitting. We use several widely-used datasets including directed graph datasets: CORA-ML [3], CITESEER [37] and AM-PHOTO [38]; undirected graph datasets: PUBMED [27] and DBLP [30]. The split of the datasets would have a significant effect on models results [38]. Thus, we divide the datasets randomly and conduct multiple tests to achieve consistent outcomes. For train/validation/test split, following the rules in GCN [17], we choose 20 labels per class for training set, 500 labels for validation set, and rest for the test set. Baselines. We compare our model to the 11 SOTA models divided into four main categories: 1) supervised models for undirected graph with GCN [17], GAT [46] and [18]; 2) supervised models for directed graph with Mag Net [64] and Di GCN [43]; 3) self-supervised models without augmentations including DGI [47] and GMI [32]; 4) contrastive learning models with augmentations containing MVGRL [13], Graph CL [63], GRACE [68], and GCA [69]. 5.2 Experimental Results Overall accuracy. The performance comparisons between our model and baselines on five datasets are reported in Table 1. We use a two-layer GCN as our encoder and first train our model in an unsupervised manner to obtain the embedding. Then we take a simple ℓ2-regularized logistic regression as the classifier [47]. For curriculum learning scheme, we select the log pacing function and the start and ending difficulties are set in Section 3.2. We train all models according to their default settings, then calculate mean test accuracy with STD in percent (%) averaged over 20 random dataset splits with random weight initialization. It can be seen easily that our methods achieve the state-of-the-art results on all datasets. In general, the unsupervised methods including MVGRL, GRACE, and GCA, do not perform well on directed graphs compared to their good performance in undirected graphs. This is mainly due to the models data augmentation methods are not applicable to digaphs, resulting in the inability to learn contrastive information from complex directed structures. Notice that Graph CL performs relatively mediocre in unsupervised methods, most notably because it focuses mainly on graph-level unsupervised methods, and does not apply well to node-level contrastive learning task. Since DGI and GMI do not require data augmentation to provide contrastive information, they perform very well on both undirected and directed graphs, which shows good suitability for different graph structure. Moreover, supervised methods such as GCN, GAT and APPNP are inferior to Di GCN and Mag Net, which are specifically designed for directed graphs, in terms of performance. Since our Laplacian perturbation uses the approximate Laplacian matrix proposed by Di GCN, we compare their performance. It is not difficult to find that our model outperforms Di GCN on all datasets, which shows that contrastive learning can learn good encoders by performing a certain data augmentation in an unsupervised manner. Table 1: Overall accuracy (%) with STD. "No Curr" means do not use curriculum learning and we set two fixed views as αa = αb = 1 α. "Random" means random order, "Anti Curr" means using anti-curriculum order and "Curr" indicates using curriculum order. The best results are highlighted with bold and the second are marked with underline. OOM means out of memory on a 12GB GPU. Method DIRECTED UNDIRECTED CORA-ML CITESEER AM-PHOTO PUBMED DBLP GCN [17] 70.92 0.39 63.00 0.45 88.52 0.47 78.78 0.30 73.54 0.77 GAT [46] 72.22 0.57 63.73 0.57 88.36 1.25 77.49 0.47 76.08 0.54 APPNP [18] 70.31 0.67 61.63 0.63 87.43 0.98 79.35 0.48 77.92 0.75 Mag Net [64] 76.32 0.10 65.04 0.47 86.80 0.65 74.23 0.46 69.73 0.98 Di GCN [43] 77.03 0.70 64.60 0.60 88.66 0.51 76.79 0.49 73.37 0.72 UNSUPERVISED DGI[47] 75.21 1.29 64.58 1.78 85.25 0.59 74.11 0.62 76.53 1.24 GMI[32] 76.59 0.35 63.29 0.70 81.12 0.01 80.27 0.16 76.66 0.48 MVGRL[13] 76.67 0.12 62.22 0.02 86.15 0.21 79.98 0.04 OOM Graph CL [63] 67.34 0.12 57.84 0.11 67.66 0.05 75.29 0.08 77.85 0.22 GRACE [68] 73.88 0.25 61.20 0.20 87.95 0.32 79.54 0.05 78.03 0.09 GCA [69] 76.32 0.33 63.25 0.10 87.35 0.27 79.81 0.61 77.83 0.35 Ours + No Curr 75.86 0.09 66.99 0.54 87.32 0.14 79.57 0.12 78.28 0.05 Ours + Random 76.52 1.66 67.15 0.82 89.03 0.46 80.75 0.10 79.58 0.14 Ours + Anti Curr 76.12 1.04 66.83 1.13 88.83 0.73 80.22 0.37 79.42 0.15 Ours + Curr 77.53 0.14 67.42 0.14 89.41 0.11 80.69 0.08 79.70 0.13 Ablation study on curriculum learning. We validate the effectiveness of curriculum learning strategy and the results are shown in Table 1. We find that even with the contrastive views containing the maximum information (with the largest perturbation term), the effect of the model without curriculum learning is significantly lower than the model with curriculum learning. This means that features learned from fixed views is always incomplete. Besides, we also find that the order of curriculum learning is vital, the ACC of learning from easy-to-difficult > random order > difficult-toeasy. This substantially validates that scoring function do help to learn from multiply tasks. Impact of different pacing functions. We study the three different pacing functions defined in Table 3 and the results in Figure 4(a) show that using different path functions can have an impact on the performance but not as much as changing the order of curriculum learning. Among them, the log pacing function performs best as it speeds up learning on easy tasks and stays on harder tasks for more epochs, thus helping the model to grasp the more subtle differences between contrastive views. Effect of Laplacian perturbation and other data augmentation methods. Table 2 shows the effect of the different data augmentation methods on the five contrastive learning models. We our Laplacian perturbation works best on both directed and undirected graphs. But the difference with other methods is not significant on the undirected graph dataset PUBMED. In addition to this, we also compare with random removing edges used in GRACE under our Di GCL framework. We find that the data augmentation approach without damaging the graph structure can achieve better performance. Table 2: Overall accuracy (%) with STD with various data augmentation methods. Our model uses curriculum learning, and use the log pacing function. The best results are highlighted with bold. Methods Data Augmentation Method CITESEER AM-PHOTO PUBMED Graph CL [63] Edge perturbation 57.84 0.11 67.66 0.05 75.29 0.09 Node dropping 57.45 0.12 66.69 0.07 75.25 0.08 Subgraph sampling 57.59 0.10 66.75 0.07 74.75 0.11 GRACE [68] Random removing edges 61.20 0.20 87.95 0.32 79.54 0.05 MVGRL [13] Graph diffusion with heat kernel 61.22 0.07 79.63 0.31 78.54 0.33 Graph diffusion with Page Rank kernel 62.22 0.02 86.15 0.21 79.98 0.04 GCA [69] Removing edges by degree score 63.25 0.10 87.35 0.27 79.81 0.61 Removing edges by Page Rank score 62.21 0.16 86.88 0.37 78.92 0.37 Removing edges by eigenvalue score 63.12 0.08 87.02 0.56 79.70 0.09 Ours Random removing edges 64.97 0.08 88.45 0.01 79.66 0.13 Laplacian perturbation 67.42 0.14 89.41 0.11 80.69 0.08 Generalization of Laplacian perturbation. Here, we will add generalizability experiments that migrate Laplacian perturbation to MVGRL [13] and GCA [69]. We replace the topological data augmentation method in the original model with our Laplacian perturbation and follow all the configurations in the original model, including the parameters of the data augmentation, the number of contrastive views, the structure of the model, and the training parameters. We test the generalization performance of our Laplacian perturbation in the node classification task. The results are shown in Figure 4(b). We can clearly find that the performance of the model on directed graphs can be improved after using Laplacian perturbation, which indicates that our method can help existing contrastive learning models to learn structure information from complex networks. Monotonicity of the perturbation error in real-world datasets. To visualize the variation of the perturbation error with the perturbation term, we empirically show the variation of directed graph entropy in the three real-world directed graph datasets: CORA-ML, CITESEER and AM-PHOTO. We Figure 4: (a) node classification results with various pacing functions; (b) node classification results for the generalizability of Laplacian perturbation on two datasets, Lp means with Laplacian perturbation; (c) perturbation error with various perturbation terms α. Figure 5: (a) time of different data augmentation methods on random graphs with different size; (b) running time per epoch of different models on random size graphs, OOM means out of memory; (c) generalization experiment of graph classification task based on MVGRL. set α = 0.1 and take α {0,0.1,0.2,0.4,0.6,0.7,0.8,0.85,0.89}. The results are shown in Figure 4(c), and it is clearly shown that as the perturbation term increases, the degree of perturbation is elevated, and the perturbation error will increase. Augmentation time at different graph scales. We construct 20 random directed graphs of different scales, with the number of edges ranging from 1K to 2M, then test four different data augmentation methods mean running time in milliseconds (ms) for 5 times. Figure 5(a) summarizes the results and shows that our Laplacian perturbation is very competitive in terms of runtime among these methods. In contrast to the graph diffusion, which also require the computation of eigenvectors, we can reduce the time by a factor of one hundred with our fast algorithm proposed in Section 2.2. This makes it possible to dynamically perform Laplacian perturbation to generate contrastive views during training. Running time with different graph size. To generate arbitrary graph size, we construct a simple random graph with N nodes and assign 10N directed edges uniformly at random. We compare with GRACE[68] and GCA [69] since these two models have the similar accuracy with ours. We record the running time of the models for each epoch at different graph sizes, and the results are shown in Figure 5(b). We can find that processing 20,000 nodes and 200,000 edges with 12 GB memory is the limit for these three models. Since our model requires a Laplacian perturbation operation, this will take more time than the data augmentation performed by GCA and GRACE. Generalization to graph classification task. To validate the generalization ability of our Di GCL model, we migrate it to MVGRL [13] and test on three graph classification datasets: MUTAG [19], PTC [19], and IMDB-BIN [60]. The generalized MVGRL model is denoted by MVGRL+Di GCL and the results are shown in Figure 5(c). Using our model can improve the accuracy on all datasets, however, the improvement is not significant. We consider there are three main potential reasons for this: 1) the datasets are undirected graphs, and the effect of using Laplacian perturbation on them is not as obvious as in directed graphs; 2) the graph classification problem requires more graph-level contrastive information, while our model focuses on node-level; 3) our scoring function designed for node-level curriculum learning is not suitable for graph-level difficulty measure. 6 Conclusion, Limitation and Future Work In this paper, we design a directed graph data augmentation scheme called Laplace perturbation and theoretically investigate how it can provide contrastive information without changing the directed graph structure. Moreover, we present the Di GCL which utilizes Laplacian perturbation and curriculum learning to progressively learn from dynamic easy-to-difficult contrastive views. Finally, we use several tasks on various datasets to demonstrate the effectiveness and generalization ability of our proposed Di GCL. We empirically show that Di GCL can retain more structural features of directed graphs than other GCL models while providing adequate contrastive information. Extensive experiments show that our Di GCL outperforms the state-of-the-art approaches. As elaborated in Section 2.1 and 5.2, our approach is more suitable for obtaining node-level contrastive information. This limits us to use the model to solve some graph-level problems, such as pre-training on protein molecular, drug property prediction, etc. We will extend our approach to graph-level tasks in subsequent work. Meanwhile, we manually design the pacing function to preplan the training speed, such as linear or exponential functions. However, in many cases, the pacing during training is not perfectly planned by the ideal curve. We need to dynamically adjust the pacing function according to the training situation, such as using self-paced learning [20]. Acknowledgments and Disclosure of Funding The authors would like to thank Changsheng Sun, Xu Cai and Jinyang Wang for the great help in discussion and proofreading. This research is supported by NUS Research Scholarship. [1] M. Barigozzi and C. Brownlees, Nets: Network estimation for time series, Journal of Applied Econometrics, vol. 34, no. 3, pp. 347 364, 2019. [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, Curriculum learning, in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41 48. [3] A. Bojchevski and S. Günnemann, Deep gaussian embedding of attributed graphs: Unsupervised inductive learning via ranking, ar Xiv preprint ar Xiv:1707.03815, 2017. [4] Y. Cai, L. Ge, J. Liu, J. Cai, T.-J. Cham, J. Yuan, and N. M. Thalmann, Exploiting spatialtemporal relationships for 3d pose estimation via graph convolutional networks, in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2272 2281. [5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, A simple framework for contrastive learning of visual representations, ar Xiv preprint ar Xiv:2002.05709, 2020. [6] F. Chung, Laplacians and the cheeger inequality for directed graphs, Annals of Combinatorics, vol. 9, no. 1, pp. 1 19, 2005. [7] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, Convolutional 2d knowledge graph embeddings, in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [8] C. Gong, J. Yang, and D. Tao, Multi-modal curriculum learning over graphs, ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 4, pp. 1 25, 2019. [9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial networks, ar Xiv preprint ar Xiv:1406.2661, 2014. [10] H. Guo, R. Pasunuru, and M. Bansal, Autosem: Automatic task selection and mixing in multi-task learning, ar Xiv preprint ar Xiv:1904.04153, 2019. [11] M. Guo, A. Haque, D.-A. Huang, S. Yeung, and L. Fei-Fei, Dynamic task prioritization for multitask learning, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 270 287. [12] W. Hamilton, Z. Ying, and J. Leskovec, Inductive representation learning on large graphs, in NIPS, 2017, pp. 1024 1034. [13] K. Hassani and A. H. Khasahmadi, Contrastive multi-view representation learning on graphs, ar Xiv preprint ar Xiv:2006.05582, 2020. [14] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, Momentum contrast for unsupervised visual representation learning, ar Xiv preprint ar Xiv:1911.05722, 2019. [15] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, Strategies for pre-training graph neural networks, ar Xiv preprint ar Xiv:1905.12265, 2019. [16] M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing, Rethinking knowledge graph propagation for zero-shot learning, in CVPR, 2019, pp. 11 487 11 496. [17] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, ar Xiv preprint ar Xiv:1609.02907, 2016. [18] J. Klicpera, A. Bojchevski, and S. Günnemann, Predict then propagate: Graph neural networks meet personalized pagerank, ar Xiv preprint ar Xiv:1810.05997, 2018. [19] N. Kriege and P. Mutzel, Subgraph matching kernels for attributed graphs, ar Xiv preprint ar Xiv:1206.6483, 2012. [20] C. Li, J. Yan, F. Wei, W. Dong, Q. Liu, and H. Zha, Self-paced multi-task learning, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017. [21] Y. Liang, S. Ke, J. Zhang, X. Yi, and Y. Zheng, Geoman: Multi-level attention networks for geo-sensory time series prediction. in IJCAI, 2018, pp. 3428 3434. [22] C. Liu and Y. Li, A parallel pagerank algorithm with power iteration acceleration, International Journal of Grid and Distributed Computing, vol. 8, no. 2, pp. 273 284, 2015. [23] Y. Ma, J. Hao, Y. Yang, H. Li, J. Jin, and G. Chen, Spectral-based graph convolutional network for directed graphs, ar Xiv preprint ar Xiv:1907.08990, 2019. [24] C. B. Moler, Experiments with MATLAB. Society for Industrial and Applied Mathematics, 2011. [25] F. Monti, K. Otness, and M. M. Bronstein, Motifnet: a motif-based graph convolutional network for directed graphs, in IEEE DSW. IEEE, 2018, pp. 225 228. [26] K. Murugesan and J. Carbonell, Self-paced multitask learning with shared knowledge, ar Xiv preprint ar Xiv:1703.00977, 2017. [27] G. Namata, B. London, L. Getoor, B. Huang, and U. EDU, Query-driven active surveying for collective classification, in 10th International Workshop on Mining and Learning with Graphs, vol. 8, 2012. [28] A. v. d. Oord, Y. Li, and O. Vinyals, Representation learning with contrastive predictive coding, ar Xiv preprint ar Xiv:1807.03748, 2018. [29] K. Ouyang, Y. Liang, Y. Liu, Z. Tong, S. Ruan, Y. Zheng, and D. S. Rosenblum, Fine-grained urban flow inference, ar Xiv preprint ar Xiv:2002.02318, 2020. [30] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, Tri-party deep network representation, Network, vol. 11, no. 9, p. 12, 2016. [31] Z. Peng, Y. Dong, M. Luo, X.-M. Wu, and Q. Zheng, Self-supervised graph representation learning via global context prediction, ar Xiv preprint ar Xiv:2003.01604, 2020. [32] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang, Graph representation learning via graphical mutual information maximization, in Proceedings of The Web Conference 2020, 2020, pp. 259 270. [33] A. Pentina, V. Sharmanska, and C. H. Lampert, Curriculum learning of multiple tasks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5492 5500. [34] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang, Gcc: Graph contrastive coding for graph neural network pre-training, in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1150 1160. [35] R. Qiu, H. Yin, Z. Huang, and T. Chen, Gag: Global attributed graph neural network for streaming session-based recommendation, in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 669 678. [36] N. Sarafianos, T. Giannakopoulos, C. Nikou, and I. A. Kakadiaris, Curriculum learning of visual attribute clusters for multi-task classification, Pattern Recognition, vol. 80, pp. 94 108, 2018. [37] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, Collective classification in network data, AI magazine, vol. 29, no. 3, pp. 93 93, 2008. [38] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, Pitfalls of graph neural network evaluation, ar Xiv preprint ar Xiv:1811.05868, 2018. [39] Q. Sun, J. Li, H. Peng, J. Wu, Y. Ning, P. S. Yu, and L. He, Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism, in Proceedings of the Web Conference 2021, 2021, pp. 2081 2091. [40] Q. Sun, H. Peng, J. Li, S. Wang, X. Dong, L. Zhao, S. Y. Philip, and L. He, Pairwise learning for name disambiguation in large-scale heterogeneous academic networks, in 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 2020, pp. 511 520. [41] Y. Tian, D. Krishnan, and P. Isola, Contrastive multiview coding, ar Xiv preprint ar Xiv:1906.05849, 2019. [42] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, What makes for good views for contrastive learning, ar Xiv preprint ar Xiv:2005.10243, 2020. [43] Z. Tong, Y. Liang, C. Sun, X. Li, D. Rosenblum, and A. Lim, Digraph inception convolutional networks, Advances in Neural Information Processing Systems, vol. 33, 2020. [44] Z. Tong, Y. Liang, C. Sun, D. S. Rosenblum, and A. Lim, Directed graph convolutional network, ar Xiv preprint ar Xiv:2004.13970, 2020. [45] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, On mutual information maximization for representation learning, in International Conference on Learning Representations, 2019. [46] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, Graph attention networks, ar Xiv preprint ar Xiv:1710.10903, 2017. [47] P. Veliˇckovi c, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, Deep graph infomax, ar Xiv preprint ar Xiv:1809.10341, 2018. [48] V. Verma, M. Qu, A. Lamb, Y. Bengio, J. Kannala, and J. Tang, Graphmix: Regularized training of graph neural networks for semi-supervised learning, ar Xiv preprint ar Xiv:1909.11715, 2019. [49] L. Wang, C. Gao, C. Huang, R. Liu, W. Ma, and S. Vosoughi, Embedding heterogeneous networks into hyperbolic space without meta-path, in Proceedings of the AAAI conference on artificial intelligence, 2021. [50] L. Wang, C. Huang, W. Ma, X. Cao, and S. Vosoughi, Graph embedding via diffusion-waveletsbased node feature distribution characterization, 2021. [51] Y. Wang, S. Liu, M. Yoon, H. Lamba, W. Wang, C. Faloutsos, and B. Hooi, Provably robust node classification via low-pass message passing, in 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 2020, pp. 621 630. [52] Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi, Graphcrop: Subgraph cropping for graph classification, ar Xiv preprint ar Xiv:2009.10564, 2020. [53] Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi, Curgraph: Curriculum learning for graph classification, in Proceedings of the Web Conference 2021, 2021, pp. 1238 1248. [54] Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi, Mixup for node and graph classification, in Proceedings of the Web Conference 2021, 2021, pp. 3663 3674. [55] Y. Wang, W. Wang, Y. Liang, Y. Cai, J. Liu, and B. Hooi, Nodeaug: Semi-supervised node classification with data augmentation, in KDD, 2020, pp. 207 217. [56] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger, Simplifying graph convolutional networks, ar Xiv preprint ar Xiv:1902.07153, 2019. [57] M. Wu, C. Zhuang, M. Mosse, D. Yamins, and N. Goodman, On mutual information in contrastive learning for visual representations, ar Xiv preprint ar Xiv:2005.13149, 2020. [58] X. Wu, E. Dyer, and B. Neyshabur, When do curricula work? ar Xiv preprint ar Xiv:2012.03107, 2020. [59] T. Xiao, X. Wang, A. A. Efros, and T. Darrell, What should not be contrastive in contrastive learning, ar Xiv preprint ar Xiv:2008.05659, 2020. [60] P. Yanardag and S. Vishwanathan, Deep graph kernels, in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365 1374. [61] C. Yang, Z. An, and Y. Xu, Multi-view contrastive learning for online knowledge distillation, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3750 3754. [62] C. Ye, R. C. Wilson, C. H. Comin, L. d. F. Costa, and E. R. Hancock, Approximate von neumann entropy for directed graphs, Physical Review E, vol. 89, no. 5, p. 052804, 2014. [63] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, Graph contrastive learning with augmentations, Advances in Neural Information Processing Systems, vol. 33, 2020. [64] X. Zhang, N. Brugnone, M. Perlmutter, and M. Hirn, Magnet: A magnetic neural network for directed graphs, ar Xiv preprint ar Xiv:2102.11391, 2021. [65] J. Zhao, M. Mathieu, and Y. Le Cun, Energy-based generative adversarial network, ar Xiv preprint ar Xiv:1609.03126, 2016. [66] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, Random erasing data augmentation. in AAAI, 2020, pp. 13 001 13 008. [67] Y. Zhu, Y. Xu, Q. Liu, and S. Wu, An empirical study of graph contrastive learning, 2021. [68] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, Deep graph contrastive representation learning, ar Xiv preprint ar Xiv:2006.04131, 2020. [69] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, Graph contrastive learning with adaptive augmentation, ar Xiv preprint ar Xiv:2010.14945, 2020. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] The main content of this paper is theoretical. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 2 and 3. (b) Did you include complete proofs of all theoretical results? [Yes] See the Supplementary Material. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Our implement can be obtained at https://github.com/flyingtango/Di GCL. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See the Supplementary Material. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Section 5.2. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See the Supplementary Material. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section 5.2. (b) Did you mention the license of the assets? [Yes] See the Supplementary Material. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]