# labelinvariant_augmentation_for_semisupervised_graph_classification__dd3a2e2c.pdf Label-invariant Augmentation for Semi-Supervised Graph Classification Han Yue Chunhui Zhang Chuxu Zhang Hongfu Liu Michtom School of Computer Science Brandeis University, Waltham, MA {hanyue,chunhuizhang,chuxuzhang,hongfuliu}@brandeis.edu Recently, contrastiveness-based augmentation surges a new climax in the computer vision domain, where some operations, including rotation, crop, and flip, combined with dedicated algorithms, dramatically increase the model generalization and robustness. Following this trend, some pioneering attempts employ the similar idea to graph data. Nevertheless, unlike images, it is much more difficult to design reasonable augmentations without changing the nature of graphs. Although exciting, the current graph contrastive learning does not achieve as promising performance as visual contrastive learning. We conjecture the current performance of graph contrastive learning might be limited by the violation of the label-invariant augmentation assumption. In light of this, we propose a label-invariant augmentation for graph-structured data to address this challenge. Different from the node/edge modification and subgraph extraction, we conduct the augmentation in the representation space and generate the augmented samples in the most difficult direction while keeping the label of augmented data the same as the original samples. In the semi-supervised scenario, we demonstrate our proposed method outperforms the classical graph neural network based methods and recent graph contrastive learning on eight benchmark graph-structured data, followed by several in-depth experiments to further explore the label-invariant augmentation in several aspects. 1 Introduction Contrastive augmentation aims to expand training data in both volume and diversity in a selfsupervised fashion to increase model robustness and generalization. Common sense and domain knowledge are employed to design the contrastive augmentation operations. Denoising autoencoder [2, 3] is one of the pioneering studies to apply perturbations to generate contrastive samples for tablet data, which takes a corrupted input and recovers the original undistorted input. For visual data, some operations, including rotation, crop, and flip, combined with dedicated algorithms, significantly improve the learning performance in diverse tasks [28, 8, 26, 19, 6]. Treating the augmented and original samples as positive pairs and the augmented samples from different source samples as negative pairs, contrastive learning aims to learn the augment-invariant representations by increasing the similarity of positive pairs and the dissimilarity of negative pairs [5]. These positive pairs increase the model robustness due to the assumption that the augmented operations preserve the nature of images and make the augmented samples have consistent labels with the original ones. The negative pairs work as the instance-level discrimination, which is expected to enhance the model generalization, but might deteriorate the downstream task since the negative pairs contain the augmented samples from different source samples but with the same category. The recent BYOL [16] and Sim Siam [7] demonstrate the negative effect of the negative pairs and conclude that the current performance of contrastive learning can be further boosted even without negative pairs. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Following this trend, some pioneering attempts employ contrastive augmentation to graph data [11]. Graph CL [40] is the first work to address the graph contrastive learning problem with four types of augmentations, including node dropping, edge perturbation, attribute masking, and subgraph extraction. Later, JOAO [39] extends Graph CL by automatically selecting one type of graph augmentation from the above four types plus non-augmentation. GRACE [42] treats the original graph data and the novel-level augmented data as two views and learns the graph representation by maximizing the agreement between the two views. Similarly, MVGRL [18] conducts the contrastive multi-view representation learning on both node and graph levels. Beyond the above studies to augment graphs, sim GRACE [36] perturbs the model parameters for contrastive learning, which can be regarded as an ensemble of model perturbation or a robust regularization. Although exciting, the above studies point out that the effectiveness of graph contrastive learning heavily hinges on ad-hoc data augmentations, which need to be carefully designed or selected per dataset and request more domain knowledge. Contributions. We conjecture these hand-crafted graph augmentations might change the nature of the original graph and violate the label-invariant assumption in the downstream tasks. Different from treating graph contrastive learning in a pre-trained perspective, we aim to incorporate the downstream classification task into the representation learning, where the label information is fully used for both decision boundary learning and graph augmentation. Specifically, we propose Graph Label-invariant Augmentation (GLA), which conducts augmentation in the representation space and augments the most difficult sample while keeping the label of the augmented sample the same as the original sample. Our major contributions are summarized as follows: We propose a label-invariant augmentation strategy for graph contrastive learning, which involves labels in the downstream task to guide the contrastive augmentation. It is worthy to note that we do not generate any graph data. Instead, we directly generate the label-consistent representations as the augmented graphs during the training phase. In the rich representation space, we aim to generate the most difficult sample for the model and increase the model generalization. We choose a lightweight technique by randomly generating a set of qualified candidates and selecting the most difficult one, i.e., minimizing the maximum loss or worst case loss over the augmented candidates. We conduct a series of semi-supervised experiments on eight graph benchmark datasets in a fair setting and compare our label-invariant augmentation with classical graph neural network based methods and recent graph contrastive methods by running the codes provided by the original authors. Extensive results demonstrate our label-invariant augmentation can achieve better performance in general cases without generating real augmented graphs and any specific domain knowledge. Besides algorithmic performance, we also provide rich and in-depth experiments to explore label-invariant augmentation in several aspects. 2 Related Work Here we introduce the related work of graph neural networks and graph contrastive learning for graph classification. Node classification, although related, is not covered here due to its different setting. Graph Neural Network. Graph Neural Networks (GNNs) have been employed on various graph learning tasks and achieved promising performance [23]. To extract the representation of each node, GNNs pass node embeddings from its connected neighbor nodes and apply feedforward neural networks to transform the aggregated features. As a pioneer study in GNNs, graph convolutional network (GCN) firstly aims to generalize the convolution mechanism from image to graph [23, 35, 14]. Based on GCN, instead of simply summing and averaging connected neighboring node s embedding, graph attention networks [31, 34, 33, 41, 13] adopt an attention mechanism that builds self-attention to score each connected neighboring nodes embedding to identify the more important nodes and enhance the effectiveness of message passing. Then in order to break prior GNN s limitations on message passing over long distances on large graphs, graph recurrent neural networks [17, 9] apply the gating mechanism from RNNs to propagation on graph topology. Simultaneously, for dealing with the noise introduced from more than 3 layers of graph convolution, Deep GCN [25, 24] uses skip connections and enables GCN to achieve better results with deeper layers. Recently, GAE [22] and Infomax [30] achieve state-of-the-art performance on several benchmark datasets. GAE extends the variational auto-encoder to graph neural networks for unsupervised learning, while Infomax learns the unsupervised representation on graphs to enlarge mutual information between local (node-level) and global (graph-level) representations in one graph. Graph Contrastive Learning. Recently, many studies have been devoted to the graph contrastive learning area in diverse angles, including graph augmentation, negative sample selection, and view fusion. Graph CL [40] summarizes four types of graph augmentations to learn the invariant representation across different augmented views. Built on Graph CL, JOAO [39] proposes a learnable module to automatically select augmentation for different datasets to alleviate the human labor in combinations of these augmentations. Differently, MVGRL [18] contrasts node and graph encodings across views which enriches more negative samples for contrastive learning. Later, Info GCL [37] diminishes the mutual information between contrastive parts among views while preserving the task-relevant representation. Beyond augmenting graphs, Sim GRACE [36] disturbs the model weights and then learns the invariant high-level representation at the output end to alleviate the design of graph augmentation. Different from the above methods that separate the pre-train and fine-tuning phases, we aim to employ the label information in downstream tasks to guide the augmentation process. Specifically, in this study, we propose a label-invariant augmentation strategy for graph-level representation learning. 3 Methodology A graph can be represented by G = (V, X, A), where V = {v1, v2, ..., vn} is the set of vertexes, X Rn d denotes the features of each vertex, and A {0, 1}n n represents the adjacency matrix. Given a set of labeled graphs S = {(G1, y1), (G2, y2), ..., (GM, y M)} where M is the number of labeled graphs, and yi Y is the corresponding categorical label of graph Gi G (1 i M), and another set of unlabeled graphs T = {GM+1, ..., GN}, where N is the number of all graphs, M50% 30%->70% Average Accuracy Gain (%) GLA (Ours) Graph CL JOAOv2 (a) performance gain Sim GRACE Graph CL Label-Invariant Rate (%) (b) label-invariant rate distribution 30% 50% 70% Label Ratio Label-Invariant Rate (%) MUTAG PROTEINS (c) GLA s label-invariant rate Figure 3: Performance gain and label-invariant rates. (a) demonstrates the average performance gains on eight datasets with more labeled samples produced by Graph CL, JOAOv2, and GLA. (b) shows the label-invariant rate distributions of different augmentation methods over eight datasets. (c) shows the label-invariant rates of our GLA over different semi-supervised settings. our proposed Graph Label-invariant Augmentation (GLA) method,1 we perform contrastive learning and graph classifier learning synchronously. The implementation details of GLA are as follows. We implement the networks based on Graph CL [40] by Py Torch, set the magnitude of perturbation η to 1.0, and the weight of classification loss α to 1.0, which is the same with Graph CL. We adopt Adam optimizer [20] to minimize the objective function in Eq. (9). Evaluation Protocol. We evaluate the models with 10-fold cross-validation. We randomly shuffle a dataset and then evenly split it into 10 parts. Each fold corresponds to one part of data as the test set and another part as the validation set to select the best epoch, where the rest folds are used for training. We select 30%, 50%, 70% graphs from the training set as labeled graphs for each fold, then conduct semi-supervised learning. For a fair comparison, we use the same training/validation/test splits for all compared methods on each dataset, and report the average accuracy across 10 folds. 4.2 Algorithmic Performance Table 2 shows the prediction results of two self-supervised and five graph contrastive learning methods under the semi-supervised graph classification setting with 30%, 50%, and 70% label ratios on eight benchmark datasets, where the best and second-best results are highlighted in red and blue, respectively, and the last column is the average rank score across all datasets. Although different algorithms achieve their best performances on different datasets, the contrastiveness-based methods perform better than the non-contrastiveness-based methods in general, which indicates the effectiveness of the graph augmentation. Our proposed GLA achieves the best ranking scores under all 30%, 50%, and 70% label ratios in experiments, the second-best average performance 1Our code is available at https://github.com/brandeis-machine-learning/GLA under 30% label ratio, and the best average performance under 50% and 70% label ratios. In our algorithmic design, we employ the decision boundary learned from the labeled samples to verify the label-invariant augmentation. It is worthy to note that the quality of the decision boundary depends on the number of labeled samples. We conjecture that a 30% label ratio is not sufficient enough to learn a high-quality decision boundary, resulting in our GLA performing slightly worse than JOAOv2 on average. With more labeled samples, our GLA delivers the best average performance over other competitive methods. Different from other graph contrastive learning methods, our augmentation method aims to generate label-invariant augmentations, which decreases the possibility of getting bad augmentations, thus resulting in better performance. Besides the general comparison in Table 2, we dive into details and discover several interesting findings. Figure 3(a) demonstrates the average performance gains on eight datasets with more labeled samples produced by Graph CL, JOAOv2, and our GLA, the top three methods in our experiments. In addition to seeing that the increased performance of all three methods well aligns with more labeled samples, our GLA receives more performance gains than Graph CL and JOAOv2. By such comparisons, we can roughly eliminate the effect of more labeled samples and attribute the extra gains to the label-invariant augmentation. It also verifies our aforementioned conjecture that the high-quality decision boundary is beneficial to the label-invariant augmentation, further bringing in the performance boost. Moreover, we further verify our motivation by checking the label-invariant property of different contrastive methods. While we do not have a ground truth classifier, we use fine-tuned classifiers in the representation spaces learned by these contrastive methods with a 100% label ratio as the surrogates of the ground truth classifier. Then we use these classifiers to assess how many of the augmented representations belong to the same class as their corresponding original representations. Figure 3(b) presents the distributions of label-invariant rates across eight baseline datasets for all graph contrastive methods. As our GLA trained under different label ratios would generate different augmentations, we put the results of GLA s label-invariant rates under 30%, 50%, and 70% label ratios together for plotting. We can see that GLA has the highest label-invariant rates on average compared to other methods. It is also noticed that the label-invariant rates of different contrastive methods keep the same ranking with the performance in Table 2 (the last column), which verifies our motivation for designing a label-invariant augmentation strategy. Moreover, we further demonstrate our GLA s label-invariant rates along with different label ratios in Figure 3(c), which accords with our expectation that more labeled samples lead to a high-quality decision boundary and further promote the label-invariant rate in GLA. 4.3 In-depth Exploration We further explore GLA in terms of negative pairs, augmentation space, and strategy. Negative Pairs. The existing graph contrastive learning methods treat the augmented graphs from different source samples as negative pairs and employ the instance-level discrimination on these negative pairs. Since these methods separate the pre-train and fine-tuning phases, the negative pairs contain the augmented samples from different source samples but with the same category in the downstream tasks. Here we explore the effect of negative pairs on our GLA. Figure 4(a) shows the performance of our GLA with and without negative pairs on four datasets. We can see the performance with negative pairs significantly drops compared with our default setting without negative pairs, which behaves consistently on all four datasets. Different from the existing graph contrastive methods, our GLA integrates the pre-train and fine-tuning phases, where the negative pairs designed in a self-supervised fashion are not beneficial to the downstream tasks. This finding is also in accord with the recent studies [7, 16] in the visual contrastive learning area. Augmentation Space. Different from the most graph contrastive learning methods that directly augment raw graphs, our GLA conducts the augmentation in the representation space, as we believe the raw graphs can be mapped into the representation space, and this space is much easier to augment than the original graph space. In Eq. (5), we design our representation augmentation with a random unit vector scaled by the magnitude of the perturbation η. Figure 4(b,c,e,f) show the performance of our GLA with different values of η on four datasets, where we provide Graph CL and Graph CL+Label Invariant as references. Graph CL+Label-Invariant takes the augmented graph from Graph CL and filters the augmented samples that violate the label-invariant property by the downstream classifier. Comparing the two references, we can see that the label-invariant property benefits not only our GLA but also other contrastive methods in most cases. For our GLA, although the η values corresponding MUTAG PROTEINS DD NCI1 Accuracy (%) Without Negative Pairs (GLA) With Negative Pairs (a) w/o negative pairs 0.5 0.75 1.0 1.25 1.5 Accuracy (%) GLA Graph CL Graph CL + Label-invariant (b) η on MUTAG 0.5 0.75 1.0 1.25 1.5 Accuracy (%) GLA Graph CL Graph CL + Label-invariant (c) η on PROTEINS MUTAG PROTEINS DD NCI1 Accuracy (%) Hardest (GLA) Random Easiest (d) augmentation strategy 0.5 0.75 1.0 1.25 1.5 Accuracy (%) GLA Graph CL Graph CL + Label-invariant (e) η on DD 0.5 0.75 1.0 1.25 1.5 Accuracy (%) GLA Graph CL Graph CL + Label-invariant (f) η on NCI1 Figure 4: In-depth exploration of GLA. (a) contrastive loss with/without negative pairs, (d) performance of different label-invariant augmentation strategies, (b,c,e,f) performance of magnitude of perturbation η on different datasets under 50% label ratio. to the best performance vary on different datasets, the default setting with η = 1 delivers satisfying performance in general, which outperforms Graph CL+Label-Invariant and indicates the superior of the representation augmentation over the raw graph augmentation. Augmentation Strategy. In the representation space, there might exist multiple qualified candidates that obey the label-invariant property. Our GLA chooses the most difficult augmentation for the model. Here we demonstrate the performance of different augmentation strategies among qualified candidates, including the most difficult augmentation, random augmentation, and the easiest augmentation in Figure 4(d), where the random augmentation can be regarded as Graph CL+Label-Invariant. We can see that the most difficult augmentation increases the model generalization and indeed brings in significant improvements over the other two ways. This also provides good support for our representation augmentation, where we can find the most difficult augmentation in the representation space, but it is difficult to directly generate the raw graphs that are challenging to the downstream classifier. 5 Conclusion In this paper, we consider the graph contrastive learning problem. Different from the existing methods from the pre-train perspective, we propose a novel Graph Label-invariant Augmentation (GLA) algorithm which integrates the pre-train and fine-tuning phases to conduct the label-invariant augmentation in the representation space by perturbations. Specifically, GLA first checks whether the augmented representation obeys the label-invariant property and chooses the most difficult sample from the qualified samples. By this means, GLA achieves the contrastive augmentation without generating any raw graphs and also increases the model generalization. Extensive experiments in the semi-supervised setting on eight benchmark graph datasets demonstrate the effectiveness of our GLA. Moreover, we also provide extra experiments to verify our motivation and explore the in-depth factors of GLA in the effect of negative pairs, augmentation space, and strategy. Limitations. The performance of our method relies on the quality of the decision boundary indicated by the downstream classifier. Therefore, our method requires graph label information from downstream tasks to help with model training. Potential Negative Societal Impacts. The problem addressed in this paper is well-defined and the experiments are based on public datasets. As far as we can see, it does not involve societal issues. [1] K. M. Borgwardt, C. S. Ong, S. Schönauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47 i56, 2005. [2] M. Chen, K. Weinberger, F. Sha, and Y. Bengio. Marginalized denoising auto-encoders for nonlinear representations. In International Conference on Machine Learning, 2014. [3] M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized denoising autoencoders for domain adaptation. In International Conference on Machine Learning, 2012. [4] T. Chen, S. Bian, and Y. Sun. Are powerful graph neural nets necessary? a dissection on graph classification. ar Xiv preprint ar Xiv:1905.04579, 2019. [5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020. [6] X. Chen, H. Fan, R. Girshick, and K. He. Improved baselines with momentum contrastive learning. ar Xiv preprint ar Xiv:2003.04297, 2020. [7] X. Chen and K. He. Exploring simple siamese representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. [8] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka. Debiased contrastive learning. Advances in Neural Information Processing Systems, 33:8765 8775, 2020. [9] Z. Cui, K. Henrickson, R. Ke, and Y. Wang. Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting. IEEE Transactions on Intelligent Transportation Systems, 21(11):4883 4894, 2019. [10] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of Medicinal Chemistry, 34(2):786 797, 1991. [11] K. Ding, Z. Xu, H. Tong, and H. Liu. Data augmentation for deep graph learning: A survey. ar Xiv preprint ar Xiv:2202.08235, 2022. [12] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. Journal of Molecular Biology, 330(4):771 783, 2003. [13] Y. Fan, M. Ju, C. Zhang, and Y. Ye. Heterogeneous temporal graph neural network. In SIAM International Conference on Data Mining, 2022. [14] H. Gao, Z. Wang, and S. Ji. Large-scale learnable graph convolutional networks. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018. [15] C. Gong, T. Ren, M. Ye, and Q. Liu. Maxup: Lightweight adversarial training with data augmentation improves neural network training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2474 2483, 2021. [16] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271 21284, 2020. [17] E. Hajiramezanali, A. Hasanzadeh, K. Narayanan, N. Duffield, M. Zhou, and X. Qian. Variational graph recurrent neural networks. Advances in Neural Information Processing Systems, 32, 2019. [18] K. Hassani and A. H. Khasahmadi. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, 2020. [19] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661 18673, 2020. [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [21] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. [22] T. N. Kipf and M. Welling. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016. [23] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. [24] G. Li, M. Müller, B. Ghanem, and V. Koltun. Training graph neural networks with 1000 layers. In International Conference on Machine Learning, 2021. [25] G. Li, M. Müller, A. Thabet, and B. Ghanem. Deepgcns: Can gcns go as deep as cnns? In The IEEE International Conference on Computer Vision, 2019. [26] Y. Li, P. Hu, Z. Liu, D. Peng, J. T. Zhou, and X. Peng. Contrastive clustering. In AAAI Conference on Artificial Intelligence, 2021. [27] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond, 2020. [28] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision, 2020. [29] B. Rozemberczki, O. Kiss, and R. Sarkar. Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs. In ACM International Conference on Information and Knowledge Management, 2020. [30] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm. Deep graph infomax. International Conference on Learning Representations, 2(3):4, 2019. [31] P. Veliˇckovi c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. [32] N. Wale, I. A. Watson, and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3):347 375, 2008. [33] X. Wang, X. He, Y. Cao, M. Liu, and T.-S. Chua. Kgat: Knowledge graph attention network for recommendation. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. [34] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu. Heterogeneous graph attention network. In The World Wide Web Conference, 2019. [35] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, 2019. [36] J. Xia, L. Wu, J. Chen, B. Hu, and S. Z. Li. Simgrace: A simple framework for graph contrastive learning without data augmentation. In ACM Web Conference, 2022. [37] D. Xu, W. Cheng, D. Luo, H. Chen, and X. Zhang. Infogcl: Information-aware graph contrastive learning. 34:30414 30425, 2021. [38] P. Yanardag and S. Vishwanathan. Deep graph kernels. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. [39] Y. You, T. Chen, Y. Shen, and Z. Wang. Graph contrastive learning automated. In International Conference on Machine Learning, 2021. [40] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33:5812 5823, 2020. [41] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla. Heterogeneous graph neural network. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019. [42] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang. Deep Graph Contrastive Representation Learning. In ICML Workshop on Graph Representation Learning and Beyond, 2020. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 5. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 5. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See footnote URL in Section 4.1. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 4.1. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We reported standard deviations in Tabel 2. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] We did not include memory or time consumption comparisons. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section 4.1. (b) Did you mention the license of the assets? [No] All datasets are collected and made public by creators. We cannot find the license information for these datasets. All datasets are cited in Section 4.1. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] We used public datasets, and detailed information regarding content could be found in the corresponding citations in Section 4.1. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] We used public datasets, and detailed information regarding content could be found in the corresponding citations in Section 4.1. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]