# deep_bidirectional_languageknowledge_graph_pretraining__858d9c59.pdf Deep Bidirectional Language-Knowledge Graph Pretraining Michihiro Yasunaga,1 Antoine Bosselut,2 Hongyu Ren,1 Xikun Zhang1 Christopher D Manning,1 Percy Liang,1 Jure Leskovec1 1Stanford University 2EPFL Equal senior authorship {myasu,antoineb,hyren,xikunz2,manning,pliang,jure}@cs.stanford.edu Pretraining a language model (LM) on text has been shown to help various downstream NLP tasks. Recent works show that a knowledge graph (KG) can complement text data, offering structured background knowledge that provides a useful scaffold for reasoning. However, these works are not pretrained to learn a deep fusion of the two modalities at scale, limiting the potential to acquire fully joint representations of text and KG. Here we propose DRAGON (Deep Bidirectional Language-Knowledge Graph Pretraining), a self-supervised method to pretrain a deeply joint language-knowledge foundation model from text and KG at scale. Specifically, our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities. We pretrain this model by unifying two self-supervised reasoning tasks, masked language modeling and KG link prediction. DRAGON outperforms existing LM and LM+KG models on diverse downstream tasks including question answering across general and biomedical domains, with +5% absolute gain on average. In particular, DRAGON achieves strong performance on complex reasoning about language and knowledge (+10% on questions involving long contexts or multi-step reasoning) and low-resource QA (+8% on OBQA and Riddle Sense), and new state-of-the-art results on various Bio NLP tasks. Our code and trained models are available at https://github.com/michiyasunaga/dragon. 1 Introduction Pretraining learns self-supervised representations from massive raw data to help various downstream tasks [1]. Language models (LMs) pretrained on large amounts of text data, such as BERT [2] and GPTs [3], have shown strong performance on many natural language processing (NLP) tasks. The success of these models comes from deeply interactive (contextualized) representations of input tokens learned at scale via self-supervision [2, 4]. Meanwhile, large knowledge graphs (KGs), such as Freebase [5], Wikidata [6] and Concept Net [7], can provide complementary information to text data. KGs offer structured background knowledge by representing entities as nodes and relations between them as edges, and also offer scaffolds for structured, multi-step reasoning about entities [8, 9, 10, 11] ( 3.4.1). The dual strengths of text data and KGs motivate research in pretraining deeply interactive representations of the two modalities at scale. How to effectively combine text and KGs for pretraining is an open problem and presents challenges. Given text and KG, we need both (i) a deeply bidirectional model for the two modalities to interact, and (ii) a self-supervised objective to learn joint reasoning over text and KG at scale. Several existing works [12, 13, 14, 15, 16] propose methods for self-supervised pretraining, but they fuse text and KG in a shallow or uni-directional manner. Another line of work [8, 9] proposes bidirectional models for text and KG, but these models focus on finetuning on labeled downstream tasks and do not perform 36th Conference on Neural Information Processing Systems (Neur IPS 2022). A DO DN IJO PN@? AJM C10 Ro BERTa 61.7 70.9 68.6 67.6 71.0 71.1 73.1 74.5 QAGNN 65.1 74.5 74.2 72.1 71.6 75.6 71.3 78.6 Grease LM 65.1 74.9 76.6 75.6 73.8 74.7 73.6 79.4 DRAGON (Ours) 75.2 79.6 77.5 79.1 78.2 77.8 80.9 83.5 Table 2: Accuracy of DRAGON on CSQA + OBQA dev sets for questions involving complex reasoning such as negation terms, conjunction terms, hedge terms, prepositional phrases, and more entity mentions. DRAGON consistently outperforms the existing LM (Ro BERTa) and KG-augmented QA models (QAGNN, Grease LM) in these complex reasoning settings. tasks, but do not pretrain with a KG. Grease LM is the existing top-performing model in this paradigm. As we use the same encoder architecture as Grease LM for DRAGON, the only difference from Grease LM is that DRAGON performs self-supervised pretraining while Grease LM does not. 3.4 Results Table 1 shows performance on the 9 downstream commonsense reasoning tasks. Across all tasks, DRAGON consistently outperforms the existing LM (Ro BERTa) and KG-augmented QA models (QAGNN, Grease LM), e.g., +7% absolute accuracy boost over Ro BERTa and +5% over Grease LM on OBQA. These accuracy boosts indicate the advantage of DRAGON over Ro BERTa (KG reasoning) and over Grease LM (pretraining). The gain is especially significant on datasets that have small training data such as ARC, Riddle and OBQA, and datasets that require complex reasoning such as Cosmos QA and Hella Swag, which we analyze in more detail in the following sections. 3.4.1 Analysis: Effect of knowledge graph The first key contribution of DRAGON (w.r.t. existing LM pretraining methods) is that we incorporate KGs. We find that this significantly improves the model s performance for robust and complex reasoning, such as resolving multi-step reasoning and negation, as we discuss below. Quantitative analysis. In Table 2, we study downstream task performance of DRAGON on questions involving complex reasoning. Building on [8, 9], we consider several proxies to categorize complex questions: (i) presence of negation (e.g. no, never), (ii) presence of conjunction (e.g. and, but), (iii) presence of hedge (e.g. sometimes, maybe), (iv) number of prepositional phrases, and (v) number of entity mentions. Having negation or conjunction indicates logical multi-step reasoning, having more prepositional phrases or entity mentions indicates involving more reasoning steps or constraints, and having hedge terms indicates involving complex textual nuance. DRAGON significantly outperforms the baseline LM (Ro BERTa) across all these categories (e.g., +14% accuracy for negation), which confirms that our joint language-knowledge pretraining boosts reasoning performance. DRAGON also consistently outperforms the existing KG-augmented QA models (QAGNN, Grease LM). We find that QAGNN and Grease LM only improve moderately on Ro BERTa for some categories like conjunction or many prepositional phrases (=2, 3), but DRAGON provides substantial boosts. This suggests that through self-supervised pretraining with larger and diverse data, DRAGON has learned more general-purpose reasoning abilities than the finetuning-only models like Grease LM. Qualitative analysis. Using the CSQA dataset, we further conducted case studies on the behavior of DRAGON s KG reasoning component, where we visualize how graph attention weights change given different question variations (Figure 2). We find that DRAGON exhibits abilities to extrapolate and perform robust reasoning. For instance, DRAGON adjusts the entity attention weights and final predictions accordingly when we add conjunction or negation about entities (A1, A2) or when we add extra context to an original question (B1!B2), but existing models, Ro BERTa and Grease LM, struggle to predict the correct answers. As these questions are more complex than ones typically seen in the CSQA training set, our insight is that while vanilla LMs (Ro BERTa) and finetuning (Grease LM) ticket ticket ticket ticket folding chair folding chair folding record record concert concert Where would you use a folding chair and store one? A. camp B. school C. beach trip DRAGON GNN 1st Layer DRAGON GNN Final Layer Int Ro BERTa: A. camp ( ) Grease LM: C. camp ( ) DRAGON: B. school ( ) (A1) Conjunction Where would you use a folding chair but not store one? A. garage B. school C. beach trip DRAGON GNN 1st Layer DRAGON GNN Final Layer (A2) Negation + Conjunction You will buy a ticket for entering what building for entertainment? A. station B. movie theater DRAGON GNN 1st Layer DRAGON GNN Final Layer (B1) Single context You don't enjoy watching pre-recorded performance. You will buy a ticket for entering what building for entertainment? A. station B. movie theater C. concert hall DRAGON GNN 1st Layer DRAGON GNN Final Layer (B2) Multi context live Model Prediction Ro BERTa: B. school ( ) Grease LM: B. school ( ) DRAGON: C. beach ( ) Model Prediction Ro BERTa: B. movie theater ( ) Grease LM: B. movie theater ( ) DRAGON: B. movie theater Model Prediction Ro BERTa: B. movie theater ( ) Grease LM: B. movie theater ( ) DRAGON: C. concert hall Model Prediction Figure 2: Analysis of DRAGON s graph reasoning, where we visualize how graph attention weights and final predictions change given question variations. Darker and thicker edges indicate higher attention weights. DRAGON exhibits abilities to extrapolate and perform robust reasoning. DRAGON adjusts the entity attention weights and final predictions accordingly when conjunction or negation is given about entities (A1, A2) or when extra context is added to an original question (B1!B2), but existing models, Ro BERTa and Grease LM, struggle to predict the correct answers. A1: DRAGON s final GNN layer shows strong attention to school but weak attention to trip , likely because the question states and store one hence, the chair is not used for a trip. A2: DRAGON shows strong attention to trip and beach , likely because the question now states but not store one hence, the chair is used for a trip. B1!B2: DRAGON s final GNN layer shows strong attention to movie in the original question (B1), but after adding the extra context don t enjoy pre-record (B2), DRAGON shows strong attention to live and concert , leading to making the correctly adjusted prediction concert hall . One interpretation of these findings is that DRAGON leverages the KG s graph structure as a scaffold for performing complex reasoning. This insight is related to recent works that provide LMs with scratch space for intermediate reasoning [8, 65, 66]. Method Cosmos QA (10% train) PIQA (10% train) Ro BERTa 72.2 66.4 Grease LM 73.0 67.0 DRAGON (Ours) 77.9 72.3 Table 3: Performance in low-resource setting where 10% of finetuning data is used. DRAGON attains large gains, suggesting its benefit for downstream data efficiency. Method CSQA OBQA Grease LM 74.2 66.9 Grease LM-Ex 73.9 66.2 DRAGON (Ours) 76.0 72.0 DRAGON-Ex (Ours) 76.3 72.8 Table 4: Downstream performance when model capacity number of text-KG fusion layers is increased ( -Ex ). Increased capacity does not help for the finetuning-only model (Grease LM), but helps when pretrained (DRAGON), suggesting the promise of DRAGON to be further scaled up. Ablation Type Ablation CSQA OBQA Pretraining objective MLM + Link Pred (final) 76.0 72.0 MLM only 74.3 67.2 Link Pred only 73.8 66.4 Link Pred head Dist Mult (final) 76.0 72.0 Trans E 75.7 71.4 Rotat E 75.8 71.7 Cross-modal model Bidirectional interaction (final) 76.0 72.0 Concatenate at end 74.5 68.0 KG structure Use graph (final) 76.0 72.0 Convert to sentence 74.7 70.1 Table 5: Ablation study of DRAGON. Using joint pretraining objective MLM + Link Pred ( 2.3) outperforms using one of them only. All variants of Link Pred scoring models (Dist Mult, Trans E, Rotat E) outperform the baseline without Link Pred ( MLM only ), suggesting that DRAGON can be combined with various KG representation learning models. Cross-modal model with bidirectional modality interaction ( 2.2) outperforms combining text and KG representations only at the end. Finally, using KG as graph outperforms converting KG as sentences, suggesting the benefit of graph structure for reasoning. have limitation in learning complex reasoning, KG-augmented pretraining (DRAGON) helps acquire generalizable reasoning abilities that extrapolate to harder test examples. 3.4.2 Analysis: Effect of pretraining Another key contribution of DRAGON (w.r.t. existing QA models like Grease LM) is pretraining. Here we discuss when and why our pretraining is useful. Considering the three core factors in machine learning (data, task complexity, and model capacity), pretraining helps when the available downstream task data is smaller compared to the downstream task complexity or model capacity. Concretely, we find that DRAGON is especially helpful for the following three scenarios. Downstream tasks with limited data. In Table 1, we find that DRAGON provides significant boosts over Grease LM on downstream tasks with limited finetuning data available, such as ARC (3K training instances; +4% accuracy gain), Riddle (3K instances; +4% accuracy) and OBQA (5K instances; +5% accuracy). For other tasks, we also experimented with a low-resource setting where 10% of finetuning data is used (Table 3). Here we also see that DRAGON attains significant gains over Grease LM (+5% accuracy on PIQA), suggesting the improved data-efficiency of DRAGON. Complex downstream tasks. In Table 1, we find that DRAGON provides substantial gains over Grease LM on downstream tasks involving more complex reasoning, such as Cosmos QA and Hella Swag, where the inputs have longer context and more entities (thus bigger local KGs). For these tasks, improvements of Greaes LM over Ro BERTa were small (+0.1% on Cosmos QA), but DRAGON provides substantial boosts (+1.8%). Our insight is that through self-supervised pretraining with larger and more diverse data, DRAGON has learned richer text-KG interactions than Grease LM, enabling solving more complex downstream tasks. Similarly, as seen in 3.4.1, DRAGON also attains large gains over Grease LM on complex questions containing negation, conjunction and prepositional phrases (Table 2), and extrapolates to questions more complex than seen in training sets (Figure 2). Increased model capacity. In Table 4, we study downstream performance when the model capacity is increased the number of text-KG fusion layers is increased from 5 to 7 for both Grease LM and DRAGON. We find that increased capacity does not help for the finetuning-only model (Grease LM) as was also reported in the original Grease LM paper, but it helps when pretrained (DRAGON). This result reveals that increased model capacity can actually be beneficial when combined with pretraining, and suggests the promise of DRAGON to be further scaled up. 3.4.3 Analysis: Design choices of DRAGON Pretraining objective (Table 5 top). The first important design choice of DRAGON is the joint pretraining objective: MLM + Link Pred ( 2.3). Using the joint objective outperforms using MLM or Link Pred alone (+5% accuracy on OBQA). This suggests that having the bidirectional self-supervised tasks on text and KG facilitates the model to fuse the two modalities for reasoning. Link prediction head choice (Table 5 middle 1). KG representation learning is an active area of research, and various KG triplet scoring models are proposed (Equation 9). We hence experimented with using different scoring models for DRAGON s link prediction head ( 2.3). We find that while Dist Mult has a slight edge, all variants we tried (Dist Mult, Trans E, Rotat E) are effective, outperforming the baseline without Link Pred ( MLM only ). This result suggests the generality of DRAGON and its promise to be combined with various KG representation learning techniques. Cross-modal model (Table 5 middle 2). Another core component of DRAGON is the cross-modal encoder with bidirectional text-KG fusion layers ( 2.2). We find that if we ablate them and simply concatenate text and KG representations at the end, the performance drops substantially. This result suggests that deep bidirectional fusion is crucial to model interactions over text and KG for reasoning. KG structure (Table 5 bottom). The final key design of DRAGON is that we leverage the graph structure of KGs via a sequence-graph encoder and link prediction objective. Here we experimented with an alternative pretraining method that drops the graph structure: we convert triplets in the local KG into sentences using a template [33], append them to the main text input, and perform vanilla MLM pretraining. We find that DRAGON substantially outperforms this variant (+2% accuracy on OBQA), which suggests that the graph structure of KGs helps the model perform reasoning. 4 Experiments: Biomedical domain Biomedicine is a domain with extensive background knowledge [67, 68, 69, 1], and experts curate various knowledge bases for it [70, 17, 71, 72]. We hypothesize that these biomedical KGs can enable deeper understanding and reasoning about biomedical text. With this motivation, we pretrain DRAGON on a biomedical corpus and KG, and evaluate on biomedical downstream tasks. Pretraining setup. For the text data, we use Pub Med [73], a widely-used corpus in biomedial LM training (e.g., Bio BERT [74], Pubmed BERT [75]). It contains the abstracts of biomedical papers on Pub Med and has 21GB of text. For the KG data, we use the Unified Medical Language System (UMLS) [17], a widely-used knowledge graph in biomedicine. It has 300K nodes and 1M edges in total. For training, we follow the same procedure as the experiment in the general domain ( 3.1), except that we initialize DRAGON s LM component with Bio Link BERT-Large [19], the state-of-theart biomedical LM, instead of Ro BERTa-Large. Note that while Bio Link BERT has Link in its name, it is not about KG links but about citation links that the model was originally pretrained with. Method Med QA Pub Med QA Bio ASQ Bio BERT [74] 36.7 60.2 84.1 Pubmed BERT [75] 38.1 55.8 87.5 Bio Link BERT [19] 44.6 72.2 94.8 + QAGNN 45.0 72.1 95.0 + Grease LM 45.1 72.4 94.9 DRAGON (Ours) 47.5 73.4 96.4 Table 6: Accuracy on biomedical NLP tasks. DRAGON outperforms all previous biomedical LMs. Downstream evaluation tasks. We finetune and evaluate DRAGON on three popular biomedical NLP and reasoning benchmarks: Med QA-USMLE (Med QA) [76], Pub Med QA [77], and Bio ASQ [78]. Appendix B.4 provides details on these tasks and data splits. Baselines. We compare DRAGON with the vanilla LM (Bio Link BERT) and LMs finetuned with the KG (QAGNN and Grease LM seeded with Bio Link BERT). Results. Table 6 summarizes model performance on the downstream tasks. Across tasks, DRAGON outperforms all the existing biomedical LMs and KG-augmented QA models, e.g., +3% absolute accuracy boost over Bio Link BERT and +2% over Grease LM on Med QA, achieving new state-of-theart performance on these tasks. This result suggests significant efficacy of KG-augmented pretraining for improving biomedical reasoning tasks. Combined with the results in the general commonsense domain ( 3.4), our experiments also suggest the domain-generality of DRAGON, serving as an effective pretraining method across domains with different combinations of text, KGs and seed LMs. 5 Conclusion We presented DRAGON, a self-supervised pretraining method to learn a deeply bidirectional languageknowledge model from text and knowledge graphs (KGs) at scale. In both general and biomedical domains, DRAGON outperforms existing language models and KG-augmented models on various NLP tasks, and exhibits strong performance on complex reasoning such as answering questions involving long context or multi-step reasoning. One limitation of DRAGON is that it is currently an encoder model (analogous to BERT) and does not perform language generation. An important future research would be to extend DRAGON to generation, and advance KG-enhanced language generation [28, 79]. Reproducibility Pretrained models, code and data are available at https://github.com/michiyasunaga/dragon. Experiments are available at https://worksheets.codalab.org/worksheets/0xcf9cddffff864fb382e1a2f1393c8934. Acknowledgment We thank Rok Sosic, Hamed Nilforoshan, Michael Moor, Qian Huang, members of the Stanford SNAP, P-Lambda, and NLP groups, as well as our anonymous reviewers for valuable feedback. We also gratefully acknowledge the support of HAI Google Cloud Credits 1051203844499; DARPA under Nos. HR00112190039 (TAMI), N660011924033 (MCS); ARO under Nos. W911NF-16-1-0342 (MURI), W911NF-16-1-0171 (DURIP); NSF under Nos. OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940 (Expeditions), IIS-2030477 (RAPID), NIH under No. R56LM013365; Stanford Data Science Initiative, Wu Tsai Neurosciences Institute, Chan Zuckerberg Biohub, Amazon, JPMorgan Chase, Docomo, Hitachi, Intel, JD.com, KDDI, Toshiba, NEC, and United Health Group. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding entities. [1] Rishi Bommasani et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021. [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019. [3] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (Neur IPS), 2020. [4] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In North American Chapter of the Association for Computational Linguistics (NAACL), 2018. [5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008. [6] Denny Vrandeˇci c and Markus Krötzsch. Wikidata: A free collaborative knowledgebase. Communications of the ACM, 2014. [7] Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017. [8] Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. QA- GNN: Reasoning with language models and knowledge graphs for question answering. In North American Chapter of the Association for Computational Linguistics (NAACL), 2021. [9] Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. Greaselm: Graph reasoning enhanced language models for question answering. In International Conference on Learning Representations (ICLR), 2022. [10] Hongyu Ren, Weihua Hu, and Jure Leskovec. Query2box: Reasoning over knowledge graphs in vector space using box embeddings. In International Conference on Learning Representations (ICLR), 2020. [11] Hongyu Ren, Hanjun Dai, Bo Dai, Xinyun Chen, Michihiro Yasunaga, Haitian Sun, Dale Schuurmans, Jure Leskovec, and Denny Zhou. Lego: Latent execution-guided reasoning for multi-hop question answering on knowledge graphs. In International Conference on Machine Learning (ICML), 2021. [12] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie: Enhanced language representation with informative entities. In Association for Computational Linguistics (ACL), 2019. [13] Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In International Conference on Learning Representations (ICLR), 2020. [14] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. Kepler: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics (TACL), 2021. [15] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In North American Chapter of the Association for Computational Linguistics (NAACL), 2021. [16] Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, et al. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. ar Xiv preprint ar Xiv:2107.02137, 2021. [17] Olivier Bodenreider. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic acids research, 2004. [18] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. [19] Michihiro Yasunaga, Jure Leskovec, and Percy Liang. Link BERT: Pretraining language models with document links. In Association for Computational Linguistics (ACL), 2022. [20] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In International Conference on Machine Learning (ICML), 2020. [21] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems (Neur IPS), 2020. [22] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. ar Xiv preprint ar Xiv:2112.04426, 2021. [23] Matthew E. Peters, Mark Neumann, IV Robert LLogan, Roy Schwartz, V. Joshi, Sameer Singh, and Noah A. Smith. Knowledge enhanced contextual word representations. In Empirical Methods in Natural Language Processing (EMNLP), 2019. [24] Corby Rosset, Chenyan Xiong, Minh Phan, Xia Song, Paul Bennett, and Saurabh Tiwary. Knowledge-aware language model pretraining. ar Xiv preprint ar Xiv:2007.00655, 2020. [25] Tao Shen, Yi Mao, Pengcheng He, Guodong Long, Adam Trischler, and Weizhu Chen. Ex- ploiting structured knowledge in text via graph-guided representation learning. In Empirical Methods in Natural Language Processing (EMNLP), 2020. [26] Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. Self-alignment pretraining for biomedical entity representations. In North American Chapter of the Association for Computational Linguistics (NAACL), 2021. [27] Donghan Yu, Chenguang Zhu, Yiming Yang, and Michael Zeng. Jaket: Joint pre-training of knowledge graph and language understanding. In AAAI Conference on Artificial Intelligence, 2022. [28] Pei Ke, Haozhe Ji, Yu Ran, Xin Cui, Liwei Wang, Linfeng Song, Xiaoyan Zhu, and Minlie Huang. Jointgt: Graph-text joint representation learning for text generation from knowledge graphs. In Findings of ACL, 2021. [29] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and P. Wang. K-bert: Enabling language representation with knowledge graph. In AAAI Conference on Artificial Intelligence, 2020. [30] Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuan-Jing Huang, and Zheng Zhang. Colake: Contextualized language and knowledge embedding. In International Conference on Computational Linguistics (COLING), 2020. [31] Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, Nicholas Jing Yuan, and Tong Xu. Integrating graph contextualized knowledge into pre-trained language models. In Findings of EMNLP, 2020. [32] Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. Kagnet: Knowledge-aware graph networks for commonsense reasoning. In Empirical Methods in Natural Language Processing (EMNLP), 2019. [33] Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. Scalable multi-hop relational reasoning for knowledge-aware question answering. In Empirical Methods in Natural Language Processing (EMNLP), 2020. [34] Shangwen Lv, Daya Guo, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu. Graph-based reasoning over heterogeneous external knowledge for commonsense question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020. [35] Kuan Wang, Yuyu Zhang, Diyi Yang, Le Song, and Tao Qin. Gnn is a counter? revisiting gnn for question answering. In International Conference on Learning Representations (ICLR), 2022. [36] Todor Mihaylov and Anette Frank. Knowledgeable reader: Enhancing cloze-style reading comprehension with external commonsense knowledge. In Association for Computational Linguistics (ACL), 2018. [37] An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, and Sujian Li. Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In Association for Computational Linguistics (ACL), 2019. [38] Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William W Cohen. Open domain question answering using early fusion of knowledge bases and text. In Empirical Methods in Natural Language Processing (EMNLP), 2018. [39] Haitian Sun, Tania Bedrax-Weiss, and William W Cohen. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. In Empirical Methods in Natural Language Processing (EMNLP), 2019. [40] Jun Yan, Mrigank Raman, Aaron Chan, Tianyu Zhang, Ryan Rossi, Handong Zhao, Sungchul Kim, Nedim Lipka, and Xiang Ren. Learning contextualized knowledge structures for commonsense reasoning. In Findings of ACL, 2021. [41] Yueqing Sun, Qi Shi, Le Qi, and Yu Zhang. Jointlk: Joint reasoning with language models and knowledge graphs for commonsense question answering. In North American Chapter of the Association for Computational Linguistics (NAACL), 2022. [42] Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng, and Xuedong Huang. Human parity on commonsenseqa: Augmenting self-attention with external attention. In Association for Computational Linguistics (ACL), 2022. [43] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning (ICML), 2016. [44] Seyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems (Neur IPS), 2018. [45] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems (Neur IPS), 2013. [46] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In International Conference on Learning Representations (ICLR), 2015. [47] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph em- bedding by relational rotation in complex space. In International Conference on Learning Representations (ICLR), 2019. [48] Sebastian Riedel, Limin Yao, Andrew Mc Callum, and Benjamin M Marlin. Relation extraction with matrix factorization and universal schemas. In North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [49] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Empirical Methods in Natural Language Processing (EMNLP), 2015. [50] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. Representation learning of knowledge graphs with entity descriptions. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016. [51] Liang Yao, Chengsheng Mao, and Yuan Luo. Kg-bert: Bert for knowledge graph completion. ar Xiv preprint ar Xiv:1909.03193, 2019. [52] Bosung Kim, Taesuk Hong, Youngjoong Ko, and Jungyun Seo. Multi-task learning for knowl- edge graph completion with pre-trained language models. In International Conference on Computational Linguistics (COLING), 2020. [53] Da Li, Sen Yang, Kele Xu, Ming Yi, Yukai He, and Huaimin Wang. Multi-task pre-training language model for semantic network completion. ar Xiv preprint ar Xiv:2201.04843, 2022. [54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems (Neur IPS), 2017. [55] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In International Conference on Computer Vision (ICCV), 2015. [56] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019. [57] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Empirical Methods in Natural Language Processing (EMNLP), 2018. [58] Bill Yuchen Lin, Ziyi Wu, Yichi Yang, Dong-Ho Lee, and Xiang Ren. Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge. In Findings of ACL, 2021. [59] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv preprint ar Xiv:1803.05457, 2018. [60] Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Empirical Methods in Natural Language Processing (EMNLP), 2019. [61] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Association for Computational Linguistics (ACL), 2019. [62] Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In AAAI Conference on Artificial Intelligence, 2020. [63] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Socialiqa: Com- monsense reasoning about social interactions. In Empirical Methods in Natural Language Processing (EMNLP), 2019. [64] Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, and Yejin Choi. Abductive commonsense reasoning. In International Conference on Learning Representations (ICLR), 2020. [65] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. ar Xiv preprint ar Xiv:2112.00114, 2021. [66] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ar Xiv preprint ar Xiv:2201.11903, 2022. [67] Elliot G Brown, Louise Wood, and Sue Wood. The medical dictionary for regulatory activities (meddra). Drug safety, 20(2):109 117, 1999. [68] Carolyn E Lipscomb. Medical subject headings (mesh). Bulletin of the Medical Library Association, 88(3):265, 2000. [69] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):i457 i466, 2018. [70] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25 29, 2000. [71] David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research, 2018. [72] Camilo Ruiz, Marinka Zitnik, and Jure Leskovec. Identification of disease treatment mechanisms through the multiscale interactome. Nature communications, 12(1):1 15, 2021. [73] Pub Med. https://pubmed.ncbi.nlm.nih.gov/. [74] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020. [75] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ar Xiv preprint ar Xiv:2007.15779, 2020. [76] Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 2021. [77] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Empirical Methods in Natural Language Processing (EMNLP), 2019. [78] Anastasios Nentidis, Konstantinos Bougiatiotis, Anastasia Krithara, and Georgios Paliouras. Results of the seventh edition of the bioasq challenge. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2019. [79] Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu, Qingyun Wang, Heng Ji, and Meng Jiang. A survey of knowledge-enhanced text generation. ACM Computing Surveys (CSUR), 2022. [80] Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. Towards controllable biases in language generation. In the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, long, 2020. [81] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. ar Xiv preprint ar Xiv:2112.04359, 2021. [82] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxici- typrompts: Evaluating neural toxic degeneration in language models. In Findings of EMNLP, 2020. [83] Ninareh Mehrabi, Pei Zhou, Fred Morstatter, Jay Pujara, Xiang Ren, and A. G. Galstyan. Lawyers are dishonest? quantifying representational harms in commonsense knowledge resources. Ar Xiv, abs/2103.11320, 2021.