# documentlevel_relation_extraction_with_reconstruction__26353c8b.pdf Document-Level Relation Extraction with Reconstruction Wang Xu1, Kehai Chen2 and Tiejun Zhao1 1Harbin Institute of Technology, Harbin, China 2National Institute of Information and Communications Technology, Kyoto, Japan xuwang@hit-mtlab.net, khchen@nict.go.jp, tjzhao@hit.edu.cn In document-level relation extraction (Doc RE), graph structure is generally used to encode relation information in the input document to classify the relation category between each entity pair, and has greatly advanced the Doc RE task over the past several years. However, the learned graph representation universally models relation information between all entity pairs regardless of whether there are relationships between these entity pairs. Thus, those entity pairs without relationships disperse the attention of the encoder-classifier Doc RE for ones with relationships, which may further hind the improvement of Doc RE. To alleviate this issue, we propose a novel encoder-classifierreconstructor model for Doc RE. The reconstructor manages to reconstruct the ground-truth path dependencies from the graph representation, to ensure that the proposed Doc RE model pays more attention to encode entity pairs with relationships in the training. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classification in the inference, which can further improve the performance of Doc RE model. Experimental results on a large-scale Doc RE dataset show that the proposed model can significantly improve the accuracy of relation extraction on a strong heterogeneous graph-based baseline. The code is publicly available at https://github.com/xwjim/Doc RE-Rec. Introduction Graph structure plays an important role in the document relation extraction (Doc RE) (Christopoulou, Miwa, and Ananiadou 2019; Sahu et al. 2019; Nan et al. 2020; Tang et al. 2020). Typically, one unstructured input document is first organized as a structure input graph (i.e., homogeneous or heterogeneous graphs) based on syntactic trees, coreference, or heuristics rules, thereby building relationships between entity pairs within and across multiple sentences of the input document. Neural networks (i.e., graph network) are used to iteratively encode the structure input graph as a graph representation to model relation information in the input document. The graph representation is fed into one classifier to classify the relation category between each entity pair, which has achieved the state-of-theart performance in Doc RE (Christopoulou, Miwa, and Ananiadou 2019; Nan et al. 2020). Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. However, during the training of Doc RE model, the graph representation universally encodes relation information between all entity pairs regardless of whether there are relationships between these entity pairs. For example, Figure 1 shows three entities in an input document: XFiles, Chris Carter, and Fox Mulder. Intuitively, they are three entity pairs: {X-Files, Chris Carter}, {X-Files, Fox Mulder}, and {Chris Carter, Fox Mulder}. The Doc RE model learns the node representations of each entity pair to classify their relation. As seen, there exists relationship between {Chris Carter, Fox Mulder} in the reference, indicating that there is naturally a reliable reasoning path from Chris Carter to Fox Mulder. In comparison, there do not exist relationships between {X-Files, Chris Carter} and between {X-Files, Fox Mulder}, indicating that there are not reasoning paths between {X-Files-Chris Carter} or {X-Files, Fox Mulder}. However, the learned graph representation models the three path dependencies universally and does not consider whether there is a path dependency between one target entity pair. As a result, {XFiles, Chris Carter} and {X-Files, Fox Mulder} without relationships disperse the attention of the Doc RE model for the learning of {Fox Mulder, Chris Carter} with relationship, which may further hinder the improvement of the Doc RE model. To alleviate this issue, we propose a novel reconstructor method to enable the Doc RE model to model path dependency between one entity pair with the ground-truth relationship. To this end, the reconstructor generates a sequence of node representations on the path from one entity node to another entity node and thereby maximizes the probability of its path if there is a ground-truth relationship between one entity pair and minimizes the probability otherwise. This allows the proposed Doc RE model to pay more attention to the learning of entity pairs with relationships in the training, thereby learning an effective graph representation for the subsequent relation classification. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classification in the inference, which can further improve the performance of Doc RE model. Experimental results on a large-scale Doc RE dataset show that the proposed method gained improvement of 1.7 F1 points over a strong heterogeneous graph-based Doc RE model, especially outperformed the recent state-of- The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Chris Carter X-Files Fox Mulder Mention node Entity node Sentence node 8 9 9 7 7 8 L Iterations Reference: NA creator NA Graph Representation Relation Classification [1] The X-Files was directed by David Nutter, and written by Chris Carter, Frank Spotnitz and Howard Gordon. ... . [2] The show centers on FBI special agents Fox Mulder (David Duchovny) and Dana Scully (Gillian Anderson) who work on cases linked to the paranormal, called X-Files. ... Figure 1: Heuristic rules are used to convert the input document into a heterogeneous graph. Then graph attention network is applied to learn the graph representation. Finally the node representations of entity pairs are used to classify their relationships. the-art LSR model for Doc RE (Nan et al. 2020). In this section, based on (Christopoulou, Miwa, and Ananiadou 2019) s work, we used heuristic rules to convert the input document into a heterogeneous graph without external syntactic knowledge. Moreover, a graph attention network is used to encode the heterogeneous graph instead of the edge-oriented graph network (Christopoulou, Miwa, and Ananiadou 2019), thereby implementing a strong and general baseline for Doc RE. Heterogeneous Graph Construction Formally, given an input document that consists of L sentences {S1, S2, , SL}, each of which is a sequence of words {xl 1, xl 2, , xl J} with the length J=|Sl|. A bidirectional long short-term memory (Bi LSTM) reads word by word to generate a sequence of word vectors to represent each sentence in the input document. Also, we apply the heterogeneous graph (Christopoulou, Miwa, and Ananiadou 2019) to the input document to build relationships between all entity pairs. Specifically, the heterogeneous graph includes three defined distinct types of nodes: Mention Node, Entity Node, and Sentence Node. For example, Figure 1 shows an input document including two sentences (yellow color index) in which there are four mentions (blue color) and three entities (green color). The representation of each node is the average of the words in the concept, thereby forming a set of node representations {v1, v2, , v N}, where N is the number of nodes. For edge connections, there are five distinct types of edges between pairs of nodes following (Christopoulou, Miwa, and Ananiadou 2019) s work, Mention-Mention(MM) edge, Mention-Sentence (MS) edge, Mention-Entity (ME) edge, Sentence-Sentence (SS) edge, Sentence-Sentence (SS) edge, Entity-Sentence (ES) edge respectively. In addition, we add a Mention-Coreference (CO) edges between the two mentions which are referred to the same entity. According to these above definitions, there is a N N adjacency matrix E denoting edge connections. Finally, the heterogeneous graph can be denoted as G={V, E}, to keep relation information between all entity pairs in the input document. Encoder To learn an effective graph representation, we used the graph attention network (Guo, Zhang, and Lu 2019) to encode the feature representation of each node in the heterogeneous graph. Formally, given the outputs of all previous hop reasoning operations {s1 n, s2 n, , sl 1 n }, they are concatenated and then transformed to a fixed dimensional vector as the input of the l hop reasoning: zl n = Wl e [vn : s1 n : s2 n : : sl 1 n ], (1) where sl 1 n Rd0 and Wl e Rd0 (l d0). Also, according to edge matrix E[n][ac]=k (0 ac < N, k > 0), C direct adjacent nodes of vn are {zl a1, zl a2, , zl a C}. We then use the self-attention mechanism (Vaswani et al. 2017) to capture the feature information of vn between zl n and {zl a1, zl a1, , zl a C}: sl n = softmax(zl n K d0 )V, (2) where {K, V} are key and value matrices that are transformed from the direct adjacent nodes representations {zl a1, zl a1, , zl a C} according to the edge type. After performing L hop reasonings, there is a sequence of annotations {s1 n, s2 n, , s L n} to encode relation information in the input document. Finally, another no-linear layer is applied to integrate the reason information {s1 n, s2 n, , s L n} and the node information vn: qn = Relu(Wo [vn : s1 n : : s L n]), (3) where Wo Rd1 (d0 (L+1)), qn Rd1. As a result, the heterogeneous graph G is represented as {q1, q2, , q N}. Classifier Given the heterogeneous graph representation {q1, q2, , q N}, two node representations of each entity pair are as the input to the classifier to classify their relationship. Specifically, the classifier is a multi-layer perceptron (MLP) layer with sigmoid function to calculate the relationship probability: R(r) = P(r|{ei, ej}) = sigmoid(MLP([qi : qj])). (4) To train the Doc RE model, the binary cross-entropy is used to optimize parameters of neural networks over the triple examples (subject, object, relation) on the training date set (including T documents), that is, {{e1t n, e2t n, rt n}Nt n=1}T t=1: Lossc = 1 PT t=0 Nt n=1 {rt nlog(R(rt n)) +(1 rt n)log(1 R(rt n))}, where rt n {0, 1} indicates whether the entity pair has relation label r and Nt is the number of relations in the tth document. Methodology Intuitively, when a human understands a document with relationships, he or she often pays more attention to learn entity pairs with relationships rather than ones without relationships. Motivated by this observation, we proposed a novel Doc RE model with reconstruction (See Figure 2) to pay more attention to entity pairs with relationships, thus enhancing the accuracy of relationship classification. Meta Path of Entity Pair Generally, when there is a relationship between two entities, they should have one strong path dependency in the graph structure (or representation). In comparison, when there is not a relationship between two entities, there is a weak path dependency.1 Thus, we explore to reconstruct the path dependency between each entity pair from the learned graph representation. To this end, we first define three type paths between two entity nodes in the graph representation as reconstructed candidates according to the meta-path information (Sun and Han 2013). 1) Meta Path1 of Pattern Recognition: Two entities are connected through a sentence in this reasoning type. The relation schema is EM MM EM, for example node sequence {7,3,4,8} in Figure 1. 2) Meta Path2 of Logical Reasoning: the relation between two entities is indirectly established by a bridge entity. The bridge entity occurs in a sentence with the two entities separately. The relation schema is EM MM CO MM EM, for example node sequence {7,3,4,5,6,9} in Figure 1. 3) Meta Path3 of Coreference Reasoning: Coreference resolution must be performed first to identify target entities. A reference word refers to an entity that appear in the previous sentence. The two entities occur in the same sentence implicitly. The relation schema is ES SS ES, for example node sequence {7,1,2,9} in Figure 1. Actually, all the entity pairs have at least one of the three meta-paths. We select one meta-path type according to the priority, meta-path1 > meta-path2 > meta-path3. Generally, several instance paths may exist corresponding to the meta path, we select the instance path that appears firstly in the document. 1If there is no path dependency between two target entities without a relationship, this may weaken the understanding of relationship information in the document. 7 Classifier Reconstructor 𝑆𝑐𝑜𝑟𝑒= 𝑙𝑜𝑔𝑅+ 𝜆 Maximize 𝑁𝑖 Minimize 𝛮𝑗 Figure 2: Model overview. The reconstructor manages to reconstruct the ground-truth path dependencies from the graph representation to ensure that the model to pay attention to model entity pairs with relationships. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classification in the inference. Path Reconstruction For each entity pair, one instance path is selected as the supervision of the reconstruction of the path dependency. In other words, there is only one supervision path φn={vb1, vb2, , vb C} between each target pair {e1n, e2n}, where b C is the number of nodes. To reconstruct the path dependency of each entity pair, we model the reconstructor as the sequence generation. Specifically, we use a LSTM to compute a path hidden state pbc for each node qbc 1 on the path φn: pbc = LSTM(pbc 1, qbc 1). (6) Note that pb0 is initialized as the transform of oij, since it plays a key role in classification. pbc is fed into a softmax layer to compute the probability of node vbc on the path: P(vbc|v