# documentlevel_relation_extraction_with_reconstruction__26353c8b.pdf

Document-Level Relation Extraction with Reconstruction

Wang Xu1, Kehai Chen2 and Tiejun Zhao1

1Harbin Institute of Technology, Harbin, China 2National Institute of Information and Communications Technology, Kyoto, Japan xuwang@hit-mtlab.net, khchen@nict.go.jp, tjzhao@hit.edu.cn

In document-level relation extraction (Doc RE), graph structure is generally used to encode relation information in the input document to classify the relation category between each entity pair, and has greatly advanced the Doc RE task over the past several years. However, the learned graph representation universally models relation information between all entity pairs regardless of whether there are relationships between these entity pairs. Thus, those entity pairs without relationships disperse the attention of the encoder-classiﬁer Doc RE for ones with relationships, which may further hind the improvement of Doc RE. To alleviate this issue, we propose a novel encoder-classiﬁerreconstructor model for Doc RE. The reconstructor manages to reconstruct the ground-truth path dependencies from the graph representation, to ensure that the proposed Doc RE model pays more attention to encode entity pairs with relationships in the training. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classiﬁcation in the inference, which can further improve the performance of Doc RE model. Experimental results on a large-scale Doc RE dataset show that the proposed model can signiﬁcantly improve the accuracy of relation extraction on a strong heterogeneous graph-based baseline. The code is publicly available at https://github.com/xwjim/Doc RE-Rec.

Introduction Graph structure plays an important role in the document relation extraction (Doc RE) (Christopoulou, Miwa, and Ananiadou 2019; Sahu et al. 2019; Nan et al. 2020; Tang et al. 2020). Typically, one unstructured input document is ﬁrst organized as a structure input graph (i.e., homogeneous or heterogeneous graphs) based on syntactic trees, coreference, or heuristics rules, thereby building relationships between entity pairs within and across multiple sentences of the input document. Neural networks (i.e., graph network) are used to iteratively encode the structure input graph as a graph representation to model relation information in the input document. The graph representation is fed into one classiﬁer to classify the relation category between each entity pair, which has achieved the state-of-theart performance in Doc RE (Christopoulou, Miwa, and Ananiadou 2019; Nan et al. 2020).

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

However, during the training of Doc RE model, the graph representation universally encodes relation information between all entity pairs regardless of whether there are relationships between these entity pairs. For example, Figure 1 shows three entities in an input document: XFiles, Chris Carter, and Fox Mulder. Intuitively, they are three entity pairs: {X-Files, Chris Carter}, {X-Files, Fox Mulder}, and {Chris Carter, Fox Mulder}. The Doc RE model learns the node representations of each entity pair to classify their relation. As seen, there exists relationship between {Chris Carter, Fox Mulder} in the reference, indicating that there is naturally a reliable reasoning path from Chris Carter to Fox Mulder. In comparison, there do not exist relationships between {X-Files, Chris Carter} and between {X-Files, Fox Mulder}, indicating that there are not reasoning paths between {X-Files-Chris Carter} or {X-Files, Fox Mulder}. However, the learned graph representation models the three path dependencies universally and does not consider whether there is a path dependency between one target entity pair. As a result, {XFiles, Chris Carter} and {X-Files, Fox Mulder} without relationships disperse the attention of the Doc RE model for the learning of {Fox Mulder, Chris Carter} with relationship, which may further hinder the improvement of the Doc RE model.

To alleviate this issue, we propose a novel reconstructor method to enable the Doc RE model to model path dependency between one entity pair with the ground-truth relationship. To this end, the reconstructor generates a sequence of node representations on the path from one entity node to another entity node and thereby maximizes the probability of its path if there is a ground-truth relationship between one entity pair and minimizes the probability otherwise. This allows the proposed Doc RE model to pay more attention to the learning of entity pairs with relationships in the training, thereby learning an effective graph representation for the subsequent relation classiﬁcation. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classiﬁcation in the inference, which can further improve the performance of Doc RE model. Experimental results on a large-scale Doc RE dataset show that the proposed method gained improvement of 1.7 F1 points over a strong heterogeneous graph-based Doc RE model, especially outperformed the recent state-of-

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Chris Carter X-Files Fox Mulder

Mention node Entity node

Sentence node

8 9 9 7 7 8 L Iterations

Reference: NA creator NA

Graph Representation Relation Classification

[1] The X-Files was directed by David Nutter, and written by

Chris Carter, Frank Spotnitz and Howard Gordon. ... . [2]

The show centers on FBI special agents Fox Mulder (David

Duchovny) and Dana Scully (Gillian Anderson) who work

on cases linked to the paranormal, called X-Files. ...

Figure 1: Heuristic rules are used to convert the input document into a heterogeneous graph. Then graph attention network is applied to learn the graph representation. Finally the node representations of entity pairs are used to classify their relationships.

the-art LSR model for Doc RE (Nan et al. 2020).

In this section, based on (Christopoulou, Miwa, and Ananiadou 2019) s work, we used heuristic rules to convert the input document into a heterogeneous graph without external syntactic knowledge. Moreover, a graph attention network is used to encode the heterogeneous graph instead of the edge-oriented graph network (Christopoulou, Miwa, and Ananiadou 2019), thereby implementing a strong and general baseline for Doc RE.

Heterogeneous Graph Construction

Formally, given an input document that consists of L sentences {S1, S2, , SL}, each of which is a sequence of words {xl 1, xl 2, , xl J} with the length J=|Sl|. A bidirectional long short-term memory (Bi LSTM) reads word by word to generate a sequence of word vectors to represent each sentence in the input document. Also, we apply the heterogeneous graph (Christopoulou, Miwa, and Ananiadou 2019) to the input document to build relationships between all entity pairs. Speciﬁcally, the heterogeneous graph includes three deﬁned distinct types of nodes: Mention Node, Entity Node, and Sentence Node. For example, Figure 1 shows an input document including two sentences (yellow color index) in which there are four mentions (blue color) and three entities (green color). The representation of each node is the average of the words in the concept, thereby forming a set of node representations {v1, v2, , v N}, where N is the number of nodes. For edge connections, there are ﬁve distinct types of edges between pairs of nodes following (Christopoulou, Miwa, and Ananiadou 2019) s work, Mention-Mention(MM) edge, Mention-Sentence (MS) edge, Mention-Entity (ME) edge, Sentence-Sentence (SS) edge, Sentence-Sentence (SS) edge, Entity-Sentence (ES) edge respectively. In addition, we add a Mention-Coreference (CO) edges between the two mentions which are referred to the same entity. According to these above deﬁnitions, there is a N N adjacency matrix E denoting edge connections. Finally, the heterogeneous graph can be denoted as G={V, E}, to keep relation information between all entity pairs in the input document.

Encoder To learn an effective graph representation, we used the graph attention network (Guo, Zhang, and Lu 2019) to encode the feature representation of each node in the heterogeneous graph. Formally, given the outputs of all previous hop reasoning operations {s1 n, s2 n, , sl 1 n }, they are concatenated and then transformed to a ﬁxed dimensional vector as the input of the l hop reasoning:

zl n = Wl e [vn : s1 n : s2 n : : sl 1 n ], (1)

where sl 1 n Rd0 and Wl e Rd0 (l d0). Also, according to edge matrix E[n][ac]=k (0 ac < N, k > 0), C direct adjacent nodes of vn are {zl a1, zl a2, , zl a C}. We then use the self-attention mechanism (Vaswani et al. 2017) to capture the feature information of vn between zl n and {zl a1, zl a1, , zl a C}:

sl n = softmax(zl n K d0 )V, (2)

where {K, V} are key and value matrices that are transformed from the direct adjacent nodes representations {zl a1, zl a1, , zl a C} according to the edge type. After performing L hop reasonings, there is a sequence of annotations {s1 n, s2 n, , s L n} to encode relation information in the input document. Finally, another no-linear layer is applied to integrate the reason information {s1 n, s2 n, , s L n} and the node information vn:

qn = Relu(Wo [vn : s1 n : : s L n]), (3)

where Wo Rd1 (d0 (L+1)), qn Rd1. As a result, the heterogeneous graph G is represented as {q1, q2, , q N}.

Classiﬁer Given the heterogeneous graph representation {q1, q2, , q N}, two node representations of each entity pair are as the input to the classiﬁer to classify their relationship. Speciﬁcally, the classiﬁer is a multi-layer perceptron (MLP) layer with sigmoid function to calculate the relationship probability:

R(r) = P(r|{ei, ej}) = sigmoid(MLP([qi : qj])). (4)

To train the Doc RE model, the binary cross-entropy is used to optimize parameters of neural networks

over the triple examples (subject, object, relation) on the training date set (including T documents), that is, {{e1t n, e2t n, rt n}Nt n=1}T t=1:

Lossc = 1 PT t=0 Nt

n=1 {rt nlog(R(rt n))

+(1 rt n)log(1 R(rt n))},

where rt n {0, 1} indicates whether the entity pair has relation label r and Nt is the number of relations in the tth document.

Methodology Intuitively, when a human understands a document with relationships, he or she often pays more attention to learn entity pairs with relationships rather than ones without relationships. Motivated by this observation, we proposed a novel Doc RE model with reconstruction (See Figure 2) to pay more attention to entity pairs with relationships, thus enhancing the accuracy of relationship classiﬁcation.

Meta Path of Entity Pair Generally, when there is a relationship between two entities, they should have one strong path dependency in the graph structure (or representation). In comparison, when there is not a relationship between two entities, there is a weak path dependency.1 Thus, we explore to reconstruct the path dependency between each entity pair from the learned graph representation. To this end, we ﬁrst deﬁne three type paths between two entity nodes in the graph representation as reconstructed candidates according to the meta-path information (Sun and Han 2013). 1) Meta Path1 of Pattern Recognition: Two entities are connected through a sentence in this reasoning type. The relation schema is EM MM EM, for example node sequence {7,3,4,8} in Figure 1. 2) Meta Path2 of Logical Reasoning: the relation between two entities is indirectly established by a bridge entity. The bridge entity occurs in a sentence with the two entities separately. The relation schema is EM MM CO MM EM, for example node sequence {7,3,4,5,6,9} in Figure 1. 3) Meta Path3 of Coreference Reasoning: Coreference resolution must be performed ﬁrst to identify target entities. A reference word refers to an entity that appear in the previous sentence. The two entities occur in the same sentence implicitly. The relation schema is ES SS ES, for example node sequence {7,1,2,9} in Figure 1. Actually, all the entity pairs have at least one of the three meta-paths. We select one meta-path type according to the priority, meta-path1 > meta-path2 > meta-path3. Generally, several instance paths may exist corresponding to the meta path, we select the instance path that appears ﬁrstly in the document.

1If there is no path dependency between two target entities without a relationship, this may weaken the understanding of relationship information in the document.

7 Classifier

Reconstructor

𝑆𝑐𝑜𝑟𝑒= 𝑙𝑜𝑔𝑅+ 𝜆

Maximize 𝑁𝑖

Minimize 𝛮𝑗

Figure 2: Model overview. The reconstructor manages to reconstruct the ground-truth path dependencies from the graph representation to ensure that the model to pay attention to model entity pairs with relationships. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classiﬁcation in the inference.

Path Reconstruction For each entity pair, one instance path is selected as the supervision of the reconstruction of the path dependency. In other words, there is only one supervision path φn={vb1, vb2, , vb C} between each target pair {e1n, e2n}, where b C is the number of nodes. To reconstruct the path dependency of each entity pair, we model the reconstructor as the sequence generation. Speciﬁcally, we use a LSTM to compute a path hidden state pbc for each node qbc 1 on the path φn:

pbc = LSTM(pbc 1, qbc 1). (6)

Note that pb0 is initialized as the transform of oij, since it plays a key role in classiﬁcation. pbc is fed into a softmax layer to compute the probability of node vbc on the path:

P(vbc|v<bc) = exp(p T bc Wrqbc]) P n exp(p T bc Wrqn]), (7)

where Wr Rd1 d1. Also, there is a set of node probabilities {P(vb1|v<b1), P(vb2|v<b2), , P(vb C|v<b C)} for the path φn. Finally, the probability of this path φn is computed:

c=1 (P(vbc|v<bc)). (8)

Training with Reconstruction Loss We use the reconstructed path probability to compute an additional reconstruction loss over the triple examples of the training data set {{e1t n, e2t n, rt n}Nt n=1}T t=1:

Lossr = 1 PT t=0 Nt

n=1 {rt nlog N(φn)

+(1 rt n)log(1 N(φn)},

where rt n is one of {0,1}, that is, we maximize the probability of the path N(φn) if the entity pair has relation, and minimize the probability otherwise. To simplify the Eq.(9), we use QC c=1(1 Pbc) to replace with the (1

N(φn)), where Pbc=P(vbc|v<bc). The reconstruction loss is modiﬁed as Eq.(10):

Lossr = 1 PT t=0 Nt

bc=1 {(rt nlog Pbc)

+(1 rt n)log(1 Pbc)}}.

Finally, the reconstructor loss and the existing classiﬁcation loss in Eq.(5) is added as the training objective of the proposed Doc RE model:

Loss = Lossc + Lossr. (11)

Inference with Path Reconstruction Intuitively, the proposed reconstructor encourages the Doc RE model to pay more attention to model entity pairs with ground-truth relationships. Furthermore, we maximized the path probability between one entity pair if there is indeed a relation and we minimized it otherwise when computing the reconstruction loss in Eq.(10). In other words, the higher the probability of this path is, the greater the likelihood of a relationship between the entity pair is. Naturally, we treat this path probability as a relational indicator to assist relation classiﬁcation in the inference:

S(r) = log(R(r)) + λ 1

bc=1 log(Pbc), (12)

where λ is a hyper-parameter to control the importance of reconstruction probability in the inference.

Experiments Setup The proposed methods were evaluated on a largescale human-annotated dataset for document-level relation extraction (Yao et al. 2019). Doc RED contains 3,053 documents for the training set, 1,000 documents for the development set, and 1,000 documents for the test set, totally with 132,375 entities, 56,354 relational facts, and 96 relation types. More than 40% of the relational facts require the reading and reasoning over multiple sentences. Following settings of (Nan et al. 2020) s work, we used the Glo Ve embedding (100d) and Bi LSTM (128d) as word embedding and encoder. The hop number L of the encoder was set to 2. The learning rate was set to 1e-4 and we trained the model using Adam as the optimizer. For the BERT representations, we used uncased BERT-Based model (768d) as the encoder and the learning rate was set to 1e 5 For evaluation, we used F1 and Ign F1 as the evaluation metrics. Ign F1 denotes F1 score excluding relational facts shared by the training and development/test sets. In particular, the predicted results were ranked by their conﬁdence and traverse this list from top to bottom by F1 score on development set, and the score value corresponding to the maximum F1 is picked as threshold θ. All hyper-parameters were tuned based on the development set. In addition, the results on the test set were evaluated through Coda Lab2.

2https://competitions.codalab.org/competitions/20717

Baseline Systems

According to Section Background , there is a baseline heterogeneous-based graph self-attention network model (Heter GSAN). Also, there are some recent Doc RE methods as our comparison systems: Sequence-based Models: These models used different neural architectures to encode sentences in the document, including including convolution neural networks (CNN) (Yao et al. 2019), bidirectional LSTM (Bi LSTM) (Yao et al. 2019) and Context-Aware LSTM (Yao et al. 2019). Graph-based Models. GCNN (Sahu et al. 2019), GAT (Veliˇckovi c et al. 2018), AGGCN (Guo, Zhang, and Lu 2019) constructed the graph from syntactic parsing and sequential information, or non-local dependencies from coreference resolution and other semantic dependencies, and then uses the GCN based method to calculate the node embedding. Eo G (Christopoulou, Miwa, and Ananiadou 2019) deﬁned several node types and edges to construct a heterogeneous graph of the input document without external syntactic knowledge. Eo G uses an iterative algorithm to learn new edge representations between different nodes in the heterogeneous graph and classify relationships between entity pairs. Instead of constructing a static graph representation, LSR (Nan et al. 2020) empowered the relational reasoning across sentences by automatically inducing the latent document-level graph. BERT. It applied a pre-trained language model to learn the representations of the input document (Wang et al. 2019; Devlin et al. 2019). Furthermore, it used a two-phase training process to enhance the performance of Doc RE model. Speciﬁcally, it ﬁrst predicts whether a pair of entities has a relation or not and classiﬁes the relation for each entity pair.

Main Results

Table 1 presents the detailed results on the development set and the test set of Doc RED. As seen, our baseline Heter GSAN model achieved 53.52 F1 score on the test set and outperformed the Eo G model which is also a heterogeneous-based graph Doc RE model by 1.7 points in terms of F1. Meanwhile, Heter GSAN is consistently superior to the most of comparison methods, including CNN, Bi LSTM, Context Aware, GCNN, GAT, and AGGCN. This indicates that the graph self-attention network can give a strong baseline in the heterogeneous-based methods of Doc RE. Heter GSAN+reconstruction achieved 55.23 F1, which outperformed the baseline Heter GSAN by 1.71 F1 score. In particular, Heter GSAN+reconstruction outperformed the existing state-of-the-art LSR model by 1.05 F1 score, which is a new state-of-the-art result on the Doc RED dataset without the pre-trained model (BERT). This means that the proposed reconstructor is beneﬁcial to encode relation information in the input document, thereby enhancing the relation extraction. In addition, we evaluated the proposed Heter GSAN model with a pre-trained language model as shown in Table 1. First, Heter GSAN+BERT model consistently outperformed the comparison BERT model, Two-Phase BERT

Groups Methods Dev Test Ign F1 F1 Ign F1 F1

CNN (Yao et al. 2019) 41.58 43.45 40.33 42.26 Bi LSTM (Yao et al. 2019) 48.87 50.94 48.78 51.06 Contex Aware (Yao et al. 2019) 48.94 51.09 48.40 50.07 GCNN (Sahu et al. 2019) 46.22 51.52 49.59 51.62 Eo G (Christopoulou, Miwa, and Ananiadou 2019) 45.94 52.15 49.48 51.82 GAT (Veliˇckovi c et al. 2018) 45.17 51.44 47.36 49.51 AGGCN (Guo, Zhang, and Lu 2019) 46.29 52.47 48.89 51.45 LSR (Nan et al. 2020) 48.82 55.17 52.15 54.18 Heter GSAN 52.17 54.40 52.07 53.52 +Reconstruction 54.27 56.22 53.27 55.23

BERT (Wang et al. 2019) - 54.16 - 53.20 Two-Phase BERT (Wang et al. 2019) - 54.42 - 53.92 BERT+LSR (Nan et al. 2020) 52.43 59.00 56.97 59.05 Heter GSAN 52.17 54.40 52.07 53.52 +BERT 57.00 59.13 56.21 58.54 +Reconstruction 58.13 60.18 57.12 59.45

Table 1: Results on the development set and the test set. Results with are reported in their original papers. Results with are reported in (Nan et al. 2020). Bold results indicate the best performance of the current method.

model, BERT+LSR model. This conﬁrms the effectiveness of the BERT method, which we believe makes the evaluation convincing. Moreover, Heter GSAN+BERT+Reconstruction model outperformed Heter GSAN+BERT model by 0.91 F1 score, indicating that our approach is complementary to BERT, and combining them is able to further improve the accuracy of relation extraction. Meanwhile, Heter GSAN+BERT+Reconstruction model (F1 59.45) outperformed BERT+LSR model (F1 59.05) by 0.40 F1 score on the test set, which is a new state-of-the-art result.

Effect of Reconstruction

To valid the effect of reconstruction, Figure 3 showed learning curves of classiﬁcation and reconstruction performances (in F1 scores) on the development set during the training. For reconstruction, we used the reconstructor to generate the source path for each entity pair and calculated the probability of the reconstructed path to indicate how

10 12 14 16 18 20 22 24 26 28

Iterations(1K)

classiﬁcation F1 Score

Classiﬁcation

reconstruction F1 Score

Reconstruction

Figure 3: Learning curves of classiﬁcation (left y-axis) and reconstruction (right y-axis) performances (in F1 scores) on the development set during the training.

much there is a relationship. As seen, the reconstruction F1 scores went up with the improvement of reconstruction over time. When the classiﬁcation performance reached a peak at iteration 24K, the proposed model achieved a balance between classiﬁcation and reconstruction scores. Therefore, we use the trained model at iteration 24K in Table 1.

Ablation in Training and Inference To further explore the effect of Reconstructor, we incrementally introduced it into the training and inference phases in turn. Table 2 shows the results of the ablation experiment on the development set. As seen, when Reconstructor was only introduced into the training phase (#2), there was 1.26 F1 improvement over the baseline Hete GASN model (#1) in which there are not Reconstructor in the training and inference phases. Moreover, Reconstructor was introduced into the inference as a relation indicator to assist relation classiﬁcation, that is, there are Reconstructor in both training and inference contain the Reconstructor (#3), As a result, there gained 0.56 F1 further improvement. This shows that the proposed Reconstructor can not only encode relation information of the input document efﬁcient but also indicate how much there is a relationship, to enhance relation classiﬁcation between entity pair.

Reconstructor used in Metric Training Inference Ign F1 F1 #1 52.17 54.40 #2 53.69 55.66 #3 54.27 56.22

Table 2: Ablation of Reconstructor in training and inference.

Ablation of Reconstruction Loss In the reconstruction phase, we maximized (max) the path probability if the entity pair has the ground-truth relationship

entity pair Metric relation no-relation Ign F1 F1 #1 52.17 54.40 #2 min 53.29 55.04 #3 max 53.53 55.55 #4 max min 54.27 56.22

Table 3: Ablation experiments of reconstruction loss for the proposed Heter GSAN+Reconstruction model.

and minimized (min) the path probability otherwise. Therefore, we performed the ablation of the above two reconstruction paths. Speciﬁcally, we gradually introduced them into the proposed Heter GSAN with Reconstruction to verify the effect of two reconstruction paths, as shown in Table 3. Here, relation denotes entity pairs with groundtruth relationships while no-relation denotes entity pairs without ground-truth relationships. As seen, when one of no-relation (#2) and relation (#3) entity pairs were used to compute the reconstruction loss, their F1 scores were better than the baseline Heter GSAN (#1). This means that reconstructing one of two paths is beneﬁcial to improve the performance of Doc RE model. Meanwhile, relation (#3) was superior to no-relation (#2). In particular, both of them can complement each other to further improve F1 score (#4). This indicates that two path reconstruction methods help the Doc RE model capture more diverse useful information from the input document.

Effect of Path Probability in Inference In inference, the reconstructor is regarded as a relationship indicator to assist relation classiﬁcation. The hyperparameter λ in Eq.(12) keeps a trade-off between the classiﬁcation scores and the construction scores when classifying the relation of each entity pair. Figure 4 shows classiﬁcation F1 scores of different hyper-parameter λ for the reconstructed path probability of Heter GSAN and +Reconstruction models in inference. As seen, F1 scores of +Reconstruction model increased with the increasing of

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Hyper-parameter λ in Eq.(12)

classiﬁcation F1 Score

+Reconstruction

Figure 4: Classiﬁcation F1 scores of different hyperparameter λ for the reconstructed path probability of Doc RE models (Heter GSAN and +Reconstruction) in inference.

MP1 MP2 MP3 Heter GSAN (%) 60.67 50.29 46.30 +Reconstruction (%) 61.73 52.19 47.57

Table 4: F1 scores of three groups (MP1, MP2, and MP3) with different meta paths.

λ until 0.4, indicating that the probability of reconstructed path is useful for improving the relation classiﬁcation. Subsequently, larger values of λ reduced the F1 scores, suggesting that excessive biased path information may be weak at keeping the gained improvement. Therefore, we set the hyper-parameter λ to 0.4 to control the effect of reconstructed path information in our experiments (Table 1).

Evaluating Different Meta Path To evaluate our deﬁned three candidate meta paths, we divide entity pairs of the same type of meta path in the development set to three groups, for example, MP1 indicates that the path representation of entity pairs are from the deﬁned Meta Path1 (See Subsection Meta Path of Entity Pair ) during the reconstruction. Table 4 showed F1 scores of three groups (MP1, MP2, and MP3) for Heter GSAN and +Reconstruction models. As seen, F1 scores of +Reconstruction outperformed that of Heter GSAN in all three groups. This means that our deﬁned meta paths can efﬁcient capture path dependency between entity pairs in the reconstruction processing.

Path Attention Scores To study how the reconstructor (Rec) affect the distribution of attention scores along the path in the Heter GSAN, we divided attention scores into ﬁve intervals (i.e, 0-0.2, 0.20.4, etc) and showed the percent of attention distribution on Heter GSAN and +Reconstruction on the development set as shown in the Table 5. The attention scores of Heter GSAN are mainly concentrated in interval 0-.2, which may indicate the hypothesis of universally learning relationship information. Thus, +Reconstruction signiﬁcantly reduced the percent of attention scores in interval 0-.2 and increased the percent of remaining intervals with higher attention scores. This means that the reconstructor guides the Doc RE model to pay more attention to model meta-path dependencies for the ground-truth relationships.

0-.2 .2-.4 .4-.6 .6-.8 .8-1.0 Heter GSAN(%) 84.43 5.28 2.28 2.53 5.48 +Rec (%) 69.04 9.21 10.78 1.34 9.63

Table 5: Changes of the distribution of path attention scores

Ablation of Different Meta-Paths we reconstruct one of three meta-paths (MP1, MP2 and MP3) in each Doc RE model and not consider the reconstructor in inference. The results are as follows in Table 6. First, reconstruction of each meta-path is beneﬁcial

[0] Lark Force was an Australian Army formation established in March

1941 during World War II for service in New Britain and New Ireland.

[1] Under the command of Lieutenant Colonel John Scanlan, it was

raised in Australia and deployed to Rabaul and Kavieng, aboard SS

Katoomba, MV Neptuna and HMAT Zealania, to defend their

strategically important harbours and airfields.

(Lark Force, Australia, P17) Reference: P17 Heter GSAN Prediction: NA log 𝑅𝑟 : -1.1271 𝜃1: -0.9828 Heter GSAN +Reconstruction Prediction: P17 log 𝑅𝑟 : -0.9760 𝜃2: -1.0270

(Rabaul, Australia, P137) Reference: NA Heter GSAN Prediction: P137 log 𝑅𝑟 : -0.7095 𝜃1: -0.9828 Heter GSAN +Reconstruction Prediction: NA log 𝑅𝑟 : -1.4012 𝜆log 𝑅𝜙 : -0.6017 𝑆𝑟: -2.0029 𝜃2: -1.7340

Figure 5: Case Study

type of meta-path

type of meta-path

Test F1 None 54.40 53.52 MP1&MP2 55.26 54.40 MP1 54.79 54.22 MP1&MP3 55.12 54.37 MP2 54.78 54.20 MP2&MP3 54.96 54.28 MP3 54.54 53.88 All 55.66 54.91

Table 6: Ablation experiments of different Meta-Paths.

to enhance the Doc RE model, conﬁrming our motivation. Thus, the improved range of each meta-path is in descending order: MP1, MP2, MP3, conﬁrming the priority for the reconstruction meta-path in Sec 3.1. It is a statistic that the percentage of MP1, MP2, and MP3 are 22.39%, 23.15%, and 54.46%. Then, when two different meta-paths are considered, their F1 values are higher than the single path which is reconstructed, indicating that more ground-truth path relationships are reconstructed to enhance the training of the Doc RE model. Similarly, considering three metapaths gain the highest F1 on development/test sets.

Figure 5 shows a case study of Heter GSAN and +Reconstruction models. For the entity pair {Lark Force, Australia}, Heter GSAN classiﬁed its relation to NA which is inconsistent with the Reference P17 because of its classiﬁer score -1.1271 is less than the threshold θ1 -0.9828. In comparison, the classiﬁer score of +Reconstruction classiﬁed its relation to P17 which is consistent with the Reference P17 because of its classiﬁer score -0.9760 was greater than the threshold θ2 -1.0270. This means that the proposed Reconstructor can better guild the training of Doc RE model. For another entity pair {Rabaul, Australia}, the classiﬁer scores of Heter GSAN and +Reconstruction models were greater than θ1 and θ2, respectively. However, they gained a relation category P137 which is inconsistent with the Reference NA . When the path score -0.6017 was considered in the inference, +Reconstruction classiﬁed its relation to NA which is consistent with the Reference NA . This indicates that the inference with Reconstructor can further improve the accuracy of relation classiﬁcation.

Related Work

Doc RE Early efforts focus on classifying relationships between entity pair within a single sentence or extract entity and relations jointly in a sentence (Zeng et al. 2014; Wang et al. 2016; Wei et al. 2020; Song et al. 2019). These approaches do not consider interactions across mentions and ignore relations expressed across sentence boundaries. Recently, the extraction scope has been expanded to the entire document in the biomedical domain by only considering a few relations among chemicals (Peng et al. 2017; Quirk and Poon 2017; Gupta et al. 2019; Zhang, Qi, and Manning 2018; Christopoulou, Miwa, and Ananiadou 2019). In particular, Yao et al. (2019) proposed a large-scale human-annotated Doc RED dataset. The dataset requires understanding a document and performing multihop reasoning and several works (Wang et al. 2019; Nan et al. 2020) have been done on the dataset. Reconstruction Reconstructor was used to solve the problem that translations generated by neural network translation (NMT) often lack adequacy (Tu et al. 2016; Cheng et al. 2016). (Cheng et al. 2016) reconstructs the monolingual corpora with two separate source-to-target and target-to-source NMT models. (Tu et al. 2016) aims at enhancing adequacy of unidirectional (i.e., source-to-target) NMT via a target-to-source objective on parallel corpora. Besides, (Hu et al. 2020) uses reconstructor to pre-train a graph neural network on the unlabeled data with selfsupervision to reduce the cost of labeled data.

This paper proposed a novel reconstruction method to guide the Doc RE model to pay more attention to the learning of entity pairs with the ground-truth relationships, thereby learning an effective graph representation to classify relation category. In inference, the reconstructor is further regarded as a relation indicator to assist relation classiﬁcation between entity pair. Experimental results on a large-scale Doc RED dataset show that our method can greatly advance the Doc RE task. In the future, we will explore more information related to relationship classiﬁcation in the input document, for example, syntax constraint (Chen et al. 2018), diverse information (Chen et al. 2020), and knowledge reasoning (Cohen et al. 2020).

Acknowledgments We are grateful to the anonymous reviewers, senior program Committee and area chair for their insightful comments and suggestions. The corresponding authors are Kehai Chen and Tiejun Zhao. This work is supported by the National Key R&D Program of China (No. 2018YFC0830700) and Huawei Technologies CO., Ltd (No. YBN2019115122).

References Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; and Zhao, T. 2018. Syntax-Directed Attention for Neural Machine Translation. In AAAI Conference on Artiﬁcial Intelligence, 4792 4798. New Orleans, Lousiana, USA. URL https://www.aaai.org/ocs/index.php/AAAI/ AAAI18/paper/view/16060/16008.

Chen, K.; Wang, R.; Utiyama, M.; Sumita, E.; Zhao, T.; Yang, M.; and Zhao, H. 2020. Towards More Diverse Input Representation for Neural Machine Translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28: 1586 1597. doi:10.1109/TASLP.2020.2996077.

Cheng, Y.; Xu, W.; He, Z.; He, W.; Wu, H.; Sun, M.; and Liu, Y. 2016. Semi-Supervised Learning for Neural Machine Translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1965 1974. Berlin, Germany: Association for Computational Linguistics. doi:10.18653/v1/P16-1185. URL https://www.aclweb.org/anthology/P16-1185.

Christopoulou, F.; Miwa, M.; and Ananiadou, S. 2019. Connecting the Dots: Document-level Neural Relation Extraction with Edge-oriented Graphs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), 4925 4936. Hong Kong, China: Association for Computational Linguistics. doi:10.18653/v1/D19-1498. URL https://www.aclweb.org/anthology/D19-1498.

Cohen, W. W.; Sun, H.; Hofer, R. A.; and Siegler, M. 2020. Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base. In International Conference on Learning Representations. URL https://openreview.net/ forum?id=BJlgu T4YPr.

Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In In Proceedings ofthe 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171 4186.

Guo, Z.; Zhang, Y.; and Lu, W. 2019. Attention Guided Graph Convolutional Networks for Relation Extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 241 251. Florence, Italy: Association for Computational Linguistics. doi:10.18653/ v1/P19-1024. URL https://www.aclweb.org/anthology/P191024.

Gupta, P.; Rajaram, S.; Sch utze, H.; and Runkler, T. A. 2019. Neural Relation Extraction within and across Sentence Boundaries. In The Thirty-Third AAAI Conference on

Artiﬁcial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, 6513 6520. AAAI Press. doi:10.1609/aaai.v33i01.33016513. URL https://doi. org/10.1609/aaai.v33i01.33016513.

Hu, Z.; Dong, Y.; Wang, K.; Chang, K.-W.; and Sun, Y. 2020. GPT-GNN: Generative Pre-Training of Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 20, 1857 1867. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379984. doi:10.1145/3394486.3403237. URL https://doi.org/10.1145/3394486.3403237.

Nan, G.; Guo, Z.; Sekulic, I.; and Lu, W. 2020. Reasoning with Latent Structure Reﬁnement for Document-Level Relation Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1546 1557. Online: Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.141. URL https://www. aclweb.org/anthology/2020.acl-main.141.

Peng, N.; Poon, H.; Quirk, C.; Toutanova, K.; and Yih, W. 2017. Cross-Sentence N-ary Relation Extraction with Graph LSTMs. Trans. Assoc. Comput. Linguistics 5: 101 115. URL https://transacl.org/ojs/index.php/tacl/article/ view/1028.

Quirk, C.; and Poon, H. 2017. Distant Supervision for Relation Extraction beyond the Sentence Boundary. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 1171 1182. Valencia, Spain: Association for Computational Linguistics. URL https: //www.aclweb.org/anthology/E17-1110.

Sahu, S. K.; Christopoulou, F.; Miwa, M.; and Ananiadou, S. 2019. Inter-sentence Relation Extraction with Documentlevel Graph Convolutional Neural Network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4309 4316. Florence, Italy: Association for Computational Linguistics. doi:10.18653/ v1/P19-1423. URL https://www.aclweb.org/anthology/P191423.

Song, L.; Zhang, Y.; Gildea, D.; Yu, M.; Wang, Z.; and Su, J. 2019. Leveraging Dependency Forest for Neural Medical Relation Extraction. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) doi:10.18653/v1/d19-1020. URL http://dx.doi.org/10.18653/v1/D19-1020.

Sun, Y.; and Han, J. 2013. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations 14: 20 28.

Tang, H.; Cao, Y.; Zhang, Z.; Cao, J.; Fang, F.; Wang, S.; and Yin, P. 2020. HIN: Hierarchical Inference Network for Document-Level Relation Extraction. Advances in Knowledge Discovery and Data Mining 12084: 197 209.

Tu, Z.; Liu, Y.; Shang, L.; Liu, X.; and Li, H. 2016. Neural Machine Translation with Reconstruction. Co RR abs/1611.01874. URL http://arxiv.org/abs/1611.01874. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017. Attention is All you Need. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30, 5998 6008. Curran Associates, Inc. Veliˇckovi c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li o, P.; and Bengio, Y. 2018. Graph Attention Networks. In International Conference on Learning Representations. URL https://openreview.net/forum?id=r JXMpik CZ. Wang, H.; Focke, C.; Sylvester, R.; Mishra, N.; and Wang, W. W. J. 2019. Fine-tune Bert for Doc RED with Two-step Process. Ar Xiv abs/1909.11898. Wang, L.; Cao, Z.; de Melo, G.; and Liu, Z. 2016. Relation Classiﬁcation via Multi-Level Attention CNNs. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1298 1307. Berlin, Germany: Association for Computational Linguistics. doi:10.18653/v1/P16-1123. URL https://www.aclweb.org/anthology/P16-1123. Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; and Chang, Y. 2020. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1476 1488. Online: Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.aclmain.136. Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu, Z.; Huang, L.; Zhou, J.; and Sun, M. 2019. Doc RED: A Large Scale Document-Level Relation Extraction Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 764 777. Florence, Italy: Association for Computational Linguistics. doi:10.18653/ v1/P19-1074. URL https://www.aclweb.org/anthology/P191074. Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; and Zhao, J. 2014. Relation Classiﬁcation via Convolutional Deep Neural Network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2335 2344. Dublin, Ireland: Dublin City University and Association for Computational Linguistics. URL https://www.aclweb.org/anthology/C14-1220. Zhang, Y.; Qi, P.; and Manning, C. D. 2018. Graph Convolution over Pruned Dependency Trees Improves Relation Extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2205 2215. Brussels, Belgium: Association for Computational Linguistics. doi:10.18653/v1/D18-1244. URL https://www.aclweb.org/anthology/D18-1244.