# ensemble_neural_relation_extraction_with_adaptive_boosting__c7e9b80d.pdf

Ensemble Neural Relation Extraction with Adaptive Boosting

Dongdong Yang1, Senzhang Wang2 and Zhoujun Li3

1 Unversity of Southern California 2 Nanjing University of Aeronautics and Astronautics 3 Beihang University dongdony@usc.edu, szwang@nuaa.edu.cn, lizj@buaa.edu.cn

Relation extraction has been widely studied to extract new relational facts from open corpus. Previous relation extraction methods are faced with the problem of wrong labels and noisy data, which substantially decrease the performance of the model. In this paper, we propose an ensemble neural network model - Adaptive Boosting LSTMs with Attention, to more effectively perform relation extraction. Speciﬁcally, our model ﬁrst employs the recursive neural network LSTMs to embed each sentence. Then we import attention into LSTMs by considering that the words in a sentence do not contribute equally to the semantic meaning of the sentence. Next via adaptive boosting, we build strategically several such neural classiﬁers. By ensembling multiple such LSTM classiﬁers with adaptive boosting, we build a more effective and robust joint ensemble neural networks based relation extractor. Experiment results on real dataset demonstrate the superior performance of the proposed model, improving F1-score by about 8% compared to the state-of-the-art models.

1 Introduction

Many NLP tasks have been built on different knowledge bases, such as Freebase and DBPedia. However, the knowledge bases could not cover all the facts in the real world. Therefore, it is essential to extract more common relational facts automatically in open domain corpus. As known, relation extraction (RE) aims at extracting new relation instances that are not contained in the knowledge bases from the unstructured open corpus. It aligns the entities in the open corpus with those in the knowledge bases and retrieves the entity relations from the real world. For example, if we aim to retrieve a relation from the raw text, Barack Obama married Michelle Obama 10 years ago , a naive approach would be to search the news articles for indicative phrases, such as marry or spouse . However, the result may be wrong since human language is inherently various and ambiguous.

Corresponding author

Previous supervised RE methods require a large amount of labelled relation training data by human-hand. To address this issue, Mintz et al. [Mintz et al., 2009] proposed an approach via aligning the entity in KB for later extraction without plenty of training corpus. However, their assumption - there is only one relation existing in a pair of entities, was irrational. Therefore, later researches assumed more than one relation could exist between a pair of entities. Hoffmann et al. [Hoffmann et al., 2011] proposed a multi-instance learning model with overlapping relations (Multi R) that combined a sentence-level extraction model for aggregating the individual facts. Surdeanu et al. [Surdeanu et al., 2012] proposed a multi-instance multi-label learning model (MIML-RE) to jointly model the instances of a pair of entities in text and all their labels. The major limitation of the above methods is that they cannot deeply capture the latent semantic information from the raw text. It is also challenging for them to seamlessly integrate semantic learning with feature selection to more accurately perform RE. Recently, deep neural networks are widely explored for relation extraction and have achieved signiﬁcant performance improvement [Zeng et al., 2015; Lin et al., 2016]. Compared with traditional shallow models, deep models can deeply capture the semantic information of a sentence. Zeng et al. [Lin et al., 2016] employed CNN with sentence-level attention over multiple instances to encode the semantics of sentences. Miwa and Bansal [Miwa and Bansal, 2016] used a syntaxtree-based long short-term memory networks (LSTMs) on the sentence sequences. Ye et al.[Ye et al., 2017] proposed a uniﬁed relation extraction model that combined CNN with a pair of ranking class ties. However, the main issue of existing deep models is that their performance may not be stable and could not effectively handle the quite imbalanced, noisy, and wrong labeled data in relation extraction even if a large number of parameters in the model. To address the above issues, in this paper we propose a novel ensemble deep neural network model to extract relations from the corpus via an Adaptive Boosting LSTMs with Attention model (Ada-LSTMs). Speciﬁcally, we ﬁrst choose bi-directional long short-term memory networks to embed forward and backward directions of a sentence for better understanding the sentence semantics. Considering the fact that the words in a sentence do not contribute equally to the sentence representation, we import attention mecha-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

re-weighting the gradient 𝛿

for sample with 𝐷#(s&)

𝐿𝑆𝑇𝑀, γ,(𝑥)

Ensenble Model Υ

𝑟𝑤: of the sentence

feature Layer

: LSTM networks

U : two-directional LSTMs unit

: LSTMs traditional forward

: LSTMs backward, update with 𝐷<(𝑠1)

𝑙1 : label of sentence 𝑠1

Figure 1: The framework of Ada-LSTMs contains three layers: feature layer, bi-directional Stacked LSTMs layer with attention and adaptive boosting layer. si indicates the original input sentence with a pair of entities and their relation.

nism to the bi-directional LSTMs. Next we construct multiple such LSTM classiﬁers and ensemble their results as the ﬁnal prediction result. Kim and Kang [Kim and Kang, 2010] showed that ensemble with neural networks perform better than one single neural network in prediction tasks. Motivated by their work, we import adaptive boosting and tightly couple it with deep neural networks to more effectively and robustly solve the relation extraction problem. The key role of adaptive boosting in our model is re-weighting during the training process. The weight of incorrectly classiﬁed samples will increase. In other words, the samples classiﬁed wrongly gain more attention so that the classiﬁer is forced to focus on these hard examples. Note that attention can distinguish the different importance of words in the sentence, while adaptive boosting can use sample weights to inform the training of neural networks. In a word, the combination of the two can more precisely capture the semantic meaning of the sentences and better represent them, and thus help us train a more accurate and robust model. We summarize the contributions of this paper as follows.

We propose a Multi-class Adaptive Boosting Neural Networks model, which to our knowledge is the ﬁrst work that combines adaptive boosting and neural networks for relation extraction.

We utilize adaptive boosting to tune the gradient descent in NN training. In this way, a large number of parameters in a single NN can be learned more robustly. The ensembled results on multiple NN models can achieve more accurate and robust relation extraction result.

We evaluate the proposed model on a real data set. The results demonstrate the superior performance of the proposed model which improves F1-score by about 8% compared to state-of-the-art models.

2 Related Work

As an important and fundamental task in NLP, relation extraction has been studied extensively. Many approaches for RE have been developed including distant supervision, deep learning, etc. Distant supervision was ﬁrstly proposed to address this issue by [Mintz et al., 2009]. Mintz et al. aligned Freebase relations with Wikipedia corpus to automatically ex-

tract instances from a large-scale corpus without hand-labeled annotation. Riedel et al. [Riedel et al., 2010] tagged all the sentences with at least one relation instead of only one. Hoffmann et al. [Hoffmann et al., 2011] also improved the previous work and aimed at solving the overlapping relation problem. Surdeanu et al. [Surdeanu et al., 2012] proposed a multi-instance multi-label method for relation extraction. Zheng et al. [Zheng et al., 2016] aggregated inter-Sentence information to enhance relation extraction. With neural networks bursting out many ﬁelds of research, researchers also began to apply this new technique to relation extraction. Zeng et al. [Zeng et al., 2014] ﬁrst proposed a convolutional neural network (CNN) for relation classiﬁcation. Zhang et al. [Zhang et al., 2015] proposed to utilize bidirectional long short-term memory networks to model the sentence with sequential information about all words. Recently, attention has been widely used in NLP tasks. Yang et al. [Yang et al., 2016] used a two-layer attention mechanism for document classiﬁcation, which inspires us to focus on the word understanding level. Liu et al [Lin et al., 2016] also used the attention level to its CNN architecture and gained a better performance in extraction. [Zhang et al., 2017] combines an LSTM sequence model with a form of entity position-aware attention for relation extraction. Besides, ensemble learning is a well-known machine learning paradigm which tries to learn one hypothesis from training data, ensemble methods. Freund et al. [Freund et al., 1996] was the ﬁrst paper which proposed Adaboost. Rokach et al. [Rokach, 2010] showed us the technique of generating multiple models strategically and combining these models to improve the performance of many machine learning tasks. Li et al. [Li et al., 2017] also showed that ensemble technique can be successfully used in transfer learning.

3 Methodology

Given a sentence si S, where S is a corpus, and its corresponding pair of entities ϕ = (e1, e2), our model aims at measuring the probability of each candidate relation Ωi L. L is deﬁned as {1, 2, 3, ..., C}, where C is the number of relation classes. Figure 1 shows the overview of the model framework. The model mainly consists of three layers: feature layer, bi-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Notations Interpretation Y the ﬁnal trained classiﬁcation model γt(x) a neural network classiﬁer T the number of trained neural network classiﬁers αt the weight of the neural classiﬁer γt(x) Dt the weight vector for total samples at the tth

epoch Dt(si) the weight for sentence si at the tth epoch si the ith sentence in the corpus S rwk the kth word in a sentence si

Table 1: Notations and their meanings.

directional LSTMs layer with an attention and adaptive boosting layer. The feature layer makes sentence vectorized and embeds them as the input of the model. The bi-directional LSTMs with attention layer can deeply capture the latent information of each sentence. Attention mechanism could weight each phase in a sentence, which is learned during the training process. The adaptive boosting layer combines multiple classiﬁers to generate the ﬁnal weighted joint function for classifying the relation. The essential notations used in this paper and their meanings are given in Table 1. Next, we will introduce the three layers of the proposed model in details in the following sections.

3.1 Embedded Features The embedded features contain word embeddings and position embeddings. We use two embedded features for relation extraction as the input of the bi-directional long short-term memory neural networks. We describe the embedding features as follows.

Word Embeddings The inputs are some raw words {rw1,rw2,...,rwl}, where l is the length of the input sentence. We make every raw word rwi represented by a real-valued vector wi via word embedding which is encoded by an embedding matrix M Rda V , where V is the representation of a ﬁx-sized vocabulary and da is the dimension of the word embedding. In our paper, we use the skip-gram model to train word embeddings.

Position Embeddings A position embedding is deﬁned as a word distance, which is from the position of the word to the positions of the entities in a sentence. A position embedding matrix is denoted as P Rlp dp, where lp is the number of distances and dp is the dimension of the position embedding proposed by Ye et al. [Ye et al., 2017]. As there are two entities in a sentence that we need to measure their distances to the word, we have two dp values. Therefore, the dimension of the word representations is dw = da + 2 dp and the ﬁnal input vector for raw word rwi is xi = [wi, dp 1, dp 2].

3.2 Multi-class Adaptive Boosting Neural Networks The Multi-class Adaptive Boosting Long Short-term Memory Neural Networks (Ada-LSTMs) is a joint model, in which several neural networks are combined together according to

their weight vector α, learned from adaptive boosting algorithm. Before describing the model in detail, we would like to show the motivation for coming up with this model. We ﬁrst analyze the distribution of a public dataset for relation extraction, which is currently widely used as the benchmark and released by [Riedel et al., 2010]. The data distribution is quite unbalanced. Among the 56 relation ties, 32 of them have less than 100 samples and 12 of them have more than 1000 samples. Besides, as [Liu et al., 2016] discussed, the dataset has wrong labelling and noisy data problems. Thus it is difﬁcult for a single model to achieve promising result on relation extraction with such noisy and distorted training data. Therefore, it is essential to introduce a robust algorithm to alleviate the wrong labelling data issue and the distortions of the data. In our model, we adopt multi-class adaptive boosting method to improve the robustness of the neural networks for relation extraction. For the neural networks part, we use LSTMs because it is naturally suitable to handle the sequential words in a sentence and captures the meanings well. For the ensemble learning part, Adaboost is a widely used ensemble learning method that sequentially trains and ensembles multiple classiﬁers. The tth classiﬁer is trained with more emphasis on different weights on the input samples, which is based on a probability distribution Dt to re-weighing the samples. The original adaptive boosting [Freund et al., 1996] is to solve the binary classiﬁcation problems and calculate the samples one by one. To make it ﬁt into our model, we make the following modiﬁcations as shown in Equations (1)-(8).

t αtγt(x)) (1)

The ﬁnal prediction model Y is obtained by weighted voting as shown in (1), where αt means the weight of each classiﬁer γt(x) for our ﬁnal extractor Y (x). The softmax function f in Equation (1) is to predict the labels of relation types. Here we focus more on the upper level of the model architecture, and more details about the neural classiﬁer γt(x) will be given in the next section. The result of training the tth classiﬁer is such a hypothesis ht : X L where X is the space of input features and L = {1, ..., c} is the space of labels.

The weight αt of each NN classiﬁer γt(x) is updated based on its training error ϵ on the training set as shown in Equation (2). After the tth round the weighted error ϵt of the resulting classiﬁer is calculated. During the training process, the weight α of each classiﬁer is learned by a parameter vector Dt, which is the sentence weight for total samples in one epoch. Different from [Freund et al., 1996] assigned equal value for each sentence in the dataset, our model assigns equal value for each batch in the dataset. Each batch contains the same number of sentences. The vector D1(bi) = 1

n, where n is the number of batches, bi is the ith batch of all the samples. D1 means the initialized vector parameter in the ﬁrst epoch. In our case we process the samples batch by batch.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

j eτj, τ < 1

τj = PK k error(I(γ(k) = yk))

The Equations (3)-(4) show how to calculate the training error ϵ, which is used for updating the vector paramter Dt. error(I(γ(k) = yk)) means that when the model output γ(k) is not equal to its true label yk in a batch bi, we gather the error of that batch. I is the error indicator and K is the batch size. Finally we average the error τ of each batch j as shown in Equation (4).

c(x) = eαt, τ < 1

2 e αt, τ 1

Dt+1(bi) = Dt(bi)

Zt c(x) (6)

After calculating the weight α of each classiﬁer, we could use it to update the vector Dt as shown in Equations (5)-(6), where Zt is a normalization constant. Dt is the weight vector for the samples at the epoch t. Dt+1 is computed from Dt by increasing the probability of incorrectly labeling samples. We maintain the weight Dt(bi) for the batch bi during the learning process. Then, we could use it to inform the training process of neural networks, by setting a constraint to gradient descent during back propagation of neural networks. By combing Equations (2)-(6), we have Equation (7).

Dt+1(bi) = Dt(bi)

Zt e αtyiγt(xi) (7)

More details about how re-weighting affects the neural networks are given in the following. During the training process, if the training samples are trained enough and has been ﬁtted well, its weight Dt(bi) will drop. Otherwise, the weight Dt(bi) will increase if the samples are classiﬁed wrong so that it could contribute more to the gradient descent during training the model. That means, on each round the weight of incorrectly classiﬁed samples are increased so that the classiﬁer is forced to focus on the hard examples [Freund et al., 1996]. The weights Dt affect the the gradient descent in back propagation during training. We assign the parameter Dt(bi) to the gradient of the back propagation as shown in Equation (8). Then the neural networks parameters are updated via back propagation with Dt, as the architecture Figure (1) shows. In this way, the adaptive boosting algorithm informs the neural networks. We learn Dt as the weights of the samples to impact the neural networks. Finally, multiple NN classiﬁers are learned and combined as a joint relation extractor.

δnew = δold Dt β (8)

The pseudocode of the Ada-LSTMs model is given in Algorithm 1. m is the total number of training data. n is the number of batches. ϵt is the training error of the training samples. δold is the ﬁnal layer derivative in back propagation and δnew is the new derivative in back propagation used to update the networks. β = 1 max(Dt) is the reciprocal of maximal

Algorithm 1 Ada-LSTMs Model for Relation Extraction.

Input: (s1, ϕ1, Ω1), (s2, ϕ2, Ω2),...,(sm, ϕm, Ωm), where si S is a sentence in the sentence set S, ϕi is a pair of entities and Ωi L is their relation tie. Output: ﬁnal weighted extrator Y (x) 1: for t = 1 to T do 2: init Dt on {1, ..., n} 3: for s in S do 4: look up embedding x for words in s 5: Att-LSTMs FORWARD (x) 6: update δ based on Equation (8) 7: Att-LSTMs BACKWARD 8: calculate training error ϵt of γt: 9: ϵt = Pr Dt[γt(xi) = yi] 10: select classiﬁer with smallest error ϵt on Dt 11: calculate αt, c(x) based on Equation (2)-(5) 12: Dt+1 = g(Dt, αt, Ω, γt) 13: γt : X L 14: end for 15: end for 16: ﬁnal prediction model: Y (x) = f(PT t αtγt(x))

Dt(bi) value, where i is the index of batch number and s is a sentence in the corpus S. β is a coefﬁcient, aiming at avoiding Dt too small to update the NN. g is a mapping function. f is a softmax function. Att-LSTMs is the LSTMs with selective attention model, which will be described later.

LSTMs with Selective Attention (Att-LSTMs) In this part, as shown in algorithm 1, we elaborate more details of the proposed neural networks with selective attention (Att-LSTMs), which more speciﬁcally is attention-based long short-term neural networks. The recursive neural networks have shown in marvelous priority in modeling sequential data [Miwa and Bansal, 2016]. Therefore, we make use of LSTMs to deeply learn the semantic meaning of a sentence which is composed of a sequence of words for relation extraction.

it ft ot gt

ct = ft ct 1 + it gt ht = ot tanh(ct) (10)

The LSTM s unit is summarized in Equation (9). A sentence is initially vectorized into a sequence of encoded words {x1, x2, ..., xl} Rdw, where l and dw are the lengths of the input sentence and the dimension of word representations, respectively. d represents the LSTM dimensionality. As Equations (9)-(10) show, it, ft, ct, ot, ht are the input, forget, memory, output gate and hidden state of the cell at time t, respectively. The current memory cell state ct is the combination of ct 1 and gt, weighted by it and ft, respectively. σ denotes a non-linear activation function. means the elements-wise multiplication. d denotes the dimensionality of LSTM. In our implementation of relation extraction, an input sentence is tagged with the target entities and the

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

relation type. For further usage, we concatenate the current memory cell hidden state vector ht of LSTM from two directions as the output vector hk=[ ht, ht] at time t. Combining two directions of the sentence could better utilize the features to predict the relation type.

αt = exp(et) Pl t=1 exp(et) = exp(fa(ht)) Pl t=1 exp(fa(ht)) (11)

t=1 αtht (12)

We add an attention model [Xu et al., 2015] to neural networks. The idea of attention is to select the most important piece of information. Since not all words contribute equally to the sentence representation, the important meaning of the sentence could be presented by the informative words to form a more effective vector representation via attention. Finally, we dropout our architecture on both attention layer and bi-directional LSTMs layer. The attention mechanism is shown in Equations (11)-(12). For each word location t, fa is a function learned during training. Speciﬁcally, et = fa(ht) = σ(Wht + b), where W and b will be learned during training and σ is a non-linear function. Then we get a normalized importance weight αt through a softmax function. l means the length of the sentence sample. Then, we compute the sentence vector c as a sum of adaptive weighted average of state sequence ht. In this way, we could selectively integrate information word by word with attention.

i=1 ykilog(qki) (13)

Finally, we use cross entropy [De Boer et al., 2005] to design our loss function L0 as shown in Equation (13). n is the total number of samples. C is the number of labels. q = f(c), where c is the output of attention layer and f is the softmax function. Our training goal is to minimize L0.

3.3 Implementation Details

Learning Rate We followed the method referred to [Kingma and Ba, 2014] to decay the learning rate. The adaptive learning rate decay

method is deﬁned as lrt lrt 1

1 β2 2 1 βt 1 , where lrt, lrt 1 are the current and the last learning rates, respectively.

L2 Regulation L2 regulation imposes a penalty on the loss goal L0. For the training goal, we use a negative log likelihood of the relation labels for the pair of entities as function loss. The L2 regulation is as L2 = λ Pn i=1 W 2 i . It should have the same order of magnitude so that L2 regulation would not weight too much or too little in the training process. We set the constant λ based on the above rule.

Number of epochs 40 LSTMs unit size 350 Dropout probability 0.5 Batch size 50 Position dimension 5 Word dimension 50 Unrolled steps of LSTMs 70 Number of neural networks 20 Initial learning Rate 10 3

L2 regulation Coefﬁcient 10 4

Table 2: Parameter Settings

4 Experiments

4.1 Dataset

We evaluate our model on the public dataset 1, which is developed by [Riedel et al., 2010]. The dataset was generated via aligning the relations in Freebase with the New York Times corpus (NYT). The dataset induces the relationship for entities of NYT corpus into 56 relationships. After ﬁltering part of NA negative data, the training part is gained by aligning the sentences from 2005 to 2006 in NYT and contains 176,662 non-repeated sentences, among which there are 156,662 positive samples and 20,000 NA negative samples. The testing part is gained in 2007 and contains 6,944 non-repeated samples, among which there are 6,444 positive samples and 500 NA negative samples.

4.2 Experiment Settings

Word Representations Similar to [Ye et al., 2017], we keep the words that appear more than 100 times to construct word dictionary. In our paper, the vocabulary size of our dataset is 114,042. We use word2vec2 to train the word embedding on the NYT corpus. We set word-embedding to be 50-dimensional vectors. Additionally, the vectors will concatenate two position embedding, 2 5 dimensional vector, as its ﬁnal word embedding.

Hyper-parameter settings Table 2 shows the parameter settings. We set some parameters empirically, such as the batch size, the word dimension, the number of epochs. We set the weights of L2 penalty as 10 4 and the learning rate as 10 3, which both are chosen from {10 1, 10 2, 10 3, 10 4, 10 5}. We select 350 LSTM s units based on our empirically parameter study from the set {250, 300, 350, 400, 450}. The selection for the number of classiﬁers will be discussed in the experiment results.

4.3 Evaluation

To evaluate the proposed method, we select the following state-of-the-art feature-based methods for comparison through held-out evaluation: Mintz [Mintz et al., 2009] is a traditional distant supervised model via aligning relation data on Freebase.

1http://iesl.cs.umass.edu/riedel/ecml/ 2http://code.google.com/p/word2vec

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

P@N(%) One Two All 100 200 300 Avg 100 200 300 Avg 100 200 300 Avg CNN+ATT 76.2 65.2 60.8 67.4 76.2 65.7 62.1 68.0 76.2 68.6 59.8 68.2 PCNN+ATT 73.3 69.2 60.8 67.8 77.2 71.6 66.1 71.6 76.6 73.1 67.4 72.2 Rank+Ex ATT - - - - - - - - 83.5 82.2 78.7 81.5 Ada+LSTM 82.0 81.0 76.7 79.9 85.0 80.5 77.6 81.0 95.0 92.5 92.0 93.1

Table 3: P@N comparison with state-of-the-art methods.

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Recall

Hoffmann MIMLRE Mintz CNN+ATT PCNN+ATT Rank+Ex ATT Ada-LSTMs

Figure 2: Precision-Recall curves of different methods.

Multi R [Hoffmann et al., 2011] is a graphical model of multi-instance to handle the overlapping relations problem. MIML [Surdeanu et al., 2012] jointly models both multiple instances and multiple relations. CNN+ATT, PCNN+ATT [Lin et al., 2016] add attention mechanism to CNN and PCNN models which are proposed by Zeng et al. [Zeng et al., 2014; 2015]. Compared to CNN, PCNN adopts convolutional architecture with piecewise max pooling to learn relevant features. Rank+Ex ATT [Ye et al., 2017] aggregates ranking method to attention CNN model. In our experiments, we run the model nearly 40 epochs. Each epoch has 3533 steps (batches). At the ﬁrst 10 epochs, the loss of the model drops quickly and then the loss becomes relatively stable. Therefore, in the following experiments, we select the Ada-LSTMs model with 10-30 rounds training steps as the ﬁnal joint extractor for relation extraction. We compare our Ada-LSTMs model with the above baselines and the Precision Recall (PR) curves are shown in Figure 2. From the result, one could conclude that: (1) Our proposed method Ada-LSTMs outperforms all the baseline methods. The F1-score of our model is 0.54, which is the highest and outperforms the latest state-of-theart model Rank+Ex ATT by nearly 8%. (2) Our method Ada-LSTMs has a more robust performance because the precision-recall curve is more smooth than other methods. With the increase of recall, the decay tendency of precision is obviously slower than others. Especially when recall is low, the precision of Ada-LSTMs still performs well unlike the others dropping rapidly. We next evaluate our model via the precision@N(P@N), which means the top N precisions of the results, as shown in Table 3. One, Two, All mean that we randomly select

one, two and use all the sentences for each entity pair, respectively. Here we only select the top 100, 200, 300 precisions for our experiment. Experiment data of other methods (CNN+ATT, PCNN+ATT, Rank+Ex ATT) are obtained from their published papers. The results show our model outperforms all the baselines in P@N(One, Two, All, Average). Compared to Rank+Ex ATT, the latest state-of-the-state model, our model has a signiﬁcant improvement on the average of P@100, P@200, P@300 by about 11.6% on average.

The effect of classiﬁer number To study the impact of the classiﬁer number on our model performance, we set different numbers of classiﬁers. The result is given in Figure 3. One can see that when the classiﬁer number of Ada-LSTMs is relatively small, the algorithm performance increases signiﬁcantly with the increase of classiﬁer number. Ada-LSTMs 10 > Ada-LSTMs 5 > Ada-LSTM 1. However, when the classiﬁer number becomes large, the performance improvement gets less signiﬁcant. The PR curves of Ada-LSTMs with 10, 20, 30, and 40 classiﬁers are quite similar. As mentioned, adaptive boosting plays two roles in our model due to ensembling the models and updating the gradient descent during the back propagation of neural networks via re-weighting.

5 Conclusions

In this paper, we proposed to integrate attention-based LSTMs with adaptive boosting model for relation extraction. Compared to the previous models, our proposed model is more effective and robust. Experimental results on the widely used dataset show that our method signiﬁcantly outperforms the baselines. In the future, it would be interesting to apply the proposed framework to other tasks, such as image retrieval and abstract extraction.

0.0 0.1 0.2 0.3 0.4 0.5 Recall

Ada-LSTMs-1 Ada-LSTMs-5 Ada-LSTMs-10 Ada-LSTMs-20 Ada-LSTMs-30 Ada-LSTMs-40

Figure 3: Precision-Recall curves of different classiﬁers number.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Acknowledgments

This work is supported in part by the Natural Science Foundation of China (Grand Nos. U1636211,61602237,61672081,61370126), Natural Science Foundation of Jiangsu Province (No. BK20171420), and Beijing Advanced Innovation Center for Imaging Technology (No. BAICIT-2016001).

References [De Boer et al., 2005] Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19 67, 2005. [Freund et al., 1996] Yoav Freund, Robert E Schapire, et al. Experiments with a new boosting algorithm. In Proceedings of The International Conference on Machine Learning, pages 148 156, 1996. [Hoffmann et al., 2011] Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 541 550, 2011. [Kim and Kang, 2010] Myoung-Jong Kim and Dae-Ki Kang. Ensemble with neural networks for bankruptcy prediction. Expert systems with applications, 37(4):3373 3379, 2010. [Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [Li et al., 2017] Dandan Li, Shuzhen Yao, Senzhang Wang, and Ying Wang. Cross-program design space exploration by ensemble transfer learning. In Proceedings of the 36th International Conference on Computer-Aided Design, pages 201 208, 2017. [Lin et al., 2016] Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. Neural relation extraction with selective attention over instances. In Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics, 2016. [Liu et al., 2016] Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. Learning natural language inference using bidirectional lstm model and inner-attention. ar Xiv preprint ar Xiv:1605.09090, 2016. [Mintz et al., 2009] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language, pages 1003 1011, 2009. [Miwa and Bansal, 2016] Makoto Miwa and Mohit Bansal. End-to-end relation extraction using lstms on sequences and tree structures. ar Xiv preprint ar Xiv:1601.00770, 2016.

[Riedel et al., 2010] Sebastian Riedel, Limin Yao, and Andrew Mc Callum. Modeling relations and their mentions without labeled text. In Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 148 163. Springer, 2010. [Rokach, 2010] Lior Rokach. Ensemble-based classiﬁers. Artiﬁcial Intelligence Review, 33(1):1 39, 2010. [Surdeanu et al., 2012] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. Multiinstance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 455 465, 2012. [Xu et al., 2015] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of The International Conference on Machine Learning, pages 2048 2057, 2015. [Yang et al., 2016] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classiﬁcation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016. [Ye et al., 2017] Hai Ye, Wenhan Chao, and Zhunchen Luo. Jointly extracting relations with class ties via effective deep ranking. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017. [Zeng et al., 2014] Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao, et al. Relation classiﬁcation via convolutional deep neural network. In Proceedings of 24th International Conference on Computational Linguistics, pages 2335 2344, 2014. [Zeng et al., 2015] Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 17 21, 2015. [Zhang et al., 2015] Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. Bidirectional long short-term memory networks for relation classiﬁcation. In Proceedings of the 29th Paciﬁc Asia Conference on Language, Information and Computation, pages 73 78, 2015. [Zhang et al., 2017] Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D Manning. Position-aware attention and supervised data improve slot ﬁlling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35 45, 2017. [Zheng et al., 2016] Hao Zheng, Zhoujun Li, Senzhang Wang, Zhao Yan, and Jianshe Zhou. Aggregating intersentence information to enhance relation extraction. In Proceedings of the 30th AAAI Conference on Artiﬁcial Intelligence, pages 3108 3115, 2016.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)