# learning_cosubstructures_by_kernel_dependence_maximization__0002aad6.pdf

Learning Co-Substructures by Kernel Dependence Maximization

Sho Yokoi 1, Daichi Mochihashi 2, Ryo Takahashi 1, Naoaki Okazaki 1, Kentaro Inui 1

1 Tohoku University, Sendai, Japan 2 The Institute of Statistical Mathematics, Tokyo, Japan {yokoi, ryo.t, okazaki, inui}@ecei.tohoku.ac.jp, daichi@ism.ac.jp

Modeling associations between items in a dataset is a problem that is frequently encountered in data and knowledge mining research. Most previous studies have simply applied a predeﬁned ﬁxed pattern for extracting the substructure of each item pair and then analyzed the associations between these substructures. Using such ﬁxed patterns may not, however, capture the signiﬁcant association. We, therefore, propose the novel machine learning task of extracting a strongly associated substructure pair (cosubstructure) from each input item pair. We call this task dependent co-substructure extraction (DCSE), and formalize it as a dependence maximization problem. Then, we discuss critical issues with this task: the data sparsity problem and a huge search space. To address the data sparsity problem, we adopt the Hilbert Schmidt independence criterion as an objective function. To improve search efﬁciency, we adopt the Metropolis Hastings algorithm. We report the results of empirical evaluations, in which the proposed method is applied for acquiring and predicting narrative event pairs, an active task in the ﬁeld of natural language processing.

1 Introduction Modeling associations between items in a dataset is a general problem commonly addressed in a broad range of data or knowledge mining research. For example, the valuable natural language processing tasks of extracting narrative event pairs (e.g., X commit a crime, X be arrested ) as components of script knowledge [Chambers and Jurafsky, 2008] and learning selectional preference of predicates (e.g., food is preferred as an object of eat) [Resnik, 1997] model associations between event pairs and predicate-argument pairs, respectively. The common approach for modeling associations usually involves three steps; we use the approach proposed by Chambers and Jurafsky (C&J) as an example. In the ﬁrst step, associated pairs of items are collected from a dataset as positive samples; in the C&J method, these are sentence pairs that include co-referent people or objects from a text corpus (e.g., the sentence pair Tomi killed Nancy. and The police arrested Tomi immediately. is collected because of the co-referent Tomi

extremely is

sadness with

falling asleep

She full She very

ﬁlled heart

is falling asleep

concentrating

full She very

stuffed stuffed

My sadness with

ﬁlled heart

outdoorswith John

restaurant at

has restaurant

with John outdoors

, concentrating

trouble Bob sad He has

extremely is

Figure 1: Example of the input and output of Dependent Co Substructure Extraction (DCSE) for acquiring narrative event pairs.

in both the sentences). In the second step, the abstract representation (substructure) is extracted from each item pair; the C&J method utilizes head predicates coupled with argument slots (e.g., X kill, arrest X ). In the third step, the association between the extracted substructure pairs is modeled; the C&J method utilizes pointwise mutual information (PMI) to measure the association. In this way, X kill, arrest X (for example) might be chosen as a narrative event pair because the calculated PMI value is large; in contrast, event pairs with low PMI values such as X kill, X graduate would be discarded. We believe the second step mentioned above can be improved because, in most previous studies, people simply applied intuitively deﬁned and ﬁxed patterns for extracting substructures, without much optimization for the third step. For example, the C&J method uses a simple syntactic pattern, namely one predicate with a co-referent argument slot, such as X kill. With such a ﬁxed pattern, however, the intended associations between item pairs may not be best captured. Fig. 1 shows an example of this, where we assume that we are mining narrative event pairs from the set D of sentence pairs. If we were to use the syntactic pattern of the C&J s method, we would obtain event pairs such as X have, X full and X have, X sad . However, the event representation X have is clearly too abstract to capture the full associations with the distinct

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

events X full and X sad. Ideally, we want to acquire event pairs such as X have dinner, X full and X have trouble, X sad as illustrated by the set Z in Fig. 1. Therefore, ideally, we should be able to ﬂexibly choose substructures of arbitrary size that are the most appropriate to capture the associations. This issue motivates us to consider a new machine learning task, which we call dependent co-substructure extraction (DCSE). In this task, a pair of strongly associated substructures xi, yi are extracted for each input item pair si, ti , as illustrated in Fig. 1. We consider selecting an appropriate level of knowledge abstraction to mine based on the association strength. In Fig. 1, for example, we want to include dinner in the substructure of the ﬁrst pair but not at restaurant, using no predeﬁned pattern. In this paper, we ﬁrst formalize the task of DCSE as a dependence maximization problem; then, we discuss two critical issues with the task: the data sparsity problem and the huge search space (Sec. 2). We propose adopting the Hilbert Schmidt independence criterion (HSIC) as an objective function to cope with the data sparsity problem (Sec. 3) and the Metropolis-Hastings (MH) algorithm to boost search efﬁciency (Sec. 4). Finally, we demonstrate the superiority of the proposed method via experiments in two scenarios, namely knowledge acquisition and predicting narrative event pairs (Sec. 5).

2 DCSE as Dependence Maximization

First, we formalize our dependence maximization problem. For each given pair of items, ﬁnd a strongly associated pair of substructures based on the dependence maximization principle as follows.

Dependent co-substructure extraction (DCSE). Given: A set D = {(si, ti)}N i=1 of item pairs, where si S and ti T are raw items and each item pair (si, ti) represents a speciﬁc relation of interest (e.g., co-reference and co-occurrence). S and T are sets containing all the raw data {s} and {t}, respectively. Find: A set Z = {(xi, yi)}N i=1 of substructure pairs (co-substructures) that maximizes the dependence (given below), where xi si, yi ti for each i and x s denotes that x is a substructure of s. To estimate the dependence, we assume that each solution Z as N independent samples drawn from some joint distribution: Z = {(xi, yi)}N i=1 PXY . (1) Then, we estimate the dependence between X and Y by the distance between the joint density PXY and the product of marginals PXPY .

There are several possible ways of measuring the distance between PXY and PXPY ; one straightforward way is to employ mutual information (MI) as a dependence measure. MI has been utilized various knowledge mining and machine learning studies to measure association strength and dependence [Church and Hanks, 1990; Maes et al., 1997; Turney, 2002; Torkkola, 2003; Peng et al., 2005]. Using MI,

the dependence of Z = {(xi, yi)}N i=1 is computed by

y p(x, y) log p(x, y)

p(x)p(y) (2)

= KL[PXY PXPY ], (3) where KL[ ] denotes the Kullback Liebler divergence between two distributions. Adopting MI in DCSE dose, however, poses two critical problems:

Data sparsity Our search space includes substructure pairs of arbitrary size. In the case of Fig. 1, for example, we may consider a speciﬁc substructure such as She has big dinner as a candidate substructure. Thus, we encounter a data sparsity problem if the probability distribution of substructures is na ıvely estimated by counting occurrences in the data.

Huge search space The search space of DCSE can be prohibitively large. The optimal co-substructure for a given input item pair depends on the co-substructure choices for other item pairs, making searching difﬁcult. In other words, one cannot reach the global optimal simply by locally choosing seemingly good co-substructures. There is, therefore, a need to devise an approximation method to improve search efﬁciency.

We propose a solution to these problems in the subsequent two sections.

3 Objective Function: HSIC

To cope with the data sparsity problem, we propose adopting the Hilbert Schmidt independence criterion (HSIC) [Gretton et al., 2005], instead of MI, as the objective function. HSIC is a kernel-based dependence measure involving low computational cost that has been used in a range of machine learning tasks such as feature selection [Song et al., 2012], dimensionality reduction [Fukumizu et al., 2009], and unsupervised object matching [Quadrianto et al., 2009]. Intuitively, HSIC can be seen as a smoothed version of MI, where the candidate substructure counts are somehow smoothed using similarities between substructures. Consider the example in Fig. 1 again. Using HSIC, the (semantic) similarity between the substructures have dinner and eat can be considered when estimating the dependence of Z, which smoothens the counts of those substructures. Let X and Y be random variables with ranges X and Y respectively (i.e. X constitutes all candidate substructures {x: x s, s S}, and similarly for Y) and Z = {(xi, yi)}N i=1 X Y be a series of N independent observations drawn from the joint distribution PXY . The HSIC value1 of Z, which estimates the degree of dependence between X and Y , is

HSIC(Z; k, ℓ) = 1 N 2 tr(KHLH) = 1 N 2 tr( K L). (4)

In this equation, k: X X R and ℓ: Y Y R are positive deﬁnite kernels that serve as similarity functions over substructures,

1Precisely, HSIC measures the dependence between two random variables, X and Y . Eq. 4 gives an empirical estimator of HSIC.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

K = (k(xi, xj)) RN N and L = (ℓ(yi, yj)) RN N are Gram matrices, which serve as the similarity matrices given by kernel functions k and l, and K = HKH RN N and L = HLH RN N are centered Gram matrices, where H = ((δij 1

N )) RN N. To elaborate on the intuition that the HSIC is a smoothed version of MI, let us ﬁrst consider a smoothed version of PMI. We deﬁne pointwise HSIC (PHSIC): X Y R with given Z, as follows:

PHSIC(x, y; Z) :=

i=1 k(x, xi; {xn}) ℓ(y, yi; {yn}), (5)

where {xn} denotes {x1, . . . , x N}. The function k( , ; {xn}): X X R is deﬁned by

k(x, x ; {xn}) := k(x, x ) 1

j=1 k(x, xj)

i=1 k(xi, x ) + 1

j=1 k(xi, xj),

which gives the similarity between x and x centered in future space; namely, if addition is deﬁned on X, k(x, x ; {xn}) equals to k(x xn, x xn). The PHSIC value of x and y is essentially increased by the presence of other samples (xi, yi) that are similar to (x, y), i.e., xi x and yi y; this enables smoothing across similar items. Moreover, HSIC corresponds to the summation of PHSIC values, paralleling the relationship between MI and PMI (i.e., MI is the summation of PMI values.) This is the sense in which HSIC can be seen as a smoothed version of MI. The relationship between MI and HSIC is summarized in Table 1. This smoothing is a strong advantage of the HSIC. In knowledge mining and natural language processing, a wide range of methods can be used for estimating similarities between words, phrases, and their arbitrary substructures, from classical thesaurus-based methods to modern embedding-based ones. Using such a similarity function, one can consider, for example, the co-occurrence eat, full when estimating the association strength of, for example, have dinner, full . By solving the HSIC maximization problem, strongly associated co-substructures of arbitrary size can be extracted while coping with the data sparsity problem.

4 Search: Based on Metropolis Hastings In order to ﬁnd a near-optimal Z in a huge search space, we adopt an approach based on Metropolis Hastings (MH) sampling [Chib and Greenberg, 1995]. We consider the probability distribution p(Z; k, ℓ, β) exp(β HSIC(Z; k, ℓ)), (7) where β is the inverse of a temperature parameter. The larger an HSIC value is, the higher Z s probability. By sampling Z on the distribution given by Eq. 7 while changing Z step by step, Z is expected to converge to its optimal value with a substantially lower computational cost than that with a full search (Fig. 2). The details of the sampling procedure are as follows. 1. Let Z = {(xi, yi)}N i=1 be the current sample.

q(Z0|Z) q(Z00|Z0)

Figure 2: Overview of Metropolis Hastings sampling.

2. Draw a new candidate Z by changing only one substructure. Speciﬁcally, ﬁrst, draw xi or yi from Z = {(xi, yi)}N i=1 from a uniform distribution: i, p(xi|Z) = p(yi|Z) = 1 2N . Then, propose an x i for a given xi using a proposal distribution q(x i|xi) (a speciﬁc example of q(x i|xi) is given in Sec. 5.1). Thus, the proposed distribution for drawing a new candidate Z = {. . . , (x i, yi), . . .} for a given Z = {. . . , (xi, yi), . . .} by changing only xi is

q(Z |Z) = q(x i|xi)p(xi|Z) = 1 2N q(x i|xi). (8)

3. Accept Z with the probability min(1, r), where,

r = p(Z ; k, ℓ, β)

p(Z; k, ℓ, β) q(Z|Z )

q(Z |Z) (9)

= exp(β(HSIC(Z ;k, ℓ) HSIC(Z;k, ℓ)))q(x|x )

q(x |x). (10)

4. Repeat Steps 2 3. Because the HSIC is a kernel-based measure, its high computational cost may be a concern. In reality, we only need to compute Gram matrices with O(N 2) computational cost only for the ﬁrst iteration. For each iteration of MH sampling, it is sufﬁcient to update only one row of the Gram matrix with O(N) computational cost. In addition, computing the HSIC takes only O(Nκ2) time via rank κ incomplete Cholesky decomposition [Gretton et al., 2005] (Lemma 2).

5 Experiments We evaluated the performance of the proposed method in two scenarios. Knowledge acquisition extracts a set of abstract representation pairs, Z = {(x, y)}, from a corpus. We feed an input D = {(s, t)} to a DCSE method, and then verify if the output Z is reasonable and interpretable. In order to examine the behavior of the proposed method, we perform knowledge acquisition on a synthetic dataset in Sec. 5.2. Prediction constructs a model from Z, and computes the relevance of a new pair (s, t) based on the relevance of the substructure pairs (x, y). In other words, the determination of whether a new pair (s, t) has a relationship is based on the score of its abstract representation (x, y). We report the results of experiments on real corpora in Sec. 5.3.

5.1 Experimental Settings Data Representation We adopted dependency trees for the input representations of input si and ti and their rooted subtrees as those for output

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Relationship between MI and HSIC. denotes exclusive or. When the equation indicated by the + sign is satisﬁed, the PMI/PHSIC value increases. When the equation indicated by the sign is satisﬁed, the PMI/PHSIC value decreases. I[cond] = 1 if the condition is true and 0 otherwise. Note that the elements of the centered Gram matrix K can be expressed as Kij = k(xi, xj; {xn}) and HSIC(Z) = 1 N2 P

ij Kij Lij.

consistency of (x, y) with (xi, yi) Z consistency of (x, y) with Z estimate of dependency

MI + x=xi y=yi PMI(x, y; Z)=log N P

i I[x=xi y=yi] P

i I[x=xi] P

i I[y=yi] MI(Z)= 1

i PMI(xi, yi) x=xi y=yi

HSIC + k(x, xi; {xn}) ℓ(y, yi; {yn})>0

PHSIC(x, y; Z)=P

i k(x, xi; {xn}) ℓ(y, yi; {yn}) HSIC(Z)= 1 N2 P

i PHSIC(xi, yi) k(x, xi; {xn}) ℓ(y, yi; {yn})<0

substructures xi and yi (Fig. 1).

Kernel Function HSIC requires two positive deﬁnite kernels, k and ℓ(Eq. 4), that compute the similarity between two subtrees of dependency trees. We employed the cosine similarity2 between vector representations, which has several desirable properties: (i) it is the most standard function used to measure similarity between phrases using word vectors; (ii) it has no hyperparameter; and (iii) its computation cost is low. Let xi, xj X be rooted subtrees. We then deﬁned a kernel function k, k(xi, xj) = cos(vtree(xi), vtree(xj)), (11) and another kernel function ℓanalogously. The vector vtree(x) is the average of the word vectors for the word set {w} constituting the subtree x: vtree(x) = average({v(w): w x, w V }). (12) Here, v( ) denotes 300-dimensional pre-trained word vectors3, and V represents a vocabulary set4. If the product kernel k( , ) ℓ( , ) is characteristic on X Y, HSIC(X, Y, k, ℓ) = 0 if and only if X and Y are independenct [Muandet et al., 2016]. Therefore, when we test independence using HSIC, the kernels should be characteristic (e.g., Gaussian kernel and Laplacian kernel). However, in this study, we are more interested in the case of dependence (the HSIC value is large), rather than independence (the HSIC value is small). Therefore, the HSIC can be sufﬁciently estimated by only considering low-order moments of the probability distribution; this little negative effect even when using characteristic kernels. In fact, replacing the cosine kernel with the Gaussian kernel had almost no impact in our experiments (with the same setup otherwise).

Proposal Distribution MH sampling uses a proposal distribution q(x |x) that suggests a next candidate x s given the current candidate x s (Eq. 8) and similarly uses q(y |y) for y . For a given x s, let M(x) be the set of subtrees of s that are obtained by stretching or shrinking only one edge of x. The experiments used q(x |x) that yielded 1/|M(x)| if x M(x) and 0 otherwise (Fig. 3).

2 cos( , ): Rd Rd R is a positive deﬁnite kernel which satisﬁes the application condition of HSIC. 3https://code.google.com/archive/p/word2vec/ 4Stop words in http://www.ranks.nl/stopwords/ are excluded.

Figure 3: Proposal distribution q(x |x) used in experiments.

5.2 Knowledge Acquisition from Synthetic Data In order to verify that the proposed method yields reasonable and interpretable paired abstract representations, we prepared a small synthetic dataset constituting 12 pairs of sentences.

Results Fig. 4 shows the experimental results. The upper half shows the input D and the lower half shows the output Z obtained via the proposed method. For example, the method abstracted the ﬁrst input (s1, t1) = They had breakfast at the eatery., They are full now. to (x1, y1) = had breakfast, full (number 1 in the ﬁgure). The heat maps in Fig. 4 represent centered Gram matrices. For example, the bottom-left heat map shows the similarity matrix K for {x1, . . . , x N}; the element at (1, 9) represents the value of k(x1, x9) = k(had breakfast, had trouble).

Discussion The proposed method successfully found the common substructures in the inputs. For example, the method recognized the co-substructure have breakfast, full in the ﬁrst block because it was common to many inputs. By contrast, the method pruned rare words that were unimportant for modeling association. For example, the method removed at my house in s4 from the ﬁrst block. In addition, the proposed method works ﬂexibly on surface variations by considering word similarity. Although the words dinner (in s4) and lunch (in s7) appeared only once in the corpus, the method found an appropriate common cosubstructure by recognizing that they are similar to breakfast . The method ultimately recognized three clusters in the input data eat meals, be full , eat meals with friends, feel happy , and have trouble, cry from the upper, middle, and lower four pairs, respectively. We can also conﬁrm this behavior by observing the K value (the bottom-left heat map in Fig. 4): the events in the ﬁrst block and those in the second block in K are strongly similar, respectively (the squares surrounded by red lines in Fig. 4). Furthermore, we can observe that (with) friends remained in the abstract representation in the second block. We infer that

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

1. They had breakfast at the eatery . 2. I had breakfast at the ten o clock . 3. We had special breakfast . 4. I have had dinner at my house . 5. She had breakfast with her friends . 6. They had breakfast with their friends at the refectory . 7. He had lunch with his friends at eleven . 8. I had breakfast with my friends at my uncle s house . 9. He had trouble with his homework . 10. I had trouble associating with others . 11. She has trouble understanding a book when she reads . 12. I ve been had trouble with my bowels since last night .

1. had breakfast 2. had breakfast 3. had breakfast 4. had dinner 5. had breakfast friends 6. had breakfast friends 7. had lunch friends 8. had breakfast friends uncle house 9. had trouble 10. had trouble 11. has trouble 12. had trouble

1. full 2. full 3. full 4. full 5. felt happy 6. felt happy 7. felt happy 8. feel happy 9. cried despair 10. howl 11. cries 12. cry

1. They are full now . 2. I 'm full already . 3. We are full and tired of eating . 4. I am full from dinner . 5. She felt very happy . 6. They felt happy . 7. He felt happy seeing his friends . 8. I feel really happy . 9. He cried in despair . 10. I howl in fright . 11. She cries continuously . 12. I cry with pain .

1. 5. 9. 1. 5. 9.

1. 5. 9. 1. 5. 9. xi yi

Figure 4: Results obtained via the knowledge acquisition task. The upper and lower halves respectively show the input D and the output Z. The heat maps represent centered Gram matrices.

this is because removing the expression from the second block would result in merging of the ﬁrst and second blocks on the X side, whereas the information in the ﬁrst (full) and second (felt happy) blocks on the Y side was totally different. This merger would have been an undesirable behavior, decreasing the ability to predict t from s. The proposed method prevented this behavior by observing the Gram matrices of X and Y; the red and blue frames in Fig. 4 suggest that the ﬁrst and second blocks are not merged.

5.3 Prediction on Real Corpora In order to demonstrate the effectiveness of the proposed method on a real dataset and task, we conducted an experiment on pairwise classiﬁcation of narrative event pairs. We ﬁrst learned an association model with positive paired data gathered from corpora and then measured the model s prediction performance on a test dataset.

Dataset Table 2 provides the data statistics for the performed prediction task. We used the following two corpora: The Gigaword Corpus5 [Graff and Cieri, 2003]: a large collection of English newswire text data that has been used in several previous studies [Chambers and Jurafsky, 2008; Chambers and Jurafsky, 2009; Granroth-Wilding and Clark, 2016]. We used 17,781 documents published in the year 2000 from the New York Times (NYT) portion. Andrew Lang Fairy Tale Corpus6: a small collection of children s stories that has been used in a previous study [Jans et al., 2012]. We used all 437 stories in this experiment. Applying Stanford Core NLP Version 3.7.0 [Manning et al., 2014] to raw text from the corpora, we extracted sentence pairs

5https://catalog.ldc.upenn.edu/ldc2003t05/ 6http://www.mythfolklore.net/andrewlang/

Table 2: Data statistics for the prediction task.

Corpus Collection #All #Training #Test(pos) #Test(neg) Gigaword regular 16,748 10,000 500 500 Fairy Tale 2-skip 1,673 1,000 100 100

sharing co-referring arguments. When handling the Gigaword and the Fairy Tale corpora, used regular bigrams and 2-skip bigrams, respectively [Jans et al., 2012]. Next, we ﬁltered the sentence pairs using the following conditions: 4 the number of tokens in a sentence 30; the POS tag of the root node of the dependency tree is in the set {VB, VBD, VBG, VBN, VBP, VBZ}; the word at the root node of the dependency tree is not in the set {be, am, are, is, was, were, m, re, s}; and the position of all protagonists seen from the predicate verb are in the set {nsubj, dobj}. We collected all sentences satisfying the above conditions into a set of positive sentence pairs D(all). Finally, we randomly chose positive sentence pairs from this set to construct the training set and the test set D(test) P (without overlap). We obtained pseudo-negative sentence pairs D(test) N = {(s , t )} for the test set by randomly extracting s

and t from positive sentence pairs D(all) = {(s, t)}.

Performance Measure: AUC We used the area under the receiver operating characteristic curve (AUC-ROC or AUC) to evaluate the performance of the different scoring functions. This task is a pairwise binary classiﬁcation problem and is essentially a version of the conventional cloze test for narrative event chains. In binary classiﬁcation/ranking problems, AUC-ROC is generally used as an evaluation metric, and it is a stable and robust measure even when the ratio of positive and negative examples in the

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

test set is skewed, unlike the area under the precision-recall curve (AUC-PR) [Fawcett, 2006]. Given a set of positive examples DP = {(s, t)} and a set of negative examples DN = {(s , t )}, the AUC can be computed using any scoring function f : S T R as, 1 |DP||DN|

(s ,t ) DN I[f(s, t) > f(s , t )], (13)

where I[cond] = 1 if the condition is true and 0 otherwise.

Experimental Procedure Here, we explain the generic procedure for computing AUCs for both the proposed and baseline methods. Training Train a model g: X Y R. 1. Abstraction: generate abstract event pairs Z for a given training set D(train). 2. Training: construct an association model g between abstract representations from Z. Test Compute the score f(s, t) for each (s, t) in the test set. 1. Abstraction: convert the given pair (s, t) to an abstract representation (x, y) using the method in question. 2. Scoring: compute the score g(x, y) and regard it as the score for the original representation f(s, t).

Baseline Method 1 (C&J 08) The ﬁrst baseline method is one proposed by [Chambers and Jurafsky, 2008]. To abstract raw sentences, the heads of predicate verbs and positions of protagonists are focused on (Sec. 1). The model g(x, y) uses PMI under Z:

PMI(x, y; Z)=log N c(x, y) P y c(x, y ) P x c(x , y), (14)

where c(x, y) denotes the frequency of (x, y) in Z.

Baseline Method 2 (Jans et al. 12) The second baseline method is proposed by [Jans et al., 2012]. In this method, the abstract representations are identical to those of C&J 08. The model g(x, y) computes the logarithm of a conditional probability under Z:

g(x, y) = log p(y|x; Z) = log c(x, y) P y c(x, y ). (15)

Baseline Method 3 (C&J 08 + PHSIC) We also consider a kernelized version of C&J 08 using PHSIC, which intuitively is smoothed PMI. We deﬁne the following kernel function between verb dependency tuples (v, d):

k((v1, d1),(v2, d2))= cos(v(v1),v(v2)) (d1 =d2) 1 (o.w.) . (16)

Proposed Method (DCSE + PHSIC) The proposed method uses DCSE, realized using the HSIC and MH for abstraction (Sec. 3 and Sec. 4), and the PHSIC for the model. Note that DCSE is performed for each instance in the test set. Training 1. Given the training set D(train), perform DCSE by maximizing the HSIC and generating abstract event pairs Z (Sec. 2). We ran the MH sampler with β = 108 to draw 7 105 and 2 105 samples, respectively, for the Gigaword corpus the Fairy Tale corpora.

0.0 0.2 0.4 0.6 0.8 1.0 0.0

C&J'08 Jans et al.'12 C&J'08+PHSIC Proposed

(a) Gigaword

0.0 0.2 0.4 0.6 0.8 1.0 0.0

C&J'08 Jans et al.'12 C&J'08+PHSIC Proposed

(b) Fairy Tale Figure 5: ROC curves on the prediction task.

Table 3: AUC values for the prediction task. The best result in each column is shown in bold.

Method Abstraction Model Gigaword Fairy Tale C&J 08 Fixed (C&J) PMI 0.553 0.596 Jans et al. 12 Fixed (C&J) Conditional 0.556 0.576 C&J 08 + PHSIC Fixed (C&J) PHSIC 0.518 0.518 Proposed DCSE PHSIC 0.633 0.646

2. The model is PHSIC under Z (Eq. 5).

Test 1. Taking {(s, t)} Z as the input, perform DCSE with a ﬁxed Z to obtain (x, y) such that x s, y t. 2. f(s, t) is deﬁned by g(x, y) = PHSIC(x, y; Z).

Results and Discussion Fig. 5 shows the ROC curves obtained for each method. The xand y-axes denote the false positive rate and true positive rate, respectively. The area under the curve corresponds to the AUC. Table 3 summarizes the AUC values of the different methods for each dataset. The experimental results show that the proposed method outperforms all the existing methods when applied to both the datasets. Moreover, its prediction performance was better than those of the baseline methods over the entire ROC curve (Fig. 5). These results indicate that changing from a ﬁxed abstract representation (C&J) to DCSE resulted in considerable performance improvement in the prediction task. A comparison between C&J and C&J+PHSIC highlights that there is no advantage of integrating PHSIC with the ﬁxed abstract representation. The experimental results imply that simply applying PHSIC to a ﬁxed abstract representation does not improve predictive performance (C&J 08 + PMI > C&J 08 + PHSIC). These facts also support the effectiveness of determining an abstract representation optimized for each instance (DCSE). The experimental results for real corpora also show that the proposed method can capture signiﬁcant association from original sentences more accurately than the existing methods. For instance, from the sentence pair Hasegawa had a teamhigh 10 wins last season., He pitched in with nine saves while Troy Percival was hurt and had an ERA of 3.57 in a teamleading 66 appearances. our method extracted had wins last season, pitched nine saves , while the existing methods extracted the abstract representation X have, X pitch , which cannot readily interpreted.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

6 Conclusion

In this paper, we have addressed the problem of determining abstract representations when modeling the associations between items in a dataset. We have proposed a new machine learning task called dependent co-substructure extraction (DCSE) that extracts strongly dependent substructure pairs from associated pairs. The proposed method incorporates HSIC (Sec. 3) and MH sampling (Sec. 4) in order to cope with the challenges of data sparsity and huge search space, respectively. Our experimental results demonstrate the effectiveness of the new task and the proposed method in two scenarios, namely knowledge acquisition and predicting narrative event pairs (Sec. 5). While we obtained favorable experimental results by using a simple cosine kernel, the proposed method can utilize arbitrary kernel functions such as the RBF kernel, tree kernels, and graph kernels on arbitrary data structures such as sequences, graphs, and vectors. An intriguing direction for future work would be to adopt other data structures and kernel functions so that semantic similarities can be captured more precisely. Applying our method to various knowledge mining tasks would also be interesting. Even with the relatively low computational cost of kernelbased measures, HSIC still faces a scalability problem. Although we conducted experiments on a dataset consisting of 10,000 pairs, we would like to train a better model on a larger dataset with, for example, more than a million pairs. An important task for future work, therefore, is to improve the scalability of proposed method. Promising approaches toward this aim include using various methods for approximating Gram matrices, such as using random Fourier features.

Acknowledgments

This work was supported by JST CREST Grant Number JPMJCR1513, Japan. We are grateful to Prof. K. Fukumizu and Prof. H. Kashima for giving us valuable advice. We would also like to thank Dr. R. Tian and S. Kobayashi for meaningful discussions.

[Chambers and Jurafsky, 2008] Nathanael Chambers and Dan Jurafsky. Unsupervised Learning of Narrative Event Chains. In ACL, pages 789 797, 2008.

[Chambers and Jurafsky, 2009] Nathanael Chambers and Dan Jurafsky. Unsupervised Learning of Narrative Schemas and their Participants. In ACL, pages 602 610, 2009.

[Chib and Greenberg, 1995] Siddhartha Chib and Edward Greenberg. Understanding the Metropolis-Hastings Algorithm. The American Statistician, 49(4):327 335, 1995.

[Church and Hanks, 1990] Kenneth Ward Church and Patrick Hanks. Word Association Norms, Mutual Information, and Lexicography. Computational linguistics, 16(1):22 29, 1990.

[Fawcett, 2006] Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861 874, 2006.

[Fukumizu et al., 2009] Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Kernel dimension reduction in regression. Annals of Statistics, 37(4):1871 1905, 2009. [Graff and Cieri, 2003] David Graff and Christopher Cieri. English Gigaword, LDC2003T05. Philadelphia: Linguistic Data Consortium, 2003. [Granroth-Wilding and Clark, 2016] Mark Granroth-Wilding and Stephen Clark. What Happens Next? Event Prediction Using a Compositional Neural Network Model. In AAAI, pages 2727 2733, 2016. [Gretton et al., 2005] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch olkopf. Measuring Statistical Dependence with Hilbert-Schmidt Norms. In ALT, pages 63 77, 2005. [Jans et al., 2012] Bram Jans, Steven Bethard, Ivan Vuli c, and M. Francine Moens. Skip N-grams and Ranking Functions for Predicting Script Events. In EACL, pages 336 344, 2012. [Maes et al., 1997] Frederik Maes, Andre Collignon, Dirk Vandermeulen, Guy Marchal, and Paul Suetens. Multimodality Image Registration by Maximization of Mutual Information. IEEE Trans. on Medical Imaging, 16(2):187 198, 1997. [Manning et al., 2014] Christopher D. Manning, John Bauer, Jenny Finkel, Steven J. Bethard, Mihai Surdeanu, and David Mc Closky. The Stanford Core NLP Natural Language Processing Toolkit. In ACL System Demonstrations, pages 55 60, 2014. [Muandet et al., 2016] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch olkopf. Kernel Mean Embedding of Distributions: A Review and Beyonds. ar Xiv preprint ar Xiv:1605.09522, 2016. [Peng et al., 2005] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: Criteria of Max-Dependency, Max-Relevance, and Min Redundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(8):1226 1238, 2005. [Quadrianto et al., 2009] Novi Quadrianto, Le Song, and Alex J. Smola. Kernelized sorting. In NIPS, pages 1289 1296, 2009. [Resnik, 1997] Philip Stuart Resnik. Selectional Preference and Sense Disambiguation. In ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How, pages 52 57, 1997. [Song et al., 2012] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature Selection via Dependence Maximization. Journal of Machine Learning Research, 13:1393 1434, 2012. [Torkkola, 2003] Kari Torkkola. Feature Extraction by Non Parametric Mutual Information Maximization. Journal of Machine Learning Research, 3:1415 1438, 2003. [Turney, 2002] Peter D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classiﬁcation of Reviews. In ACL, pages 417 424, 2002.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)