# multichannel_reverse_dictionary_model__1296d6c4.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Multi-Channel Reverse Dictionary Model

Lei Zhang,2 Fanchao Qi,1,2 Zhiyuan Liu,1,2 Yasheng Wang,3 Qun Liu,3 Maosong Sun1,2

1Department of Computer Science and Technology, Tsinghua University 2Institute for Artiﬁcial Intelligence, Tsinghua University Beijing National Research Center for Information Science and Technology 3Huawei Noah s Ark Lab zhanglei9003@gmail.com, qfc17@mails.tsinghua.edu.cn {liuzy, sms}@tsinghua.edu.cn, {wangyasheng, qun.liu}@huawei.com

A reverse dictionary takes the description of a target word as input and outputs the target word together with other words that match the description. Existing reverse dictionary methods cannot deal with highly variable input queries and low-frequency target words successfully. Inspired by the description-to-word inference process of humans, we propose the multi-channel reverse dictionary model, which can mitigate the two problems simultaneously. Our model comprises a sentence encoder and multiple predictors. The predictors are expected to identify different characteristics of the target word from the input query. We evaluate our model on English and Chinese datasets including both dictionary deﬁnitions and human-written descriptions. Experimental results show that our model achieves the state-of-the-art performance, and even outperforms the most popular commercial reverse dictionary system on the human-written description dataset. We also conduct quantitative analyses and a case study to demonstrate the effectiveness and robustness of our model. All the code and data of this work can be obtained on https://github.com/thunlp/Multi RD.

Introduction A regular (forward) dictionary maps words to deﬁnitions while a reverse dictionary (Sierra 2000) does the opposite and maps descriptions to corresponding words. In Figure 1, for example, a regular dictionary tells you that expressway is a wide road that allows trafﬁc to travel fast , and when you input a road where cars go very quickly without stopping to a reverse dictionary, it might return expressway together with other semantically similar words like freeway . Reverse dictionaries have great practical value. First and foremost, they can effectively address the tip-of-the-tongue problem (Brown and Mc Neill 1966), which severely afﬂicts many people, especially those who write a lot such as researchers, writers and students. Additionally, reverse dictionaries can render assistance to new language learners who

Indicates equal contribution Work done during internship at Tsinghua University Corresponding author Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

' ( $ )* & $ *

Figure 1: An example illustrating what a forward and a reverse dictionary are.

know a limited number of words. Moreover, reverse dictionaries are believed to be helpful to word selection (or word dictionary) anomia patients, people who can recognize and describe an object but fail to name the object due to neurological disorder (Benson 1979). In terms of natural language processing (NLP), reverse dictionaries can be used to evaluate the quality of sentence representations (Hill et al. 2016). They are also beneﬁcial to the tasks involving text-to-entity mapping including question answering and information retrieval (Kartsaklis, Pilehvar, and Collier 2018) There have been some successful commercial reverse dictionary systems such as One Look1, the most popular one, but their architecture is usually undisclosed proprietary knowledge. Some scientiﬁc researches into building reverse dictionaries have also been conducted. Early work adopts sentence matching based methods, which utilize hand-engineered features to ﬁnd the words whose stored deﬁnitions are most similar to the input query (Bilac et al. 2004; Zock and Bilac 2004; M endez, Calvo, and Moreno Armend ariz 2013; Shaw et al. 2013). But these methods cannot successfully cope with the main difﬁculty of reverse dictionaries that human-written input queries might differ widely from target words deﬁnitions. Hill et al. (2016) propose a new method based on neural language model (NLM). They employ a NLM as the sentence encoder to learn the representation of the input query, and return those words whose embeddings are closest to the

1https://onelook.com/thesaurus/

input query s representation. The NLM based reverse dictionary model alleviates the above-mentioned problem of variable input queries, but its performance is heavily dependent on the quality of word embeddings. According to Zipf s law (Zipf 1949), however, quite a few words are low-frequency and usually have poor embeddings, which will undermine the overall performance of ordinary NLM based models. To tackle the issue, we propose the multi-channel reverse dictionary model, which is inspired by the description-to-word inference process of humans. Taking expressway as an example, when we forget what word means a road where cars go very quickly , it may occur to us that the part-of-speech tag of the target word should be noun and it belongs to the category of entity . We might also guess that the target word probably contains the morpheme way . When having knowledge of these characteristics, it is much easier for us to search the target word out. Correspondingly, in our multi-channel reverse dictionary model, we employ multiple predictors to identify different characteristics of target words from input queries. By doing this, the target words with poor embeddings can still be picked out by their characteristics and, moreover, the words which have close embeddings to the correct target word but contradictory characteristics to the given description will be ﬁltered out. We view each characteristic predictor as an information channel of searching the target word. Two types of channels involving internal and external channels are taken into consideration. The internal channels correspond to the characteristics of words themselves including the part-of-speech (POS) tag and morpheme. The external channels reﬂect characteristics of target words related to external knowledge bases. We take account of two external characteristics including the word category and sememe. The word category information can be obtained from word taxonomy systems and it usually corresponds to the genus words of deﬁnitions. A sememe is deﬁned as the minimum semantic unit of human languages (Bloomﬁeld 1926), which is similar to the concept of semantic primitive (Wierzbicka 1996). Sememes of a word depict the meaning of the word atomically, which can be also predicted from the description of the word. More speciﬁcally, we adopt the well-established bidirectional LSTM (Bi LSTM) (Hochreiter and Schmidhuber 1997) with attention (Bahdanau, Cho, and Bengio 2015) as the basic framework and add four feature-speciﬁc characteristic predictors to it. In experiments, we evaluate our model on English and Chinese datasets including both dictionary deﬁnitions and human-written descriptions, ﬁnding that our model achieves the state-of-the-art performance. It is especially worth mentioning that for the ﬁrst time One Look is outperformed when input queries are human-written descriptions. In addition, to test our model under other real application scenarios like crossword game, we provide our model with prior knowledge about the target word such as the initial letter, and ﬁnd it yields substantial performance enhancement. We also conduct detailed quantitative analyses and a case study to demonstrate the effectiveness of our model as well as its robustness in handling polysemous and low-frequency words.

Related Work

Reverse Dictionary Models

Most of existing reverse dictionary models are based on sentence-sentence matching methods, i.e., comparing the input query with stored word deﬁnitions and return the word whose deﬁnition is most similar to the input query (Zock and Bilac 2004; Bilac et al. 2004). They usually use some handengineered features, e.g., tf-idf, to measure sentence similarity, and leverage well-established information retrieval techniques to search the target word (Shaw et al. 2013). Some of them utilize external knowledge bases like Word Net (Miller 1995) to enhance sentence similarity measurement by ﬁnding synonyms or other pairs of related words between the input query and stored deﬁnitions (M endez, Calvo, and Moreno-Armend ariz 2013; Lam and Kalita 2013; Shaw et al. 2013). Recent years have witnessed a growing number of reverse dictionary models which conduct sentence-word matching. Thorat and Choudhari (2016) present a node-graph architecture which can directly measure the similarity between the input query and any word in a word graph. However, it works on a small lexicon (3, 000 words) only. Hill et al. (2016) propose a NLM based reverse dictionary model, which uses a bag-of-words (BOW) model or an LSTM to embed the input query into the semantic space of word embeddings, and returns the words whose embeddings are closest to the representation of the input query. Following the NLM model, Morinaga and Yamaguchi (2018) incorporate category inference to eliminate irrelevant results and achieve better performance; Kartsaklis, Pilehvar, and Collier (2018) employ a graph of Word Net synsets and words in deﬁnitions to learn target word representations together with a multi-sense LSTM to encode input queries, and they claim to deliver state-of-the-art results; Hedderich et al. (2019) use multi-sense embeddings when encoding the queries, aiming to improve sentence representations of input queries; Pilehvar (2019) adopt sense embeddings to disambiguate senses of polysemous target words. Our multi-channel model also uses a NLM to embed input queries. Compared with previous work, our model employs multiple predictors to identity characteristics of target words, which is consistent with the inference process of humans, and achieves signiﬁcantly better performance.

Applications of Dictionary Deﬁnitions

Dictionary deﬁnitions are handy resources for NLP research. Many studies utilize dictionary deﬁnitions to improve word embeddings (Noraset et al. 2017; Tissier, Gravier, and Habrard 2017; Bahdanau et al. 2017; Bosc and Vincent 2018; Scheepers, Kanoulas, and Gavves 2018). In addition, dictionary deﬁnitions are utilized in various applications including word sense disambiguation (Luo et al. 2018), knowledge representation learning (Xie et al. 2016), reading comprehension (Long et al. 2017) and knowledge graph generation (Silva, Freitas, and Handschuh 2018; Prokhorov, Pilehvar, and Collier 2019).

Methodology

In this section, we ﬁrst introduce some notations. Then we describe our basic framework, i.e., Bi LSTM with attention. Next we detail our multi-channel model and its two internal and two external predictors. The architecture of our model is illustrated in Figure 2.

We deﬁne W as the vocabulary set, M as the whole morpheme set and P as the whole POS tag set. For a given word w W, its morpheme set is Mw = {m1, , m|Mw|}, where each of its morpheme mi M and | | denotes the cardinality of a set. A word may have multiple senses and each sense corresponds to a POS tag. Supposing w has nw senses, all the POS tags of its senses form its POS tag set Pw = {p1, , pnw}, where each POS tag pi P. In subsequent sections, we use lowercase boldface symbols to stand for vectors and uppercase boldface symbols for matrices. For instance, w is the word vector of w and W is a weight matrix.

Basic Framework

The basic framework of our model is essentially similar to a sentence classiﬁcation model, composed of a sentence encoder and a classiﬁer. We select Bidirectional LSTM (Bi LSTM) (Schuster and Paliwal 1997) as the sentence encoder, which encodes an input query into a vector. Different words in a sentence have different importance to the representation of the sentence, e.g., the genus words are more important than the modiﬁers in a deﬁnition. Therefore, we integrate attention mechanism (Bahdanau, Cho, and Bengio 2015) into Bi LSTM to learn better sentence representations. Formally, for an input query Q = {q1, , q|Q|}, we ﬁrst pass the pre-trained word embeddings of its words q1, , q|Q| Rd to the Bi LSTM, where d is the dimension of word embeddings, and obtain two sequences of directional hidden states:

{ h1, ..., h|Q|}, { h1, ..., h|Q|}

= Bi LSTM(q1, ..., q|Q|), (1)

where hi, hi Rl and l is the dimension of directional hidden states. Then we concatenate bi-directional hidden states to obtain non-directional hidden states:

hi = Concatenate( hi, hi). (2)

The ﬁnal sentence representation is the weighted sum of non-directional hidden states:

i=1 αihi, (3)

where αi is the attention item serving as the weight:

αi = ht hi,

ht = Concatenate( h|Q|, h1). (4)

Figure 2: Multi-channel reverse dictionary model.

Next we map v, the sentence vector of the input query, into the space of word embeddings, and calculate the conﬁdence score of each word using dot product:

vword = Wwordv + bword, scw,word = vword w, (5)

where scw,word indicates the conﬁdence score of w, Wword Rd 2l is a weight matrix, bword Rd is a bias vector.

Internal Channel: POS Tag Predictor A dictionary deﬁnition or human-written description of a word is usually able to reﬂect the POS tag of the corresponding sense of the word. We believe that predicting the POS tag of the target word can alleviate the problem of returning words with POS tags contradictory to the input query in existing reverse dictionary models. We simply pass the sentence vector of the input query v to a single-layer perceptron:

scpos = Wposv + bpos, (6)

where scpos R|P| records the prediction score of each POS tag, Wpos R|P| 2l is a weight matrix, and bpos R|P| is a bias vector. The conﬁdence score of w from the POS tag channel is the sum of the prediction scores of w s POS tags:

p Pw [scpos]indexpos(p), (7)

where [x]i denotes the i-th element of x, and indexpos(p) returns the POS tag index of p.

Internal Channel: Morpheme Predictor Most words are complex words consisting of more than one morphemes. We ﬁnd there exists a kind of local semantic correspondence between the morphemes of a word and its deﬁnition or description. For instance, the word expressway has two morphemes express and way and its dictionary deﬁnition is a wide road in a city on which cars

can travel very quickly . We can observe that the two words road and quickly semantically correspond to the two morphemes way and express respectively. By predicting morphemes of the target word from the input query, a reverse dictionary can capture compositional information of the target word, which is complementary to contextual information of word embeddings. We design a special morpheme predictor. Different from the POS tag predictor, we allow each hidden state to be involved in morpheme prediction directly, and do max-pooling to obtain ﬁnal morpheme prediction scores. Speciﬁcally, we feed each non-directional hidden state to a single-layer perceptron and obtain local morpheme prediction scores: sci mor = Wmorhi + bmor, (8)

where sci mor R|M| measures the semantic correspondence between i-th word in the input query and each morpheme, Wmor RM| 2l is a weight matrix, and bmor R|M| is a bias vector. Then we do max-pooling over all the local morpheme prediction scores to obtain global morpheme prediction scores: [scmor]j = max 1 i |Q|[sci mor]j. (9)

And the conﬁdence score of w from the morpheme channel is: scw,mor =

m Mw [scmor]indexmor(m), (10)

where indexmor(m) returns the morpheme index of m.

External Channel: Word Category Predictor Semantically related words often belong to different categories, although they have close word embeddings, e.g., car and road . Word category information is helpful in eliminating semantically related but not similar words from the results of reverse dictionaries (Morinaga and Yamaguchi 2018). There are many available word taxonomy systems which can provide hierarchical word category information, e.g., Word Net (Miller 1995). Some of them provides POS tag information as well, in which case POS tag predictor can be removed. We design a hierarchical predictor to calculate prediction scores of word categories. Speciﬁcally, each word belongs to a certain category in each layer of word hierarchy. We ﬁrst compute the word category prediction score of each layer: sccat,k = Wcat,kv + bcat,k, (11) where sccat,k Rck is the word category prediction score distribution of k-th layer, Wcat,k Rck 2l is a weight matrix, bcat,k Rck is a bias vector, and ck is the category number of k-th layer. Then the ﬁnal conﬁdence score of w from the word category channel is the weighted sum of its category prediction scores of all the layers:

k=1 βk[sccat,k]indexcat k (w), (12)

where K is the total layer number of the word hierarchy, βk is a hyper-parameter controlling the relative weights, and indexcat k (w) returns the category index of w in the k-th layer .

External Channel: Sememe Predictor In linguistics, a sememe is the minimum semantic unit of natural languages (Bloomﬁeld 1926). Sememes of a word can accurately depict the meaning of the word. How Net (Dong and Dong 2003) is the most famous sememe knowledge base. It deﬁnes about 2, 000 sememes and uses them to annotate more than 100, 000 Chinese and English words by hand. How Net and its sememe knowledge has been widely applied to various NLP tasks including sentiment analysis (Fu et al. 2013), word representation learning (Niu et al. 2017), semantic composition (Qi et al. 2019a), sequence modeling (Qin et al. 2019) and textual adversarial attack (Zang et al. 2019). Sememe annotation of a word in How Net includes hierarchical sememe structures as well as relations between sememes. For simplicity, we extract a set of unstructured sememes for each word, in which case sememes of a word can be regarded as multiple semantic labels of the word. We ﬁnd there also exists local semantic correspondence between the sememes of a word and its description. Still taking expressway as an example, its annotated sememes in How Net are route and fast, which semantically correspond to the words in its deﬁnition road and quickly respectively. Therefore, we design a sememe predictor similar to the morpheme predictor. Formally, we use S to represent the set of all sememes. The sememe set of a word w is Sw = {s1, , s|Sw|}. We pass each hidden state to a single-layer perceptron to calculate local sememe prediction scores:

sci sem = Wsemhi + bsem, (13)

where sci sem R|S| indicates how corresponding between i-th word in the input query and each sememe, Wsem R|S| 2l is a weight matrix, and bsem is a bias vector. Final sememe prediction scores are computed by doing maxpooling: [scsem]j = max 1 i |Q|[sci sem]j. (14)

The conﬁdence score of w from the sememe channel is:

s Sw [scsem]indexsem(s), (15)

where indexsem(s) returns the sememe index of s.

Multi-channel Reverse Dictionary Model By combining the conﬁdence scores of direct word prediction and indirect characteristic prediction, we obtain the ﬁnal conﬁdence score of a given word w in our multi-channel reverse dictionary model:

scw = λwordscw,word +

c C λcscw,c, (16)

where C = {pos, mor, cat, sem} is the channel set, and λword and λc are the hyper-parameters controlling relative weights of corresponding terms. As for training loss, we simply adopt the one-versus-all cross-entropy loss inspired by the sentence classiﬁcation models.

Model Seen Deﬁnition Unseen Deﬁnition Description One Look 0 .66/.94/.95 200 - - - 5.5 .33/.54/.76 332 BOW 172 .03/.16/.43 414 248 .03/.13/.39 424 22 .13/.41/.69 308 RNN 134 .03/.16/.44 375 171 .03/.15/.42 404 17 .14/.40/.73 274 RDWECI 121 .06/.20/.44 420 170 .05/.19/.43 420 16 .14/.41/.74 306 Super Sense 378 .03/.15/.36 462 465 .02/.11/.31 454 115 .03/.15/.47 396 MS-LSTM 0 .92/.98/.99 65 276 .03/.14/.37 426 1000 .01/.04/.18 404 Bi LSTM 25 .18/.39/.63 363 101 .07/.24/.49 401 5 .25/.60/.83 214 +Mor 24 .19/.41/.63 345 80 .08/.26/.52 399 4 .26/.62/.85 198 +Cat 19 .19/.42/.68 309 68 .08/.28/.54 362 4 .30/.62/.85 206 +Sem 19 .19/.43/.66 349 80 .08/.26/.53 393 4 .30/.64/.87 218 Multi-channel 16 .20/.44/.71 310 54 .09/.29/.58 358 2 .32/.64/.88 203 median rank accuracy@1/10/100 rank variance

Table 1: Overall reverse dictionary performance of all the models.

Experiments

In this section, we evaluate the performance of our multichannel reverse dictionary model. We also conduct detailed quantitative analyses as well as a case study to explore the inﬂuencing factors in the reverse dictionary task and demonstrate the strength and weakness of our model. We carry out experiments on both English and Chinese datasets. But due to limited space, we present our experiments on the Chinese dataset in the appendix.

We use the English dictionary deﬁnition dataset created by Hill et al. (2016)2 as the training set. It contains about 100, 000 words and 900, 000 word-deﬁnition pairs. We have three test sets including: (1) seen deﬁnition set, which contains 500 pairs of words and Word Net deﬁnitions existing in the training set and is used to assess the ability to recall previously encoded information; (2) unseen deﬁnition set, which also contains 500 pairs of words and Word Net deﬁnitions but the words together with all their deﬁnitions have been excluded from the training set; and (3) description set, which consists of 200 pairs of words and human-written descriptions and is a benchmark dataset created by Hill et al. (2016) too. To obtain the morpheme information our model needs, we use Morfessor (Virpioja et al. 2013) to segment all the words into morphemes. As for the word category information, we use the lexical names from Word Net (Miller 1995). There are 45 lexical names and the total layer number of the word category hierarchy is 1. Since the lexical names have included POS tags, e.g., noun.animal, we remove the POS tag predictor from our model. We use How Net as the source of sememes. It contains 43, 321 English words manually annotated with 2, 148 different sememes in total. We employ Open How Net (Qi et al. 2019b), the open data accessing API of How Net, to obtain sememes of words.

2The deﬁnitions are extracted from ﬁve electronic resources: Word Net, The American Heritage Dictionary, The Collaborative International Dictionary of English, Wiktionary and Webster s.

Experimental Settings

Baseline Methods We choose the following models as the baseline methods: (1) One Look, the most popular commercial reverse dictionary system, whose 2.0 version is used; (2) BOW and RNN with rank loss (Hill et al. 2016), both of which are NLM based and the former uses a bag-of-words model while the latter uses an LSTM; (3) RDWECI (Morinaga and Yamaguchi 2018), which incorporates category inference and is an improved version of BOW; (4) Super Sense (Pilehvar 2019), an improved version of BOW which uses pretrained sense embeddings to substitute target word embeddings; (5) MS-LSTM (Kartsaklis, Pilehvar, and Collier 2018), an improved version of RNN which uses graphbased Word Net synset embeddings together with a multisense LSTM to predict synsets from descriptions and claims to produce state-of-the-art performance; and (6) Bi LSTM, the basic framework of our multi-channel model.

Hyper-parameters and Training For our model, the dimension of non-directional hidden states is 300 2, the weights of different channels are equally set to 1, and the dropout rate is 0.5. For all the models except MS-LSTM, we use the 300-dimensional word embeddings pretrained on Google News with word2vec3, and the word embeddings are ﬁxed during training. For all the other baseline methods, we use their recommended hyper-parameters. For training, we adopt Adam as the optimizer with initial learning rate 0.001, and the batch size is 128.

Evaluation Metrics Following previous work, we use three evaluation metrics: the median rank of target words (lower better), the accuracy that target words appear in top 1/10/100 (acc@1/10/100, higher better) and the standard deviation of target words ranks (rank variance, lower better). Notice that MS-LSTM can only predict Word Net synsets. Thus, we map the target words to corresponding Word Net synsets (target synsets) and calculate the accuracy and rank variance of the target synsets.

3https://code.google.com/archive/p/word2vec/

Prior Knowlege Seen Deﬁnition Unseen Deﬁnition Description None 16 .20/.44/.71 310 54 .09/.29/.58 358 2.5 .32/.64/.88 203 POS Tag 13 .21/.45/.72 290 45 .10/.31/.60 348 3 .35/.65/.91 174 Initial Letter 1 .39/.73/.90 270 4 .26/.63/.85 348 0 .62/.90/.97 160 Word Length 1 .40/.71/.90 269 6 .25/.56/.84 346 0 .55/.85/.95 163 median rank accuracy@1/10/100 rank variance

Table 2: Reverse dictionary performance with prior knowledge.

Overall Experimental Results

Table 1 exhibits reverse dictionary performance of all the models on the three test sets, where Mor , Cat and Sem represent the morpheme, word category and sememe predictors respectively. Notice that the performance of One Look on the unseen test set is meaningless because we cannot exclude any deﬁnitions from its deﬁnition bank, hence we do not list corresponding results. From the table, we can see: (1) Compared with all the baseline methods other than One Look, our multi-channel model achieves substantially better performance on the unseen deﬁnition set and the description set, which veriﬁes the absolute superiority of our model in generalizing to the novel and unseen input queries. (2) One Look signiﬁcantly outperforms our model when the input queries are dictionary deﬁnitions. This result is expected because the input dictionary deﬁnitions are already stored in the database of One Look and even simple text matching can easily handle this situation. However, the input queries of a reverse dictionary cannot be exact dictionary deﬁnitions in reality. On the description test set, our multi-channel model achieves better overall performance than One Look. Although One Look yields slightly higher acc@1, it has limited value in terms of practical application, because people always need to pick the proper word from several candidates, not to mention the fact that the acc@1 of One Look is only 0.33. (3) MS-LSTM performs very well on the seen deﬁnition set but badly on the description set, which manifests its limited generalization ability and practical value. Notice that when testing MS-LSTM, the searching space is the whole synset list rather than the synset list of the test set, which causes the difference in performance on the unseen deﬁnition set measured by us and recorded in the original work (Kartsaklis, Pilehvar, and Collier 2018). (4) All the Bi LSTM variants enhanced with different information channels (+Mor, +Cat and +Sem) perform better than vanilla Bi LSTM. These results prove the effectiveness of predicting characteristics of target words in the reverse dictionary task. Moreover, our multi-channel model achieves further performance enhancement as compared with the single-channel models, which demonstrates the potency of characteristic fusion and also veriﬁes the efﬁcacy of our multi-channel model. (5) BOW performs better than RNN, which is consistent with the ﬁndings from Hill et al. (2016). However, Bi LSTM far surpasses BOW as well as RNN. This veriﬁes the necessity for bi-directional encoding in RNN models, and also shows the potential of RNNs.

Performance with Prior Knowledge In practical application of reverse dictionaries, extra information about target words in addition to descriptions may be known. For example, we may remember the initial letter of the word we forget, or the length of the target word is known in crossword game. In this subsection, we evaluate the performance of our model with the prior knowledge of target words, including POS tag, initial letter and word length. More speciﬁcally, we extract the words satisfying given prior knowledge from the top 1, 000 results of our model, and then reevaluate the performance. The results are shown in Table 2. We can ﬁnd that any prior knowledge improves the performance of our model to a greater or lesser extent, which is an expected result. However, the performance boost brought by the initial letter and word length information is much bigger than that brought by the POS tag information. The possible reasons are as follows. For the POS tag, it has been already predicted in our multi-channel model, hence the improvement it brings is limited, which also demonstrates that our model can do well in POS tag prediction. For the initial letter and word length, they are hard to predict according to a deﬁnition or description and not considered in our model. Therefore, they can ﬁlter many candidates out and markedly increase performance.

Analyses of Inﬂuencing Factors In this subsection, we conduct quantitative analyses of the inﬂuencing factors in reverse dictionary performance. To make results more accurate, we use a larger test set consisting of 38, 113 words and 80, 658 seen pairs of words and Word Net deﬁnitions. Since we are interested in the features of target words, we exclude MS-LSTM that predicts Word Net synsets.

Sense Number Figure 3 exhibits the acc@10 of all the models on the words with different numbers of senses. It is obvious that performance of all the models declines with the increase in the sense number, which indicates that polysemy is a difﬁculty in the task of reserve dictionary. But our model displays outstanding robustness and its performance hardly deteriorates even on the words with the most senses.

Word Frequency Figure 4 displays all the models performance on the words within different ranges of word frequency ranking. We can ﬁnd that the most frequent and infrequent words are harder to predict for all the reverse dictionary models. The most infrequent words usually have

1 2 3 4 5 6+ Number of Senses

Multi-channel Bi LSTM RDWECI BOW RNN Super Sense

Figure 3: Acc@10 on words with different sense numbers. The numbers of words are 21, 582, 8, 266, 3, 538, 1, 691, 953 and 2, 083 respectively.

0-5k 5-10k10-20k 20-30k 30-40k 40k+ Word Frequency Rankings

Multi-channel Bi LSTM RDWECI BOW RNN Super Sense

Figure 4: Acc@10 on ranges of different word frequency rankings. The number of words in each range is 3, 299, 3, 243, 3, 515, 3, 300, 5, 565 and 19, 191 respectively.

poor embeddings, which may damage the performance of NLM based models. For the most frequent words, on the other hand, although their embeddings are better, they usually have more senses. We count the average sense numbers of all the ranges, which are 5.6, 3.2, 2.6, 2.1, 1.7 and 1.4 respectively. The ﬁrst range has a much larger average sense number, which explains its bad performance. Moreover, our model also demonstrates remarkable robustness.

Query Length The effect of query length on reverse dictionary performance is illustrated in Figure 5. When the input query has only one word, the system performance is strikingly poor, especially our multi-channel model. This is easy to explain because the information extracted from the input query is too limited. In this case, outputting the synonyms of the query word is likely to be a better choice.

1 2 3-5 6-10 11-15 16-20 21-25 26+ Length of Queries

Multi-channel Bi LSTM RDWECI BOW RNN Super Sense

Figure 5: Acc@10 on ranges of different query length. The number of queries in each range is 672, 3, 609, 21, 113, 31, 013, 14, 684, 5, 803, 2, 316 and 1, 448 respectively.

Word Input Query postnuptial relating to events after a marriage. takeaway concession made by a labor union to a company.

Table 3: Two reverse dictionary cases.

In this subsection, we give two cases in Table 3 to display the strength and weakness of our reverse dictionary model. For the ﬁrst word postnuptial , our model correctly predicts its morpheme post and sememe Get Married from the words after and marriage in the input query. Therefore, our model easily ﬁnds the correct answer. For the second case, the input query describes a rare sense of the word takeaway . How Net has no sememe annotation for this sense, and morphemes of the word are not semantically related to any words in the query either. Our model cannot solve this kind of cases, which is in fact hard to handle for all the NLM based models. In this situation, the text matching methods, which return the words whose stored deﬁnitions are most similar to the input query, may help.

Conclusion and Future Work

In this paper, we propose a multi-channel reverse dictionary model, which incorporates multiple predictors to predict characteristics of target words from given input queries. Experimental results and analyses show that our model achieves the state-of-the-art performance and also possesses outstanding robustness. In the future, we will try to combine our model with text matching methods to better tackle extreme cases, e.g., single-word input query. In addition, we are considering extending our model to the cross-lingual reverse dictionary task. Moreover, we will explore the feasibility of transferring our model to related tasks such as question answering.

Acknowledgements This work is funded by the Natural Science Foundation of China (NSFC) and the German Research Foundation (DFG) in Project Crossmodal Learning, NSFC 61621136008 / DFG TRR-169. Furthermore, we thank the anonymous reviewers for their valuable comments and suggestions.

References Bahdanau, D.; Bosc, T.; Jastrzebski, S.; Grefenstette, E.; Vincent, P.; and Bengio, Y. 2017. Learning to compute word embeddings on the ﬂy. ar Xiv preprint ar Xiv:1706.00286. Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR. Benson, D. F. 1979. Neurologic correlates of anomia. In Studies in neurolinguistics. Elsevier. 293 328. Bilac, S.; Watanabe, W.; Hashimoto, T.; Tokunaga, T.; and Tanaka, H. 2004. Dictionary search based on the target word description. In Proceedings of NLP. Bloomﬁeld, L. 1926. A set of postulates for the science of language. Language 2(3):153 164. Bosc, T., and Vincent, P. 2018. Auto-encoding dictionary deﬁnitions into consistent word embeddings. In Proceedings of EMNLP. Brown, R., and Mc Neill, D. 1966. The tip of the tongue phenomenon. Journal of verbal learning and verbal behavior 5(4):325 337. Dong, Z., and Dong, Q. 2003. Hownet-a hybrid language and knowledge resource. In Proceedings of NLP-KE. Fu, X.; Liu, G.; Guo, Y.; and Wang, Z. 2013. Multi-aspect sentiment analysis for chinese online social reviews based on topic modeling and hownet lexicon. Knowledge-Based Systems 37:186 195. Hedderich, M. A.; Yates, A.; Klakow, D.; and de Melo, G. 2019. Using multi-sense vector embeddings for reverse dictionaries. In Proceedings of IWCS. Hill, F.; Cho, K.; Korhonen, A.; and Bengio, Y. 2016. Learning to understand phrases by embedding the dictionary. TKDE 4:17 30. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735 1780. Kartsaklis, D.; Pilehvar, M. T.; and Collier, N. 2018. Mapping text to knowledge graph entities using multi-sense lstms. In Proceedings of EMNLP. Lam, K. N., and Kalita, J. K. 2013. Creating reverse bilingual dictionaries. In Proceedings of HLT-NAACL. Long, T.; Bengio, E.; Lowe, R.; Cheung, J. C. K.; and Precup, D. 2017. World knowledge for reading comprehension: Rare entity prediction with hierarchical lstms using external descriptions. In Proceedings of EMNLP. Luo, F.; Liu, T.; Xia, Q.; Chang, B.; and Sui, Z. 2018. Incorporating glosses into neural word sense disambiguation. In Proceedings of ACL. M endez, O.; Calvo, H.; and Moreno-Armend ariz, M. A. 2013. A reverse dictionary based on semantic analysis using wordnet. In Proceedings of MICAI 2013. Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the Acm 38(11):39 41. Morinaga, Y., and Yamaguchi, K. 2018. Improvement of reverse dictionary by tuning word vectors and category inference. In Proceedings of ICIST.

Niu, Y.; Xie, R.; Liu, Z.; and Sun, M. 2017. Improved word representation learning with sememes. In Proceedings of ACL. Noraset, T.; Liang, C.; Birnbaum, L.; and Downey, D. 2017. Definition modeling: Learning to deﬁne word embeddings in natural language. In Proceedings of AAAI. Pilehvar, M. T. 2019. On the importance of distinguishing word meaning representations: A case study on reverse dictionary mapping. In Proceedings of NAACL-HLT. Prokhorov, V.; Pilehvar, M. T.; and Collier, N. 2019. Generating knowledge graph paths from textual deﬁnitions using sequenceto-sequence models. In Proceedings NAACL-HLT. Qi, F.; Huang, J.; Yang, C.; Liu, Z.; Chen, X.; Liu, Q.; and Sun, M. 2019a. Modeling semantic compositionality with sememe knowledge. In Proceedings of ACL. Qi, F.; Yang, C.; Liu, Z.; Dong, Q.; Sun, M.; and Dong, Z. 2019b. Openhownet: An open sememe-based lexical knowledge base. ar Xiv preprint ar Xiv:1901.09957. Qin, Y.; Qi, F.; Ouyang, S.; Liu, Z.; Yang, C.; Wang, Y.; Liu, Q.; and Sun, M. 2019. Enhancing recurrent neural networks with sememes. ar Xiv preprint ar Xiv:1910.08910. Scheepers, T.; Kanoulas, E.; and Gavves, E. 2018. Improving word embedding compositionality using lexicographic deﬁnitions. In Proceedings of WWW. Schuster, M., and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673 2681. Shaw, R.; Datta, A.; Vander Meer, D. E.; and Dutta, K. 2013. Building a scalable database-driven reverse dictionary. TKDE 25:528 540. Sierra, G. 2000. The onomasiological dictionary: a gap in lexicography. In Proceedings of the ninth Euralex international congress. Silva, V.; Freitas, A.; and Handschuh, S. 2018. Building a knowledge graph from natural language deﬁnitions for interpretable text entailment recognition. In Proceedings LREC. Thorat, S., and Choudhari, V. 2016. Implementing a reverse dictionary, based on word deﬁnitions, using a node-graph architecture. In Proceedings of COLING. Tissier, J.; Gravier, C.; and Habrard, A. 2017. Dict2vec : Learning word embeddings using lexical dictionaries. In Proceedings of EMNLP. Virpioja, S.; Smit, P.; Gr onroos, S. A.; and Kurimo, M. 2013. Morfessor 2.0: Python implementation and extensions for morfessor baseline. Aalto University Publication. Wierzbicka, A. 1996. Semantics: Primes and universals: Primes and universals. Oxford University Press, UK. Xie, R.; Liu, Z.; Jia, J.; Luan, H.; and Sun, M. 2016. Representation learning of knowledge graphs with entity descriptions. In Proceedings of AAAI. Zang, Y.; Yang, C.; Qi, F.; Liu, Z.; Zhang, M.; Liu, Q.; and Sun, M. 2019. Textual adversarial attack as combinatorial optimization. ar Xiv preprint ar Xiv:1910.12196. Zipf, G. K. 1949. Human behavior and the principle of least effort. SERBIULA (sistema Librum 2.0). Zock, M., and Bilac, S. 2004. Word lookup on the basis of associations: from an idea to a roadmap. In Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries.