# explainable_and_discourse_topicaware_neural_language_understanding__fcfc2b97.pdf Explainable and Discourse Topic-aware Neural Language Understanding Yatin Chaudhary 1 2 Hinrich Schütze 2 Pankaj Gupta 1 Marrying topic models and language models exposes language understanding to a broader source of document-level context beyond sentences via topics. While introducing topical semantics in language models, existing approaches incorporate latent document topic proportions and ignore topical discourse in sentences of the document. This work extends the line of research by additionally introducing an explainable topic representation in language understanding, obtained from a set of key terms correspondingly for each latent topic of the proportion. Moreover, we retain sentencetopic association along with document-topic association by modeling topical discourse for every sentence in the document. We present a novel neural composite language modeling (NCLM) framework that exploits both the latent and explainable topics along with topical discourse at sentence-level in a joint learning framework of topic and language models. Experiments over a range of tasks such as language modeling, word sense disambiguation, document classification, retrieval and text generation demonstrate ability of the proposed model in improving language understanding. 1. Introduction Topic models (TMs) such as LDA (Blei et al., 2001) facilitate document-level semantic knowledge in the form of topics, explaining the thematic structures hidden in a document collection. In doing so, they learn document-topic association in a generative fashion by counting word-occurrences across documents. Essentially, the generative framework assumes that each document is a mixture of latent topics, i.e., topic-proportions and each latent topic is a unique dis- 1Corporate Technology, Machine Intelligence (MIC-DE), Siemens AG, Munich, Germany 2CIS, University of Munich (LMU), Munich, Germany. Correspondence to: Yatin Chaudhary . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). tribution over words in vocabulary. Beyond a document representation, topic models also offer interpretability via topics (a set of top key terms). Recently, neural topic models (Gupta et al., 2019b;a; Miao et al., 2016) have been shown to outperform LDA-based models. Thus, we consider neural network based topic models in this work. Language models (LMs) (Mikolov et al., 2010; Peters et al., 2018) have recently gained success in natural language understanding by predicting the next (target) word in a sequence given its preceding and/or following context(s), accounting for linguistic structures such as word ordering. However, LM are often contextualized by an n-gram window or a sentence, ignoring global semantics in context beyond the sentence boundary especially in modeling documents. To capture long-term semantic dependencies, recent works (Wang et al., 2018; Lau et al., 2017; Dieng et al., 2017) have attempted to introduce document-level semantics in LMs at sentence-level by marrying topic and language models, e.g., augmenting LSTM-based LMs with a latent document-topic proportion (association) obtained from a topic model for the document in which the sentence appears. Motivation 1: While augmenting LMs with topical semantics, existing approaches incorporate latent document-topic proportions and ignore an explanatory representation for each latent topic of the proportion. Here, the explanatory representation of a topic refers to a vector representation obtained from a set of high-probability terms in its topicword distribution. For example in Figure 1(a), we run a topic model over a document of three sentences and discover a latent document-topic proportion ˆhd as well as three topics (top-5 key terms) correspondingly explaining each latent topic (T1, T2 and T3) of the proportion. Observe that the context in sent#2 can not resolve the meaning of the word chip. However, introducing ˆhd with complementary explainable topics (collections of key terms) provide an abstract (latent) and a fine granularity (explanatory) outlook, respectively. To our knowledge, the scheme of augmenting LMs with both the latent document-topic proportion and explanatory topics remains unexplored. Contribution 1: Complementing the latent document-topic proportion, we also leverage explanatory topics in augmenting LMs with topical semantics in a neural composite language modeling (NCLM) framework, consisting of a neural Explainable and Discourse Topic-aware Neural Language Understanding sent#2: Production of chip is a multi-step process share, investing, sales, market Topic #2: computer, unix, linux, android, smartphone Topic #3: electronic, circuit, processor, silicon, transistor p(chip|sent#2, ETR) Document (d) p(chip|sent#2, ĥd) Explainable Topic Representation (ETR) sent#1: An integrated circuit (IC) is a set of electronic circuits used for computer processor. sent#2: Production of chip is a multi-step process. sent#3: Sales are expected to grow 2.7 % to 79.3 billion dollars, the largest market share in future. Latent Topic Proportion (ĥd) of document d 0.1 Neural Language Model Expected Topic-proportion Dominant topic (T1) do not match with Expectation (T3) + sent#3 Document (d) Figure 1. Detailed illustration of: (a) Motivation #1: top 5 key terms of each topic provide a fine-grained outlook of document semantics context than hd for prediction of word chip ; (b) Motivation #2: Negative influence via sentence-level topical discourse mismatch. topic model (NTM) and a neural language model (NLM). Motivation 2: A sentence in a document may have a different topical discourse than its neighboring sentences or the document itself. Illustrated in Figure 1 (b), an NTM generates two different document-topic proportions (TP) for input document d and sent#2+sent#3 while modeling sent#1 in the NLM. Observe that the sent#1 expects a topic proportion dominated by topic T3 (electronics) as in TP1; however NTM generates TP2 or TP3 due to input d or sent#2+sent#3, respectively where both the document-topic proportions are dominated by the topic T1 about marketing. Therefore, there is need to deal with such topical discourse mismatch for each sentence in the document. Contribution 2: In order to retain sentence-level topical semantics, we first extract sentence-topic association, i.e., sentence-level latent topic proportion, for each sentence using NTM and then introduce them in NLM in combination with the document-topic proportion (association). Contribution 3: We evaluate the proposed NCLM framework over range of tasks such as language modeling, word sense disambiguation, document classification and information retrieval. Experimental results suggest that both the explanatory topics and sentence-topic association help in improving natural language understanding. Implementation of NCLM is available at: https://github.com/ Yatin Chaudhary/NCLM. 2. Neural Language Model Language modeling is the task of assigning probability distribution over a sequence of words. Typically, language models (Peters et al., 2018) are applied at the sentence-level. Consider a sentence s = {(wm, ym)| m=1:M} of length M in document d, where (wm, ym) is a tuple containing the indices of input and output words in vocabulary of size V . A LM computes the joint probability p(s) i.e., likelihood of s by a product of conditional probabilities as follows: p(s) = p(y1, ..., y M) = p(y1) m=2 p(ym|y1:m 1) where p(ym|y1:m 1) is the probability of word ym conditioned on preceding context y1:m 1. RNN-based LMs capture linguistic properties in their recurrent hidden state rm RH and compute output state om RH for each ym: om, rm = f(rm 1, wm) ; p(ym|y1:m 1) = p(ym|om) (1) where function f( ) can be a standard LSTM (Hochreiter & Schmidhuber, 1997) or GRU (Cho et al., 2014) cell and H is the number of hidden units. As illustrated in Figure 2 (c), the NLM component in our proposed NCLM framework is based on LSTM cell, i.e., f=f LST M. Then, the conditional p(ym|om) is computed using multinomial logistic as: p(ym|om) = exp(o T m U:,ym+aym ) PV j=1 exp(o Tm U:,j + aj) (2) where U RH V , a RV are NLM decoding parameters and V is the vocabulary size. Here, the input wm and output ym indices are related as ym=wm+1. Finally, NLM computes log-likelihood LNLM of s as a training objective and maximizes it: LNLM = log p(y1) m=2 log p(ym|om) (3) 3. Neural Composite Language Model While NLM captures sentence-level (short-range dependencies) linguistic properties, they tend to ignore the document-level (long-range) context across sentence boundaries. Khandelwal et al. (2018) have shown that even by considering multiple preceding sentences as the context to predict the current word, it is often difficult to capture long-term dependencies beyond a distance of 200 words in context. Therefore, a composition of NLM and NTM provides a broader document-level semantic awareness during sequence modeling leveraging document-topic proportion (association) extracted using NTM. The complementary learning leads to an improved language understanding, accounting for both sentence and document-level semantics. See Table 1 for description of notations used. Explainable and Discourse Topic-aware Neural Language Understanding Table 1. Description of the notations used in this work Notation Description Notation Description NTM, NLM Neural Topic Model, Neural Language Model V , Z, K NLM Vocab, NTM Vocab, Number of topics LTR, ETR Latent and Explainable Topic Representations H, DE LSTM hidden size, Word embedding size d, s, y a document, a sentence in d, a word in s V RZ Bo W representation of document d d-s, s-y d after removing s, s after removing y [W, U RH Z] Decoding matrix of NTM, NLM N, ϵ RK Gaussian distribution, a sample from N E RDE Z Pre-trained word embedding matrix [µ, σ] RK mean, variance of approximate N [rm, om] RH hidden, output vector of LSTM cell [hd-s, zatt d-s] RZ LTR, ETR representations of document d-s [o LT A d , o ET A d , o LET A d ] RH Topic composition of document d with om t RK top N A list of topics with top N words [o LT A d,s , o ET A d,s , o LET A d,s ] RH Topic composition of sentence s with om 3.1. Neural Topic Model In this work, NTM (Figure 2 (a)) is based on Neural Variational Document Model (NVDM) proposed by Miao et al. (2016). It is an unsupervised generative model that learns to regenerate an input document V using a continuous latent topic representation h which is sampled from a prior Gaussian distribution p(h). NVDM adopts the neural variational inference framework to compute a posterior Gaussian distribution q(h|V) which approximates the true prior p(h). Given a document d, consider V RZ be its bag-of-words (Bo W) representation and vi RZ is the one-hot representation of the ith word of the vocabulary of size Z. The generative process (Algorithm 1: lines #9-18) of NVDM is: Step 1: Latent topic representation h RK is sampled by encoding V using an MLP encoder f MLP followed by two linear projections l1 and l2 as shown in Figure 1(a), where I is the identity matrix. To elaborate further, for each input V encoder network generates the parameters mean µ(V) and deviation σ(V) required to parameterize the approximate posterior distribution q(h|V) in diagonal Gaussian form and samples h from it (Algorithm 2: lines #13-20). h q(h|V) N(h|µ(V), diag(σ2(V))) Step 2: Conditional word probabilities p(vi|h) are computed independently for each word, using multinomial logistic regression with parameters shared across all documents: p(vi|h) = exp{h T W:,i + bi} P|Z| j=1 exp{h T W:,j + bj} (4) where W RK Z & b RZ are NTM decoding parameters. The word probabilities p(vi|h) are further used to compute document probability p(V|h) conditioned on h. By marginalizing p(V|h) over latent representation h, we get the likelihood p(V) of document d as shown below. h p(h) p(V|h)dh and p(V|h) = i=1 p(vi|h) where Nd is the number of words in document d. However, it is intractable to sample all possible configurations of h p(h). Therefore, NVDM uses neural variational inference framework to compute evidence lower bound LNT M as: LNT M = Eq(h|V) i=1 log p(vi|h) Here LNT M being a lower bound i.e., log p(V) LNT M, NVDM maximizes the log-likelihood of documents log p(V) by maximizing the evidence lower bound itself. The LNT M can be maximized via back-propagation of gradients w.r.t. model parameters using the samples generated from posterior distribution q(h|V). NVDM assumes both prior p(h) and posterior q(h|V) distributions as Gaussian and hence employ KL-Divergence as a regularizer term to conform q(h|V) to the Gaussian assumption i.e., KLD = KL[q(h|V)||p(h)], mentioned in equation 5. 3.2. Topical Representation Extraction To exploit document-level semantics while language modeling, we extract topics using NVDM and represent semantics of the extracted topics in the following two forms: Latent Topic Extraction: We sample latent topic representation h RK as shown in Figure 2 (a) and Algorithm 2 (lines #13-20). Essentially, the topic vector h is an abstract (latent) representation of topic-word distributions for K topics and represents a document-topic proportion (association) as a mixture of K latent topics about the document being modeled. Precisely, each scalar value hk R denotes the contribution of kth topic in representing a document d by h. We name h as Latent Topic Representation (LTR) and denote it as hd for an input document d. Explainable Topic Extraction: Beyond the latent topicproportion, we also extract explainable topics (a finegranularity description) that can be obtained from high probability key terms of a topic-word distribution corresponding to each latent topic k. In doing so using NTM, we use the decoding weight parameter W RK Z, i.e., a topic matrix where each kth row Wk,: RZ denotes a distribution over vocabulary words for kth topic. As illustrated in Figure 2 (b), we extract key terms for each topic using the utility TOPIC-EXTRACT. Algorithm 2 (lines #1-11) and Algorithm 2 (lines #22-30) describe the mechanism of topic Explainable and Discourse Topic-aware Neural Language Understanding Softmax Layer Chip is on the Table is on the Table LM Task Chip3 is on the Table1 WSD Task Word embedding lookup (a) (b) (c) word1 word2 word3 word4 word5 e1 e2 e3 e4 e5 o1 o2 o3 o4 o5 Document (d) Figure 2. Illustration of our proposed Neural Composite Language Modeling (NCLM) framework: (a) Neural Topic Model (NTM), (b) Latent and Explainable Topic Representation Extraction and (c) Neural Language Model (NLM). learning and extracting explainable topic using GET-ETR. Observe that the utility TOPIC-EXTRACT filters out the top key-terms not appearing in the document being modeled in order to highlight the contribution of those topical words shared in topic-word distribution and the document itself. Specifically, TOPIC-EXTRACT returns K lists of key terms explaining each latent topic hk, i.e., t = [tk|k=1:K] such that tk had top N key terms for kth topic. We use the mask D to apply the filter as: t = row-argmax[W D]1:top N where row-argmax is a function which returns indices of top N values from each row of input matrix, is an element-wise hadamard product and D RK Z is an indicator matrix where each column D:,i {1K if vi = 0; 0K otherwise}. Now for each topic k, we perform embedding lookup using matrix E RDE Z (pretrained word embeddings (Bojanowski et al., 2017)) for each word index in tk and then average them to compute the explanatory topic-embedding vector zk as shown below: Ptop N j=1 emb_lookup(E, tk j ) Finally, we perform weighted sum of topic vectors zk using document-topic proportion vector ˆh as weights to compute zatt. We name zatt as Explainable Attentive Topic Representation (ETR) and denote it as zatt d for a document d. k=1 (zk ˆhk) and ˆh = softmax(h) 3.3. Joint Topic and Language Model For simplification of notation in further sections, we drop the position index in (wm, ym, om, rm) from equations {1, 7, 3} and simply refer to them as (w, y, o, r), since our method is independent of word positions. In this section we describe the composition of topical representation c {hd, zatt d } with the output vector o of NLM such that NLM is aware of document-level semantics while language modeling. We denote composition function by (o c), where we first concatenate the two complementary representations (o and c) and then perform a projection as: ˆo = (o c) = sigmoid([o; c]T Wp + bp) (6) where Wp R ˆ H H and bp RH are projection parameters, and ˆH = H + K. We then compute prediction probability of output word y using equation 7 as: p(y|o, c) = exp{ˆo T U:,y + ay} PV j=1 exp{ˆo T U:,j + aj} (7) Using this composition scheme, we employ the two representations: LTR and ETR exclusively or in combination while performing composition within NCLM framework. Following are the proposed configurations in NCLM: Latent Topic-aware NLM: Existing works (Lau et al., 2017; Dieng et al., 2017; Wang et al., 2018) in marrying topic and language models leverage latent document-topic representation h to incorporate document-level semantics into sequence modeling. Also, modeling in such composite setting can be tricky. To remove the chances of NLM memorizing the next word due to input to NTM, the prior works Explainable and Discourse Topic-aware Neural Language Understanding Algorithm 1 Computation of combined loss L 1: Input: sentence s = {(wm, ym)| m=1:M} 2: Input: V RZ of document d-s containing Nd swords 3: Input: pretrained embedding matrix E 4: Parameters: {W, U, b, a, f MLP , l1, l2, f LST M} 5: Hyper-parameters: {α, top N, g} 6: Initialize: p(h) N(h|0, diag(I)) 7: Initialize: p(V|h) 0; p(s|V) 0; r0 0 8: 9: Neural Topic Model: 10: Sample Latent Topic Representation (LTR) h 11: h, q(h|V) SAMPLE-h(f MLP , g, V, l1, l2, sigmoid) 12: Compute KL divergence between true prior p(h) and q(h|V) 13: KLD KL[q(h|V)||p(h)] 14: for i from 1 to Nd s do 15: p(vi|h) exp{h T W:,i+bi} PZ j=1 exp{h T W:,j+bj} 16: p(V|h) p(V|h) p(vi|h) 17: end for 18: LNT M (log p(V|h) KLD) 19: if ETA or LETA then 20: Extract Explainable Topic Representation (ETR) 21: zatt d-s GET-ETR(W, V, top N, h, E) 22: end if 23: 24: Neural Composite Language Model: 25: for m from 1 to M do 26: om, rm f LST M(rm 1, wm) 27: Composition of NTM and NLM 28: if LTA then 29: ˆom (om hd-s) 30: else if ETA then 31: ˆom (om zatt d-s) 32: else if LETA then 33: ˆom (om [hd-s; zatt d-s] 34: end if 35: p(ym|om, V) exp{ˆo T m U:,ym +aym } PV j=1 exp{ˆo T m U:,j+aj} 36: p(s|V) p(s|V) p(ym|om, V) 37: end for 38: LNLM log p(s|v) 39: L α LNT M + (1 α) LNLM exclude the current sentence from the document before input to NTM. Thus for a given document d and a sentence s on NLM, we compute an LTR vector hd-s by modeling d s sentences on NTM. Then, we compose it with output vector o of NLM to obtain a representation o LT A d using equation 6, i.e., o LT A d = (o hd-s). We name this scheme of composition as LTA-NLM, a baseline for our contributions. Explainable Topic-aware NLM: Discussed in section 3.2, the ETR explains each latent topic and facilitates a finegranularity descriptive outlook by a set of key-terms. Complementary to LTR, we use the ETR vector in composition with NLM. In doing so, we first compose ETR representation zatt d-s of d-s sentences in a document d with NLM output vector o to obtain o ET A d using equation 6, i.e., o ET A d = (o zatt d-s). This newly composite vector o ET A d encodes fine-grained explainable topical semantics to be Algorithm 2 Utility functions 1: function GET-ETR(W, V, top N, h, E) 2: Extract top N words from each topic belonging to d 3: t TOPIC-EXTRACT(W, V, top N) 4: Embedding lookup and summation to get topic embedding 5: for k from 1 to K do Ptop N j=1 emb_lookup(E,tk j ) top N 7: end for 8: Weighted sum of all topic embeddings 9: zatt PK k=1(zk ˆhk); ˆh softmax(h) 10: return zatt 11: end function 12: 13: function SAMPLE-h(f, g, V, l1, l2, act) 14: Sample h via gaussian distribution conditioned on V 15: π act(f(V)) ; ϵ N(ϵ|0, diag(I)) 16: µ(V) l1(π) ; σ(V) l2(π) 17: q(h|V) N(h|µ(V), diag(σ2(V))) 18: h (µ(V) + ϵ σ(V)) q(h|V) 19: return g(h), q(h|V) 20: end function 21: 22: function TOPIC-EXTRACT(W, V, top N) 23: Create mask matrix D RK Z initialized with 0 24: for i from 1 to Z do 25: replace all 0 with 1 in column D:,i if the count of the ith word of the vocabulary is non-zero in V 26: end for 27: Take hadamard product and find top N max values 28: t = row-argmax[W D]1:top N 29: return t 30: end function used in sequence modeling task. We name this composition as ETA-NLM. To our knowledge, none of the existing approaches of joint topic and language modeling leverage explainable topics i.e., topic-word distributions into NLMs. The proposed ETA-NLM is the first one to exploit it. Latent and Explainable Topic-aware NLM: We now leverage the two complementary topical representations using the latent hd-s and explainable zatt d-s vectors jointly. We concatenate them together and compose it with the output vector o of NLM to obtain o LET A d using equation 6, i.e., o LET A d = (o [hd-s; zatt d-s]). We name this composition as LETA-NLM due to latent and explainable topic vectors. 3.4. Sentence-level Topical Discourse Discussed in section 1 and illustrated in Figure 1(b), there is a need for sentence-level topics in order to avoid dominant topic mismatch. Thus, we retain sentence-level topical discourse (SDT) by incorporating sentence-level topic association (latent and/or explainable) while modeling the sentence on NLM. To avoid memorization of current word being predicted y, we remove it from sentence s i.e., s-y is input to NTM to compute its topic-proportion. Given the latent and explainable representations, we first extract Explainable and Discourse Topic-aware Neural Language Understanding sentence-level LTR hs-y and ETR zatt s-y vectors and then concatenate these with the corresponding document-level LTR and/or ETR vectors before composing them with NLM. Following are the additional compositions for every sentence s in a document d: LTA-NLM +SDT : o LT A d,s = (o [hd-s; hs-y]) ETA-NLM +SDT : o ET A d,s = (o [zatt d-s; zatt s-y]) LETA-NLM +SDT: o LET A d,s = (o [hd-s; hs-y; zatt d-s; zatt s-y]) Similarly, these composed output vectors are used to assign probability to the output word y using equation 7. To summarize, we have presented six different configurations of our proposed NCLM framework based on the composition of different latent or explainable representations as well as document-topic and sentence-topic associations: p(y|o, c) = exp{ˆo T U:,y + ay} PV j=1 exp{ˆo T U:,j + aj} where, ˆo {o LT A d , o ET A d , o LET A d o LT A d,s , o ET A d,s , o LET A d,s } 3.5. Training Objective Training of the joint topic and language model is performed by maximizing the joint log-likelihood objective L which is a linear combination of the log-likelihood of document d via NTM and sentence s via NLM i.e., L = α LNT M + (1 α) LNLM where, α [0, 1] is a hyper-parameter, maintaining a balance between NTM and NLM during joint training by updating model parameters at different scales. 3.6. Computational Complexity of NCLM Framework NTM component complexity: Computational complexities of extracting latent and explainable topic representations of document d are described below. 1. Latent topic extraction: complexity of computing the latent topic representation (LTR) vector hd RK, via matrix projection on the encoder side, is given as O(KZ), where K is number of topics. 2. Explainable topic extraction: complexity of computing explainable attentive topic representation (ETR) vector zatt d is given as O(KZ + K(Z log Z + DEtop N)), where (a) O(KZ) is the complexity of computing mask matrix D and taking hadamard product with the topic matrix W, (b) O(Z log Z) is the complexity of sorting kth row Wk,: of topic matrix WK Z, and (c) O(DEtop N) is the complexity of extracting and adding pre-trained word embeddings, via embedding matrix E, for top N key terms of kth topic. Therefore, at asymptotic limits, computational complexity for zatt d becomes O(K(Z log Z + DEtop N)). NLM component complexity: NLM component of our NCLM framework is implemented as Recurrent Neural Network using LSTM cell(s). For each input word wm of sentence s = {(wm, ym)| m=1:M} in document d, the complexity of our NLM component can be sub-divided into three parts: 1. Hidden state computation: complexity of computing output state om, via LSTM cell, is given as O(H(H + HHi)), where H is the number of hidden units of LSTM cell and Hi is the size of input word embedding. 2. Topic composition: complexity of composition ( ), via concatenation and projection, of topic representation c {hd, zatt d , [hd; zatt d ]} with output state om of NLM is given as O((H + Hc)H), where Hc {K, DE, (K +DE)} is the size of topic representation c. 3. Word prediction: complexity of output word prediction via multinomial logistic regression over NLM vocabulary is given as O(HV ), where V is NLM vocabulary size. Therefore, combined computational complexity for all M words in sentence s is given as Os = O(MH(H + Hi + Hc + V )) Composite model complexity: Based on the type of topic composition, our proposed models have different computational complexities as mentioned below: LTA-NLM: Os + O(KZ) ETA-NLM: Os + O(K(Z log Z + DEtop N)) LETA-NLM: Os + O(K(Z log Z + DEtop N)) Sentence-level topical discourse: For each output word ym in sentence s, we additionally compute latent/explainable topical representation of sentence s after removing ym i.e., s-ym, to maintain topical discourse across sentences in document d. Therefore, M additional LTR/ETR vectors are computed for all words in sentence s and hence computational complexities are significantly increased by the factor of M as shown below: LTA-NLM +SDT: Os + O(MKZ) ETA-NLM +SDT: Os + O(MK(Z log Z + DEtop N)) LETA-NLM +SDT: Os + O(MK(Z log Z + DEtop N)) 4. Experiments and Results To demonstrate the positive influence of composing LTR and ETR representations in neural language modeling, we Explainable and Discourse Topic-aware Neural Language Understanding Table 2. Language Modeling Perplexity scores on three datasets under two different settings of NLM i.e., S small-NLM and L large-NLM. Here, (+) augment this feature to the model in previous row, NLM LSTM-LM, ( ) scores taken from Wang et al. (2018). Dotted line separates our baseline models from the previous works ( ). Here, bold values indicate best performing proposed model in comparison to LSTM-LM baseline, and GAIN(%) indicates the improvement in performance of the same. MODEL APNEWS IMDB BNC S L S L S L LDA+LSTM* 54.83 50.17 69.62 62.78 96.38 87.28 LCLM* 54.18 50.63 67.78 67.86 87.47 80.68 Topic RNN* 54.12 50.01 66.45 60.14 93.55 84.12 TDLM* 52.65 48.21 63.82 58.59 86.43 80.58 TCNLM* 52.59 47.74 62.59 56.12 86.21 80.12 LSTM-LM 64.95 59.28 72.31 65.54 106.82 98.78 LTA-NLM 55.48 49.61 68.21 61.49 98.31 89.36 + SDT 48.23 42.85 63.81 58.90 90.36 80.30 ETA-NLM 49.34 43.19 59.20 51.40 95.62 87.22 + SDT 48.75 43.50 57.83 50.51 95.64 88.37 LETA-NLM 48.33 43.17 58.10 52.35 94.78 86.73 + SDT 42.98 39.41 56.65 51.05 88.30 81.12 GAIN(%) 33.8 33.5 21.6 22.1 17.3 18.7 perform quantitative and qualitative evaluation of our proposed models on five NLP tasks. 4.1. Evaluation: Language Modeling We present experimental results of language modeling using our proposed models on APNEWS, IMDB and BNC datasets (Lau et al., 2017). For NLM, we tokenize sentences and documents into words, lowercase all words and remove those words which occur less than 10 times. For NTM, we additionally remove stopwords, word occuring less than 100 times and top 0.1% most frequent words. We use standard language model perplexity as the evaluation measure for our proposed models. For data statistics and time complexity of experiments refer supplementary. Experimental setup: We follow Wang et al. (2018) for our experimental setup. See supplementary for detailed hyperparameter settings. Sentence s being modeled at NLM side is removed from document d at NTM side. We use two settings of NLM component: (1) small-NLM (1-layer, 600 hidden units), and (2) large-NLM (2-layer, 900 hidden units). We fix the NLM sequence length to 30 and bigger sentences are split into multiple sequences of length less than 30. We initialize the input word embeddings for NLM with 300-dimensional pretrained embeddings extracted from word2vec (Mikolov et al., 2013) model trained on Google News. We perform an ablation study to get the best setting of hyperparameters α and top N (see supplementary). Table 3. Top 5 words of two randomly selected topics extracted using ETA-NLM model for APNEWS, IMDB and BNC datasets. APNEWS IMDB BNC army legal comedy disney music art soldiers jurors jokes daffy album art infantry trial unfunny cindrella guitar paintings brigade execution satire alladin band painting veterans jury sandler looney music museum battalion verdict streep bambi pop gallery Baselines: We compare our proposed models with seven baseline models: (i) LDA-LSTM: concatenating pretrained LDA topic-proportion vector with LSTM-LM; (ii) LCLM (Wang & Cho, 2016); (iii) Topic RNN (Dieng et al., 2017); (iv) TDLM (Lau et al., 2017); (v) TCNLM (Wang et al., 2018); (vi) LSTM-LM: NLM component of our proposed models; and (vii) LTA-NLM: our baseline model. Results: Language modeling perplexity scores are presented in Table 2. All topic composition models outperform LSTM-LM baseline which demonstrate the advantage of composing document topical semantics in NLM. Based on the results, here are three key observations: (i) ETANLM always performs better than LTA-NLM resulting in 11% (49.34 vs 55.48), 13.2% (59.20 vs 68.21) and 2.7% (95.62 vs 98.31) improvement for APNEWS, IMDB and BNC datasets respectively under small-NLM configuration. This behaviour asserts that ETR vector effectively captures fine-grained document topical semantics compared to LTR vector; (ii) LETA-NLM outperforms both LTA-NLM & ETANLM by exploiting complementary semantics of ETR and LTR vectors; (iii) composing sentence-level topic representations i.e., +SDT, further boost the performance by maintaining sentence-level topical discourse and hence LETANLM +SDT improve upon LTA-NLM model by 22% (42.98 vs 55.48), 17% (56.65 vs 68.21) and 10% (88.30 vs 98.31) for APNEWS, IMDB and BNC datasets respectively under small-NLM configuration. 4.2. Evaluation: Topic Modeling Typically, topic models are evaluated using perplexity measure. However, in the context of our NCLM framework where we compose topical semantics in language modeling, we investigate quality of topics generated by our composite models. We follow Wang et al. (2018) to infer topic coherence of top 5/10/15/20 topic words for each topic using pairwise NPMI scores and average them to get an average coherence score. We use the same experimental setup and hyperparameter settings as described in section 4.1. Baselines: Following Wang et al. (2018), we compare average coherence scores of our proposed models with the following baselines: (i) LDA (Blei et al., 2001); (ii) Explainable and Discourse Topic-aware Neural Language Understanding .13 .15 .15 .15 .22 .22 .22 .22 .21 .22 .09 .08 .09 .10 .23 .23 .23 .22 .23 .22 .12 .10 .10 .11 .20 .21 .20 .20 .21 .19 Figure 3. Topic coherence score comparison of our proposed models and multiple baselines. ( ) taken from Wang et al. (2018) Table 4. Text Classification accuracy scores on three datasets. CNN model proposed by Kim (2014), +Topic augment topic feature in the above model, and bold values indicate best models. MODEL 20NS R21578 IMDB CNN-Rand .721 .690 .888 +Topic .724 .699 .891 CNN-LSTM .745 .750 .899 CNN-LTA .753 .759 .907 CNN-ETA .775 .763 .903 CNN-LETA .770 .750 .908 TDLM (Lau et al., 2017); (iii) Topic RNN (Dieng et al., 2017); (iv) TCNLM (Wang et al., 2018). Results: Average topic coherence scores are presented as bar-plot in Figure 3. Based on the plot, here are two key observations: (i) all of our proposed models outperform every baseline by a significant margin; (ii) however, there is no discernible pattern in the topic coherence scores of our proposed models, hence, an improvement in language modeling performance does not correspondingly improve topic coherence. For a qualitative overview, Table 3 shows 2 randomly chosen topics for each dataset. See supplementary for examples of topic-aware sentence generation. 4.3. Evaluation: Text Classification We evaluate the quality of representations learned by our proposed models via document classification. We use three labeled datasets: 20Newsgroups (20NS), Reuters (R21578) and IMDB movie reviews (IMDB) (See supplementary for data statistics). Based on the scores in Table 2, we employ our best performing composite language models as static feature extractors. For each document d, we extract: (1) output state om for each input word xm via NLM component ; and (2) document topic representation vector c {hd, zatt d , [hd; zatt d ]} via NTM component based on the model configuration. We then concatenate c with each om and use them as inputs to train a CNN based text classifier proposed by Kim (2014). For IMDB movie reviews dataset, we use our best models trained on unlabeled IMDB dataset as feature extractor. However, as 20NS and R21578 are news-domain datasets, we employ best models trained on APNEWS because of its bigger corpus size than BNC. Baselines: Using LTA-NLM, ETA-NLM and LETA-NLM as feature extractors, we propose CNN-LTA, CNN-ETA and CNN-LETA respectively. We compare these models with: (1) CNN-Rand: randomly initialized CNN text classifier (Kim, 2014) without embedding update; (2) +Topic: additionally concatenate LTR vector hd with each word embedding input in CNN-Rand; and (3) CNN-LSTM: use om extracted using pre-trained LSTM-LM as input to CNN classifier. Results: Document classification results are presented in Table 4. Based on the results, there are two notable key findings: (1) CNN-Rand performed worst among all models, but incorporating document topical semantics i.e., +Topic, provided a boost in classification scores for 20NS (.724 vs .721), R21578 (.699 vs .690) and IMDB (.891 vs .888) datasets which shows the advantage of composing document topic representations during language modeling; (2) however, the best performance comes from CNN-ETA for 20NS (.775 vs .745) & R21578 (.763 vs .750) datasets and CNN-LETA for IMDB (.908 vs .899). This shows that ETA-NLM and LETA-NLM learn better representations than LTA-NLM and LSTM-LM and suggests that ETR vector effectively captures fine-grained document semantics than LTR vector. 4.4. Evaluation: Information Retrieval We further evaluate the quality of learned representations via document retrieval task. We show retrieval performance on three datasets: 20Newsgroups (20NS), Reuters (R21578) and AGnews. Following Gupta et al. (2019a), we treat all test documents as queries and retrieve a fraction of training documents closest to each query using cosine similarity measure. Then, we compute precision for each query as the fraction of all retrieved documents with same label as query and average over precision scores of all queries to get an final precision score. Similar to text classification, for each document d of length Nd, we extract the final output state o Nd of the NLM component and concatenate it with c {hd, zatt d , [hd; zatt d ]} extracted via NTM component to get a composite representation. We then compute cosine similarity of each query-document pair using this composite representation. We employ our proposed models pre-trained on APNEWS dataset as feature extractors. We compute precision scores for top-5 and top-10 retrieved documents for each dataset. Explainable and Discourse Topic-aware Neural Language Understanding Table 5. Information Retrieval Evaluation: Average precision scores on three labeled datasets for top-5 (P@5) & top-10 (P@10) retrievals. Bold value indicates best score in each column. MODEL 20NS R21578 AGNEWS P@5 P@10 P@5 P@10 P@5 P@10 NTM .198 .190 .581 .567 .607 .600 LTA-NLM .264 .217 .585 .558 .682 .666 ETA-NLM .287 .242 .590 .562 .683 .665 LETA-NLM .281 .236 .615 .589 .694 .675 Table 6. WSD evaluation results using F1 scores (micro). Among proposed models, Bold values indicates best model compared to Bi LSTM-LM. MODEL DEV TEST SE07 SE13 SE15 SE2 SE3 ALL MFS 54.5 63.8 67.1 65.6 66.0 65.5 Bi LSTM-LM 55.0 53.9 60.8 63.6 60.8 59.8 LTA-NLM 56.0 54.8 60.9 64.8 62.2 60.7 ETA-NLM 55.4 54.7 60.6 64.8 62.3 60.7 LETA-NLM 55.6 54.7 61.1 64.7 62.1 60.7 Baselines: Document retrieval is used to evaluate applicability of topic models. Therefore, we compare the retrieval performance using composite representations of our best performing LTA-NLM, ETA-NLM and LETA-NLM models (using table 2) with baseline performance using document LTR vector hd extracted via pre-trained NTM component of our proposed composite language model. Results: Document retrieval results are presented in Table 5. It is worth noting that: (1) all of our proposed composite models performed much better than the NTM component itself, and (2) ETA-NLM and LETA-NLM models performed much better than LTA-NLM which reconfirms that ETR vector is more descriptive than LTR vector and support NLM in encoding long-term semantic dependencies. (3) as compared to LTA-NLM, ETA-NLM performed best for 20NS (.376 vs .355), while LETA-NLM performed best for R21578 (.664 vs .629) and AGnews (.694 vs .682). 4.5. Evaluation: Word Sense Disambiguation Word sense disambiguation (WSD) deals with correct prediction of appropriate semantic meaning (sense) of a word given its surrounding context. Similar to language modeling, a word can have semantic dependencies across sentence boundaries. Therefore, we show the applicability of our NCLM framework which boosts correct sense prediction by exploiting document-level topical knowledge to capture long-range semantic dependencies. We focus on English all-words WSD task where, the aim is to simultaneously predict correct sense for each word in a given sentence. We use evaluation framework proposed by Navigli et al. (2017) for training and evaluation. Experimental setup and Baselines: Following Raganato et al. (2017), we use 1-layer bidirectional-LSTM cell with 100 hidden units in the NLM component of our proposed models LTA-NLM, ETA-NLM and LETA-NLM. In the absence of next word prediction task, we use full document context on NTM side. Our models consider all the words in a sentence as input, and learn to predict the correct sense via multinomial logistic regression over a vocabulary of all unique senses present in training data. Models are trained using a learning rate of 1e-3 & batch size of 32 and predictions are evaluated using micro F1 score. We compare evaluation performance of our model with the following baselines: (1) MFS: most frequent sense extracted from Word Net (Miller, 1995); and (2) Bi LSTM-LM: language model using 1-layer bidirectional-LSTM cell with 100 hidden units. Results: WSD F1 scores are presented in Table 6. Observe that by averaging F1 scores over all test datasets, our proposed models outperform Bi LSTM-LM (60.7 vs 59.8) which again confirms the advantage of document-level semantic knowledge in resolving sense ambiguities via composition. 5. Conclusion We have presented a neural composite language modeling framework that leverages both the latent and explainable topic representations by composing a neural language model and a neural topic model. Moreover, we have introduced sentence-topic association along with document-topic association to retain sentence-level topical discourse. Experimental results on several language understanding tasks have supported our multi-fold contributions. Acknowledgments This research was supported by Bundeswirtschaftsministerium (bmwi.de), grant 01MD19003E (PLASS (plass.io)) at Siemens AG - CT Machine Intelligence, Munich Germany. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 601 608, 2001. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135 146, 2017. ISSN 2307-387X. Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Explainable and Discourse Topic-aware Neural Language Understanding Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724 1734, 2014. Dieng, A. B., Wang, C., Gao, J., and Paisley, J. W. Topicrnn: A recurrent neural network with long-range semantic dependency. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. Gupta, P., Chaudhary, Y., Buettner, F., and Schütze, H. Document informed neural autoregressive topic models with distributional prior. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 6505 6512, 2019a. Gupta, P., Chaudhary, Y., Buettner, F., and Schütze, H. Texttovec: Deep contextualized neural autoregressive topic models of language with distributed compositional prior. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019b. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735 1780, 1997. Khandelwal, U., He, H., Qi, P., and Jurafsky, D. Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 284 294, 2018. Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746 1751, 2014. Lau, J. H., Baldwin, T., and Cohn, T. Topically driven neural language model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 355 365, 2017. Miao, Y., Yu, L., and Blunsom, P. Neural variational inference for text processing. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 1727 1736, 2016. Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., and Khudanpur, S. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp. 1045 1048, 2010. Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. Miller, G. A. Wordnet: A lexical database for english. Commun. ACM, 38(11):39 41, 1995. Navigli, R., Camacho-Collados, J., and Raganato, A. Word sense disambiguation: A unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pp. 99 110, 2017. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 2227 2237, 2018. Raganato, A., Bovi, C. D., and Navigli, R. Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 1156 1167, 2017. Wang, T. and Cho, K. Larger-context language modelling with recurrent neural network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. Wang, W., Gan, Z., Wang, W., Shen, D., Huang, J., Ping, W., Satheesh, S., and Carin, L. Topic compositional neural language model. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pp. 356 365, 2018.