# extracting_topical_phrases_from_clinical_documents__5c65a4ff.pdf

Extracting Topical Phrases from Clinical Documents

Yulan He School of Engineering and Applied Science Aston University, UK y.he@cantab.net

In clinical documents, medical terms are often expressed in multi-word phrases. Traditional topic modelling approaches relying on the bag-of-words assumption are not effective in extracting topic themes from clinical documents. This paper proposes to ﬁrst extract medical phrases using an off-theshelf tool for medical concept mention extraction, and then train a topic model which takes a hierarchy of Pitman-Yor processes as prior for modelling the generation of phrases of arbitrary length. Experimental results on patients discharge summaries show that the proposed approach outperforms the state-of-the-art topical phrase extraction model on both perplexity and topic coherence measure and ﬁnds more interpretable topics.

Introduction

Topics models such as Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003) have been extensively used to automatically discover topical themes from text documents. However, most of the topic models rely on the bag-of-words assumption by ignoring word orders and the extracted topics are often listed as a sequence of word unigrams which could hinder the interpretability of the discovered topics. In clinical documents, medical terms are often expressed in multi-word phrases, for example, blood glucose and white blood cell . These two phrases, if split into unigrams, would lose their original semantic meanings. Also, they might be grouped under the same topic by the unigram topic models because of the common word blood . For this reason, simply splitting documents into word unigrams would generate ambiguous and spurious topics (Arnold and Speier 2012). There are in general three categories of approaches to topical phrase extraction. The ﬁrst one is to pre-process documents by extracting phrases, either based on some statistical measures for word collocation terms detection or through frequent pattern mining, and then run traditional topic models such as LDA on the bag-of-phrases . Since each extracted phrase is considered as a single term, the resulting vocabulary is signiﬁcantly enlarged which leads to more sparse data. The topical phrase mining method (To PMine)

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(El-Kishky et al. 2014) also operates LDA on the bag-ofphrases . But it decomposes each phrase and imposes a constraint in LDA inference that all the constituent words within the same phrase should be assigned with the same topic. However, their method cannot detect less frequent phrases sharing common lower-order n-grams, for example, drugs with different dosage levels such as ciproﬂoxacin 500 mg and ciproﬂoxacin 250 mg . The second category of approaches is to extract topical phrases as a post-processing step to unigram topic models (Blei and Lafferty 2009; Danilevsky et al. 2014). Such approaches assume that words which are simultaneously labeled with the same topic many times can be grouped as a phrase. However, in unigram topic models, words within the same phrase may not be assigned with the same topic. Moreover, as topic models are typically run on data with stopwords removed, the post-process approaches would have a difﬁculty in recognising phrases which contain stopwords such as short of breath . The last category of approaches combine phrase boundary detection with topic inference into a uniﬁed model. Examples include the Topic N-Gram (TNG) model (Wang, Mc Callum, and Wei 2007), the phrase-discovering topic model (PDLDA) (Lindsey, Headden III, and Stipicevic 2012) and the N-gram Topic Segmentation model (Jameel and Lam 2013). But the models are computationally too complex and the topical phrases detected often have lower quality. In this paper, we propose a new approach which lies in between the aforementioned categories 1 and 3 of topical phrase extraction approaches. We ﬁrst extract the medical phrases from patient s discharge summaries using an off-the-shelf medical concept mention extraction tool. The phrases extracted are guaranteed to be of high quality and are also clinical-relevant. Then we learn a topic model which takes a hierarchy of Pitman-Yor Processes (PYPs) as priors. This allows the capture of n-grams of arbitrary length naturally by taking into account word orders within phrases. As will be shown in our experimental results, the proposed approach outperforms the other topical phrase models in terms of both perplexity and topic coherence measure and generates more interpretable topics. We proceed to describe related work in topic modelling of clinical documents and phrase discovery topic models. We then discuss how we extract medical phrases and present our proposed Topical Phrase Model (TPM) followed by ex-

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

perimental results. Finally, we conclude the paper.

Related Work Topic Modelling of Clinical Documents Arnold and William (2012) proposed a topic model that captures temporal topic patterns in an individual patient s medical record. In such a model, each patient has his or her own timeline consisting of a subset of topics. In each of the patients clinical reports, the hidden topics not only generate observed words, but also generate the timestamp associated with each report. Paul and Dredze (2013) used a three-dimensional LDA variant to jointly model combinations of drug (marijuana, salvia, etc.), aspect (effects, chemistry, etc.) and route of administration (smoking, oral, etc.) for generating extractive summaries about drug usage from the web. In both models, documents are modelled as ﬁnite mixtures over an underlying set of latent topics which are inferred from word co-occurrence patterns with word order ignored. As such, some of the extracted topics are not necessarily clinical relevant and often post-processing is required in order to ﬁlter out clinically irrelevant topics. For example, Zeng et al. (2006) proposed to identify topics relevant to biology based on calculating the mutual information between the topics and the controlled vocabulary of the Gene Ontology (GO) terms tagged in biomedical documents. Other approaches performed pre-processing on clinical documents before applying the LDA model. For example, Yu et al. (2013) ﬁrst extracted noun phrases from clinical documents and then run LDA on the bag-of-nounphrases . Lehman et al. (2014) extracted a set of UMLS codes from each patient s hospital discharge summary. They then trained a topic model on documents containing unordered sets of UMLS codes. The resulting topics have been shown more easier to interpret. However, they restricted the UMLS codes to three categories only, either disease, symptom, or ﬁnding, and ignore other UMLS codes and words. As such, the scope of study is limited.

Phrase Discovery Topic Models The early topic model which goes beyond bag-of-words is the Bigram Topic Model (BTM) (Wallach 2006) in which each word is generated from the distribution over words for the context deﬁned by a latent topic and the previous word. Wang et al. (2007) proposed a Topical N-Gram (TNG) model which extended BTM by introducing a switch variable at each word position to signal either the start of a new n-gram or a continuation of a previously identiﬁed n-gram. In TNG however, words within a n-gram do not share the same topic. Post-processing is required to take the topic of the ﬁnal word in a n-gram as the topic of the entire n-gram. Also, in TNG, the topic-speciﬁc bigram distributions do not share probability mass with their constituent unigram distributions. To overcome these drawbacks, Lindsay et al. (2012) proposed a phrase-discovering topic model (PDLDA) which used the hierarchical Pitman-Yor process (HPYP) priors for the topic-word matrix. Jameel and Lam (2013) proposed a topic segmentation model which can detect topics at the segment level and at the n-gram word level. All these mod-

els essentially involve an additional step in learning phrase boundaries apart from topic detection. Performing phrase segmentation and topic detection simultaneously is computationally more expensive. Also, phrases detected in this way often have lower quality. More recently, El-Kishky et al. (2014) proposed a topical phrase mining method called Top Mine which consists of two steps. It ﬁrst discover phrases from text using a method similar to frequent pattern mining commonly used in association rule mining, and then train a LDA model on the bagof-phrases input under the constraint that words in the same phrase should be assigned with the same topic. Our proposed approach also follows a two-step process in which we ﬁrst extract medical phrases and then train a topical phrase model on the bag-of-phrases input. However, unlike To PMine which does not explicitly model the generation of n-grams, our model naturally captures n-grams with arbitrary length through the use of HPYPs. Also, as phrase detection is separated from topic inference, our approach is computationally less complicated compared to PDLDA or the N-gram Topic Segmentation model.

Extracting Medical Phrases

Figure 1: An example of medical term extraction result.

We extract medical phrases from text using a medical term extraction system built upon an open source toolkit called Med Tagger1, which combines machine learning and knowledge bases to identify medical concept mentions in clinical text. Med Tagger assigns each extracted medical concept mention to one of the 15 semantic groups which are further classiﬁed into 133 subgroups. Med Tagger has been shown constantly achieving an F-score of over 0.84 at various NLP challenges on the medical concept mention extraction task (Liu et al. 2012). An advantage of Med Tagger is that it also performs concept mention normalisation. For example, a medical term SOB extracted can be identiﬁed as the medical phrase short of breath . In cases where an extracted term can be mapped to multiple possible medical phrases by Med Tagger, we simply take the ﬁrst identiﬁed phrase as the

1http://www.ohnlp.org/index.php/Med Tagger

normalised medical term. Figure 1 shows an example output generated by the medical phrase extraction system where detected medical terms/phrases are highlighted in blue colour. Clicking on any medical term brings up a dialog box showing the annotation results including the corresponding normalised phrase, its mapped semantic group, etc.

Topical Phrase Model (TPM)

Once the medical phrases have been identiﬁed, each document consists of a mixed set of unigram words and phrases. We propose in this section a Topical Phrase Model (TPM) which is able to learn hidden topics from text and model the generation of n-grams. Since TPM is built upon the Hierarchical Pitman-Yor Process (HPYP), we ﬁrst describe HPYP before discussing the details of TPM.

Hierarchical Pitman-Yor Process (HPYP)

In previous work (Teh 2006), the hierarchical Pitman-Yor process language model (HPYLM) has been shown to recover exactly the formulation of interpolated Kneser-Ney (Chen and Goodman 1999), one of the best smoothing methods for n-gram language models. In HPYLM, a unigram word distribution G is ﬁrst generated from the PYP as:

G PYP(a0, b0, G0), (1)

where G0 is a uniform distribution over a ﬁxed vocabulary W of V words, G0(w) = 1

V ; w W, a0 is the discount parameter, b0 is called concentration parameter which controls the amount of variability of G around the prior G0. Then bigram word distributions {Gu}u W is generated using G as the base distribution, Gu PYP(a1, b1, G ). Trigram word distributions {Guv}(u,v) W2 can then be successively generated from Guv PYP(a2, b2, Gu). This process continues until the context length reaches n 1. In general, given a context u consisting of a sequence of up to n 1 words, the distribution over the current word w is generated by

Gu PYP(an 1, bn 1, Gu ) (2)

where u is the sufﬁx of u consisting of all but the ﬁrst word. The generative process of drawing words from the prior is analogous to the generalised Chinese Restaurant Process (CRP) (Pitman 2002) where a restaurant corresponds to each Gu which has an inﬁnite number of tables and each of which has inﬁnite seating capacity. Each table is served a dish chosen from the base distribution Gu (i.e., a distinct value drawn from the base distribution Gu ). A sequence of customers corresponding to words drawn from Gu arrives in the restaurant. The ﬁrst customer sits at the ﬁrst table; the (n + 1)th customer chooses an occupied table in proportional to the number of customers already sitting down and share the dish with other customers, or chooses a new table in proportional to some constant parameter and order a dish from the base distribution. Choosing a dish is equivalent to sending the new table as a proxy customer to the parent restaurant in a recursive manner. This process repeats until the proxy customer chooses to sit in an existing table or there is no more parent restaurant.

In HPYLM, the probability of a word following context u given the seating arrangement is:

Pu(w|Λ) = cuw atuw

cu. + b + atu. + b

cu. + b Pu (w|Λ), (3)

where cuw is the number of customers having dish w in restaurant u, tuw is the number of tables serving w in restaurant u and Λ denotes the current seating arrangement. A dot is used to indicate marginal counts (i.e., cu. =

w cuw and tu. =

w tuw). For the global base distribution, the predictive probability is P (w|Λ) = G (w). Equation 3 corresponds to interpolated Kneser-Ney which estimates the probability of word w following context u by discounting the true count cuw by a ﬁxed amount atuw and interpolates the estimated probability of word w with lower order mgram probabilities (Teh 2006).

Generative Process of TPM We consider a Bayesian nonparametric version of topic model where a topic is assigned to each word token from a document-speciﬁc multinominal distribution and a word is generated from a topic-speciﬁc distribution taking the PYP as priors. Moreover, a HPYP process is used which allows the modelling of n-gram word sequences of arbitrary length. Compared to the Dirichlet prior commonly used in LDA, using PYP as priors is more appropriate to deal with natural language since PYP can capture the fact that words in natural language follows a power law.

wd,m,1 wd,m,2 wd,m,Ld,m

zd,m,1 zd,m,2 zd,m,Ld,m

Figure 2: Topical Phrase Model (TPM).

Our proposed Topical Phrase Model is illustrated in Figure 2 and the generative process is described below:

For each topic k {1, .., K}

First generate a unigram word distribution, Gk PYP(a0, b0, G0) Then given a context u consisting of a sequence of up to n 1 words, generate a n-gram word distribution, Gk u PYP(an 1, bn 1, Gk u ), where u denotes the parent context of u.

For each document d {1, .., D}

Choose a topic distribution θd Dirichlet(α)

For each phrase m {1, .., Md}, and for each word within a phrase i {1, .., Ld,m} If it is the ﬁrst word in the phrase, i.e., l = 1 Select a topic zd,m,l Discrete(θd) Draw a word wd,m,l Discrete(Gzd,m,l ) Else Set zd,m,l = zd,m,l 1 Draw a word wd,m,l|u Discrete(Gzd,m,l u ) Note that in the plate diagram shown in Figure 2, we have explicitly shown the phrase plate that a document d contains a total of Md phrases and each phrase m contains Ld,m words. In our implementation of the Gibbs sampling procedure of TPM, we still loop over every single word from a total of Nd words in document d. For each word wdi at position i of document d, we use a switch variable xdi to indicate whether the word wdi should be concatenated with the previous word wd,i 1 to form a multi-term phrase. If xdi = 0, then wdi is either the ﬁrst word of a multi-term phrase or a single word by its own, and a topic zdi is sampled from a topic-speciﬁc multinomial distribution and a word is drawn from a topic-speciﬁc unigram distribution. If xdi = 1, then the word wdi is part of a multi-term phrase, and its topic zdi is taken to be the same as zd,i 1 and a word is drawn from a topic-speciﬁc distribution conditioned on its context u which includes all the previous words in the phrase. Since we have already identiﬁed all the phrases as has been discussed in the previous section, the switch variable x is observed and does not need to be sampled from data. In TPM, we place PYP as a prior distribution over word probabilities. For each topic k 1, .., K, we can generate n-gram word distributions by the PYP, Gk u PYP(an 1, bn 1, Gk u ). To build the Gibbs sampling algorithm, we ﬁrst derive the joint distribution over words w, topic assignments z, table conﬁguration r. Let Ω = {a0, b0, α}, we have:

P(z, w, r|Ω) =

k=1 P(Gk u|a0, b0, G0)

d=1 P(θd|α)

i=1 P(zdi|θd)P(wdi|Gzdi u ). (4)

If we omit Ω for clarity and let the index y = (d; i) and the subscript \y denote a quantity that excludes counts in word position i of document d, we have:

P(zy = k|z\y, w\y, r\y) P(zy = k|z\y)P k u(wy|w\y, r\y),

where P(zy = k|z\y) = Cd,k\y+αk Cd\y+

z αz if the yth word is a unigram or the ﬁrst word in a phrase. Here, Cd,k is the number of times topic label k being assigned to some word tokens in document d. If the yth word is part of a multi-term phrase, we simply take zy = zy 1 and do not sample a topic. As we aim to capture n-grams under each topic, there are a hierarchy of PYP distributions which model word context of different length for each topic. Let Λ denote the current seating arrangement, the generation of next word w from Gk u can be computed recursively as:

P k u(w|Λ) = ck uw an 1tk uw cku. + bn 1 + an 1tk u. + bn 1 cku. + bn 1 P k u (w|Λ).

For a word w W, the context u consists of a sequence of n 1 words, u Wn 1, and u is the context consisting of all words in u except the ﬁrst one. We use ck uw to denote the number of customers eating dish w in restaurant u owned by k (i.e., the number of occurrences of w following u in topic k), tk uw to denote the number of tables serving dish w in restaurant u owned by k, ck u. =

w ck uw and tk u. =

w tk uw. For a unigram w,

P k (w|Λ) = ck w a0tk w ck . + b0 + a0tk . + b0 ck . + b0 1 V ,

where denotes no previous word and V is the vocabulary size. PDLDA also used HPYP priors. However, PDLDA only involves a single hierarchical n-gram language model with each topic considered as part of context in u, i.e., for the current word wi, its context is deﬁned as u =< zi, wi 1, wi 2, ..., wi n+1 > which consists of both topic and the preceding n 1 words. Nevertheless, topics are not part of the word vocabulary. Our proposed TPM instead generates a separate n-gram language model for each topic. Also, the hyperparameters an 1 and bn 1 are set to different values for each context length n (different depth in HPYP) by auxiliary variable sampling (Teh et al. 2006), but are shared across all topics.

Experiments We use the clinical record data released as part of the i2b2 Natural Language Processing Challenges for Clinical Records (Uzuner et al. 2010). The data contains a total of 1,243 de-identiﬁed discharge summaries. The original challenge focused on the identiﬁcation of medications, their dosages, modes (routes) of administration, frequencies, durations, and reasons for administration in discharge summaries. In our experiments here, we focused on extracting topical phrases from discharge summaries. Each document is pre-processed to remove common stopwords and clinical stopwords such as status report , discharge , Dr. etc. We use our medical phrase extraction system built upon Med Tagger to identify phrases from documents. We did not perform stemming. The total number of word tokens is 791,097. The vocabulary size is 7,738 in the bag-of-words representation, and 32,893 in the bagof-phrases representation. It can be seen that if considering each phrase as a single token, the vocabulary size is signiﬁcantly enlarged. We train TPM with a maximum of 1,000 Gibbs sampling iterations and stop if the total log-likelihood converges. We optimise all the hyperparameters including α and an 1, bn 1 for different context length n in HPYP every 50 iterations. We compare our proposed approach with the following baselines:

LDA. We use the MALLET2 implementation of the LDA model to extract topics from our pre-processed data where medical phrases have already been identiﬁed and are

2http://mallet.cs.umass.edu/

(a) Perplexity vs. Gibbs sampling iterations.

(b) Perplexity vs. different topic numbers.

Figure 3: A comparison of perplexity values of various topic models.

treated as single tokens. For all the hyperparameters, we use the default settings and perform optimisation every 50 Gibbs sampling iterations. TNG (Wang, Mc Callum, and Wei 2007). The Topical NGram model is used to simultaneously detect n-grams and infer topics. Again, the MALLET implementation of TNG is used with the default hyperparameter settings. To PMine (El-Kishky et al. 2014). This approach ﬁrst extracts phrases using a method similar to frequent pattern mining commonly used in association rule mining and then train a modiﬁed LDA model on the bag-of phrases input. It has been shown outperforming a number of phrase discovery topic models including TNG and PDLDA.

Perplexity Perplexity has been commonly used in evaluating topic models prediction ability on unseen data. It is deﬁned as the reciprocal geometric mean of the likelihood of a test corpus. Lower perplexity implies better predictiveness, and hence a better model. We use 10% of the data as a held-out set and compare how different models perform in predicting the held-out set. Figure 3(a) shows the perplexity values versus Gibbs sampling iterations when the topic number is set to 50. It can be observed that TNG has better perplexity values compared to LDA. Both To PMine and TPM perform significantly better than TNG and LDA with much lower perplexity values. We also vary the number of topics and observe a general trend that perplexity values decrease with the increasing number of topics for all the models as shown in Figure 3(b). TNG and LDA perform similarly while To PMine and TPM achieve much lower perplexities with the best performance given by TPM.

Topical Coherence Various topic coherence measures have been proposed to evaluate topics regarding their understandability. It has been recently reported in (R oder, Both, and Hinneburg 2015) through an extensive study that a new coherence measure based on a combination of some known approaches gives

the best results in terms of approximating the human ratings of topic interpretability, outperforming all the other existing topic coherence measures including the widely used measure based on the pointwise mutual information (PMI) of all word pairs in the given top topic words (Newman et al. 2010). In particular, the new coherence measure retrieves cooccurrence counts for the given words from Wikipedia using a sliding window of size 110. For each of the top n words in a given topic, the normalised PMI value with respect to every other top words is calculated based on the cooccurrence counts. Thus, each top word is represented as a vector of normalised PMI values. The coherence measure of each topic is the arithmetic mean of the cosine similarity measurement of all vector pairs. We report in Figure 4 the topic coherence measure calculated on the top 10 words/phrases of each topic based on the method proposed in (R oder, Both, and Hinneburg 2015). It can be observed that in general the coherence value increases with the increasing number of topics. TNG has the worst topic coherence values compared to the other models. To PMine only slightly outperforms LDA. TPM consistently gives superior performance over all the other models for all the topic settings.

Figure 4: Topic coherence measure vs. number of topics.

Topic 1 Topic 2 Topic 3 Topic 4

swallowing infection surgery chest pain transferred vancomycin repair aspirin intubated antibiotics wound heart intensive care culture signs ischemia speech fever removed myocardial ischemia wean intervertebral nasogastric tube hypertension extubated culture blood diets set tube feeds ﬂuid female lipitor arrest levaquin hospital course coronary artery disease aspiration gentamicin diverticulitis emergency department

heart failure chest pain chest x ray alert override override added heart rate shortness of breath chest x ray showed override notice override added heart transplant nausea and vomiting physical therapy reason for override aware rate control dyspnea on exertion neck supple previous override information heart sounds increased shortness of breath showed no evidence po ref potentially serious interactions heart anatomy pain control chest x ray result reason for override aware distant heart sounds substernal chest pain chest x ray revealed reason for override respiratory distress chest pressure head and neck alert override override added on ﬁlled pressure back pain motor vehicle accident order for coumadin respiratory failure atypical chest pain neck supple no adenopathy reason for override will monitor

restrictive lung disease right coronary artery ciproﬂoxacin 500 mg dressing changes pleuritic chest pain left upper extremity levoﬂoxacin 500 mg superﬁcial femoral artery ferrous sulfate 325 mg systemic vascular resistance levoﬂoxacin 250 mg great toe dyspnea on exertion systolic ejection murmur ciproﬂoxacin 250 mg right fourth toe vq scan pulmonary vascular resistance metronidazole 500 mg vancomycin 250 mg breath chest pain ﬂash pulmonary edema chronic urinary tract infection plastic surgery interstitial lung disease shortness of breath white blood cell count cellulitis of right foot morbid obesity transesophageal echocardiogram benign prostatic hyperplasia amputation of right foot arterial blood gas left internal mammary artery recurrent urinary tract split thickness skin graft pulmonary function tests coronary artery disease irbesartan 150 mg bone and bone

Table 1: Topic examples extracted from 50-topic runs. Each column shows the top 10 words/phrases ordered by likelihood.

Qualitative Evaluation Results

We list in Table 1 some example topics extracted from the 50-topic run. Since TNG has the lowest topic coherence scores compared to all the other models, we do not list the TNG topics due to the space constraint. It can be observed that topic words listed under LDA topics are still dominated by unigrams. This is not surprising since LDA was simply operated on the bag-of-phrases . The occurrence frequencies of most phrases are usually much lower than those of unigram words. As such, only a few of them appear in the top 10 words for each topic. To PMine extracts phrases based on frequent pattern mining. It tends to group phrases sharing common words to form a topic. For example, most top words in Topic 1 has a common word heart . While for Topic 3 and 4, the common word shared among most top words is x-ray and override , respectively. TPM, on the contrary, is able to detect topics comprising a diverse range of words. More interestingly, TPM can detect symptom, diagnosis method and medication for certain diseases.

For example, Topic 1 is about lung disease . The top words include the disease name ( restrictive lung disease , interstitial lung disease ), symptom ( pleuritic chest pain , dyspnea on exertion , morbid obesity ), diagnosis method ( vq scan , arterial blood gas , pulmonary function tests ) and possible medication ( ferrous sulfate 325 mg ). Also, some topics show drugs with different dosages. For example, Topic 3 includes the antibiotics, Ciproﬂoxacin and Levoﬂoxacin, with different dosages, which are both used to treat urinary tract infection. Detecting topics which consist of phrases at such a ﬁne granularity level would not be possible with LDA run on bag-of-phrases or other topic models without explicitly modelling the generation of n-grams.

Conclusions

In this paper, we have proposed a new approach which ﬁrst detects high quality phrases and then trains a topic model

which explicitly models the generation of n-grams of arbitrary length. Compared to existing methods relying on frequent pattern mining for phrase identiﬁcation, our approach can detect less frequent topical phrases but sharing common lower-order n-grams, for example, the same drug with different dosage levels. Also, since phrase detection is separated from topic inference, our approach is computationally less complicated compared to models which need to detect phrase boundaries and infer topics simultaneously. The experimental results show that our approach outperforms the state-of-the-art method in both perplexity and topic interpretability.

Acknowledgments The author would like to thank Udochukwu Orizu for implementing a medical phrase extraction system based on Med Tagger. The work is partly funded by the Innovate UK under the grant number 101779.

References Arnold, C., and Speier, W. 2012. A topic model of clinical reports. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1031 1032. Blei, D. M., and Lafferty, J. D. 2009. Visualizing topics with multi-word expressions. ar Xiv preprint ar Xiv:0907.1013. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent Dirichlet Allocation. Journal of machine Learning research 3:993 1022. Chen, S. F., and Goodman, J. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4):359 393. Danilevsky, M.; Wang, C.; Desai, N.; Ren, X.; Guo, J.; and Han, J. 2014. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the SIAM International Conference on Data Mining (SDM). El-Kishky, A.; Song, Y.; Wang, C.; Voss, C. R.; and Han, J. 2014. Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment 8(3):305 316. Jameel, S., and Lam, W. 2013. An unsupervised topic segmentation model incorporating word order. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 203 212. Lehman, L.-w.; Long, W.; Saeed, M.; and Mark, R. 2014. Latent topic discovery of clinical concepts from hospital discharge summaries of a heterogeneous patient cohort. In Proceedings of the 36th IEEE International Conference on Engineering in Medicine and Biology Society (EMBC), 1773 1776. Lindsey, R. V.; Headden III, W. P.; and Stipicevic, M. J. 2012. A phrase-discovering topic model using hierarchical pitman-yor processes. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCo NLL), 214 222.

Liu, H.; Wu, S. T.; Li, D.; Jonnalagadda, S.; Sohn, S.; Wagholikar, K.; Haug, P. J.; Huff, S. M.; and Chute, C. G. 2012. Towards a semantic lexicon for clinical natural language processing. In AMIA Annual Symposium Proceedings, volume 2012, 568 576. Newman, D.; Lau, J. H.; Grieser, K.; and Baldwin, T. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 100 108. Paul, M. J., and Dredze, M. 2013. Drug extraction from the web: Summarizing drug experiences with multi-dimensional topic models. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 168 178. Pitman, J. 2002. Combinatorial stochastic processes. Technical report, Department of Statistics, University of California at Berkeley. R oder, M.; Both, A.; and Hinneburg, A. 2015. Exploring the space of topic coherence measures. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM), 399 408. Teh, Y. W.; Jordan, M. I.; Beal, M. J.; and Blei, D. M. 2006. Hierarchical dirichlet processes. Journal of the american statistical association 101(476). Teh, Y. W. 2006. A hierarchical bayesian language model based on Pitman-Yor processes. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL), 985 992. Uzuner, O.; Solti, I.; Xia, F.; and Cadag, E. 2010. Community annotation experiment for ground truth generation for the i2b2 medication challenge. Journal of the American Medical Informatics Association 17(5):519 523. Wallach, H. M. 2006. Topic modeling: beyond bag-ofwords. In Proceedings of the 23rd international conference on Machine learning (ICML), 977 984. Wang, X.; Mc Callum, A.; and Wei, X. 2007. Topical ngrams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 697 702. Yu, Z.; Johnson, T. R.; and Kavuluru, R. 2013. Phrase based topic modeling for semantic information processing in biomedicine. In Proceedings of the 12th IEEE International Conference on Machine Learning and Applications (ICMLA), volume 1, 440 445. Zheng, B.; Mc Lean, D. C.; and Lu, X. 2006. Identifying biological concepts from a protein-related corpus with a probabilistic topic model. BMC bioinformatics 7(1):58.