# automatic_emergency_diagnosis_with_knowledgebased_tree_decoding__b23bb603.pdf

Automatic Emergency Diagnosis with Knowledge-Based Tree Decoding

Ke Wang1,2 , Xuyan Chen3 , Ning Chen1,2 and Ting Chen1,2

1Institute for Artiﬁcial Intelligence, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Tsinghua-Fuzhou Institute of Digital Technology, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China 3Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, Beijing, China wangke18@mails.tsinghua.edu.cn, tingchen@tsinghua.edu.cn,

Automatic diagnosis based on clinical notes is critical especially in the emergency department, where a fast and professional result is vital in assuring proper and timely treatment. Previous works formalize this task as plain text classiﬁcation and fail to utilize the medically signiﬁcant tree structure of International Classiﬁcation of Diseases (ICD) coding system. Besides, external medical knowledge is rarely used before, and we explore it by extracting relevant materials from Wikipedia or Baidupedia. In this paper, we propose a knowledge-based tree decoding model (K-BTD), and the inference procedure is a top-down decoding process from the root node to leaf nodes. The stepwise inference procedure enables the model to give support for decision at each step, which visualizes the diagnosis procedure and adds to the interpretability of ﬁnal predictions. Experiments on real-world data from the emergency department of a large-scale hospital indicate that the proposed model outperforms all baselines in both micro-F1 and macro-F1, and reduce the semantic distance dramatically.

1 Introduction

The clinical note, an essential part of Electronic Health Record (EHR), generally contains a patient s past medical history, chief complaints and current symptoms. The physicians need to study and be on probation for years before they can give diagnosis individually, but the diagnosis is still timeconsuming and error-prone. To address existing drawbacks of human diagnosis, researchers started to study automatic diagnosis [Xiao et al., 2018]. Automatic diagnosis takes raw texts of clinical notes as input, and gives the codes of diseases according to the ICD coding system [WHO, 1978], which is adopted in hospitals world-wide. The ICD codes are naturally organized as a tree structure. The tree starts from a virtual root node, goes deeper through intermediate nodes and ﬁnally reaches diseases in

Contact Author

leaf nodes. Only leaf nodes correspond to a speciﬁc disease with its ICD code. Intermediate nodes represent a medical concept or a range of diseases. Automatic diagnosis has become a popular research ﬁeld recently [Perotte et al., 2013; Wang et al., 2016; Subotin and Davis, 2016], and the majority of existing works formalize it as a plain text classiﬁcation task. For example, [Lipton et al., 2016] and [Li et al., 2018] use Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) to predict diseases. To give support from clinical notes for ﬁnal predictions, attention mechanisms are used in [Sha and Wang, 2017] and [Mullenbach et al., 2018]. Although existing models have made progress on the accuracy of disease diagnosis by a considerable margin, automatic diagnosis is still confronted with two major problems.

Negligence of the Medical Relationships among Diseases. Existing models generally treat diseases as mutually independent. However, from a medical perspective, diseases are interconnected and they are organized hierarchically in the ICD tree. The constraints of hierarchical structure can prevent the predictions from being too far away from the ground truth. For example, without hierarchical restrictions, a patient with acute myocardial infarction (disease of the circulatory system) can be diagnosed with acute gastroenteritis (disease of the digestive system) by mistake, and this may bring about life loss and huge economic compensation.

Low Practicality of Support. Existing attention-based models can already tag words that have deep impact on ﬁnal predictions. However, this one-step attention cannot reveal a transparent reasoning process, thus lacking in the practicality of assisting doctors in making decisions. In general, diagnosis is a step-by-step procedure, where the physician ﬁrst locates the diseased organ and then uses the knowledge he has learned and the information from patients to stepwisely reach ﬁnal results. Each step inside the diagnosis procedure requires different support from clinical notes. To address these problems, we propose a knowledge-based tree decoding model named K-BTD, consisting of Clinical Notes Encoder, Knowledge Encoder, Judge Net and Fusion Net. K-BTD takes in raw clinical notes as input, and the inference procedure is a top-down decoding process of the ICD tree. Each node inside the tree is equipped with external med-

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

ical knowledge extracted from Wikipedia or Baidupedia. At each decoding step, the Judge Net decides whether to expand the children of the current node, and the Fusion Net aggregates information from multiple resources to promote further decoding. The whole process is repeated recursively until there are no more children to expand. The stepwise decoding process imitates the diagnosis procedure of a human doctor, and at each step our model can give support for its decision, which visualizes the reasoning process and provides human doctors with better references. We conduct experiments on real-world data from the emergency department of Beijing Tsinghua Changgung Hospital, a large-scale hospital in China. Experimental results indicate that our model achieves signiﬁcant improvements over other state-of-the-art models in micro-F1, macro-F1 and semantic distance. We also show the superiority of our model in terms of interpretability. Ablation analysis and error analysis are conducted to verify the internal mechanism. To the best of our knowledge, this is the ﬁrst empirical study to inference diagnosis with knowledge-based tree decoding.

2 Related Work

2.1 Automatic Diagnosis

Automatic Diagnosis is a long-standing task in the ﬁeld of medical informatics. Early works utilize machine learning models such as hierarchical Support Vector Machine (SVM) [Perotte et al., 2013]. With the rapid development of deep learning technologies, researchers start to formalize it as a text classiﬁcation task. Long Short Term Memory (LSTM) [Lipton et al., 2016] and CNN [Li et al., 2018] are used to extract semantic features from textual content. Bag-of-words and disease correlation graph are explored in [Wang et al., 2016]. However, these methods can only extract shallow features and cannot give support for ﬁnal predictions, which block it from practical application. With the widespread of attention mechanism, researchers begin to predict diseases and their support by incorporating deep learning models with it. [Mullenbach et al., 2018] adopts per-label attention mechanism and allow the model to learn distinct representations for each disease label.

2.2 Tree-based Multi-label Classiﬁcation

Tree-based multi-label classiﬁcation is a branch of multilabel classiﬁcation, and it is applicable when the predictor has a hierarchical structure. To be speciﬁc, the label space is a tree where nodes represent nested semantic concepts, and the speciﬁcity of them increases with depth. Its successful implementation can reduce a large discrete sample space to only a small number of candidate labels. Some researchers focus on inducing tree structure label space to improve inference efﬁciency [Beygelzimer et al., 2009; Daum e III et al., 2017]. Some researchers employ the already existed label space structure. For example, the natural tree structure of Medical Subject Heading (Me SH) is utilized by [Singh et al., 2018] in Me SH tagging task. Some researchers have explored the tree structure of ICD coding system. [Perotte et al., 2013] adopts hierarchical SVM

to model the inclusion and exclusion relationships of diseases. [Kamkar et al., 2015] reaches stable and better feature selection based on the ICD tree structure. [Xie and Xing, 2018] applies tree-of-sequences LSTM to model the latent representation of each node in the ICD tree with textual descriptions of the ICD codes and their hierarchical structure. However, none of them formalizes automatic diagnosis as a tree-decoding procedure along the ICD tree.

3.1 Problem Formulation Considering that a patient can be diagnosed with more than one disease, we treat automatic diagnosis as a multi-label classiﬁcation task over ICD-9 codes. ICD-9 is a standard version of the ICD coding system, which contains over 15,000 codes in its taxonomy1. In our study, we only consider high-frequent ICD-9 codes such as top-100 and top-150 codes, which takes up more than 90% of all appeared ICD-9 codes. Each node in the ICD-9 tree is equipped with a piece of text representing the external knowledge extracted from Wikipedia or Baidupedia. The input of our model, the clinical notes of a patient, is a word sequence X = {x1, x2, . . . , x N}, where N is the length of sequence X. Let the ICD-9 codes for diseases to be the label space L, and the labeling task is to determine yl {0, 1} for all l L.

3.2 Model Overview The clinical note from patient is ﬁrst encoded by Clinical Notes Encoder to get feature sequence T = {t1, t2, . . . , t N} and ﬂowing vector Froot for root node. We name Fn as ﬂowing vector for node n because it contains all information ﬂowing from the root node to current node n. Next, the tree decoding process starts from the root node of the ICD-9 tree. Each step of the decoding process will be conducted on a particular node n in the tree (e.g., root). The Knowledge Encoder ﬁrst encodes the external knowledge of each child of the current node n. Next, the Judge Net enumerates each child node and decides whether to expand this child node for further decoding. If a child node m is chosen to be expanded, the Fusion Net will aggregate ﬂowing vector Fn and the external knowledge of node m to get ﬂowing vector Fm for child node m. The decoding process will repeat recursively until there are no more nodes to expand. The expanded leaf nodes will be chosen as ﬁnal predictions. The overall framework is shown in Figure 1.

3.3 Clinical Notes Encoder Taking the word sequence X of clinical notes as input, the Clinical Notes Encoder computes the latent text description through two layers, i.e., embedding layer and RNN layer. We ﬁrst convert each word xi to a vector with pretrained Tencent AI Lab Embedding Corpus [Song et al., 2018], which features its strength in the medical domain. We have tried building word embeddings with the dataset we adopt,

1The ICD-9 tree structure can be found at https://bioportal. bioontology.org/ontologies/ICD9CM.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Clinical Notes

The patient

Clinical Notes Encoder

𝑇𝑇, 𝐹𝐹𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟

𝐶𝐶𝑚𝑚,1 𝐶𝐶𝑚𝑚,2 𝐶𝐶𝑚𝑚,𝑀𝑀

Self-Attention Layer 𝐴𝐴𝑚𝑚

Knowledge Encoder

𝑍𝑍s transform 𝑊𝑊1 𝑠𝑠1

Semantic-level score

Attention Layer

Max-pooling

Word-level score

𝑠𝑠 Shortcut connection Overall score

IF 𝑠𝑠> 𝜖𝜖 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 Move to

the next child node 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇

Attention Layer

Weighted Sum

Stepwise Support

Transparent reasoning process

Better assistance

for physicians

Figure 1: Overview of the proposed K-BTD model. The decoding process is now conducted on node n, and deciding whether to expand child node m for further decoding. The Am, Cm, T on the right side are originally calculated from the left side.

and the performance is not as good. After the embedding layer, an RNN is used to extract the contextual information from clinical notes. It is noticeable that the RNN can be in any form such as Gated Recurrent Unit (GRU), LSTM and BERT [Devlin et al., 2018]. After the RNN layer, we can get clinical notes feature sequence T = {t1, t2, . . . , t N} RN q, where q is the dimension of the hidden state. The ﬂowing vector for the root node is set as Froot = t N.

3.4 Knowledge Encoder

For Knowledge Encoder, we choose CNN due to its high efﬁciency. The input of Knowledge Encoder from node m is a word sequence of external medical knowledge Gm = {gm,1, gm,2, . . . , gm,M}, where M is the sequence length. The same embedding method is applied on the sequence to get Gm = { gm,1, gm,2, . . . , gm,M}. The convolution operation is done with a convolution matrix W Ru (h k), where u is the number of ﬁlters, h is the length of the sliding window and k is the dimension of word embeddings. The convolutional feature matrix Cm RM u is calculated by:

Cm,i = W gm,i:i+h 1 + b (1)

where gm,i:i+h 1 is the concatenation of word embeddings within the i-th sliding window, and b Ru is a bias vector. To integrate the information from convolutional feature matrix Cm, we adopt self attention and calculate Am, the fea-

ture vector of external knowledge for node m as follows:

e Cm,i Wa P j e Cm,j Wa

where Wa is a multi-layer perceptron.

3.5 Judge Net

Suppose the decoding process is now conducted on node n, and the network is deciding whether to expand child node m for further decoding. The Judge Net calculates the possibility that child node m will be expanded with two measurements, namely semantic-level score and word-level score.

Semantic-level score. We use semantic-level score s1 to represent the semantic-level interactions between external knowledge feature vector Am and the aggregated information Fn that has ﬂown to current node n:

s1 = Sigmoid (Zs W1) R where Zs = Concat (Fn, Am, Fn Am, |Fn Am|) (3)

where W1 is a multi-layer perceptron and indicates element-wise multiplication.

Word-level score. Semantic-level score only considers the interactions between holistic feature vectors. Here, we use word-level score s2 to introduce reactions at the word-level between clinical notes feature sequence T = {t1, t2, . . . , t N} and convolutional feature sequence {Cm,i}.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

First, we calculate the similarity matrix S with {ti} and {Cm,j} as follows:

Sij = tanh ti W2CT m,j (4)

where W2 Rq u is a parameter matrix. Then, row-wise and column-wise softmax are separately applied on S to get γij and δij:

γij = exp(Sij) PM v=1 exp(Siv) (5)

δij = exp(Sij) PN v=1 exp(Svj) (6)

we use ξi = PM j=1 δij to measure the importance of the ith word in clinical notes. It is noticeable that ξi is only calculated for interpretability, and is not involved in subsequent calculations. Next, we generate intermediate representation pi Ru by attentively aggregate {Cm,j}:

j=1 γij Cm,j (7)

Then, we apply a GRU to process the generated intermediate representation P = {p1, p2, . . . , p N} RN u:

P = GRU (P) RN dp (8) where dp is the dimension of the hidden state. Next, we project P to a compressed vector by max-pooling along the column axis to get the word-level score s2:

s2 = Sigmoid Maxpoolcol Concat

P, T W3 R (9)

where W3 is a multi-layer perceptron, and T is a shortcut connection to facilitate the training process. Overall score. To consider both scores simultaneously, we formulate the overall score s as a weighted summation of semantic-level score and word-level score:

s = λs1 + (1 λ)s2 (10) where λ [0, 1] is a hyper-parameter. If s is greater than a preset threshold ϵ, the child node m will be expanded. If s is smaller than ϵ, the Judge Net will check the next child node of current node n. ϵ is set to be 0.45 in our experiment to balance precision and recall.

# of admission records 72333 # of unique ICD-9 codes 2156 Ratio of top-100 codes 91.6% Ratio of top-150 codes 95.2% Avg. # of codes per admission (top-100) 2.35 Avg. # of codes per admission (top-150) 2.43 Avg. # of words per admission 105.58 Avg. # of words per external knowledge 268.17

Table 1: Statistics of the dataset from Beijing Tsinghua Changgung Hospital, a large-scale hospital in China.

3.6 Fusion Net The Fusion Net comes into operation only if a child node m is chosen to be expanded by Judge Net. Its purpose is to generate Fm by fusing the information ﬂowing to parent node n and the external knowledge of child node m. First, the weight βi for each position i in the convolutional feature sequence {Cm,i} is calculated via dot-production with ﬂowing vector Fn and subsequent softmax operation:

αi = Fn Cm,i

βi = exp(αi) PM j=1 exp(αj) (11)

Next, the ﬂowing vector of child node m is the weighted summation of Fn, Am and the weighted average of {Cm,i}:

Fm = Fn + s2 s1 + s2 Am + s1 s1 + s2

i=1 βi Cm,i (12)

The different weight for Am and PM i=1 βi Cm,i is designed to introduce more information from child node m. If s1 is smaller than s2, the model needs to pay more attention to the word-level information. If s1 is greater than s2, the model should absorb more information from the semantic-level.

3.7 Training We use layerwise multi-label cross entropy to train our network. The loss function is as follows:

Max Depth X

k=1 L(yd,k, pd,k)

where Max Depth is the depth of the ICD-9 tree, L is binary cross entropy loss, pd is the prediction of Judge Net at layer d and yd is the ground truth label at layer d. The exponential part is added to penalize more for early error and penalize less for late error.

4 Experiments

4.1 Dataset Description The data from the emergency department of Beijing Tsinghua Changgung Hospital, a large-scale hospital in China, is collected from 2015 to 2017, and it is the ﬁrst large automatic diagnosis dataset in Chinese. The dataset contains unstructured

Chronic Airway Obstruction (ICD-9 code: 496.0) Clinical manifestations: Chronic cough is often the earliest symptom, and it can be unhealed with the course of the disease. Cough ... Cause of diseases: The risk factors that have been found can be roughly divided into external factors like environment ...

Table 2: Partially shown example of external medical knowledge from Baidupedia (translated from Chinese).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Tasks Top-100 Codes Top-150 Codes Metrics micro-F1 macro-F1 SD micro-F1 macro-F1 SD

TFIDF+SVM 0.6241 0.0089 0.5027 0.0076 1.818 0.047 0.5987 0.0077 0.4592 0.0082 2.371 0.061 Text CNN 0.6695 0.0126 0.5239 0.0097 1.315 0.041 0.6409 0.0117 0.4887 0.0109 1.826 0.057 BERT 0.6823 0.0118 0.5459 0.0104 1.250 0.051 0.6598 0.0097 0.5201 0.0129 1.588 0.072 Deep Labeler 0.6703 0.0152 0.5357 0.0193 1.291 0.038 0.6385 0.0137 0.4977 0.0201 1.799 0.089 CAML 0.6618 0.0134 0.5375 0.0093 1.282 0.035 0.6402 0.0161 0.5028 0.0144 1.766 0.093 AIC 0.6875 0.0093 0.5602 0.0081 1.142 0.045 0.6557 0.0096 0.5152 0.0108 1.527 0.059

Type #2 C-Mem NNs 0.6855 0.0177 0.5652 0.0203 1.131 0.058 0.6647 0.0135 0.5227 0.0121 1.476 0.033 Fact-Law 0.6785 0.0147 0.5603 0.0111 1.189 0.040 0.6602 0.0181 0.5185 0.0166 1.481 0.049

K-BTD (LSTM) 0.7025 0.0104 0.5903 0.0138 0.899 0.014 0.6798 0.0117 0.5462 0.0142 1.182 0.047 K-BTD (GRU) 0.7011 0.0149 0.5942 0.0126 0.920 0.021 0.6813 0.0206 0.5493 0.0178 1.169 0.051 K-BTD (BERT) 0.7085 0.0128 0.5973 0.0097 0.852 0.036 0.6855 0.0175 0.5472 0.0134 1.159 0.032

Table 3: Automatic diagnosis results on the data from the emergency department of Beijing Tsinghua Changgung Hospital. Values after the plus minus sign denote standard deviations from 5-fold random data splits.

clinical notes including chief complaints, history of recent illness, past medical history and structured data such as auxiliary examination results.To keep track with previous work, we only utilize the unstructured part. Each admission record is tagged with one or more ICD-9 codes by licensed physicians, denoting the identiﬁed diseases. A summary of the dataset statistics is provided in Table 1. We choose the top-100 and top-150 most frequent codes to conduct two separate experiments. When doing experiments on top-k frequent codes, we ﬁlter the dataset down to instances that have at least one of the top-k frequent codes. In experiment, we conduct random ﬁve fold cross-validation to examine the performance of our model as well as baselines. For external medical knowledge, we use Wikipedia and Baidupedia as resources. [Trevena, 2011] has demonstrated the reliability of medical articles in Wikipedia, and medical terms in Baidupedia are under the supervision of National Health Care Commission of China. The extraction of external medical knowledge is divided into three steps. For each node in the ICD-9 tree, we ﬁrst search the corresponding Wikipedia page. If the Wikipedia page does not exist, we use the corresponding Baidupedia page to make up for it. If the corresponding Baidupedia page also does not exist, we will consider highly related candidates in Baidupedia. We rank related candidates by the similarity between candidates and query terms and select the one with highest similarity. For Wikipedia pages, we use the content in the section of Signs and Symptoms. For Baidupedia pages, we extract the materials in sections of Clinical Manifestations and Cause of Diseases. An illustration of external medical knowledge is shown in Table 2.

4.2 Baselines For comparison, we reproduce two major categories of baselines. The ﬁrst category utilizes only clinical notes and the second employs external medical knowledge as well. As for the ﬁrst category baselines, we ﬁrst choose TFIDF+SVM, Text CNN [Kim, 2014] and BERT [Devlin et al., 2018], which are classic models for text classiﬁcation. Besides, three classic automatic diagnosis models Deep Labeler [Li et al., 2018], AIC [Xie and Xing, 2018] and CAML [Mullenbach et al., 2018] are also adopted for comparison.

As for the second category baselines, we adopt CMem NNs [Prakash et al., 2017], which uses multi-hop memory networks to inference diagnosis. As there are few works in automatic diagnosis that utilize external knowledge, we adopt a classic reading comprehension model Fact-Law Attention Model [Luo et al., 2017] from the legal judgement domain for comprehensive comparison. All baselines in type #2, as well as the proposed model use the same external knowledge to assure fair comparison. To better demonstrate the advantage of our model, we use LSTM, GRU and BERT to replace the RNN in Clinical Notes Encoder and conduct three different experiments. The code for our model is publicly available at https://github.com/ kaisadadi/K-BTD.

4.3 Evaluation Metrics

The distribution of diseases is highly imbalanced so we evaluate our model with both micro-F1 and macro-F1. Besides, we propose another metrics named semantic distance (SD), which is modiﬁed from [Singh et al., 2018]. SD is deﬁned as:

u Y min v ˆ Y D(u, v) + 1

u ˆ Y min v Y D(u, v) (14)

where Y is the set of target codes and ˆY is the set of predicted codes. D is a distance function that measures the shortest distance between two nodes in the ICD-9 tree. If ˆY is empty, SD is set to be the maximum distance between two nodes in the ICD-9 tree. The semantic distance measures the distance between predictions and labels from a medical view. Mismatch with small semantic distance is tolerable, while a large semantic distance might result in completely wrong operations or even death.

4.4 Experimental Settings

In experiment, we set λ to 0.5, ϵ to 0.45 and dp to 256. We adopt Adam [Kingma and Ba, 2015] for optimization. The size of the mini-batch is 64, and the learning rate is 10 5 for BERT-related models and 10 3 otherwise. Dropout is set to 0.5, and weight decay is 10 5.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Clinical Note: The patient got upper abdominal pain and paroxysmal colic 12 hours ago with no obvious inducement. The symptom exacerbated after 6 hours, accompanied with acid reflux, nausea and multiple vomiting, which is yellow watery. The patient also has chills, sore throat, and fatigue all over. The patient takes ibuprofen and the symptom is not relieved ....... Past history: allergy history (-), last menstruation: 2017-*-**. Menstruation is delayed. Diagnosis: Gastroenteritis(ICD-9: 558.9), Reflux esophagitis (ICD-9: 530.1)

upper abdominal pain, vomiting, nausea, ibuprofen

pain, chills, sore throat, fatigue, ibuprofen

Gastroenteritis upper respiratory infections

Diseases of the digestive system

Noninfectious enteritis and colitis

vomiting, nausea

vomiting, chills

Diseases of esophagus, stomach, and duodenum vomiting, sore throat

abdominal pain, vomiting

paroxysmal colic Gastroenteritis

Reflux esophagitis

acid reflux, yellow watery

diseases of the respiratory system sore throat

chills, fatigue (Judge Net ends this branch)

acid reflux

Figure 2: Evaluation of interpretability with a partially shown clinical note (translated from Chinese). Support is placed in the rectangular bar. The middle part shows the support and predictions from CAML, and the bottom part shows the stepwise inference results and support from our proposed model.

4.5 Experimental Results

We evaluate our model on real-world data from the emergency department of Beijing Tsinghua Changgung Hospital, and experimental results are shown in Table 3. From the results we can see that: (1) K-BTD exceeds all baselines in both experiments, which demonstrates the effectiveness and robustness of our model. (2) K-BTD surpasses baselines in macro-F1 by a considerable margin, and this indicates that the utilization of tree structure enables the model to make the right decisions on diseases that are not common to appear. (3) K-BTD reduces semantic distance signiﬁcantly compared with baselines. This indicates that the results of our model are closer to the ground truth than baselines and can provide better references for human doctors. (4) K-BTD can beneﬁt from the research progress in text encoder. K-BTD is actually a framework, where the components inside can be updated easily. We are conﬁdent that our model can reach higher performance with better text encoders in the future.

4.6 Evaluation of Interpretability

Because our model employs tree decoding architecture, it can give support for decision at each step while conventional models can only provide support for ﬁnal predictions. Here we demonstrate the interpretability of our model with an actual clinical note from the dataset, and compare it with another explainable model CAML [Mullenbach et al., 2018]. For K-BTD, each step s support is chosen from words with

value ξi greater than a preset threshold. For CAML, we follow the original settings. The results are shown in Figure 2. We can see from results that support from both CAML and our model catch similar important words such as vomiting and abdominal pain . However, without stepwise inference, CAML misdiagnoses upper respiratory infections because of the shared symptoms between diseases (e.g. sore throat). Besides, we can see that our model s decision at each step has different support. For example, the support for diseases of the digestive system are abdominal and vomiting , and the support for diseases of esophagus, stomach and duodenum are acid reﬂux , vomiting and chills . This observation demonstrates that our model pays attention to different parts of the clinical notes at different positions in the decoding process. The stepwise support brings more interpretability to ﬁnal predictions and can provide human doctors with better assistance in diagnosis inference.

4.7 Ablation Analysis

External medical knowledge is a critical part of our model. To further explore the role that external medical knowledge plays in the whole architecture, we conduct three experiments:

Experiment #1. Randomly mask from 10% to 50% of the knowledge texts for all nodes in the ICD-9 tree.

Experiment #2. Randomly shufﬂe the knowledge for 10% to 50% nodes in the ICD-9 tree.

Experiment #3. Randomly replace the knowledge for 10% to 50% nodes in the ICD-9 tree with totally irrelevant materials.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Figure 3: Results of ablation analysis. E1, E2 and E3 correspond to experiment #1, #2 and #3. Micro means micro-F1 grade and macro means macro-F1 grade.

Model Depth of the ICD-9 tree 0 1 2 3 4 K-BTD 0% 13% 23% 53% 11% BERT 0% 16% 29% 42% 13% AIC 0% 15% 27% 45% 13% C-Mem NNs 0% 14% 31% 43% 12%

Table 4: Results of error analysis.

We repeat each experiment ﬁve times to reduce the effect of randomness, and use the mean value as ﬁnal results. For all three experiments, the model takes more epochs to converge and the performance on both micro-F1 and macro-F1 of the initial three to ﬁve epochs decreases a lot. The results are shown in Figure 3. We observe that the model performance decreases on all three experiments, and the degree of decline increases from experiment #1 to experiment #3. It s clear that noise to external medical knowledge can decrease model performance. The severer the noise is, the more the model performance drops, which indicates that our model relies heavily on the quality of external medical knowledge. We can safely predict that external knowledge with better quality can bring further improvement to model performance.

4.8 Error Analysis To give better insight into out model s performance, we calculate the ratio of ﬁrst-occurred errors that appear at different depth in the ICD-9 tree. We deﬁne the depth of the root node to be 0, and the maximum depth is 4. The results are shown in Table 4. For baseline models without tree decoding, we use the ICD-9 tree to locate the depth of the ﬁrst-occurred error. It s clear that K-BTD make mistakes in deeper positions in the ICD-9 tree compared with baselines. The deeper the ﬁrst error occurs in the decoding process, the closer the distance between predictions and ground truth is, which again proves the effectiveness of our model. The reason of low error ratio in depth 4 is that the number of children of nodes in depth 3 is quite small.

5 Conclusion

In this paper, we propose K-BTD, a knowledge-based tree decoding model for automatic emergency diagnosis. To be speciﬁc, we utilize the medically signiﬁcant ICD-9 tree structure and formulate this task as a stepwise top-down decoding procedure. External knowledge is extracted to assist the decoding process, and we have demonstrated its importance in our model. The top-down decoding process enables our model to give different support for each decision, which adds to the interpretability and practicality of our model compared with existing single-step models. Experimental results on real-world data show that our model outperforms baselines in all metrics, which proves the effectiveness and robustness of our model. In the future, we will focus on increasing the diagnosis accuracy of infrequent diseases.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (grants 61872218, 61721003, 61673241 and 61906105), National Key R&D Program of China (2019YFB1404804), Tsinghua-Fuzhou Institute of Digital Technology, Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University Peking Union Medical College Hospital Initiative Scientiﬁc Research Program, and Tsinghua University Big Data Research Center. The funders had no roles in study design, data collection and analysis, the decision to publish, and preparation of the manuscript. Besides, the author would like to thank Haoxi Zhong, Yudong Chen and Zhijing Wu for their sincere and helpful suggestions on paper writing.

[Beygelzimer et al., 2009] Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory Sorkin, and Alex Strehl. Conditional probability tree estimation analysis and algorithms. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 51 58, 2009.

[Daum e III et al., 2017] Hal Daum e III, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic time one-against-some. In Proceedings of the Thirty-Fourth International Conference on Machine Learning, pages 923 932. JMLR. org, 2017.

[Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[Kamkar et al., 2015] Iman Kamkar, Sunil Kumar Gupta, Dinh Phung, and Svetha Venkatesh. Stable feature selection for clinical prediction: Exploiting icd tree structure using tree-lasso. Journal of biomedical informatics, 53:277 290, 2015.

[Kim, 2014] Yoon Kim. Convolutional neural networks for sentence classiﬁcation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746 1751, 2014.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

[Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the Third International Conference on Learning Representations, 2015. [Li et al., 2018] Min Li, Zhihui Fei, Min Zeng, Fangxiang Wu, Yaohang Li, Yi Pan, and Jianxin Wang. Automated icd-9 coding via a deep learning approach. IEEE/ACM transactions on computational biology and bioinformatics, 2018. [Lipton et al., 2016] Zachary Chase Lipton, David C. Kale, Charles Elkan, and Randall C. Wetzel. Learning to diagnose with LSTM recurrent neural networks. In 4th International Conference on Learning Representations, 2016. [Luo et al., 2017] Bingfeng Luo, Yansong Feng, Jianbo Xu, Xiang Zhang, and Dongyan Zhao. Learning to predict charges for criminal cases with legal basis. ar Xiv preprint ar Xiv:1707.09168, 2017. [Mullenbach et al., 2018] James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag Technologies, pages 1101 1111, 2018. [Perotte et al., 2013] Adler Perotte, Rimma Pivovarov, Karthik Natarajan, Nicole Weiskopf, Frank Wood, and No emie Elhadad. Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association, 21(2):231 237, 2013. [Prakash et al., 2017] Aaditya Prakash, Siyuan Zhao, Sadid A Hasan, Vivek Datla, Kathy Lee, Ashequl Qadir, Joey Liu, and Oladimeji Farri. Condensed memory networks for clinical diagnostic inferencing. In Proceedings of the Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017. [Sha and Wang, 2017] Ying Sha and May D Wang. Interpretable predictions of clinical outcomes with an attentionbased recurrent neural network. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 233 240, 2017. [Singh et al., 2018] Gaurav Singh, James Thomas, Iain James Marshall, John Shawe-Taylor, and Byron C. Wallace. Structured multi-label biomedical text tagging via attentive neural tree decoding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2837 2842, 2018. [Song et al., 2018] Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175 180, 2018. [Subotin and Davis, 2016] Michael Subotin and Anthony R Davis. A method for modeling co-occurrence propen-

sity of clinical codes with application to icd-10-pcs autocoding. Journal of the American Medical Informatics Association, 23(5):866 871, 2016. [Trevena, 2011] Lyndal Trevena. Wikiproject medicine, 2011. [Wang et al., 2016] Sen Wang, Xiaojun Chang, Xue Li, Guodong Long, Lina Yao, and Quan Z Sheng. Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Transactions on Knowledge and Data Engineering, 28(12):3191 3202, 2016. [WHO, 1978] WHO. International classiﬁcation of diseases:[9th] ninth revision, basic tabulation list with alphabetic index. 1978. [Xiao et al., 2018] Cao Xiao, Edward Choi, and Jimeng Sun. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association, 25(10):1419 1428, 2018. [Xie and Xing, 2018] Pengtao Xie and Eric Xing. A neural architecture for automated icd coding. In Proceedings of the Fifty-Sixth Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1066 1076, 2018.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)