# weaklysupervised_hierarchical_text_classification__28ef527e.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Weakly-Supervised Hierarchical Text Classiﬁcation

Yu Meng, Jiaming Shen, Chao Zhang, Jiawei Han University of Illinois at Urbana-Champaign, Urbana, IL, USA {yumeng5, js2, czhang82, hanj}@illinois.edu

Hierarchical text classiﬁcation, which aims to classify text documents into a given hierarchy, is an important task in many real-world applications. Recently, deep neural models are gaining increasing popularity for text classiﬁcation due to their expressive power and minimum requirement for feature engineering. However, applying deep neural networks for hierarchical text classiﬁcation remains challenging, because they heavily rely on a large amount of training data and meanwhile cannot easily determine appropriate levels of documents in the hierarchical setting. In this paper, we propose a weakly-supervised neural method for hierarchical text classiﬁcation. Our method does not require a large amount of training data but requires only easy-to-provide weak supervision signals such as a few class-related documents or keywords. Our method effectively leverages such weak supervision signals to generate pseudo documents for model pre-training, and then performs self-training on real unlabeled data to iteratively reﬁne the model. During the training process, our model features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism. Experiments on three datasets from different domains demonstrate the efﬁcacy of our method compared with a comprehensive set of baselines.

Introduction

Hierarchical text classiﬁcation, which aims at classifying text documents into classes that are organized into a hierarchy, is an important text mining and natural language processing task. Unlike ﬂat text classiﬁcation, hierarchical text classiﬁcation considers the interrelationships among classes and allows for organizing documents into a natural hierarchical structure. It has a wide variety of applications such as semantic classiﬁcation (Tang, Qin, and Liu 2015), question answering (Li and Roth 2002), and web search organization (Dumais and Chen 2000). Traditional ﬂat text classiﬁers (e.g., SVM, logistic regression) have been tailored in various ways for hierarchical text classiﬁcation. Early attempts (Ceci and Malerba 2006) disregard the relationships among classes and treat hierarchical classiﬁcation tasks as ﬂat ones. Later approaches (Dumais and Chen 2000; Liu et al. 2005; Cai and Hofmann 2004)

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

train a set of local classiﬁers and make predictions in a topdown manner, or design global hierarchical loss functions that regularize with the hierarchy. Most existing efforts for hierarchical text classiﬁcation rely on traditional text classiﬁers. Recently, deep neural networks have demonstrated superior performance for ﬂat text classiﬁcation. Compared with traditional classiﬁers, deep neural networks (Kim 2014; Yang et al. 2016) largely reduce feature engineering efforts by learning distributed representations that capture text semantics. Meanwhile, they provide stronger expressive power over traditional classiﬁers, thereby yielding better performance when large amounts of training data are available. Motivated by the enjoyable properties of deep neural networks, we explore using deep neural networks for hierarchical text classiﬁcation. Despite the success of deep neural models in ﬂat text classiﬁcation and their advantages over traditional classiﬁers, applying them to hierarchical text classiﬁcation is nontrivial because of two major challenges. The ﬁrst challenge is that the training data deﬁciency prohibits neural models from being adopted. Neural models are data hungry and require humans to provide tons of carefullylabeled documents for good performance. In many practical scenarios, however, hand-labeling excessive documents often requires domain expertise and can be too expensive to realize. The second challenge is to determine the most appropriate level for each document in the class hierarchy. In hierarchical text classiﬁcation, documents do not necessarily belong to leaf nodes and may be better assigned to intermediate nodes. However, there are no simple ways for existing deep neural networks to automatically determine the best granularity for a given document. In this work, we propose a neural approach named We SHClass, for Weakly-Supervised Hierarchical Text Classiﬁcation and address the above two challenges. Our approach is built upon deep neural networks, yet it requires only a small amount of weak supervision instead of excessive training data. Such weak supervision can be either a few (e.g., less than a dozen) labeled documents or class-correlated keywords, which can be easily provided by users. To leverage such weak supervision for effective classiﬁcation, our approach employs a novel pretrain-and-reﬁne paradigm. Specifically, in the pre-training step, we leverage user-provided seeds to learn a spherical distribution for each class, and then generate pseudo documents from a language model

guided by the spherical distribution. In the reﬁnement step, we iteratively bootstrap the global model on real unlabeled documents, which self-learns from its own high-conﬁdent predictions. We SHClass automatically determines the most appropriate level during the classiﬁcation process by explicitly modeling the class hierarchy. Speciﬁcally, we pre-train a local classiﬁer at each node in the class hierarchy, and aggregate the classiﬁers into a global one using self-training. The global classiﬁer is used to make ﬁnal predictions in a top-down recursive manner. During recursive predictions, we introduce a novel blocking mechanism, which examines the distribution of a document over internal nodes and avoids mandatorily pushing general documents down to leaf nodes. Our contributions are summarized as follows:

1. We design a method for hierarchical text classiﬁcation using neural models under weak supervision. We SHClass does not require large amounts of training documents but just easy-to-provide word-level or document-level weak supervision. In addition, it can be applied to different classiﬁcation types (e.g., topics, sentiments).

2. We propose a pseudo document generation module that generates high-quality training documents only based on weak supervision sources. The generated documents serve as pseudo training data which alleviate the training data bottleneck together with the subsequent self-training step.

3. We propose a hierarchical neural model structure that mirrors the class taxonomy and its corresponding training method, which involves local classiﬁer pre-training and global classiﬁer self-training. The entire process is tailored for hierarchical text classiﬁcation, which automatically determines the most appropriate level of each document with a novel blocking mechanism.

4. We conduct a thorough evaluation on three real-world datasets from different domains to demonstrate the effectiveness of We SHClass. We also perform several case studies to understand the properties of different components in We SHClass.

Problem Formulation We study hierarchical text classiﬁcation that involves treestructured class categories. Speciﬁcally, each category can belong to at most one parent category and can have arbitrary number of children categories. Following the deﬁnition in (Silla and Freitas 2010), we consider non-mandatory leaf prediction, wherein documents can be assigned to both internal and leaf categories in the hierarchy. Traditional supervised text classiﬁcation methods rely on large amounts of labeled documents for each class. In this work, we focus on text classiﬁcation under weak supervision. Given a class taxonomy represented as a tree T , we ask the user to provide weak supervision sources (e.g., a few classrelated keywords or documents) only for each leaf class in T . Then we propagate the weak supervision sources upwards in T from leaves to root, so that the weak supervision sources of each internal class are an aggregation of weak supervision sources of all its descendant leaf classes. Speciﬁcally, given

M leaf node classes, the supervision for each class comes from one of the following:

1. Word-level supervision: S = {Sj}|M j=1, where Sj = {wj,1, . . . , wj,k} represents a set of k keywords correlated with class Cj;

2. Document-level supervision: DL = {DL j }|M j=1, where DL j = {Dj,1, . . . , Dj,l} denotes a small set of l (l corpus size) labeled documents in class Cj.

Now we are ready to formulate the hierarchical text classiﬁcation problem. Given a text collection D = {D1, . . . , DN}, a class category tree T , and weak supervisions of either S or DL for each leaf class in T , the weakly-supervised hierarchical text classiﬁcation task aims to assign the most likely label Cj T to each Di D, where Cj could be either an internal or a leaf class.

Pseudo Document Generation To break the bottleneck of lacking abundant labeled data for model training, we leverage user-given weak supervision to generate pseudo documents, which serve as pseudo training data for model pre-training. In this section, we ﬁrst introduce how to leverage weak supervision sources to model class distributions in a spherical space, and then explain how to generate class-speciﬁc pseudo documents based on class distributions and a language model.

Modeling Class Distribution We model each class as a high-dimensional spherical probability distribution which has been shown effective for various tasks (Zhang et al. 2017). We ﬁrst train Skip-Gram model (Mikolov et al. 2013) to learn d-dimensional vector representations for each word in the corpus. Since directional similarities between vectors are more effective in capturing semantic correlations (Banerjee et al. 2005; Levy, Goldberg, and Dagan 2015), we normalize all the d-dimensional word embeddings so that they reside on a unit sphere in Rd. For each class Cj T , we model the semantics of class Cj as a mixture of von Mises Fisher (mov MF) distributions (Banerjee et al. 2005; Gopal and Yang 2014) in Rd:

h=1 αhfh(x | µh, κh) =

h=1 αhcd(κh)eκhµT h x,

where Θ = {α1, . . . , αm, µ1, . . . , µm, κ1, . . . , κm}, h {1, . . . , m}, κh 0, µh = 1, and the normalization constant cd(κh) is given by

cd(κh) = κd/2 1 h (2π)d/2Id/2 1(κh),

where Ir( ) represents the modiﬁed Bessel function of the ﬁrst kind at order r. We choose the number of components in mov MF for leaf and internal classes differently:

For each leaf class Cj, we set the number of v MF component m = 1, and the resulting mov MF distribution is equivalent to a single v MF distribution, whose two parameters, the mean direction µ and the concentration parameter κ, act as semantic focus and concentration for Cj.

For each internal class Cj, we set the number of v MF component m to be the number of its children classes. Recall that we only ask the user to provide weak supervision sources at the leaf classes, and the weak supervision source of Cj are aggregated from its children classes. The semantics of a parent class can thus be seen as a mixture of the semantics of its children classes.

We ﬁrst retrieve a set of keywords for each class given the weak supervision sources, then ﬁt mov MF distributions using the embedding vectors of the retrieved keywords. Speciﬁcally, the set of keywords are retrieved as follows: (1) When users provide related keywords Sj for each class j, we use the average embedding of these seed keywords to ﬁnd top-n closest keywords in the embedding space; (2) When users provide documents DL j that are correlated with class j, we extract n representative keywords from DL j using tf-idf weighting. The parameter n above is set to be the largest number that does not result in shared words across different classes. Compared to directly using weak supervision signals, retrieving relevant keywords for modeling class distributions has a smoothing effect which makes our model less sensitive to the weak supervision sources. Let X be the set of embeddings of the n retrieved keywords on the unit sphere, i.e.,

X = {xi Rd | xi drawn from f(x | Θ), 1 i n},

we use the Expectation Maximization (EM) framework (Banerjee et al. 2005) to estimate the parameters Θ of the mov MF distributions:

p(zi = h | xi, Θ(t)) = α(t) h fh(xi | µ(t) h , κ(t) h ) Pm h =1 α(t) h fh (xi | µ(t) h , κ(t) h ) ,

where Z = {z1, . . . , zn} is the set of hidden random variables that indicate the particular v MF distribution from which the points are sampled;

α(t+1) h = 1

i=1 p(zi = h | xi, Θ(t)),

i=1 p(zi = h | xi, Θ(t))xi,

µ(t+1) h = r(t+1) h r(t+1) h ,

Id/2(κ(t+1) h )

Id/2 1(κ(t+1) h ) = r(t+1) h Pn i=1 p(zi = h | xi, Θ(t)).

where we use the approximation procedure based on Newton s method (Banerjee et al. 2005) to derive an approximation of κ(t+1) h because the implicit equation makes obtaining an analytic solution infeasible.

Language Model Based Document Generation After obtaining the distributions for each class, we use an LSTMbased language model (Sundermeyer, Schl uter, and Ney 2012) to generate meaningful pseudo documents. Speciﬁcally, we ﬁrst train an LSTM language model on the entire corpus. To generate a pseudo document of class Cj, we sample an embedding vector from the mov MF distribution of Cj and use the closest word in embedding space as the beginning word of the sequence. Then we feed the current sequence to the LSTM language model to generate the next word and attach it to the current sequence recursively 1. Since the beginning word of the pseudo document comes directly from the class distribution, the generated document is ensured to be correlated to Cj. By virtue of the mixture distribution modeling, the semantics of every children class (if any) of Cj gets a chance to be included in the pseudo documents, so that the resulting trained neural model will have better generalization ability.

The Hierarchical Classiﬁcation Model

In this section, we introduce the hierarchical neural model and its training method under weakly-supervised setting.

Local Classiﬁer Pre-Training

We construct a neural classiﬁer Mp (Mp could be any text classiﬁer such as CNNs or RNNs) for each class Cp T if Cp has two or more children classes. Intuitively, the classiﬁer Mp aims to classify the documents assigned to Cp into its children classes for more ﬁne-grained predictions. For each document Di, the output of Mp can be interpreted as p(Di Cc | Di Cp), the conditional probability of Di belonging to each children class Cc of Cp, given Di is assigned to Cp. The local classiﬁers perform local text classiﬁcation at internal nodes in the hierarchy, and serve as building blocks that can be later ensembled into a global hierarchical classiﬁer. We generate β pseudo documents per class and use them to pre-train local classiﬁers with the goal of providing each local classiﬁer with a good initialization for the subsequent self-training step. To prevent the local classiﬁers from overﬁtting to pseudo documents and performing badly on classifying real documents, we use pseudo labels instead of one-hot encodings in pre-training. Speciﬁcally, we use a hyperparameter α that accounts for the noises in pseudo documents, and set the pseudo label l i for pseudo document D i (we use D i instead of Di to denote a pseudo document) as

l ij = (1 α) + α/m D i is generated from class j α/m otherwise (1) where m is the total number of children classes at the corresponding local classiﬁer. After creating pseudo labels, we pre-train each local classiﬁer Mp of class Cp using the pseudo documents for each children class of Cp, by minimizing the KL divergence loss from outputs Y of Mp to the pseudo

1In case of long pseudo documents, we repeatedly generate several sequences and concatenate them to form the entire document.

labels L , namely

loss = KL(L Y) = X

j l ij log l ij yij .

Global Classiﬁer Self-Training At each level k in the class taxonomy, we need the network to output a probability distribution over all classes. Therefore, we construct a global classiﬁer Gk by ensembling all local classiﬁers from root to level k. The ensemble method is shown in Figure 1. The multiplication operation conducted between parent classiﬁer output and children classiﬁer output can be explained by the conditional probability formula:

p(Di Cc) = p(Di Cc Di Cp) = p(Di Cc | Di Cp)p(Di Cp),

where Di is a document; Cc is one of the children classes of Cp. This formula can be recursively applied so that the ﬁnal prediction is the multiplication of all local classiﬁers outputs on the path from root to the destination node.

Level 0 Root) Local Classiﬁer

Level 1 (Politics)

Local Classiﬁer

Level 1 (Sports) Local Classiﬁer

Level 2 (Military)

Level 2 (Gun Control)

Level 2 (Hockey)

Level 2 (Basketball) Level 2 (Tennis)

p(Di 2 Politics) = 0.05 p(Di 2 Sports) = 0.95

p(Di 2 Military|Di 2 Politics) = 0.34 p(Di 2 Basketball|Di 2 Sports) = 0.8

0.34 0.66 0.1 0.1 0.8

p(Di 2 Military) = 0.05 0.34 = 0.017 p(Di 2 Basketball) = 0.95 0.8 = 0.76

Figure 1: Ensemble of local classiﬁers.

Greedy top-down classiﬁcation approaches will propagate misclassiﬁcations at higher levels to lower levels, which can never be corrected. However, the way we construct the global classiﬁer assigns documents soft probability at each level, and the ﬁnal class prediction is made by jointly considering all classiﬁers outputs from root to the current level via multiplication, which gives lower-level classiﬁers chances to correct misclassiﬁcations made at higher levels. At each level k of the class taxonomy, we ﬁrst ensemble all local classiﬁers from root to level k to form the global classiﬁer Gk, and then use Gk s prediction on all unlabeled real documents to reﬁne itself iteratively. Speciﬁcally, for each unlabeled document Di, Gk outputs a probability distribution yij of Di belonging to each class j at level k, and we set pseudo labels to be (Xie, Girshick, and Farhadi 2016):

l ij = y2 ij/fj P

j y2 ij /fj , (2)

where fj = P

i yij is the soft frequency for class j. The pseudo labels reﬂect high-conﬁdent predictions, and we use them to guide the ﬁne-tuning of Gk, by iteratively (1)

computing pseudo labels L based on Gk s current predictions Y and (2) minimizing the KL divergence loss from Y to L . This process terminates when less than δ% of the documents in the corpus have class assignment changes. Since Gk is the ensemble of local classiﬁers, they are ﬁne-tuned simultaneously via back-propagation during self-training. We will demonstrate the advantages of using global classiﬁer over greedy approaches in the experiments.

Blocking Mechanism In hierarchical classiﬁcation, some documents should be classiﬁed into internal classes because they are more related to general topics rather than any of the more speciﬁc topics, which should be blocked at the corresponding local classiﬁer from getting further passed to children classes. When a document Di is classiﬁed into an internal class Cj, we use the output q of Cj s local classiﬁer to determine whether or not Di should be blocked at the current class: if q is close to a one-hot vector, it strongly indicates that Di should be classiﬁed into the corresponding child; if q is close to a uniform distribution, it implies that Di is equally relevant or irrelevant to all the children of Cj and thus more likely a general document. Therefore, we use normalized entropy as the measure for blocking. Speciﬁcally, we will block Di from being further passed down to Cj s children if

i=1 qi log qi > γ, (3)

where m 2 is the number of children of Cj; 0 γ 1 is a threshold value. When γ = 1, no documents will be blocked and all documents are assigned into leaf classes.

Inference The hierarchical classiﬁcation model can be directly applied to classify unseen samples after training. When classifying an unseen document, the model will directly output the probability distribution of that document belonging to each class at each level in the class hierarchy. The same blocking mechanism can be applied to determine the appropriate level that the document should belong to.

Algorithm Summary Algorithm 1 puts the above pieces together and summarizes the overall model training process for hierarchical text classiﬁcation. As shown, the overall training is proceeded in a top-down manner, from root to the ﬁnal internal level. At each level, we generate pseudo documents and pseudo labels to pre-train each local classiﬁer. Then we self-train the ensembled global classiﬁer using its own predictions in an iterative manner. Finally we apply blocking mechanism to block general documents, and pass the remaining documents to the next level.

Experiments Experiment Settings Datasets and Evaluation Metrics We use three corpora from three different domains to evaluate the performance of our proposed method:

Algorithm 1: Overall Network Training.

Input: A text collection D = {Di}|N i=1; a class category tree T ; weak supervisions W of either S or DL for each leaf class in T . Output: Class assignment C = {(Di, Ci)}|N i=1, where Ci T is the most speciﬁc class label for Di.

1 Initialize C ;

2 for k 0 to max level 1 do

3 N all nodes at level k of T ;

4 foreach node N do

5 D Pseudo document generation;

6 L Equation (1);

7 pre-train node.classifier with D , L ;

8 Gk ensemble all classiﬁers from level 0 to k;

9 while not converged do

10 L Equation (2);

11 self-train Gk with D, L ;

12 DB documents blocked based on Equation (3);

13 CB DB s current class assignments;

14 C C (DB, CB);

16 C D s current class assignments;

17 C C (D, C );

18 Return C;

Table 1: Dataset Statistics.

Corpus name # classes (level 1 + level 2) # docs Avg. doc length

NYT 5 + 25 13, 081 778 ar Xiv 3 + 53 230, 105 129 Yelp Review 3 + 5 50, 000 157

The New York Times (NYT): We crawl 13, 081 news articles using the New York Times API 2. This news corpus covers 5 super-categories and 25 sub-categories. ar Xiv: We crawl paper abstracts from ar Xiv website3 and keep all abstracts that belong to only one category. Then we include all sub-categories with more than 1, 000 documents out of 3 largest super-categories and end up with 230, 105 abstracts from 53 sub-categories. Yelp Review: We use the Yelp Review Full dataset (Zhang, Zhao, and Le Cun 2015) and take its testing portion as our dataset. The dataset contains 50, 000 documents evenly distributed into 5 sub-categories, corresponding to user ratings from 1 star to 5 stars. We consider 1 and 2 stars as negative , 3 stars as neutral , 4 and 5 stars as positive , so we end up with 3 super-categories. Table 1 provides the statistics of the three datasets; Table 2 and 3 show some sample sub-categories of NYT and ar Xiv datasets. We use Micro-F1 and Macro-F1 scores as metrics for classiﬁcation performances.

Baselines We compare our proposed method with a wide range of baseline models, described as below:

2http://developer.nytimes.com/ 3https://arxiv.org/

Table 2: Sample subcategories of NYT Dataset.

Super-category (# children) Sub-category

Politics (9) abortion, surveillance, immigration, . . . Arts (4) dance, television, music, movies Business (4) stocks, energy companies, economy, . . . Science (2) cosmos, environment Sports (7) hockey, basketball, tennis, golf, . . .

Table 3: Sample subcategories of ar Xiv Dataset.

Super-category (# children) Sub-category

Math (25) math.NA, math.AG, math.FA, . . . Physics (10) physics.optics, physics.ﬂu-dyn, .. . CS (18) cs.CV, cs.GT, cs.IT, cs.AI, cs.DC, . . .

Hier-Dataless (Song and Roth 2014): Dataless hierarchical text classiﬁcation 4 can only take word-level supervision sources. It embeds both class labels and documents in a semantic space using Explicit Semantic Analysis (Gabrilovich and Markovitch 2007) on Wikipedia articles, and assigns the nearest label to each document in the semantic space. We try both the top-down approach and bottom-up approach, with and without the bootstrapping procedure, and ﬁnally report the best performance. Hier-SVM (Dumais and Chen 2000; Liu et al. 2005): Hierarchical SVM can only take document-level supervision sources. It decomposes the training tasks according to the class taxonomy, where each local SVM is trained to distinguish sibling categories that share the same parent node. CNN (Kim 2014): The CNN text classiﬁcation model 5 can only take document-level supervision sources. We STClass (Meng et al. 2018): Weakly-supervised neural text classiﬁcation can take both word-level and documentlevel supervision sources. It ﬁrst generates bag-of-words pseudo documents for neural model pre-training, then bootstraps the model on unlabeled data. No-global: This is a variant of We SHClass without the global classiﬁer, i.e., each document is pushed down with local classiﬁers in a greedy manner. No-v MF: This is a variant of We SHClass without using mov MF distribution to model class semantics, i.e., we randomly select one word from the keyword set of each class as the beginning word when generating pseudo documents. No-selftrain: This is a variant of We SHClass without self-training module, i.e., after pre-training each local classiﬁer, we directly ensemble them as a global classiﬁer at each level to classify unlabeled documents.

Parameter Settings For all datasets, we use Skip-Gram model (Mikolov et al. 2013) to train 100-dimensional word embeddings for both mov MF distributions modeling and classiﬁer input embeddings. We set the pseudo label parameter

4https://github.com/Cog Comp/cogcomp-nlp/tree/master/ dataless-classiﬁer 5https://github.com/alexander-rakhlin/ CNN-for-Sentence-Classiﬁcation-in-Keras

α = 0.2, the number of pseudo documents per class for pretraining β = 500, and the self-training stopping criterion δ = 0.1. We set the blocking threshold γ = 0.9 for NYT dataset where general documents exist and γ = 1 for the other two. Although our proposed method can use any neural model as local classiﬁers, we empirically ﬁnd that CNN model always results in better performances than RNN models, such as LSTM (Hochreiter and Schmidhuber 1997) and Hierarchical Attention Networks (Yang et al. 2016). Therefore, we report the performance of our method by using CNN model with one convolutional layer as local classiﬁers. Speciﬁcally, the ﬁlter window sizes are 2, 3, 4, 5 with 20 feature maps each. Both the pre-training and the self-training steps are performed using SGD with batch size 256.

Weak Supervision Settings The seed information we use as weak supervision for different datasets are described as follows: (1) When the supervision source is class-related keywords, we select 3 keywords for each leaf class; (2) When the supervision source is labeled documents, we randomly sample c documents of each leaf class from the corpus (c = 3 for NYT and ar Xiv; c = 10 for Yelp Review) and use them as given labeled documents. To alleviate the randomness, we repeat the document selection process 10 times and show the performances with average and standard deviation values. We list the keyword supervisions of some sample classes for NYT dataset as follows: Immigration (immigrants, immigration, citizenship); Dance (ballet, dancers, dancer); Environment (climate, wildlife, ﬁsh).

Quantitative Comparision We show the overall text classiﬁcation results in Table 4. We SHClass achieves the overall best performance among all the baselines on the three datasets. Notably, when the supervision source is class-related keywords, We SHClass outperforms Hier-Dataless and We STClass, which shows that We SHClass can better leverage word-level supervision sources in hierarchical text classiﬁcation. When the supervision source is labeled documents, We SHClass has not only higher average performance, but also better stability than the supervised baselines. This demonstrates that when training documents are extremely limited, We SHClass can better leverage the insufﬁcient supervision for good performances and is less sensitive to seed documents. Comparing We SHClass with several ablations, Noglobal, No-v MF and No-self-train, we observe the effectiveness of the following components: (1) ensemble of local classiﬁers, (2) modeling class semantics as mov MF distributions, and (3) self-training. The results demonstrate that all these components contribute to the performance of We SHClass.

Component-Wise Evaluation In this subsection, we conduct a series of breakdown experiments on NYT dataset using class-related keywords as weak supervision to further investigate different components in our proposed method. We obtain similar results on the other two datasets.

Pseudo Documents Generation The quality of the generated pseudo documents is critical to our model, since highquality pseudo documents provide a good model initialization. Therefore, we are interested in which pseudo document generation method gives our model best initialization for the subsequent self-training step. We compare our document generation strategy (mov MF + LSTM language model) with the following two methods:

Bag-of-words (Meng et al. 2018): The pseudo documents are generated from a mixture of background unigram distribution and class-related keywords distribution.

Bag-of-words + reordering: We ﬁrst generate bag-of-words pseudo documents as in the previous method, and then use the globally trained LSTM language model to reorder the pseudo documents by greedily putting the word with the highest probability at the end of the current sequence. The beginning word is randomly chosen. We showcase some generated pseudo document snippets of class politics for NYT dataset using different methods in Table 5. Bag-of-words method generates pseudo documents without word order information; bag-of-words method with reordering generates text of high quality at the beginning, but poor near the end, which is probably because the proper words have been used at the beginning, but the remaining words are crowded at the end implausibly; our method generates text of high quality. To compare the generalization ability of the pre-trained models with different pseudo documents, we show their subsequent self-training process (at level 1) in Figure 2(a). We notice that our strategy not only makes self-training converge faster, but also has better ﬁnal performance.

Global Classiﬁer and Self-training We proceed to study why using self-trained global classiﬁer on the ensemble of local classiﬁers is better than greedy approach. We show the self-training procedure of the global classiﬁer at the ﬁnal level in Figure 2(b), where we demonstrate the classiﬁcation accuracy at level 1 (super-categories), level 2 (sub-categories) and of all classes. Since at the ﬁnal level, all local classiﬁers are ensembled to construct the global classiﬁer, self-training of the global classiﬁer is the joint training of all local classiﬁers. The result shows that the ensemble of local classiﬁers for joint training is beneﬁcial for improving the accuracy at all levels. If a greedy approach is used, however, higher-level classiﬁers will not be updated during lower-level classiﬁcation, and misclassiﬁcation at higher levels cannot be corrected.

Blocking During Self-training We demonstrate the dynamics of the blocking mechanism during self-training. Figure 2(c) shows the average normalized entropy of the corresponding local classiﬁer output for each document in NYT dataset, and Figure 2(d) shows the total number of blocked documents during the self-training procedure at the ﬁnal level. Recall that we enhance high-conﬁdent predictions to reﬁne our model during self-training. Therefore, the average normalized entropy decreases during self-training, implying there is less uncertainty in the outputs of our model. Correspondingly, fewer documents will be blocked, resulting in more available documents for self-training.

Table 4: Macro-F1 and Micro-F1 scores for all methods on three datasets, under two types of weak supervisions.

Methods NYT ar Xiv Yelp Review

KEYWORDS DOCS KEYWORDS DOCS KEYWORDS DOCS

Macro Micro Macro Avg. (Std.) Micro Avg. (Std.) Macro Micro Macro Avg. (Std.) Micro Avg. (Std.) Macro Micro Macro Avg. (Std.) Micro Avg. (Std.)

Hier-Dataless 0.593 0.811 - - 0.374 0.594 - - 0.284 0.312 - - Hier-SVM - - 0.142 (0.016) 0.469 (0.012) - - 0.049 (0.001) 0.443 (0.006) - - 0.220 (0.082) 0.310 (0.113) CNN - - 0.165 (0.027) 0.329 (0.097) - - 0.124 (0.014) 0.456 (0.023) - - 0.306 (0.028) 0.372 (0.028) We STClass 0.386 0.772 0.479 (0.027) 0.728 (0.036) 0.412 0.642 0.264 (0.016) 0.547 (0.009) 0.348 0.389 0.345 (0.027) 0.388 (0.033) No-global 0.618 0.843 0.520 (0.065) 0.768 (0.100) 0.442 0.673 0.264 (0.020) 0.581 (0.017) 0.391 0.424 0.369 (0.022) 0.403 (0.016) No-v MF 0.628 0.862 0.527 (0.031) 0.825 (0.032) 0.406 0.665 0.255 (0.015) 0.564 (0.012) 0.410 0.457 0.372 (0.029) 0.407 (0.015) No-self-train 0.550 0.787 0.491 (0.036) 0.769 (0.039) 0.395 0.635 0.234 (0.013) 0.535 (0.010) 0.362 0.408 0.348 (0.030) 0.382 (0.022)

We SHClass 0.632 0.874 0.532 (0.015) 0.827 (0.012) 0.452 0.692 0.279 (0.010) 0.585 (0.009) 0.423 0.461 0.375 (0.021) 0.410 (0.014)

Table 5: Sample generated pseudo document snippets of class politics for NYT dataset.

Doc # Bag-of-words Bag-of-words + reordering mov MF + LSTM language model

1 he s cup abortion bars have pointed use of lawsuits involving smoothen bettors rights in the federal exchange, limewire ...

the clinicians pianists said that the legalizing of the proﬁling of the . . .abortion abortion abortion identiﬁcation abortions . . .

abortion rights is often overlooked by the president s 30-feb format of a moonjock period that offered him the rules to . . .

2 ﬁrst tried to launch the agent in immigrants were in a lazar and lakshmi deﬁnition of yerxa riding this we get very coveted as . . .

majorities and clintons legalization, moderates and tribes lawfully ...lawmakers clinics immigrants immigrants immigrants . ..

immigrants who had been headed to the united states in benghazi, libya, saying that mr. he making comments describing .. .

3 the september crew members budget security administrator lat coequal representing a federal customer, identiﬁed the bladed ...

the impasse of allowances overruns pensions entitlement . . . funding ﬁnancing budgets budgets budgets budgets taxpayers . . .

budget increases on oil supplies have grown more than a ezio of its 20 percent of energy spaces, producing plans by 1 billion .. .

0 200 400 600 800 Iterations

Macro-F1 Micro-F1

BOW BOW-reorder LSTM

(a) Pseudo documents generation

0 500 1000 1500 2000 2500 3000 3500

Macro-F1 Micro-F1

level 1 level 2 all

(b) Global classiﬁer selftraining

0 500 1000 1500 2000 2500 3000 3500

Avg. normalized entropy

(c) Average normalized entropy

0 500 1000 1500 2000 2500 3000 3500

# of blocked docs

(d) Number of blocked documents

Figure 2: Component-wise evaluation on NYT dataset.

Related Work Weakly-Supervised Text Classiﬁcation There exist some previous studies that use either word-based supervision or limited amount of labeled documents as weak supervision sources for the text classiﬁcation task. We STClass (Meng et al. 2018) leverages both types of supervision sources. It applies a similar procedure of pre-training the network with pseudo documents followed by self-training on unlabeled data. Descriptive LDA (Chen et al. 2015) applies an

LDA model to infer Dirichlet priors from given keywords as category descriptions. The Dirichlet priors guide LDA to induce the category-aware topics from unlabeled documents for classiﬁcation. (Ganchev et al. 2010) propose to encode prior knowledge and indirect supervision in constraints on posteriors of latent variable probabilistic models. Predictive text embedding (Tang, Qu, and Mei 2015) utilizes both labeled and unlabeled documents to learn text embedding speciﬁcally for a task. Labeled data and word co-occurrence information are ﬁrst represented as a large-scale heterogeneous text network and then embedded into a low dimensional space. The learned embedding are fed to logistic regression classiﬁers for classiﬁcation. None of the above methods are speciﬁcally designed for hierarchical classiﬁcation.

Hierarchical Text Classiﬁcation

There have been efforts on using SVM for hierarchical classiﬁcation. (Dumais and Chen 2000; Liu et al. 2005) propose to use local SVMs that are trained to distinguish the children classes of the same parent node so that the hierarchical classiﬁcation task is decomposed into several ﬂat classiﬁcation tasks. (Cai and Hofmann 2004) deﬁne hierarchical loss function and apply cost-sensitive learning to generalize SVM learning for hierarchical classiﬁcation. A graph-CNN based deep learning model is proposed in (Peng et al. 2018) to convert text to graph-of-words, on which the graph convolution operations are applied for feature extraction. Fast XML (Prabhu and Varma 2014) is designed for extremely large label space. It learns a hierarchy of training instances and optimizes a ranking-based objective at each node of the hierarchy. The above methods rely heavily on the quantity and quality of training data for good performance, while We SH-

Class does not require much training data but only weak supervision from users. Hierarchical dataless classiﬁcation (Song and Roth 2014) uses class-related keywords as class descriptions, and projects classes and documents into the same semantic space by retrieving Wikipedia concepts. Classiﬁcation can be performed in both top-down and bottom-up manners, by measuring the vector similarity between documents and classes. Although hierarchical dataless classiﬁcation does not rely on massive training data as well, its performance is highly inﬂuenced by the text similarity between the distant supervision source (Wikipedia) and the given unlabeled corpus.

Conclusions We proposed a weakly-supervised hierarchical text classiﬁcation method We SHClass. Our designed hierarchical network structure and training method can effectively leverage (1) different types of weak supervision sources to generate high-quality pseudo documents for better model generalization ability, and (2) class taxonomy for better performances than ﬂat methods and greedy approaches. We SHClass outperforms various supervised and weakly-supervised baselines in three datasets from different domains, which demonstrates the practical value of We SHClass in real-world applications. In the future, it is interesting to study what kinds of weak supervision are most effective for the hierarchical text classiﬁcation task and how to combine multiple sources together to achieve even better performance.

Acknowledgements This research is sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C0099, National Science Foundation IIS 16-18481, IIS 1704532, and IIS-17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). We thank anonymous reviewers for valuable and insightful feedback.

References Banerjee, A.; Dhillon, I. S.; Ghosh, J.; and Sra, S. 2005. Clustering on the unit hypersphere using von mises-ﬁsher distributions. Journal of Machine Learning Research 6:1345 1382. Cai, L., and Hofmann, T. 2004. Hierarchical document categorization with support vector machines. In CIKM. Ceci, M., and Malerba, D. 2006. Classifying web documents in a hierarchy of categories: a comprehensive study. Journal of Intelligent Information Systems 28:37 78. Chen, X.; Xia, Y.; Jin, P.; and Carroll, J. A. 2015. Dataless text classiﬁcation with descriptive lda. In AAAI. Dumais, S. T., and Chen, H. 2000. Hierarchical classiﬁcation of web content. In SIGIR. Gabrilovich, E., and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI.

Ganchev, K.; Grac a, J.; Gillenwater, J.; and Taskar, B. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research 11:2001 2049. Gopal, S., and Yang, Y. 2014. Von mises-ﬁsher clustering models. In ICML. Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9:1735 1780. Kim, Y. 2014. Convolutional neural networks for sentence classiﬁcation. In EMNLP. Levy, O.; Goldberg, Y.; and Dagan, I. 2015. Improving distributional similarity with lessons learned from word embeddings. TACL 3:211 225. Li, X., and Roth, D. 2002. Learning question classiﬁers. In COLING. Liu, T.-Y.; Yang, Y.; Wan, H.; Zeng, H.-J.; Chen, Z.; and Ma, W.-Y. 2005. Support vector machines classiﬁcation with a very large-scale taxonomy. SIGKDD Explorations 7:36 43. Meng, Y.; Shen, J.; Zhang, C.; and Han, J. 2018. Weaklysupervised neural text classiﬁcation. In CIKM. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. Peng, H.; Li, J.; He, Y.; Liu, Y.; Bao, M.; Wang, L.; Song, Y.; and Yang, Q. 2018. Large-scale hierarchical text classiﬁcation with recursively regularized deep graph-cnn. In WWW. Prabhu, Y., and Varma, M. 2014. Fastxml: a fast, accurate and stable tree-classiﬁer for extreme multi-label learning. In KDD. Silla, C. N., and Freitas, A. A. 2010. A survey of hierarchical classiﬁcation across different application domains. Data Mining and Knowledge Discovery 22:31 72. Song, Y., and Roth, D. 2014. On dataless hierarchical text classiﬁcation. In AAAI. Sundermeyer, M.; Schl uter, R.; and Ney, H. 2012. Lstm neural networks for language modeling. In INTERSPEECH. Tang, D.; Qin, B.; and Liu, T. 2015. Document modeling with gated recurrent neural network for sentiment classiﬁcation. In EMNLP. Tang, J.; Qu, M.; and Mei, Q. 2015. Pte: Predictive text embedding through large-scale heterogeneous text networks. In KDD. Xie, J.; Girshick, R. B.; and Farhadi, A. 2016. Unsupervised deep embedding for clustering analysis. In ICML. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A. J.; and Hovy, E. H. 2016. Hierarchical attention networks for document classiﬁcation. In HLT-NAACL. Zhang, C.; Liu, L.; Lei, D.; Yuan, Q.; Zhuang, H.; Hanratty, T.; and Han, J. 2017. Triovecevent: Embedding-based online local event detection in geo-tagged tweet streams. In KDD. Zhang, X.; Zhao, J. J.; and Le Cun, Y. 2015. Character-level convolutional networks for text classiﬁcation. In NIPS.