# conditional_bernoulli_mixtures_for_multilabel_classification__e75e821f.pdf Conditional Bernoulli Mixtures for Multi-Label Classification Cheng Li CHENGLI@CCS.NEU.EDU Bingyu Wang RAINICY@CCS.NEU.EDU Virgil Pavlu VIP@CCS.NEU.EDU Javed Aslam JAA@CCS.NEU.EDU College of Computer and Information Science, Northeastern University, Boston, MA 02115, USA Multi-label classification is an important machine learning task wherein one assigns a subset of candidate labels to an object. In this paper, we propose a new multi-label classification method based on Conditional Bernoulli Mixtures. Our proposed method has several attractive properties: it captures label dependencies; it reduces the multi-label problem to several standard binary and multi-class problems; it subsumes the classic independent binary prediction and power-set subset prediction methods as special cases; and it exhibits accuracy and/or computational complexity advantages over existing approaches. We demonstrate two implementations of our method using logistic regressions and gradient boosted trees, together with a simple training procedure based on Expectation Maximization. We further derive an efficient prediction procedure based on dynamic programming, thus avoiding the cost of examining an exponential number of potential label subsets. Experimental results show the effectiveness of the proposed method against competitive alternatives on benchmark datasets. 1. Introduction Multi-label classification is an important machine learning task wherein one assigns a subset of candidate labels to an object. Such problems arise naturally in the context of medical diagnostics, text classification, image tagging, and a host of other areas. For example, in the medical domain, a patient may present multiple illnesses or undergo a procedure with multiple billing codes; in news categorization, an article can be associated with multiple categories; in image tagging, a photo may have multiple tags (see Figure 1). Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). Figure 1. An image from the NUS-WIDE multi-label dataset. Independent binary logistic regressions predict the labels {clouds, lake, sky, sunset, water}. Our proposed method correctly predicts {clouds, lake, sky, sunset, water, reflection}, because it captures the dependencies between reflection and the other labels. Multi-label classification is particularly difficult in the common case when the labels are many and dependent. In this case, simple approaches such as independent binary label prediction and power-set subset prediction fail due to label dependencies and combinatorial explosion, respectively. More sophisticated approaches such as classifier chains, conditional dependency networks, and conditional random fields suffer due to accuracy/computationalcomplexity trade-offs in training and/or prediction. We introduce Conditional Bernoulli Mixtures for multi-label classification to address these challenges, yielding multilabel classifiers that are flexible, accurate and efficient. Formally, in a multi-label classification problem, we are given a set of label candidates Y = {1, 2, ..., L}. Every data point x RD matches a subset of labels y Y, which can be equivalently written in the form of a binary vector y {0, 1}L, with each bit yℓindicating the presence or absence of the corresponding label. The goal of learning is to build a classier h : RD 7 {0, 1}L which maps an in- Conditional Bernoulli Mixtures for Multi-label Classification stance to a subset of labels. The predicted label subset can be of arbitrary size; when the size is restricted to be 1, the problem is called multi-class classification. Furthermore, if the total number of label candidates L is 2, the problem becomes binary classification. Various measures can be used for evaluating multi-label classifiers. Among them, three commonly used measures are subset accuracy, Jaccard index and Hamming loss. Let {(xn, yn)}N n=1 be a multi-label dataset with ground truth labels, and {ˆyn}N n=1 be the predictions made by a classifier. Subset accuracy generalizes the conventional multiclass accuracy notion: n=1 1[yn = ˆyn], (1) where 1[ ] is the indicator function. In computing subset accuracy, a predicted subset is considered correct only when it matches the true subset exactly. Admittedly, this metric is very stringent in evaluation. However, optimizing for subset accuracy is a very interesting algorithmic design and research problem, as it encourages the classifiers to output coherent and complete predictions. The optimal prediction for subset accuracy is achieved by outputting the mode of the conditional joint distribution: h (x) = argmax y p(y|x). (2) Jaccard index (also called overlap) measures the overlap between the true subset and the predicted subset using the size of their intersection divided by the size of their union, and Hamming loss counts the number of bit-wise mistakes. The theoretical results in (Dembczy nski et al., 2012; Koyejo et al., 2015) show that a multi-label classifier designed for optimizing one measure may be suboptimal under other measures. In this paper, we focus on the multilabel classification task with subset accuracy as the target evaluation measure, but for completeness we also report the Jaccard index and Hamming loss results in the supplementary material. According to (2), optimizing subset accuracy requires estimating the conditional joint distribution p(y|x), which captures conditional label dependencies given features. One naive approach is to assume conditional independence among labels p(y|x) = QL ℓ=1 p(yℓ|x), which reduces a multi-label problem into L binary classification problems. This approach is called Binary Relevance (Bin Rel) (Tsoumakas & Katakis, 2007), and is widely used due to its simplicity. One obvious disadvantage is that the individual label predictions can often be conflicting, such as tagging cat but not animal for an image. At the other extreme is the Power-Set (Pow Set) approach (Tsoumakas & Katakis, 2007), treating each label subset as a class, and trains a multi-class classifier. As a consequence, one would be restricted in practice to predicting only label subsets seen in the training data and always assigning zero probabilities to unseen subsets; even for many of the subsets observed there would likely be a scarcity of training data. Power-Set is handling label dependencies and its predictions are coherent, but the method is often infeasible on the exponential number of label sets. Bernoulli Mixtures. Mixture models offer a flexible and powerful framework for many multivariate density estimation problems. A mixture generally has the form k=1 πkp(y; βk), (3) which approximates a complex joint p(y) by a weighted combination of K component densities p(y; βk), each of which typically takes some simple density form parametrized by βk. The Expectation Maximization (EM) algorithm can be employed to train such mixture models by iterating between estimating the mixture coefficients and fitting component densities. Bernoulli Mixtures (BM) are classic models for multidimensional binary variable density estimation (Lazarsfeld et al., 1968; Mc Lachlan & Peel, 2004; Bishop, 2006), where the learnability is realized by assuming independence of variables within each mixture component. Thus each component density p(y; βk) is simply a product of Bernoulli densities and the overall model has the form ℓ=1 Ber(yℓ; µk ℓ), (4) where Ber(yℓ; µk ℓ) denotes the Bernoulli distribution with head probability µk ℓ. BM has two attractive properties. First, dependencies: although the L variables are assumed to be independent inside each component, they are in general dependent in the mixture (they are forced to be independent only when K = 1). This can be verified by computing the following covariance matrix and observing that it is non-diagonal for K 2 (Bishop, 2006). k=1 πk[Σk + µk(µk) ] E(y)E(y) , (5) where E(y) = PK k=1 πkµk, Σk = diag{µk ℓ(1 µk ℓ)}, and µk = (µk 1, µk 2, ..., µk L) . Second, efficiency: the full factorization of each component makes fitting BM efficient via the EM algorithm. In this paper, we propose a new multi-label classification method which approximates the conditional joint p(y|x) based on Bernoulli Mixtures. The method simultaneously learns the label dependencies from data and trains classifiers to account for such dependencies. Conditional Bernoulli Mixtures for Multi-label Classification 2. Conditional Bernoulli Mixtures For multi-label classification, we model the conditional joint p(y|x) with a discriminative extension of BM, capturing conditional dependencies among binary labels given features. The analysis in (Dembczy nski et al., 2012) shows that labels could be largely conditionally independent given features (i.e., p(y|x) QL ℓ=1 p(yℓ|x)), even when labels are strongly marginally dependent (i.e., p(y) = QL ℓ=1 p(yℓ)), as long as each label is highly predictable from features. Therefore it is not necessary to capture correlations among easy-to-predict labels; it is sufficient to only capture those correlations which involve some hardto-predict labels. Thus conditioning on features greatly reduces the need to estimate label correlations and makes learning much easier; the conditional on x is essential for good approximation and effective training, without making prior assumptions about the form of label dependencies (as many other methods do). Making both mixture coefficients and Bernoulli distributions conditional on x, we obtain our proposed model, Conditional Bernoulli Mixtures (CBM): k=1 π(z = k|x; α) ℓ=1 b(yℓ|x; βk ℓ), (6) where α and βk ℓ(ℓ= 1, 2, ..., L; k = 1, 2, ..., K) are model parameters to be learned, and z is a hidden categorical indicator variable, such that z = k if the data point is assigned to component k. Here we use π and b to represent the conditional mixture membership distribution and the conditional binary label distribution, respectively. CBM reduces a multi-label problem to a multi-class problem and several binary problems, approaching p(y|x) akin to divide and conquer: the categorical distribution π(z|x; α) assigns each instance x to 1 out of K components probabilistically. Its goal is to divide the feature space into several regions such that each region only has weak conditional label correlations and can be approximated by a simple component. This gating function π(z|x; α) can be instantiated by any multi-class classifier which provides probability estimations, such as a multinomial logistic regression with parameters α. Inside each region k, the local conditional joint density is approximated by a product of conditional marginal densities. Every local binary label predictor b(yℓ|x; βk ℓ) estimates the probability of getting label yℓfrom component k for data point x, and can be instantiated by any binary classifier which provides probability estimations, such as a binary logistic regression with parameters βk ℓ. All K components together with the gating function are learned jointly in order to break the global label correlation into simple forms. On one extreme, if CBM has only one component (and hence π(z|x; α) plays no role), all labels are conditionally independent and CBM degenerates to Binary Relevance. On the other extreme, if CBM assigns one component to each unique label subset and fixes each b(yℓ= 1|x; βk ℓ) to be the corresponding binary constant (1 if label ℓis in the subset and 0 otherwise), then the overall CBM model simply selects one label subset from all possible subsets, which is conceptually the same as the Power-Set approach. By varying the number of components and the complexity of each component, CBM can provide a continuous spectrum between these two extremes. Just as Binary Relevance and Power-Set, CBM is purely a reduction method. The main advantage of reduction methods compared with new models specifically designed for multi-label problem is that the reduction approach makes it easier to incorporate and reuse many well-developed multi-class and binary classifiers. In Section 3, we will demonstrate two different instantiations of CBM, one with logistic regressions, and the other with gradient boosted trees. possible labels Figure 2. The top 4 most influential components for the test image in Figure 1. π values indicate component mixing coefficients. Each bar represents an individual label probability in one component. Labels with near zero probabilities are not displayed here. An Illustrative Example. Before diving into the model training details, we illustrate how CBM performs joint label classifications with an example. Figure 1 shows a test image in the NUS-WIDE dataset, for which the matched label reflection is missed by Binary Relevance but is captured by CBM. For this image, the most influential components produced by CBM are shown in Figure 2. Simply averaging individual label probabilities by component mixing weights gives the conditional marginals p(lake|x) = 0.56, p(water|x) = 0.69, p(sunset|x) = 0.66, and p(reflection|x) = 0.32, which indicate that unlike Conditional Bernoulli Mixtures for Multi-label Classification lake, water and sunset, the label reflection by itself is not deemed as probable by CBM. However from the CBM joint density p(y|x) we can also infer Pearson correlation coefficients ρreflection,lake = 0.50, ρreflection,water = 0.40, ρreflection,sunset = 0.17 and observe that reflection is positively correlated with lake, water, and sunset. In fact, the joint probability asserts that the subset {clouds, lake, sky, sunset, water, reflection} is the most probable one, with probability 0.09. By contrast, the subset {clouds, lake, sky, sunset, water} has a lower probability 0.06. Therefore, although individually unlikely, the label reflection is correctly added to the mostly likely subset due to label correlations. 3. Training CBM with EM CBM can be trained by maximum likelihood estimation on a given dataset {(xn, yn)}N n=1. Below we first derive a generic EM algorithm that works for any instantiation of the components, and then consider two concrete instantiations. To make the mathematical derivation clear, we use different symbols for random variables and values of random variables. We use Yn when treating the labeling of xn as an unknown random variable and yn when referring to its specific labeling assignment given in the training set. The likelihood for the dataset is k=1 [π(zn = k|xn; α) ℓ=1 b(ynℓ|xn; βk ℓ)]}. Since the model contains hidden variables, we use EM to minimize an upper bound of the negative log likelihood. Denoting the posterior membership distribution p(zn|xn, yn) as Γ(zn) = (γ1 n, γ2 n, ..., γK n ), the upper bound can be written as n=1 KL(Γ(zn)||π(zn|xn; α)) n=1 γk n KL(Ber(Ynℓ; ynℓ)||b(Ynℓ|xn; βk ℓ)), (7) where Ber(Ynℓ; ynℓ) is the Bernoulli distribution with head probability ynℓ. E Step: Re-estimate the posterior membership probability of each data point belonging to each component: γk n = π(zn = k|xn; α) QL ℓ=1 b(ynℓ|xn; βk ℓ) PK k=1 π(zn = k|xn; α) QL ℓ=1 b(ynℓ|xn; βk ℓ) . (8) M Step: Update all model parameters. The bound (7) shows that the optimization of all parameters can be nicely decomposed into a series of separate optimization problems: αnew = argmin α n=1 KL(Γ(zn)||π(zn|xn; α)), (9) βk ℓnew = argmin βk ℓ n=1 γk n KL(Ber(Ynℓ; ynℓ)||b(Ynℓ|xn; βk ℓ)). The optimization problem defined in (9) is a multi-class classification problem with target class distribution (soft labels) Γ(zn) = (γ1 n, γ2 n, ..., γK n ) for xn. The training goal for π(z|x; α) is to assign (probabilistically) each data point to a component based only on its features, such that the assignment matches the posterior component membership determined by both features and true labels. The optimization problem defined in (10) is a weighted binary classification problem. xn has target label ynℓ and weight γk n. The prediction component b(yℓ|x; βk ℓ) is trained as a binary classifier for label ℓin component k, based only on training data within (soft) component k. To train CBM, we iterate between the E step and the M step, until the upper bound (7) converges (see Algorithm 1). In practice, the EM algorithm can get stuck in local op- Algorithm 1 Generic Training for CBM 1: repeat 2: E Step 3: for n = 1, 2, ..., N; k = 1, 2, ..., K do 4: update γk n as in (8) 5: M Step 6: update α as in (9) 7: for k = 1, 2, ..., K; ℓ= 1, 2, ..., L do 8: update βk ℓas in (10) 9: until convergence tima, so careful initializations are often necessary for good performance. Due to the natural connection between CBM and BM, one simple way of initializing CBM is to first fit a BM just for label density estimation without looking at features. BM can be trained very quickly by a simpler EM algorithm (Bishop, 2006). We can use several random starts for a set of BM models, and select the one with the best training objective score to initialize the CBM training procedure. 3.1. CBM with Logistic Regression Learners In the simplest form all underlying models in CBM are linear. We employ a multinomial logistic regression for the gating function π, and a binary logistic regression for each predictor b. The outer loop of the training procedure is the Conditional Bernoulli Mixtures for Multi-label Classification EM algorithm described above. In each M step, all objectives defined in (9) and (10) are convex (although the overall EM optimization is non-convex). To optimize each logistic regression, we compute the gradient of the corresponding objective w.r.t. its parameters and update all parameters using gradient based optimization methods. Here we choose L-BFGS method (Nocedal & Wright, 2006), which is the state-of-the-art optimization method for large scale logistic regressions. We can start with the existing model, and perform a few incremental update steps until it converges. To avoid over-fitting, we also add L2 regularizations (Gaussian priors) to all parameters. CBM+LR Complexity. The overall complexity of the training procedure for CBM is dominated by the updates in M steps and depends on the complexities of the binary and multi-class learning algorithms used. For CBM with logistic regression (LR) learners, we can measure the complexity of each L-BFGS (or gradient descent) update for all model parameters. If N is the number of instances, D the number of features, L the number of labels, C the number of unique label subsets in the training set, and K the number of CBM components, then each CBM+LR update takes O(KLDN). For comparison, each update takes O(LDN) in Binary Relevance with LR, and takes O(CDN) in Power-Set with LR. For large datasets, typically KL < C, and CBM is slower than Binary Relevance but faster than Power-Set. 3.2. CBM with Gradient Boosting Learners Many datasets require non-linear decision boundaries, in which case the CBM model with logistic regressions may not have enough explanation power. To make CBM nonlinear, we use gradient boosted trees (GB) for both π and b. The original gradient boosted trees algorithm described in (Friedman, 2001) is designed for standard multi-class problems, and does not take label distributions or instance weights as inputs, thus some modifications are necessary. The target label distribution is easy to deal with: one still computes the functional gradient of the objective w.r.t. each ensemble scoring function, where the objective is defined by the target distribution. To handle instance weights, one ignores them when calculating functional gradients, and performs a weighted least squares fit when fitting each regression tree to the gradients. Unlike logistic regression, where all parameters can be easily updated, gradient boosting introduces new trees to the ensemble while keeping old trees untouched. Thus the M step here is slightly different from before: rather than re-adjusting all parameters to optimize the objective, boosting improves the objective by adding a few more trees. This amounts to a partial M step, and the resulting EM algorithm is usually called the generalized EM algorithm (Gupta & Chen, 2011). We emphasize that the use of boosting as an underlying non- linear classifier here is just for demonstration purposes. Other classifiers such as neural networks could also be used. This is in direct contrast to Ada Boost.MH and Adaboost.MR (Schapire & Singer, 2000) which specifically employ boosting to optimize Hamming loss and rank loss. 3.3. Validation of Training Procedure on Artificial Data To validate the training procedure, we fit a CBM model on artificial data generated by a true CBM and see whether the training procedure could recover the true parameters. The true CBM has K = 3 components and uses logistic regressions for both π and b. Each of the N = 15, 000 data points x has D = 7 features generated independently from Gaussian distributions. We test two different ways of generating the ground-truth label subset y: argmax, where the most likely y is always selected; and sampling, where y is sampled from the true CBM distribution. For each of these two datasets, we also make a noise variant with Gaussian noise added to logistic regression scores. Test Table 1. Test subset accuracy on artificial data. y=argmax y=argmax y=sample y=sample Methods +noise +noise Upper bound 100 76.4 65.2 48.1 CBM+LR:K=2 89.9 69.3 60.2 45.8 CBM+LR:K=3 98.4 75.8 65.0 48.0 Bin Rel+LR 82.0 49.9 48 40.0 Pow Set+LR 97.5 74.1 64.6 47.3 subset accuracy results are presented in Table 1. It shows that if the true distribution is indeed generated by a CBM, then the proposed training procedure is quite effective in recovering the distribution. 4. Fast Prediction by Dynamic Programming According to (2), making the optimal prediction in terms of subset accuracy for a given x requires finding the most probable label subset y = argmaxy p(y|x). There are 2L label subset candidates, and it is intractable to evaluate the probability for each of them. Many multi-label methods suffer from this intractability for exact inference (see Section 5). Fortunately for CBM, its special structure allows it to make predictions efficiently, with either sampling or dynamic programming. As a common preprocessing step, for a given x, we can first compute πk = π(z = k|x; α) and µk ℓ= b(yℓ= 1|x; βk ℓ), for all k = 1, 2, .., K and ℓ= 1, 2, ..., L. Prediction by Sampling. The CBM density form (6) suggests a natural sampling strategy for a label subset y. We first sample a component k according to the mixture coefficients π1, ..., πK. Then from this component, we sample each label yℓindependently with probability µk ℓ. The procedure can be repeated multiple times to generate a set of Conditional Bernoulli Mixtures for Multi-label Classification y candidates, from which we pick the most probable one. The sampling strategy works best when the most probable label subset has a high probability. Sampling is easy to implement, but does not guarantee that the predicted y is indeed the optimal one. Prediction by Dynamic Programming and Pruning. In order for the overall probability p(y|x) to be high, there must exist a component k for which the component probability QL ℓ=1 b(yℓ|x; βk ℓ) is high. On the other hand, one can show that the y maximizing the overall probability does not necessarily maximize any component probability. To find y , we design a dynamic programing procedure FIND-NEXT-HIGHEST() that enumerates label subsets in a decreasing probability order in each component, and then we iterate round-robin across components until we are certain that the unchecked subsets will never produce a high overall probability. Algorithm 2 Prediction by Dynamic Prog. and Pruning 1: Input: πk, µk ℓ, k = 1, 2, ..., K, ℓ= 1, 2, ..., L 2: Initialize candidate component set S = {1, 2, ..., K} 3: Initialize the maximum overall probability M = 4: for k = 1, 2, ..., K do 5: Initialize the maximum component probability Gk 6: Initialize the priority queue Qk 7: while S = Φ do 8: for k S do 9: y = Qk.FIND-NEXT-HIGHEST() 10: Let p = PK m=1 πm QL ℓ=1 Ber(yℓ; µm ℓ) and q = QL ℓ=1 Ber(yℓ; µk ℓ) 11: if p > M then 12: Set M = p and y = y 13: if πkq M/K or πkq+P r =k πr Gr M then 14: Remove k from S 15: Output: y 16: function FIND-NEXT-HIGHEST() 17: y = Qk.deque() 18: for ℓ= 1, 2, ..., L do 19: Generate y by flipping the ℓ-th bit of y 20: if y is unseen then 21: Qk.enqueue(y ) 22: Output: y For a component k, the y with the highest component probability is the set containing precisely all the ℓwith µk ℓ 1/2. To produce a ranked list, we use a max priority queue Qk to store candidate label subsets and their associated component probabilities. Initially, Qk contains the y with the highest component probability. The t-th call to Qk.FIND-NEXT-HIGHEST() returns the y with the t-th highest component probability. The overall prediction al- gorithm (Algorithm 2) iterates over all components, checks the next candidate (Lines 9-10), updates the best label subset found so far (Lines 11-12), and prunes the remaining subset candidates in a component s ranked list (Lines 13-14). The algorithm terminates when all components ranked lists have been pruned, and surely returns the most probable y . In practice, we observe that the algorithm rarely visits elements deeper than rank 10 in the component ranked lists. 5. Related Work Modeling label dependencies has been long regarded as a central theme in multi-label classification. Estimating the high-dimensional conditional joint p(y|x) is very challenging and various approaches have been proposed to tackle this problem based on different approximations. Classifier Chains. According to the chain rule, the joint probability p(y|x) can be decomposed into a product of conditionals p(y1|x)p(y2|x, y1) p(y L|x, y1, .., y L 1). This reduces a multi-label learning problem to L binary learning problems, each of which learns a new label given all previous labels. During prediction, finding the exact joint mode is intractable. Classifier Chains (CC) (Read et al., 2011) classify labels greedily in a sequence: label yℓis decided by maximizing p(yℓ|x, y1, .., yℓ 1), and becomes a feature to be used in the prediction for label yℓ+1. This greedy prediction procedure has three issues: 1) the predicted subset can be far away from the joint mode (Dembczy nski et al., 2011); 2) errors in early label predictions propagate to subsequent label predictions; 3) the overall prediction depends on the chain order. To address the first two issues, Probabilistic Classifier Chains (PCC) replace the greedy search strategy with some more accurate search strategies, such as exhaustive search (Cheng et al., 2010), ϵ-approximate search (Dembczynski et al., 2012), Beam Search (Kumar et al., 2012; 2013), or A* search (Mena et al., 2015). To address the third issue, Ensemble of Classifier Chains (ECC) (Read et al., 2011) averages several predictions made by different chains, wherein the averaging can take place at either the individual label level (ECC-label) or the label subset level (ECC-subset). Using dynamic programming to find the optimal chain order (Liu & Tsang, 2015) has been proposed recently. A similar reduction method named Conditional Dependency Networks (CDN) estimates p(y|x) based on full conditionals and Gibbs sampling (Guo & Gu, 2011). During learning, one binary classifier is trained for each full conditional p(yℓ|x, y1, .., yℓ 1, yℓ+1, ..., y L). During prediction, Gibbs sampling is used to find the mode of the joint. The method s major limitation is that it cannot handle perfectly correlated or anti-correlated labels: consider Conditional Bernoulli Mixtures for Multi-label Classification binary classification as multi-label problem with only 2 exhaustive and exclusive labels. A perfect model for p(y1 = 1|x, y2) is 1 y2; in other words, the feature information is completely ignored. The same applies to p(y2|x, y1). So the prediction will inevitably fail. In general, Gibbs sampling may fail in the presence of perfect correlations or anti-correlations since the resulting stochastic process is not ergodic; when relations are imperfect but very strong, Gibbs sampling may need a very long time to converge. (Gasse et al., 2015) factorize labels into independent factors based on some statistical conditional independence tests and a Markov boundary learning algorithm. Its limitation, as pointed out by the authors, is that the statistical tests are ineffective when the ratio between samples and labels is not big enough. Their method outperforms the Power-Set baseline on synthetic datasets, but not on any real dataset. (Zhang & Zhang, 2010) propose to use a Bayesian Network to encode the conditional dependencies of the labels as well as the feature set, with the features as the common parent of all labels. Conditional Random Fields (CRF) (Sutton & Mc Callum, 2006) offer a general framework for structured prediction problems based on undirected graphical models. In multilabel classifications labels can form densely connected graphs, so restrictions are imposed in order to make training and inference tractable. For problems involving only hard label relations (exclusive or hierarchical), a special CRF model is proposed (Deng et al., 2014); this works only when label dependencies are strict and a priori known. pair-CRF limits the scope of potential functions to label pairs, and does not model higher order label interactions (Ghamrawi & Mc Callum, 2005). The exact inference and prediction in pair-wise CRF requires checking all possible label subsets, which is intractable for datasets with many labels. As shown in their experiment results, the most effective way of doing approximate inference and prediction is to consider only label subsets that occur in the training set. But this eliminates the possibility of predicting unseen subsets. Our CBM model does not have this limitation (see Section 4). There are also algorithms which exploit label structures but do not estimate p(y|x) explicitly. Examples include Multilabel k-Nearest Neighbors (Zhang & Zhou, 2007), Compressed Sensing based method (Hsu et al., 2009), Principle Label Space Transformation (Tai & Lin, 2012), Conditional Principal Label Space Transformation (Chen & Lin, 2012) and Compressed Labeling (Zhou et al., 2012). Various mixture models exist for multi-label document classification or supervised topical modeling (Mc Callum, 1999; Ueda & Saito, 2002; Mc Auliffe & Blei, 2008; Yang et al., 2009; Ramage et al., 2009; Puurula, 2011; Kim et al., 2012). Our CBM model differs from these models in two aspects: 1) These models are generative in nature, i.e., they model p(x), where x is typically a bag-of-words representation of a document. By contrast, our CBM is purely discriminative, since it targets p(y|x) directly. The principled advantage of discriminative approach over generative approach, as stated in (Sutton & Mc Callum, 2011), is that the former does not make overly simplistic independence assumptions among features, and is thus better suited to including rich, overlapping features. 2) Most of these models are designed specifically for text data, while our method can be applied to any multi-label classification problem. A Conditional Multinomial Mixture model has been proposed for Superset Label Learning (Liu & Dietterich, 2012). The architecture of CBM also mimics that of Mixture of Experts (ME) (Jacobs et al., 1991; Jordan & Jacobs, 1994), in which a gate model divides the input space into disjoint regions probabilistically and an expert model generates the output in each region. ME has been mainly used in regression and multi-class problems (Yuksel et al., 2012). CBM can be viewed as a multi-label extension of ME with a particular factorization of labels inside each expert. 6. Experiments We perform experiments on five commonly used and relatively large multi-label datasets: SCENE, TMC2007, MEDIAMILL, NUS-WIDE from Mulan1 and RCV1 (topics subset 1) from LIBSVM2. For the sake of reproducibility, we adopt the train/test splits provided by Mulan and LIBSVM. Datasets details are shown at the top of Table 2. We compare Conditional Bernoulli Mixtures (CBM) with the following methods which estimate p(y|x): Binary Relevance (Bin Rel), Power-Set (Pow Set), Classifier Chains (CC), Probabilistic Classifier Chains with Beam Search (PCC), Ensemble of Classifier Chains with label voting (ECC-label) and subset voting (ECC-subset), Conditional Dependency Networks (CDN) and pair-wise Conditional Random Fields (pair CRF). CDN is taken from the package MEKA;3 the rest are based on our implementations.4 For all methods which require a base learner, we employ logistic regression as a common base learner. Additionally, we test gradient boosted trees (GB) as a non-linear base learner in conjunction with Bin Rel, Pow Set and CBM. Both LR and pair CRF are L2 regularized. Hyper parameter tuning is done by cross-validation on the training set 1http://mulan.sourceforge.net 2https://www.csie.ntu.edu.tw/ cjlin/ libsvmtools/datasets/multilabel.html 3http://meka.sourceforge.net 4Our implementations of CBM and several baselines (Pow Set, PCC, CRF, etc.) are available at https://github.com/ cheng-li/pyramid. Conditional Bernoulli Mixtures for Multi-label Classification Table 2. Test subset accuracy of different methods on five datasets. All numbers are in percentages. Best performances are bolded. dataset SCENE RCV1 TMC2007 MEDIAMILL NUS-WIDE domain image text text video image #labels / #label subsets 6 / 15 103 / 799 22 / 1341 101 / 6555 81 / 18K #features / #data points 294 / 2407 47K / 6000 49K / 29K 120 / 44K 128 / 270K Method Learner Bin Rel LR 51.5 40.4 25.3 9.6 24.7 Pow Set LR 68.1 50.2 28.2 9.0 26.6 CC LR 62.9 48.2 26.2 10.9 26.0 PCC LR 64.8 48.3 26.8 10.9 26.3 ECC-label LR 60.6 46.5 26.0 11.3 26.0 ECC-subset LR 63.1 49.2 25.9 11.5 26.0 CDN LR 59.9 12.6 16.8 5.4 17.1 pair CRF linear 68.8 46.4 28.1 10.3 26.4 CBM LR 69.7 49.9 28.7 13.5 27.3 Bin Rel GB 59.3 30.1 25.4 11.2 24.4 Pow Set GB 70.5 38.2 23.1 10.1 23.6 CBM GB 70.5 43.0 27.5 14.1 26.5 (see the supplementary material for details). For methods involving random initializations or sampling, the reported results are averaged over 3 runs. 6.1. Results Test subset accuracy on five datasets is shown in Table 2, grouped by the base learner. On four datasets, the highest subset accuracy is achieved by one of the CBM instantiations; on the other dataset, CBM is close to the best one. This demonstrates the effectiveness of CBM for optimizing subset accuracy. On the SCENE and MEDIAMILL datasets, CBM+GB performs better than CBM+LR, which shows the benefit of being able to incorporate non-linear components. By contrast, pair CRF is restricted to work with linear functions, lacking the flexibility of CBM. For completeness, we present in the supplementary material the Jaccard index and Hamming loss results. CBM also achieves the highest Jaccard index on three out of five datasets, which is reasonable given the fact that CBM only targets subset accuracy and multi-label optimality is measure-related. We also report all algorithms running times in the supplementary material. CBM requires significantly less training time than Pow Set on datasets with a large number of possible label subsets. 6.2. Impact of Number of Components Figure 3 shows how increasing the number of components K affects test performance for CBM on TMC dataset. When K = 1, CBM only estimates marginals and is equivalent to Bin Rel (the slight difference in performance is due to the implementation details of CBM; see the supplemen- Figure 3. Test subset accuracy on TMC dataset with varying number of components K for CBM+LR (left) and CBM+GB (right) compared with Bin Rel, Pow Set, and pair CRF. tary material). Increasing K from 1 to 15 makes CBM quickly become a better joint estimator and outperform Bin Rel. Increasing K further gives smaller improvement for CBM and the performance asymptotes as K reaches about 30. Other datasets show similar trends. 7. Conclusion In this paper, we propose a new multi-label classification method using Conditional Bernoulli Mixtures. It captures label dependencies; it reduces a multi-label problem to binary and multi-class problems, and subsumes the classic independent binary prediction and power-set subset prediction methods as special cases. A simple Expectation Maximization training procedure is presented, together with an efficient prediction procedure based on dynamic programming. Experimental results show the effectiveness of the proposed method against competitive alternatives on benchmark multi-label datasets. Conditional Bernoulli Mixtures for Multi-label Classification Acknowledgements We thank reviewers for their helpful comments, Shivani Agarwal for her suggestions on related work, and Pavel Metrikov for discussions on the model. This work has been generously supported through a grant from the Massachusetts General Hospital Physicians Organization, MGH 507222. Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006. Chen, Yao-Nan and Lin, Hsuan-Tien. Feature-aware label space dimension reduction for multi-label classification. In Advances in Neural Information Processing Systems, pp. 1529 1537, 2012. Cheng, Weiwei, H ullermeier, Eyke, and Dembczynski, Krzysztof J. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 279 286, 2010. Dembczy nski, Krzysztof, Waegeman, Willem, and H ullermeier, Eyke. Joint mode estimation in multi-label classification by chaining. In Co LISD 2011 (Collective Learning and Inference on Structured Data 2011), 2011. Dembczy nski, Krzysztof, Waegeman, Willem, Cheng, Weiwei, and H ullermeier, Eyke. On label dependence and loss minimization in multi-label classification. Machine Learning, 88(1-2):5 45, 2012. Dembczynski, Krzysztof, Waegeman, Willem, and H ullermeier, Eyke. An analysis of chaining in multilabel classification. In ECAI, pp. 294 299, 2012. Deng, Jia, Ding, Nan, Jia, Yangqing, Frome, Andrea, Murphy, Kevin, Bengio, Samy, Li, Yuan, Neven, Hartmut, and Adam, Hartwig. Large-scale object classification using label relation graphs. In Computer Vision ECCV 2014, pp. 48 64. Springer, 2014. Friedman, Jerome H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189 1232, 2001. Gasse, Maxime, Aussem, Alexandre, and Elghazel, Haytham. On the optimality of multi-label classification under subset zero-one loss for distributions satisfying the composition property. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2531 2539, 2015. Ghamrawi, Nadia and Mc Callum, Andrew. Collective multi-label classification. In Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195 200. ACM, 2005. Guo, Yuhong and Gu, Suicheng. Multi-label classification using conditional dependency networks. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, pp. 1300, 2011. Gupta, Maya R and Chen, Yihua. Theory and use of the EM algorithm. Now Publishers Inc, 2011. Hsu, Daniel, Kakade, Sham, Langford, John, and Zhang, Tong. Multi-label prediction via compressed sensing. In NIPS, volume 22, pp. 772 780, 2009. Jacobs, Robert A, Jordan, Michael I, Nowlan, Steven J, and Hinton, Geoffrey E. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. Jordan, Michael I and Jacobs, Robert A. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181 214, 1994. Kim, Dongwoo, Kim, Suin, and Oh, Alice. Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. ar Xiv preprint ar Xiv:1206.4658, 2012. Koyejo, Oluwasanmi O, Natarajan, Nagarajan, Ravikumar, Pradeep K, and Dhillon, Inderjit S. Consistent multilabel classification. In Advances in Neural Information Processing Systems, pp. 3303 3311, 2015. Kumar, Abhishek, Vembu, Shankar, Menon, Aditya Krishna, and Elkan, Charles. Learning and inference in probabilistic classifier chains with beam search. In Machine Learning and Knowledge Discovery in Databases, pp. 665 680. Springer, 2012. Kumar, Abhishek, Vembu, Shankar, Menon, Aditya Krishna, and Elkan, Charles. Beam search algorithms for multilabel learning. Machine learning, 92(1):65 89, 2013. Lazarsfeld, Paul Felix, Henry, Neil W, and Anderson, Theodore Wilbur. Latent structure analysis. Houghton Mifflin Boston, 1968. Liu, Liping and Dietterich, Thomas G. A conditional multinomial mixture model for superset label learning. In Advances in neural information processing systems, pp. 557 565, 2012. Liu, Weiwei and Tsang, Ivor. On the optimality of classifier chain for multi-label classification. In Advances in Neural Information Processing Systems, pp. 712 720, 2015. Conditional Bernoulli Mixtures for Multi-label Classification Mc Auliffe, Jon D and Blei, David M. Supervised topic models. In Advances in neural information processing systems, pp. 121 128, 2008. Mc Callum, Andrew. Multi-label text classification with a mixture model trained by EM. In AAAI99 workshop on text learning, pp. 1 7, 1999. Mc Lachlan, Geoffrey and Peel, David. Finite mixture models. John Wiley & Sons, 2004. Mena, Deiner, Monta n es, Elena, Quevedo, Jos e R, and Del Coz, Juan Jos e. Using A* for inference in probabilistic classifier chains. In Proceedings of the 24th International Conference on Artificial Intelligence, pp. 3707 3713. AAAI Press, 2015. Nocedal, Jorge and Wright, Stephen. Numerical optimization. Springer Science & Business Media, 2006. Puurula, Antti. Mixture models for multi-label text classification. In 10th New Zealand Computer Science Research Student Conference, 2011. Ramage, Daniel, Hall, David, Nallapati, Ramesh, and Manning, Christopher D. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1Volume 1, pp. 248 256. Association for Computational Linguistics, 2009. Read, Jesse, Pfahringer, Bernhard, Holmes, Geoff, and Frank, Eibe. Classifier chains for multi-label classification. Machine learning, 85(3):333 359, 2011. Schapire, Robert E and Singer, Yoram. Boostexter: A boosting-based system for text categorization. Machine learning, 39(2):135 168, 2000. Sutton, Charles and Mc Callum, Andrew. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning, pp. 93 128, 2006. Sutton, Charles and Mc Callum, Andrew. An introduction to conditional random fields. Machine Learning, 4(4): 267 373, 2011. Tai, Farbound and Lin, Hsuan-Tien. Multilabel classification with principal label space transformation. Neural Computation, 24(9):2508 2542, 2012. Tsoumakas, Grigorios and Katakis, Ioannis. Multi-label classification: An overview. Int J Data Warehousing and Mining, 2007:1 13, 2007. Ueda, Naonori and Saito, Kazumi. Parametric mixture models for multi-labeled text. In Advances in neural information processing systems, pp. 721 728, 2002. Yang, Shuang-Hong, Zha, Hongyuan, and Hu, Bao-Gang. Dirichlet-bernoulli alignment: A generative model for multi-class multi-label multi-instance corpora. In Advances in neural information processing systems, pp. 2143 2150, 2009. Yuksel, Seniha Esen, Wilson, Joseph N, and Gader, Paul D. Twenty years of mixture of experts. Neural Networks and Learning Systems, IEEE Transactions on, 23(8): 1177 1193, 2012. Zhang, Min-Ling and Zhang, Kun. Multi-label learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 999 1008. ACM, 2010. Zhang, Min-Ling and Zhou, Zhi-Hua. Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition, 40(7):2038 2048, 2007. Zhou, Tianyi, Tao, Dacheng, and Wu, Xindong. Compressed labeling on distilled labelsets for multi-label learning. Machine Learning, 88(1-2):69 126, 2012.