# conditional_bernoulli_mixtures_for_multilabel_classification__e75e821f.pdf

Conditional Bernoulli Mixtures for Multi-Label Classiﬁcation

Cheng Li CHENGLI@CCS.NEU.EDU Bingyu Wang RAINICY@CCS.NEU.EDU Virgil Pavlu VIP@CCS.NEU.EDU Javed Aslam JAA@CCS.NEU.EDU College of Computer and Information Science, Northeastern University, Boston, MA 02115, USA

Multi-label classiﬁcation is an important machine learning task wherein one assigns a subset of candidate labels to an object. In this paper, we propose a new multi-label classiﬁcation method based on Conditional Bernoulli Mixtures. Our proposed method has several attractive properties: it captures label dependencies; it reduces the multi-label problem to several standard binary and multi-class problems; it subsumes the classic independent binary prediction and power-set subset prediction methods as special cases; and it exhibits accuracy and/or computational complexity advantages over existing approaches. We demonstrate two implementations of our method using logistic regressions and gradient boosted trees, together with a simple training procedure based on Expectation Maximization. We further derive an efﬁcient prediction procedure based on dynamic programming, thus avoiding the cost of examining an exponential number of potential label subsets. Experimental results show the effectiveness of the proposed method against competitive alternatives on benchmark datasets.

1. Introduction

Multi-label classiﬁcation is an important machine learning task wherein one assigns a subset of candidate labels to an object. Such problems arise naturally in the context of medical diagnostics, text classiﬁcation, image tagging, and a host of other areas. For example, in the medical domain, a patient may present multiple illnesses or undergo a procedure with multiple billing codes; in news categorization, an article can be associated with multiple categories; in image tagging, a photo may have multiple tags (see Figure 1).

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

Figure 1. An image from the NUS-WIDE multi-label dataset. Independent binary logistic regressions predict the labels {clouds, lake, sky, sunset, water}. Our proposed method correctly predicts {clouds, lake, sky, sunset, water, reflection}, because it captures the dependencies between reflection and the other labels.

Multi-label classiﬁcation is particularly difﬁcult in the common case when the labels are many and dependent. In this case, simple approaches such as independent binary label prediction and power-set subset prediction fail due to label dependencies and combinatorial explosion, respectively. More sophisticated approaches such as classiﬁer chains, conditional dependency networks, and conditional random ﬁelds suffer due to accuracy/computationalcomplexity trade-offs in training and/or prediction. We introduce Conditional Bernoulli Mixtures for multi-label classiﬁcation to address these challenges, yielding multilabel classiﬁers that are ﬂexible, accurate and efﬁcient.

Formally, in a multi-label classiﬁcation problem, we are given a set of label candidates Y = {1, 2, ..., L}. Every data point x RD matches a subset of labels y Y, which can be equivalently written in the form of a binary vector y {0, 1}L, with each bit yℓindicating the presence or absence of the corresponding label. The goal of learning is to build a classier h : RD 7 {0, 1}L which maps an in-

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

stance to a subset of labels. The predicted label subset can be of arbitrary size; when the size is restricted to be 1, the problem is called multi-class classiﬁcation. Furthermore, if the total number of label candidates L is 2, the problem becomes binary classiﬁcation.

Various measures can be used for evaluating multi-label classiﬁers. Among them, three commonly used measures are subset accuracy, Jaccard index and Hamming loss. Let {(xn, yn)}N n=1 be a multi-label dataset with ground truth labels, and {ˆyn}N n=1 be the predictions made by a classiﬁer. Subset accuracy generalizes the conventional multiclass accuracy notion:

n=1 1[yn = ˆyn], (1)

where 1[ ] is the indicator function. In computing subset accuracy, a predicted subset is considered correct only when it matches the true subset exactly. Admittedly, this metric is very stringent in evaluation. However, optimizing for subset accuracy is a very interesting algorithmic design and research problem, as it encourages the classiﬁers to output coherent and complete predictions. The optimal prediction for subset accuracy is achieved by outputting the mode of the conditional joint distribution:

h (x) = argmax y p(y|x). (2)

Jaccard index (also called overlap) measures the overlap between the true subset and the predicted subset using the size of their intersection divided by the size of their union, and Hamming loss counts the number of bit-wise mistakes. The theoretical results in (Dembczy nski et al., 2012; Koyejo et al., 2015) show that a multi-label classiﬁer designed for optimizing one measure may be suboptimal under other measures. In this paper, we focus on the multilabel classiﬁcation task with subset accuracy as the target evaluation measure, but for completeness we also report the Jaccard index and Hamming loss results in the supplementary material.

According to (2), optimizing subset accuracy requires estimating the conditional joint distribution p(y|x), which captures conditional label dependencies given features. One naive approach is to assume conditional independence among labels p(y|x) = QL ℓ=1 p(yℓ|x), which reduces a multi-label problem into L binary classiﬁcation problems. This approach is called Binary Relevance (Bin Rel) (Tsoumakas & Katakis, 2007), and is widely used due to its simplicity. One obvious disadvantage is that the individual label predictions can often be conﬂicting, such as tagging cat but not animal for an image.

At the other extreme is the Power-Set (Pow Set) approach (Tsoumakas & Katakis, 2007), treating each label

subset as a class, and trains a multi-class classiﬁer. As a consequence, one would be restricted in practice to predicting only label subsets seen in the training data and always assigning zero probabilities to unseen subsets; even for many of the subsets observed there would likely be a scarcity of training data. Power-Set is handling label dependencies and its predictions are coherent, but the method is often infeasible on the exponential number of label sets.

Bernoulli Mixtures. Mixture models offer a ﬂexible and powerful framework for many multivariate density estimation problems. A mixture generally has the form

k=1 πkp(y; βk), (3)

which approximates a complex joint p(y) by a weighted combination of K component densities p(y; βk), each of which typically takes some simple density form parametrized by βk. The Expectation Maximization (EM) algorithm can be employed to train such mixture models by iterating between estimating the mixture coefﬁcients and ﬁtting component densities.

Bernoulli Mixtures (BM) are classic models for multidimensional binary variable density estimation (Lazarsfeld et al., 1968; Mc Lachlan & Peel, 2004; Bishop, 2006), where the learnability is realized by assuming independence of variables within each mixture component. Thus each component density p(y; βk) is simply a product of Bernoulli densities and the overall model has the form

ℓ=1 Ber(yℓ; µk ℓ), (4)

where Ber(yℓ; µk ℓ) denotes the Bernoulli distribution with head probability µk ℓ. BM has two attractive properties. First, dependencies: although the L variables are assumed to be independent inside each component, they are in general dependent in the mixture (they are forced to be independent only when K = 1). This can be veriﬁed by computing the following covariance matrix and observing that it is non-diagonal for K 2 (Bishop, 2006).

k=1 πk[Σk + µk(µk) ] E(y)E(y) , (5)

where E(y) = PK k=1 πkµk, Σk = diag{µk ℓ(1 µk ℓ)}, and µk = (µk 1, µk 2, ..., µk L) .

Second, efﬁciency: the full factorization of each component makes ﬁtting BM efﬁcient via the EM algorithm.

In this paper, we propose a new multi-label classiﬁcation method which approximates the conditional joint p(y|x) based on Bernoulli Mixtures. The method simultaneously learns the label dependencies from data and trains classiﬁers to account for such dependencies.

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

2. Conditional Bernoulli Mixtures

For multi-label classiﬁcation, we model the conditional joint p(y|x) with a discriminative extension of BM, capturing conditional dependencies among binary labels given features. The analysis in (Dembczy nski et al., 2012) shows that labels could be largely conditionally independent given features (i.e., p(y|x) QL ℓ=1 p(yℓ|x)), even when labels are strongly marginally dependent (i.e., p(y) = QL ℓ=1 p(yℓ)), as long as each label is highly predictable from features. Therefore it is not necessary to capture correlations among easy-to-predict labels; it is sufﬁcient to only capture those correlations which involve some hardto-predict labels. Thus conditioning on features greatly reduces the need to estimate label correlations and makes learning much easier; the conditional on x is essential for good approximation and effective training, without making prior assumptions about the form of label dependencies (as many other methods do).

Making both mixture coefﬁcients and Bernoulli distributions conditional on x, we obtain our proposed model, Conditional Bernoulli Mixtures (CBM):

k=1 π(z = k|x; α)

ℓ=1 b(yℓ|x; βk ℓ), (6)

where α and βk ℓ(ℓ= 1, 2, ..., L; k = 1, 2, ..., K) are model parameters to be learned, and z is a hidden categorical indicator variable, such that z = k if the data point is assigned to component k. Here we use π and b to represent the conditional mixture membership distribution and the conditional binary label distribution, respectively.

CBM reduces a multi-label problem to a multi-class problem and several binary problems, approaching p(y|x) akin to divide and conquer: the categorical distribution π(z|x; α) assigns each instance x to 1 out of K components probabilistically. Its goal is to divide the feature space into several regions such that each region only has weak conditional label correlations and can be approximated by a simple component. This gating function π(z|x; α) can be instantiated by any multi-class classiﬁer which provides probability estimations, such as a multinomial logistic regression with parameters α.

Inside each region k, the local conditional joint density is approximated by a product of conditional marginal densities. Every local binary label predictor b(yℓ|x; βk ℓ) estimates the probability of getting label yℓfrom component k for data point x, and can be instantiated by any binary classiﬁer which provides probability estimations, such as a binary logistic regression with parameters βk ℓ. All K components together with the gating function are learned jointly in order to break the global label correlation into simple forms.

On one extreme, if CBM has only one component (and hence π(z|x; α) plays no role), all labels are conditionally independent and CBM degenerates to Binary Relevance. On the other extreme, if CBM assigns one component to each unique label subset and ﬁxes each b(yℓ= 1|x; βk ℓ) to be the corresponding binary constant (1 if label ℓis in the subset and 0 otherwise), then the overall CBM model simply selects one label subset from all possible subsets, which is conceptually the same as the Power-Set approach. By varying the number of components and the complexity of each component, CBM can provide a continuous spectrum between these two extremes. Just as Binary Relevance and Power-Set, CBM is purely a reduction method. The main advantage of reduction methods compared with new models speciﬁcally designed for multi-label problem is that the reduction approach makes it easier to incorporate and reuse many well-developed multi-class and binary classiﬁers. In Section 3, we will demonstrate two different instantiations of CBM, one with logistic regressions, and the other with gradient boosted trees.

possible labels

Figure 2. The top 4 most inﬂuential components for the test image in Figure 1. π values indicate component mixing coefﬁcients. Each bar represents an individual label probability in one component. Labels with near zero probabilities are not displayed here.

An Illustrative Example. Before diving into the model training details, we illustrate how CBM performs joint label classiﬁcations with an example. Figure 1 shows a test image in the NUS-WIDE dataset, for which the matched label reflection is missed by Binary Relevance but is captured by CBM. For this image, the most inﬂuential components produced by CBM are shown in Figure 2. Simply averaging individual label probabilities by component mixing weights gives the conditional marginals p(lake|x) = 0.56, p(water|x) = 0.69, p(sunset|x) = 0.66, and p(reflection|x) = 0.32, which indicate that unlike

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

lake, water and sunset, the label reflection by itself is not deemed as probable by CBM. However from the CBM joint density p(y|x) we can also infer Pearson correlation coefﬁcients ρreflection,lake = 0.50, ρreflection,water = 0.40, ρreflection,sunset = 0.17 and observe that reflection is positively correlated with lake, water, and sunset. In fact, the joint probability asserts that the subset {clouds, lake, sky, sunset, water, reflection} is the most probable one, with probability 0.09. By contrast, the subset {clouds, lake, sky, sunset, water} has a lower probability 0.06. Therefore, although individually unlikely, the label reflection is correctly added to the mostly likely subset due to label correlations.

3. Training CBM with EM

CBM can be trained by maximum likelihood estimation on a given dataset {(xn, yn)}N n=1. Below we ﬁrst derive a generic EM algorithm that works for any instantiation of the components, and then consider two concrete instantiations.

To make the mathematical derivation clear, we use different symbols for random variables and values of random variables. We use Yn when treating the labeling of xn as an unknown random variable and yn when referring to its speciﬁc labeling assignment given in the training set. The likelihood for the dataset is

k=1 [π(zn = k|xn; α)

ℓ=1 b(ynℓ|xn; βk ℓ)]}.

Since the model contains hidden variables, we use EM to minimize an upper bound of the negative log likelihood. Denoting the posterior membership distribution p(zn|xn, yn) as Γ(zn) = (γ1 n, γ2 n, ..., γK n ), the upper bound can be written as

n=1 KL(Γ(zn)||π(zn|xn; α))

n=1 γk n KL(Ber(Ynℓ; ynℓ)||b(Ynℓ|xn; βk ℓ)), (7)

where Ber(Ynℓ; ynℓ) is the Bernoulli distribution with head probability ynℓ.

E Step: Re-estimate the posterior membership probability of each data point belonging to each component:

γk n = π(zn = k|xn; α) QL ℓ=1 b(ynℓ|xn; βk ℓ) PK k=1 π(zn = k|xn; α) QL ℓ=1 b(ynℓ|xn; βk ℓ) . (8)

M Step: Update all model parameters. The bound (7) shows that the optimization of all parameters can be nicely

decomposed into a series of separate optimization problems:

αnew = argmin α

n=1 KL(Γ(zn)||π(zn|xn; α)), (9)

βk ℓnew = argmin βk ℓ

n=1 γk n KL(Ber(Ynℓ; ynℓ)||b(Ynℓ|xn; βk ℓ)).

The optimization problem deﬁned in (9) is a multi-class classiﬁcation problem with target class distribution (soft labels) Γ(zn) = (γ1 n, γ2 n, ..., γK n ) for xn. The training goal for π(z|x; α) is to assign (probabilistically) each data point to a component based only on its features, such that the assignment matches the posterior component membership determined by both features and true labels.

The optimization problem deﬁned in (10) is a weighted binary classiﬁcation problem. xn has target label ynℓ and weight γk n. The prediction component b(yℓ|x; βk ℓ) is trained as a binary classiﬁer for label ℓin component k, based only on training data within (soft) component k.

To train CBM, we iterate between the E step and the M step, until the upper bound (7) converges (see Algorithm 1). In practice, the EM algorithm can get stuck in local op-

Algorithm 1 Generic Training for CBM

1: repeat 2: E Step 3: for n = 1, 2, ..., N; k = 1, 2, ..., K do 4: update γk n as in (8) 5: M Step 6: update α as in (9) 7: for k = 1, 2, ..., K; ℓ= 1, 2, ..., L do 8: update βk ℓas in (10) 9: until convergence

tima, so careful initializations are often necessary for good performance. Due to the natural connection between CBM and BM, one simple way of initializing CBM is to ﬁrst ﬁt a BM just for label density estimation without looking at features. BM can be trained very quickly by a simpler EM algorithm (Bishop, 2006). We can use several random starts for a set of BM models, and select the one with the best training objective score to initialize the CBM training procedure.

3.1. CBM with Logistic Regression Learners

In the simplest form all underlying models in CBM are linear. We employ a multinomial logistic regression for the gating function π, and a binary logistic regression for each predictor b. The outer loop of the training procedure is the

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

EM algorithm described above. In each M step, all objectives deﬁned in (9) and (10) are convex (although the overall EM optimization is non-convex). To optimize each logistic regression, we compute the gradient of the corresponding objective w.r.t. its parameters and update all parameters using gradient based optimization methods. Here we choose L-BFGS method (Nocedal & Wright, 2006), which is the state-of-the-art optimization method for large scale logistic regressions. We can start with the existing model, and perform a few incremental update steps until it converges. To avoid over-ﬁtting, we also add L2 regularizations (Gaussian priors) to all parameters.

CBM+LR Complexity. The overall complexity of the training procedure for CBM is dominated by the updates in M steps and depends on the complexities of the binary and multi-class learning algorithms used. For CBM with logistic regression (LR) learners, we can measure the complexity of each L-BFGS (or gradient descent) update for all model parameters. If N is the number of instances, D the number of features, L the number of labels, C the number of unique label subsets in the training set, and K the number of CBM components, then each CBM+LR update takes O(KLDN). For comparison, each update takes O(LDN) in Binary Relevance with LR, and takes O(CDN) in Power-Set with LR. For large datasets, typically KL < C, and CBM is slower than Binary Relevance but faster than Power-Set.

3.2. CBM with Gradient Boosting Learners

Many datasets require non-linear decision boundaries, in which case the CBM model with logistic regressions may not have enough explanation power. To make CBM nonlinear, we use gradient boosted trees (GB) for both π and b. The original gradient boosted trees algorithm described in (Friedman, 2001) is designed for standard multi-class problems, and does not take label distributions or instance weights as inputs, thus some modiﬁcations are necessary. The target label distribution is easy to deal with: one still computes the functional gradient of the objective w.r.t. each ensemble scoring function, where the objective is deﬁned by the target distribution. To handle instance weights, one ignores them when calculating functional gradients, and performs a weighted least squares ﬁt when ﬁtting each regression tree to the gradients. Unlike logistic regression, where all parameters can be easily updated, gradient boosting introduces new trees to the ensemble while keeping old trees untouched. Thus the M step here is slightly different from before: rather than re-adjusting all parameters to optimize the objective, boosting improves the objective by adding a few more trees. This amounts to a partial M step, and the resulting EM algorithm is usually called the generalized EM algorithm (Gupta & Chen, 2011). We emphasize that the use of boosting as an underlying non-

linear classiﬁer here is just for demonstration purposes. Other classiﬁers such as neural networks could also be used. This is in direct contrast to Ada Boost.MH and Adaboost.MR (Schapire & Singer, 2000) which speciﬁcally employ boosting to optimize Hamming loss and rank loss.

3.3. Validation of Training Procedure on Artiﬁcial Data

To validate the training procedure, we ﬁt a CBM model on artiﬁcial data generated by a true CBM and see whether the training procedure could recover the true parameters. The true CBM has K = 3 components and uses logistic regressions for both π and b. Each of the N = 15, 000 data points x has D = 7 features generated independently from Gaussian distributions. We test two different ways of generating the ground-truth label subset y: argmax, where the most likely y is always selected; and sampling, where y is sampled from the true CBM distribution. For each of these two datasets, we also make a noise variant with Gaussian noise added to logistic regression scores. Test

Table 1. Test subset accuracy on artiﬁcial data.

y=argmax y=argmax y=sample y=sample Methods +noise +noise Upper bound 100 76.4 65.2 48.1 CBM+LR:K=2 89.9 69.3 60.2 45.8 CBM+LR:K=3 98.4 75.8 65.0 48.0 Bin Rel+LR 82.0 49.9 48 40.0 Pow Set+LR 97.5 74.1 64.6 47.3

subset accuracy results are presented in Table 1. It shows that if the true distribution is indeed generated by a CBM, then the proposed training procedure is quite effective in recovering the distribution.

4. Fast Prediction by Dynamic Programming

According to (2), making the optimal prediction in terms of subset accuracy for a given x requires ﬁnding the most probable label subset y = argmaxy p(y|x). There are 2L

label subset candidates, and it is intractable to evaluate the probability for each of them. Many multi-label methods suffer from this intractability for exact inference (see Section 5). Fortunately for CBM, its special structure allows it to make predictions efﬁciently, with either sampling or dynamic programming. As a common preprocessing step, for a given x, we can ﬁrst compute πk = π(z = k|x; α) and µk ℓ= b(yℓ= 1|x; βk ℓ), for all k = 1, 2, .., K and ℓ= 1, 2, ..., L.

Prediction by Sampling. The CBM density form (6) suggests a natural sampling strategy for a label subset y. We ﬁrst sample a component k according to the mixture coefﬁcients π1, ..., πK. Then from this component, we sample each label yℓindependently with probability µk ℓ. The procedure can be repeated multiple times to generate a set of

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

y candidates, from which we pick the most probable one. The sampling strategy works best when the most probable label subset has a high probability. Sampling is easy to implement, but does not guarantee that the predicted y is indeed the optimal one.

Prediction by Dynamic Programming and Pruning. In order for the overall probability p(y|x) to be high, there must exist a component k for which the component probability QL ℓ=1 b(yℓ|x; βk ℓ) is high. On the other hand, one can show that the y maximizing the overall probability does not necessarily maximize any component probability. To ﬁnd y , we design a dynamic programing procedure FIND-NEXT-HIGHEST() that enumerates label subsets in a decreasing probability order in each component, and then we iterate round-robin across components until we are certain that the unchecked subsets will never produce a high overall probability.

Algorithm 2 Prediction by Dynamic Prog. and Pruning

1: Input: πk, µk ℓ, k = 1, 2, ..., K, ℓ= 1, 2, ..., L 2: Initialize candidate component set S = {1, 2, ..., K} 3: Initialize the maximum overall probability M = 4: for k = 1, 2, ..., K do 5: Initialize the maximum component probability Gk

6: Initialize the priority queue Qk

7: while S = Φ do 8: for k S do 9: y = Qk.FIND-NEXT-HIGHEST() 10: Let p = PK m=1 πm QL ℓ=1 Ber(yℓ; µm ℓ) and q = QL ℓ=1 Ber(yℓ; µk ℓ) 11: if p > M then 12: Set M = p and y = y 13: if πkq M/K or πkq+P

r =k πr Gr M then 14: Remove k from S 15: Output: y

16: function FIND-NEXT-HIGHEST() 17: y = Qk.deque() 18: for ℓ= 1, 2, ..., L do 19: Generate y by ﬂipping the ℓ-th bit of y 20: if y is unseen then 21: Qk.enqueue(y ) 22: Output: y

For a component k, the y with the highest component probability is the set containing precisely all the ℓwith µk ℓ 1/2. To produce a ranked list, we use a max priority queue Qk to store candidate label subsets and their associated component probabilities. Initially, Qk contains the y with the highest component probability. The t-th call to Qk.FIND-NEXT-HIGHEST() returns the y with the t-th highest component probability. The overall prediction al-

gorithm (Algorithm 2) iterates over all components, checks the next candidate (Lines 9-10), updates the best label subset found so far (Lines 11-12), and prunes the remaining subset candidates in a component s ranked list (Lines 13-14). The algorithm terminates when all components ranked lists have been pruned, and surely returns the most probable y . In practice, we observe that the algorithm rarely visits elements deeper than rank 10 in the component ranked lists.

5. Related Work

Modeling label dependencies has been long regarded as a central theme in multi-label classiﬁcation. Estimating the high-dimensional conditional joint p(y|x) is very challenging and various approaches have been proposed to tackle this problem based on different approximations.

Classiﬁer Chains. According to the chain rule, the joint probability p(y|x) can be decomposed into a product of conditionals p(y1|x)p(y2|x, y1) p(y L|x, y1, .., y L 1). This reduces a multi-label learning problem to L binary learning problems, each of which learns a new label given all previous labels. During prediction, ﬁnding the exact joint mode is intractable. Classiﬁer Chains (CC) (Read et al., 2011) classify labels greedily in a sequence: label yℓis decided by maximizing p(yℓ|x, y1, .., yℓ 1), and becomes a feature to be used in the prediction for label yℓ+1. This greedy prediction procedure has three issues: 1) the predicted subset can be far away from the joint mode (Dembczy nski et al., 2011); 2) errors in early label predictions propagate to subsequent label predictions; 3) the overall prediction depends on the chain order. To address the ﬁrst two issues, Probabilistic Classiﬁer Chains (PCC) replace the greedy search strategy with some more accurate search strategies, such as exhaustive search (Cheng et al., 2010), ϵ-approximate search (Dembczynski et al., 2012), Beam Search (Kumar et al., 2012; 2013), or A* search (Mena et al., 2015). To address the third issue, Ensemble of Classiﬁer Chains (ECC) (Read et al., 2011) averages several predictions made by different chains, wherein the averaging can take place at either the individual label level (ECC-label) or the label subset level (ECC-subset). Using dynamic programming to ﬁnd the optimal chain order (Liu & Tsang, 2015) has been proposed recently.

A similar reduction method named Conditional Dependency Networks (CDN) estimates p(y|x) based on full conditionals and Gibbs sampling (Guo & Gu, 2011). During learning, one binary classiﬁer is trained for each full conditional p(yℓ|x, y1, .., yℓ 1, yℓ+1, ..., y L). During prediction, Gibbs sampling is used to ﬁnd the mode of the joint. The method s major limitation is that it cannot handle perfectly correlated or anti-correlated labels: consider

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

binary classiﬁcation as multi-label problem with only 2 exhaustive and exclusive labels. A perfect model for p(y1 = 1|x, y2) is 1 y2; in other words, the feature information is completely ignored. The same applies to p(y2|x, y1). So the prediction will inevitably fail. In general, Gibbs sampling may fail in the presence of perfect correlations or anti-correlations since the resulting stochastic process is not ergodic; when relations are imperfect but very strong, Gibbs sampling may need a very long time to converge.

(Gasse et al., 2015) factorize labels into independent factors based on some statistical conditional independence tests and a Markov boundary learning algorithm. Its limitation, as pointed out by the authors, is that the statistical tests are ineffective when the ratio between samples and labels is not big enough. Their method outperforms the Power-Set baseline on synthetic datasets, but not on any real dataset. (Zhang & Zhang, 2010) propose to use a Bayesian Network to encode the conditional dependencies of the labels as well as the feature set, with the features as the common parent of all labels.

Conditional Random Fields (CRF) (Sutton & Mc Callum, 2006) offer a general framework for structured prediction problems based on undirected graphical models. In multilabel classiﬁcations labels can form densely connected graphs, so restrictions are imposed in order to make training and inference tractable. For problems involving only hard label relations (exclusive or hierarchical), a special CRF model is proposed (Deng et al., 2014); this works only when label dependencies are strict and a priori known.

pair-CRF limits the scope of potential functions to label pairs, and does not model higher order label interactions (Ghamrawi & Mc Callum, 2005). The exact inference and prediction in pair-wise CRF requires checking all possible label subsets, which is intractable for datasets with many labels. As shown in their experiment results, the most effective way of doing approximate inference and prediction is to consider only label subsets that occur in the training set. But this eliminates the possibility of predicting unseen subsets. Our CBM model does not have this limitation (see Section 4).

There are also algorithms which exploit label structures but do not estimate p(y|x) explicitly. Examples include Multilabel k-Nearest Neighbors (Zhang & Zhou, 2007), Compressed Sensing based method (Hsu et al., 2009), Principle Label Space Transformation (Tai & Lin, 2012), Conditional Principal Label Space Transformation (Chen & Lin, 2012) and Compressed Labeling (Zhou et al., 2012).

Various mixture models exist for multi-label document classiﬁcation or supervised topical modeling (Mc Callum, 1999; Ueda & Saito, 2002; Mc Auliffe & Blei, 2008; Yang et al., 2009; Ramage et al., 2009; Puurula, 2011; Kim et al.,

2012). Our CBM model differs from these models in two aspects: 1) These models are generative in nature, i.e., they model p(x), where x is typically a bag-of-words representation of a document. By contrast, our CBM is purely discriminative, since it targets p(y|x) directly. The principled advantage of discriminative approach over generative approach, as stated in (Sutton & Mc Callum, 2011), is that the former does not make overly simplistic independence assumptions among features, and is thus better suited to including rich, overlapping features. 2) Most of these models are designed speciﬁcally for text data, while our method can be applied to any multi-label classiﬁcation problem. A Conditional Multinomial Mixture model has been proposed for Superset Label Learning (Liu & Dietterich, 2012).

The architecture of CBM also mimics that of Mixture of Experts (ME) (Jacobs et al., 1991; Jordan & Jacobs, 1994), in which a gate model divides the input space into disjoint regions probabilistically and an expert model generates the output in each region. ME has been mainly used in regression and multi-class problems (Yuksel et al., 2012). CBM can be viewed as a multi-label extension of ME with a particular factorization of labels inside each expert.

6. Experiments

We perform experiments on ﬁve commonly used and relatively large multi-label datasets: SCENE, TMC2007, MEDIAMILL, NUS-WIDE from Mulan1 and RCV1 (topics subset 1) from LIBSVM2. For the sake of reproducibility, we adopt the train/test splits provided by Mulan and LIBSVM. Datasets details are shown at the top of Table 2.

We compare Conditional Bernoulli Mixtures (CBM) with the following methods which estimate p(y|x): Binary Relevance (Bin Rel), Power-Set (Pow Set), Classiﬁer Chains (CC), Probabilistic Classiﬁer Chains with Beam Search (PCC), Ensemble of Classiﬁer Chains with label voting (ECC-label) and subset voting (ECC-subset), Conditional Dependency Networks (CDN) and pair-wise Conditional Random Fields (pair CRF). CDN is taken from the package MEKA;3 the rest are based on our implementations.4

For all methods which require a base learner, we employ logistic regression as a common base learner. Additionally, we test gradient boosted trees (GB) as a non-linear base learner in conjunction with Bin Rel, Pow Set and CBM. Both LR and pair CRF are L2 regularized. Hyper parameter tuning is done by cross-validation on the training set

1http://mulan.sourceforge.net 2https://www.csie.ntu.edu.tw/ cjlin/ libsvmtools/datasets/multilabel.html

3http://meka.sourceforge.net 4Our implementations of CBM and several baselines (Pow Set, PCC, CRF, etc.) are available at https://github.com/ cheng-li/pyramid.

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

Table 2. Test subset accuracy of different methods on ﬁve datasets. All numbers are in percentages. Best performances are bolded.

dataset SCENE RCV1 TMC2007 MEDIAMILL NUS-WIDE domain image text text video image #labels / #label subsets 6 / 15 103 / 799 22 / 1341 101 / 6555 81 / 18K #features / #data points 294 / 2407 47K / 6000 49K / 29K 120 / 44K 128 / 270K Method Learner Bin Rel LR 51.5 40.4 25.3 9.6 24.7 Pow Set LR 68.1 50.2 28.2 9.0 26.6 CC LR 62.9 48.2 26.2 10.9 26.0 PCC LR 64.8 48.3 26.8 10.9 26.3 ECC-label LR 60.6 46.5 26.0 11.3 26.0 ECC-subset LR 63.1 49.2 25.9 11.5 26.0 CDN LR 59.9 12.6 16.8 5.4 17.1 pair CRF linear 68.8 46.4 28.1 10.3 26.4 CBM LR 69.7 49.9 28.7 13.5 27.3 Bin Rel GB 59.3 30.1 25.4 11.2 24.4 Pow Set GB 70.5 38.2 23.1 10.1 23.6 CBM GB 70.5 43.0 27.5 14.1 26.5

(see the supplementary material for details). For methods involving random initializations or sampling, the reported results are averaged over 3 runs.

6.1. Results

Test subset accuracy on ﬁve datasets is shown in Table 2, grouped by the base learner. On four datasets, the highest subset accuracy is achieved by one of the CBM instantiations; on the other dataset, CBM is close to the best one. This demonstrates the effectiveness of CBM for optimizing subset accuracy. On the SCENE and MEDIAMILL datasets, CBM+GB performs better than CBM+LR, which shows the beneﬁt of being able to incorporate non-linear components. By contrast, pair CRF is restricted to work with linear functions, lacking the ﬂexibility of CBM. For completeness, we present in the supplementary material the Jaccard index and Hamming loss results. CBM also achieves the highest Jaccard index on three out of ﬁve datasets, which is reasonable given the fact that CBM only targets subset accuracy and multi-label optimality is measure-related. We also report all algorithms running times in the supplementary material. CBM requires signiﬁcantly less training time than Pow Set on datasets with a large number of possible label subsets.

6.2. Impact of Number of Components

Figure 3 shows how increasing the number of components K affects test performance for CBM on TMC dataset. When K = 1, CBM only estimates marginals and is equivalent to Bin Rel (the slight difference in performance is due to the implementation details of CBM; see the supplemen-

Figure 3. Test subset accuracy on TMC dataset with varying number of components K for CBM+LR (left) and CBM+GB (right) compared with Bin Rel, Pow Set, and pair CRF.

tary material). Increasing K from 1 to 15 makes CBM quickly become a better joint estimator and outperform Bin Rel. Increasing K further gives smaller improvement for CBM and the performance asymptotes as K reaches about 30. Other datasets show similar trends.

7. Conclusion

In this paper, we propose a new multi-label classiﬁcation method using Conditional Bernoulli Mixtures. It captures label dependencies; it reduces a multi-label problem to binary and multi-class problems, and subsumes the classic independent binary prediction and power-set subset prediction methods as special cases. A simple Expectation Maximization training procedure is presented, together with an efﬁcient prediction procedure based on dynamic programming. Experimental results show the effectiveness of the proposed method against competitive alternatives on benchmark multi-label datasets.

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

Acknowledgements

We thank reviewers for their helpful comments, Shivani Agarwal for her suggestions on related work, and Pavel Metrikov for discussions on the model. This work has been generously supported through a grant from the Massachusetts General Hospital Physicians Organization, MGH 507222.

Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006.

Chen, Yao-Nan and Lin, Hsuan-Tien. Feature-aware label space dimension reduction for multi-label classiﬁcation. In Advances in Neural Information Processing Systems, pp. 1529 1537, 2012.

Cheng, Weiwei, H ullermeier, Eyke, and Dembczynski, Krzysztof J. Bayes optimal multilabel classiﬁcation via probabilistic classiﬁer chains. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 279 286, 2010.

Dembczy nski, Krzysztof, Waegeman, Willem, and H ullermeier, Eyke. Joint mode estimation in multi-label classiﬁcation by chaining. In Co LISD 2011 (Collective Learning and Inference on Structured Data 2011), 2011.

Dembczy nski, Krzysztof, Waegeman, Willem, Cheng, Weiwei, and H ullermeier, Eyke. On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning, 88(1-2):5 45, 2012.

Dembczynski, Krzysztof, Waegeman, Willem, and H ullermeier, Eyke. An analysis of chaining in multilabel classiﬁcation. In ECAI, pp. 294 299, 2012.

Deng, Jia, Ding, Nan, Jia, Yangqing, Frome, Andrea, Murphy, Kevin, Bengio, Samy, Li, Yuan, Neven, Hartmut, and Adam, Hartwig. Large-scale object classiﬁcation using label relation graphs. In Computer Vision ECCV 2014, pp. 48 64. Springer, 2014.

Friedman, Jerome H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189 1232, 2001.

Gasse, Maxime, Aussem, Alexandre, and Elghazel, Haytham. On the optimality of multi-label classiﬁcation under subset zero-one loss for distributions satisfying the composition property. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2531 2539, 2015.

Ghamrawi, Nadia and Mc Callum, Andrew. Collective multi-label classiﬁcation. In Proceedings of the

14th ACM international conference on Information and knowledge management, pp. 195 200. ACM, 2005.

Guo, Yuhong and Gu, Suicheng. Multi-label classiﬁcation using conditional dependency networks. In IJCAI Proceedings-International Joint Conference on Artiﬁcial Intelligence, volume 22, pp. 1300, 2011.

Gupta, Maya R and Chen, Yihua. Theory and use of the EM algorithm. Now Publishers Inc, 2011.

Hsu, Daniel, Kakade, Sham, Langford, John, and Zhang, Tong. Multi-label prediction via compressed sensing. In NIPS, volume 22, pp. 772 780, 2009.

Jacobs, Robert A, Jordan, Michael I, Nowlan, Steven J, and Hinton, Geoffrey E. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991.

Jordan, Michael I and Jacobs, Robert A. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181 214, 1994.

Kim, Dongwoo, Kim, Suin, and Oh, Alice. Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. ar Xiv preprint ar Xiv:1206.4658, 2012.

Koyejo, Oluwasanmi O, Natarajan, Nagarajan, Ravikumar, Pradeep K, and Dhillon, Inderjit S. Consistent multilabel classiﬁcation. In Advances in Neural Information Processing Systems, pp. 3303 3311, 2015.

Kumar, Abhishek, Vembu, Shankar, Menon, Aditya Krishna, and Elkan, Charles. Learning and inference in probabilistic classiﬁer chains with beam search. In Machine Learning and Knowledge Discovery in Databases, pp. 665 680. Springer, 2012.

Kumar, Abhishek, Vembu, Shankar, Menon, Aditya Krishna, and Elkan, Charles. Beam search algorithms for multilabel learning. Machine learning, 92(1):65 89, 2013.

Lazarsfeld, Paul Felix, Henry, Neil W, and Anderson, Theodore Wilbur. Latent structure analysis. Houghton Mifﬂin Boston, 1968.

Liu, Liping and Dietterich, Thomas G. A conditional multinomial mixture model for superset label learning. In Advances in neural information processing systems, pp. 557 565, 2012.

Liu, Weiwei and Tsang, Ivor. On the optimality of classiﬁer chain for multi-label classiﬁcation. In Advances in Neural Information Processing Systems, pp. 712 720, 2015.

Conditional Bernoulli Mixtures for Multi-label Classiﬁcation

Mc Auliffe, Jon D and Blei, David M. Supervised topic models. In Advances in neural information processing systems, pp. 121 128, 2008.

Mc Callum, Andrew. Multi-label text classiﬁcation with a mixture model trained by EM. In AAAI99 workshop on text learning, pp. 1 7, 1999.

Mc Lachlan, Geoffrey and Peel, David. Finite mixture models. John Wiley & Sons, 2004.

Mena, Deiner, Monta n es, Elena, Quevedo, Jos e R, and Del Coz, Juan Jos e. Using A* for inference in probabilistic classiﬁer chains. In Proceedings of the 24th International Conference on Artiﬁcial Intelligence, pp. 3707 3713. AAAI Press, 2015.

Nocedal, Jorge and Wright, Stephen. Numerical optimization. Springer Science & Business Media, 2006.

Puurula, Antti. Mixture models for multi-label text classiﬁcation. In 10th New Zealand Computer Science Research Student Conference, 2011.

Ramage, Daniel, Hall, David, Nallapati, Ramesh, and Manning, Christopher D. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1Volume 1, pp. 248 256. Association for Computational Linguistics, 2009.

Read, Jesse, Pfahringer, Bernhard, Holmes, Geoff, and Frank, Eibe. Classiﬁer chains for multi-label classiﬁcation. Machine learning, 85(3):333 359, 2011.

Schapire, Robert E and Singer, Yoram. Boostexter: A boosting-based system for text categorization. Machine learning, 39(2):135 168, 2000.

Sutton, Charles and Mc Callum, Andrew. An introduction to conditional random ﬁelds for relational learning. Introduction to statistical relational learning, pp. 93 128, 2006.

Sutton, Charles and Mc Callum, Andrew. An introduction to conditional random ﬁelds. Machine Learning, 4(4): 267 373, 2011.

Tai, Farbound and Lin, Hsuan-Tien. Multilabel classiﬁcation with principal label space transformation. Neural Computation, 24(9):2508 2542, 2012.

Tsoumakas, Grigorios and Katakis, Ioannis. Multi-label classiﬁcation: An overview. Int J Data Warehousing and Mining, 2007:1 13, 2007.

Ueda, Naonori and Saito, Kazumi. Parametric mixture models for multi-labeled text. In Advances in neural information processing systems, pp. 721 728, 2002.

Yang, Shuang-Hong, Zha, Hongyuan, and Hu, Bao-Gang. Dirichlet-bernoulli alignment: A generative model for multi-class multi-label multi-instance corpora. In Advances in neural information processing systems, pp. 2143 2150, 2009.

Yuksel, Seniha Esen, Wilson, Joseph N, and Gader, Paul D. Twenty years of mixture of experts. Neural Networks and Learning Systems, IEEE Transactions on, 23(8): 1177 1193, 2012.

Zhang, Min-Ling and Zhang, Kun. Multi-label learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 999 1008. ACM, 2010.

Zhang, Min-Ling and Zhou, Zhi-Hua. Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition, 40(7):2038 2048, 2007.

Zhou, Tianyi, Tao, Dacheng, and Wu, Xindong. Compressed labeling on distilled labelsets for multi-label learning. Machine Learning, 88(1-2):69 126, 2012.