# extreme_multilabel_classification_from_aggregated_labels__156e0816.pdf

Extreme Multi-label Classiﬁcation from Aggregated Labels

Yanyao Shen 1 Hsiang-fu Yu 2 Sujay Sanghavi 1 2 Inderjit Dhillon 2 3

Extreme multi-label classiﬁcation (XMC) is the problem of ﬁnding the relevant labels for an input, from a very large universe of possible labels. We consider XMC in the setting where labels are available only for groups of samples - but not for individual ones. Current XMC approaches are not built for such multi-instance multi-label (MIML) training data, and MIML approaches do not scale to XMC sizes. We develop a new and scalable algorithm to impute individual-sample labels from the group labels; this can be paired with any existing XMC method to solve the aggregated label problem. We characterize the statistical properties of our algorithm under mild assumptions, and provide a new end-to-end framework for MIML as an extension. Experiments on both aggregated label XMC and MIML tasks show the advantages over existing approaches.

1. Introduction

Extreme multi-label classiﬁcation (XMC) is the problem of ﬁnding the relevant labels for an input from a very large universe of possible labels. XMC has wide applications in machine learning including product categorization (Agrawal et al., 2013; Yu et al., 2014), webpage annotation (Partalas et al., 2015) and hash-tag suggestion (Denton et al., 2015), where both the sample size and the label size are extremely large. Recently, many XMC methods have been proposed with new benchmark results on standard datasets (Prabhu et al., 2018a; Guo et al., 2019; Jain et al., 2019).

XMC problem, as well as many other modern machine learning problems, often require a large amount of data. As the size of the data grows, the annotation of the data becomes less accurate, and large-scale data annotation with high qual-

1Department of ECE, The University of Texas at Austin, TX, USA. 2Amazon, CA, USA. 3Department of Computer Science, The University of Texas at Austin, TX, USA.. Correspondence to: Yanyao Shen <shenyanyao@utexas.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

ity becomes growingly expensive. As a result, modern machine learning applications need to deal with certain types of weak supervision, including partial but noisy labeling and active labeling. These scenarios lead to exploration of advanced learning methods including semi(self)-supervised learning, robust learning and active learning.

In this paper, we study a typical weak supervision setting for XMC named Aggregated Label e Xtreme Multi-label Classiﬁcation (AL-XMC), where only aggregated labels are provided to a group of samples. AL-XMC is of interest in many practical scenarios where directly annotated training data can not be extracted easily, which is often due to the way data is organized. For example, Wikipedia contains a set of annotated labels for every wiki page, and can be used by an XMC algorithm for the task of tagging a new wiki page. However, if one is interested in predicting keywords for a new wiki paragraph, there is no such directly annotated data. Similarly, in e-commerce, the attributes of a product may not be provided directly, but the attributes of the brand of the product may be easier to extract. To summarize, it is often easier to get aggregated annotations belonging to a group of samples. This is known as multi-instance multi-label (MIML) (Zhou et al., 2012) problem in the nonextreme label size setting.

AL-XMC raises new challenges that standard approaches are not able to address. Because of the enormously large label size, directly using MIML methods leads to computation and memory issues. On the other hand, standard XMC approaches suffer from two main problems when directly applied to AL-XMC: (1) higher computation cost due to increased number of positive labels, and (2) worse performance due to ignoring of the aggregation structure. In this work, we propose an Efﬁcient AGgregated Label l Earning algorithm (EAGLE) that assigns labels to each sample by learning label embeddings based on the structure of the aggregation. More speciﬁcally, the key ingredient of EAGLE follows the simple principle that the label embedding should be close to the embedding of at least one of the sample points in every positively labeled group. We ﬁrst formulate such an estimator, then design an iterative algorithm that takes projected gradient steps to approximate it. As a by-product, our algorithm naturally extends to the non-XMC setting as a new end-to-end framework for the MIML problem. Our main contributions include:

Extreme Multi-label Classiﬁcation from Aggregated Labels

We propose to study AL-XMC, which has signiﬁcant

impact for modern machine learning applications. We propose an efﬁcient and robust algorithm EAGLE with low computation cost ( Section 4) that can be paired with any existing XMC method for solving AL-XMC.

We provide theoretical analysis for EAGLE and show

the beneﬁt of label assignment, the property of the estimator and the convergence of our iterative update (Section 5).

The proposed method can be easily extended to the

regular (non-extreme) MIML setting (Section 6). Our solution can be viewed as a co-attention mechanism between labels and samples. We empirically show its beneﬁt over previous MIML framework in Section 7.

2. Related Work

Extreme multi-label classiﬁcation (XMC). The most classic and straightforward approach for XMC is based on the One-Vs-All (OVA) method (Yen et al., 2016; Babbar & Schölkopf, 2017; Liu et al., 2017; Yen et al., 2017), which treats each label separately and learns a classiﬁer for each label. OVA has shown to achieve high accuracy, but the computation is too expensive for extremely large label set. Tree-based methods, on the other hand, try to improve the efﬁciency of OVA by using hierarchical representations for samples or labels (Agrawal et al., 2013; Prabhu & Varma, 2014; Jain et al., 2016; Si et al., 2017; Siblini et al., 2018; Prabhu et al., 2018b). Among these approaches, label partitioning based methods, including Parabel (Prabhu et al., 2018b), have achieved leading performances with training cost sub-linear in the number of labels. Apart from treebased methods, embedding based methods (Zhang et al., 2018; Chang et al., 2019; You et al., 2019; Guo et al., 2019) have been studied recently in the context of XMC in order to better use the textual features. In general, while embedding based methods may learn a better representation and use the contextual information better than tf-idf, the scalability of these approaches is worse than tree-based methods. Recently, Medini et al. (2019) apply sketching to learn XMC models with label size at the scale of 50 million, and the connection between softmax and negative sampling approach (Jain et al., 2019), hierarchical softmax with partitioned label trees are shown in (Bamler & Mandt, 2020; Wydmuch et al., 2018).

Multi-instance multi-label learning (MIML). MIML (Zhang & Zhang, 2007) is a general setting that includes both multi-instance learning (MIL) (Dietterich et al., 1997; Maron & Lozano-Pérez, 1998) and multi-label learning (MLL) (Mc Callum, 1999; Zhang & Zhou, 2013). AL-XMC can be categorized as a special MIML setting with extreme label size. Recently, Feng and Zhou (2017)

propose the general deep MIML architecture with a concept layer and two max-pooling layers to align with

the multi-instance nature of the input. In contrast, our approach learns label representations to use them as one branch of the input. On the other hand, Ilse et al. (2018) adopt the attention mechanism for multi-instance learning. Similar attention-based mechanisms are later used in learning with sets (Lee et al., 2019) but focus on a different problem. Our label assignment based algorithm EAGLE can be viewed as an analogy to the attention-based mechanisms, while having major differences from previous work. EAGLE provides the intuition that attention truly happens between the label representation and the sample representation, while previous methods do not. The idea of jointly considering sample and label space exists in the multi-label classiﬁcation problems in vision (Weston et al., 2011; Frome et al., 2013). While sharing the similar idea of learning a joint input-label space, our work addresses the multi-instance learning challenges as well as scalability in the XMC setting.

Others. AL-XMC is also related to a line of theoretical work on learning with shufﬂed labels and permutation estimation (Collier & Dalalyan, 2016; Pananjady et al., 2017b; Abid et al., 2017; Pananjady et al., 2017a; Hsu et al., 2017; Haghighatshoar & Caire, 2017), where the labels of all samples are provided without correspondences. Our work uniquely focuses on an aggregation structure where we know the group-wise correspondences. Our targeted applications have extreme label size that makes even classically efﬁcient estimators hard to compute. Another line of work studies learning with noisy labels (Natarajan et al., 2013; Liu & Tao, 2015), where one is interested in identifying the subset of correctly labeled samples (Shen & Sanghavi, 2019b), but there is no group-to-group structure. More broadly, in the natural language processing context, indirect supervision (Chang et al., 2010; Wang & Poon, 2018) tries to address the label scarcity problem where large-scale coarse annotations are provided with very limited ﬁne-grain annotations.

3. Problem Setup and Preliminaries

In this section, we ﬁrst provide a brief overview of XMC, whose deﬁnition helps us formulate the aggregated label XMC (AL-XMC) problem. Based on the formulation, we use one toy example to illustrate the shortcomings of existing XMC approaches when applied to AL-XMC.

XMC and AL-XMC formulation. An XMC problem can be deﬁned by {X, Y}, where X 2 Rn d is the feature matrix for all n samples, Y 2 {0, 1}n l is the sample-tolabel binary annotation matrix with label size l (if sample i is annotated by label k then Yi,k = 1). For the AL-

Extreme Multi-label Classiﬁcation from Aggregated Labels

Figure 1: A toy example of AL-XMC. Label 1 & 2 are both tagged for each group in {s1, s2}, {s3, s4}, {s5, s6}. Samples s1, s3, s5 all have feature x, while s2, s4, s6 all have feature x. A good model should identify that s1, s3, s5 and s2, s4, s6 belong to two labels, respectively (order does not matter). Ignoring the aggregation structure makes standard XMC approaches fail completely.

XMC problem, however, such a clean annotation matrix is not available. Instead, aggregated labels are available for subsets of samples. We use m intermediate nodes to represent this aggregation, where each node is connected to a subset of samples and gets annotated by multiple labels. More speciﬁcally, AL-XMC can be described by {X, Y1, Y2}, where the original annotation matrix is replaced by two binary matrices Y1 2 {0, 1}n m and Y2 2 {0, 1}m l. Y1

captures how the samples are grouped, while Y2 captures the labels for the aggregated samples. The goal is to use {X, Y1, Y2} to learn a good extreme multi-label classiﬁer. Let g = nnz(Y1)/m denote the average group size. In general, the larger g is, the weaker the annotation quality becomes. For convenience, let N, M, L be the set of samples, intermediate nodes, and labels, respectively. Let Nj, Lj be the set of samples, labels linked to intermediate node j respectively, 8j 2 M, and Mk be the set of nodes in M connected to label k 2 L1. Let x>

i be the i-th row in X for i 2 Nj, j 2 M, and XS be the submatrix of X that includes rows with index in set S N 2.

Deﬁciency of existing XMC approaches. Existing XMC approaches can be applied for AL-XMC by treating the product Y1Y2 directly as the annotation matrix, and learning a model using {X, Y1Y2}. While using all labeling information, this simple treatment ignores the possible incorrect correspondences due to label aggregation. To see

1We slightly abuse the notation and use Nj, Mk, Lj for the corresponding index sets as well.

2We summarize all notations in the Appendix.

the problem of this treatment, we take the XMC method Parabel (Prabhu et al., 2018b) as an example. Notice that the deﬁciency generally holds for all standard XMC approaches, but it is convenient to illustrate for a speciﬁc XMC method. Parabel is a scalable algorithm that achieves good performance on standard XMC datasets based on a partitioned label tree: It ﬁrst calculates label representations L = Y>X that summarize the positive samples associated with every label (step 1); Next, a balanced hierarchical partitioning of the labels is learned using L (step 2); Finally, a hierarchical probabilistic model is learned given the label tree (step 3). Notice that the label embedding that Parabel would use for AL-XMC if we naively use Y1Y2 as the annotation matrix is given by:

Y1Y2"> X. (1)

Consider an example where there are n samples and 2 labels, with X 2 Rn d, Y1 2 {0, 1}n n

2 , Y2 2 {0, 1}

deﬁned as follows:

2 12, Y2 = 1 n

where is the Kronecker product, 1d(1d1 d2) is an allones vector(matrix) with dimension d(d1 d2) and Id is the identity matrix with size d. A pictorial explanation is shown in Figure 1. The embedding calculated using the above X, Y1, Y2 leads to L = 0n 2 and loses all the information. With this label embedding, the clustering algorithm in step 2 and the probabilistic model in step 3 would fail to learn anything useful. However, a good model for the above setting should classify samples with feature close to x as label 1 and samples with feature close to x as label 2 (or vice versa). Such a failure of classic XMC approaches motivates us to provide algorithms that are robust for the AL-XMC problem.

4. Algorithms

The main insight we draw from the toy example above is that ignoring the aggregation structure may lead to serious information loss. Therefore, we propose an efﬁcient and robust label embedding learning algorithm to address this. We start with the guiding principle of our approach, and then explain the algorithmic details.

Given an XMC dataset with aggregated labels, our key idea for ﬁnding the embedding for each label is the following:

The embedding of label k 2 L should be close to the embedding of at least one of the samples in Nj, 8j 2 Mk.

The closeness here can be any general characterization of similarity, e.g., the standard cosine similarity. According

Extreme Multi-label Classiﬁcation from Aggregated Labels

Algorithm 1 GROUP_ROBUST_LABEL_REPR (GRLR)

Inputs: Mk, {Nj}j2Mk, X. Output: Label embedding ek. Initialize: Set e0 Proj

for t = 1, , T do /* where Proj(x) := x/kxk */ for j 2 Mk do

vi,j het 1, xii, 8i 2 Nj. ai,j 1

vi,j == maxi02Nj vi0,j

, 8i 2 Nj. end for gt Proj

i2Nj ai,jxi

et 1 + λ gt"

. end for Return: e T .

to this rule, in the previous toy example, the optimal label embedding for both labels is either [x, x] or [ x, x], instead of [0, 0]. More formally, the label embedding for label k 2 L is calculated based on the following:

ˆek = arg max

max i2Njhxi, ei. (2)

where Mk, Nj are as deﬁned in Section 3.

The goal of (2) is to ﬁnd label k s representation that is robust even when every group has samples not related to k. However, ﬁnding the estimator in (2) is in general hard, since the number of possible combinations for choosing the maximum in each group is exponential in n. We provide an iterative algorithm that alternates between the following two steps for T times to approximate this estimator: (i) identify the closest sample in each group given the current label embedding, and (ii) update the label embedding based on the direction determined by all currently closest samples. This is formally described in Algorithm 1.

The complete algorithm Efﬁcient AGgregated Label l Earning (EAGLE) is formally described in Algorithm 2, whose output can be directly fed into any standard XMC solver. Given the label embedding and a set of samples connected to the same intermediate node, each positive label is assigned to the sample with highest similarity. Notice that calculating the label embedding using Algorithm 1 is equivalent to Parabel s label embedding in (1) if g = 1, i.e., the standard XMC setting. For general g,(1) may perform well if each sample in Nj contributes equally to the label k 2 L, for every node j 2 Mk. However, this is not always the case.

Complexity. Computational efﬁciency is one of the main beneﬁts of EAGLE. Notice that for each label, only the samples belonging to its positively labeled groups are used. Let d be the average feature sparsity and assume each sample has O(log l) labels, the positive samples for each label is O(n log l/l g). Therefore, the total complexity for learning

Algorithm 2 EAGLE

Inputs: X, Y1 2 {0, 1}n m, Y2 2 {0, 1}m l. Output: A ﬁltered XMC dataset. Yfilter 0n l, N [n], M [m], L [l]. Nj {i 2 N|Y1

i,j == 1}, 8j 2 M. for k 2 L do

Mk {j 2 M|Y2

j,k == 1}. ek GRLR(Mk, {Nj}j2Mk, X). end for for j 2 M do

Yfilter(arg maxi2Njhek, xii, k) 1, 8k 2 Lj. end for Return: {X, Yfilter}.

all label s embedding is O(n d log l/l gl) = O(n d g log l). On the other hand, the time complexity for Parabel (one of the most efﬁcient XMC solver) is O(n d log l) for step 1, O(l d log l) for step 2 and O(n d log l) for step 3 (Prabhu

et al., 2018b). Therefore, EAGLE paired with any standard XMC solver for solving AL-XMC adds very affordable pre-processing cost.

5. Analysis

In this section, we provide theoretical analysis and explanations to the proposed algorithm EAGLE in Section 4. We start with comparing two estimators under the simpliﬁed regression setting to explain when assigning labels to each sample is helpful. Next, we analyze the statistical property of the label embedding estimator deﬁned in (2) in Theorem 3, and the one-step convergence result of the key step in Algorithm 1 in Theorem 4.

In EAGLE, a learned label embedding is used to assign each label to the closest sample in its aggregated group. Therefore, we start with justifying when label assignment would help. Since the multi-label classiﬁcation setting may complicate the analysis, we instead analyze a simpliﬁed regression scenario. Let Z 2 Rn l be the response of all n samples in X. Given B? 2 Rd l, each group in Z is generated according to

ZNj = j(XNj B? + Ej), j 2 M

where Ej is the noise matrix, and j is an unknown permutation matrix. For simplicity, we assume each group includes g samples and the aggregation structure can be described by Y1 = Im 1g, Y2 =

Z with m = n/g. If each row in Z is a one-hot vector, Y2 becomes a binary matrix and {X, Y1, Y2} corresponds to the standard AL-XMC problem. Our goal here is to recover the model parameter B?, with k B?k = 1 for convenience. We assume each row in Ej independently follows N(0, σ2

e Il), each sample feature is generated according to xi = xj + di

Extreme Multi-label Classiﬁcation from Aggregated Labels

for i 2 Nj, j 2 M, where xj N(0, σ2

1Id) describes the center of each group, and di N(0, σ2

2Id) captures the deviation within the group. Notice that the special case of σ1 σ2 corresponds to all samples are i.i.d. generated spherical Gaussians. In the other extreme, σ2 σ1 corresponds to samples within each group are clustered well. We consider the following two estimators:

ˆBNo AS = LR

[j2M [i2Nj {(xi0, zi)}

{(xi, zi)}i2[n]

and i0 = arg min i2Nj

---. Here, ˆBNo AS corresponds to the baseline approach that learns a model without label assignment. On the other hand, ˆBAS corresponds to the estimator we learn after assigning each output in the group to the closest instance based on the residual norm using ˆBNo AS. We have the following result that describes the property of the two estimators:

Theorem 1. Given the two estimators in (3), let R1 = --- ˆBNo AS B?---, R2 =

--- ˆBAS B?---, σx =

with n c0pgd log2 d, the following holds with high probability (i.e., 1 n c1):

We can see the pros and cons of the two estimators from the above theorem. The ﬁrst term in (5) is the rate achieved by the maximum likelihood estimator with all correspondences given (known js), while the second term is a bias term due to incorrect assignment. This bias term gets smaller as the measurement noise and the estimation error become smaller. In (4), when σ1 σ2, ˆBNo AS achieves the same rate as the optimal estimator, but when σ1 σ2, the rate goes down from n 1/2 to (n/g) 1/2. This shows that ˆBNo AS is close to optimal when the clustering quality is high (within group deviation is small), on the other hand, ˆBAS is nearly optimal for all clustering methods, while having an additional bias term that depends on the measurement noise.

Next, we analyze the property of the estimator in (2), since our iterative algorithm tries to approximate it. We assume that for each label k 2 L, there is some ground truth embedding e?

k, and each sample is associated with one of the labels. For sample i with ground truth label k, its feature vector xi can be decomposed as: xi = e?

k + i. Without

loss of generality, we only need to focus on the recovering of single label k 2 L. Further, for simplicity, let us assume that both xi and e?

k have unit norm measured in Euclidean space. We introduce the following deﬁnition that describes the property of the data:

Deﬁnition 2. Deﬁne δ = min

-- to be the min-

imum separation between each pair of ground truth label embeddings. Deﬁne f(γ) = max S M,|S|/|M| γ

to be the maximum inﬂuence of the noise, for γ 2 [0, 1], and let f = f(1). Deﬁne q = max

|Mk1\Mk2| min{Mk1,Mk2} to be

the maximum overlap between the set of intermediate nodes associated with two labels.

Given the above deﬁnition, we have the following result showing the property of the estimator in (2):

Theorem 3. With q, δ > 0, the estimator in (2) satisﬁes:

k, ˆeki 1 rf (

2r + 2)f 2, (6)

Theorem 3 characterizes the consistency of the estimator as noise goes to zero. It also quantiﬁes the inﬂuence of minimum separation as well as maximum overlap between labels. A smaller δ and a larger q both leads to harder identiﬁcation problem, which is reﬂected in an increasing r in (6). We then provide the following one-step convergence analysis for each iteration in Algorithm 1.

Theorem 4. Given label k and current iterate et. Let Sgood

t be the set of groups where et is closest to the sample belongs to label k. Denote |Sgood

t |/|Mk| = t. The next step iterate given by Algorithm 1 has the following one-step property:

k, et+1i t + (1 t)

and a sufﬁcient condition for contraction is he?

The key idea of the proof is based on the following fact of our algorithm: if a sample that does not come from label k is selected by et , then the distance from the selected incorrect sample to label k s embedding can be controlled by the distance to et and the distance from et to e?

k, where the ﬁrst term is small because of the selection rule. Please ﬁnd the details of the proof in the Appendix. Theorem 4 shows how each iterate gets closer to the ground truth label embedding. Since our algorithm does not require any assumption on the group size, we do not show the connection between t and et (which requires more restrictive assumptions), but instead provide a sufﬁcient condition to illustrate when the

Extreme Multi-label Classiﬁcation from Aggregated Labels

multi-instance

label embedding mask

masked MI logit

5),. 7 3(82 94:3*(+ (8 5),. ;

<($=*+4 >*2? 144' -.-/A3 ':41*<2*(+

4)4$4+2 >*34

$B)2*')*<%2*(+ *++4: product

Figure 2: Illustration of extending EAGLE to standard MIML framework. The upper branch is the original deep MIML framework (Feng & Zhou, 2017), where the Deep-MIML module converts X to V2 as in (7). In the lower branch, EAGLE ﬁrst learns label embedding using Algorithm 1 by re-organizing the MIML dataset. Then, a soft-assignment mask is generated based on the inner product between the multi-instance inputs and the label embedding. The masked MI logit is an element-wise multiplication of the mask and the original MI logit. A max-pooling operation over instances on the masked MI logit gives the ﬁnal logit for prediction, similar to the original deep MIML.

next iterate would improve. Notice that as the group size becomes larger, the signal becomes smaller and is smaller in general.

6. Extensions

In the previous section, EAGLE is proposed for ALXMC in the extreme label setting. In the non-extreme case, EAGLE naturally leads to a solution for the general MIML problems.

Deep-MIML Network. Feng and Zhou (2017) propose a general deep-MIML network: given a multi-instance input X with shape g d, the network ﬁrst transforms it into a g k l tensor V1 through a fully connected layer and a Re LU layer, where l is the label size and k is the additional dimension called concept size (Feng & Zhou, 2017) to improve the model capacity. A max-pooling operation over all concepts is taken on V1 to provide a g l matrix V2 (which we call multi-instance logit). Finally, max-pooling over all instances is taken on V2 to give the ﬁnal length-l logit prediction ˆY. This network can be summarized as:

max-pooling

GGGGGGGGGA (over concepts)

max-pooling

GGGGGGGGGA (over samples)

A co-attention framework. Our idea can be directly applied to modify the deep-MIML network structure, as shown in Figure 2. The main idea is to add a soft-assignment

mask to the original multi-instance logit, where this mask mimics the label assignment in EAGLE. After learning the label embedding L 2 Rl d from the dataset using Algorithm 1, the mask M 2 Rg l is calculated by M = g Softmax

where this softmax operation applies to each column in M. As a result, Mi,j indicates the afﬁnity between instance i and label j. Notice that controls the hardness of the assignment, and the special case of = 0 corresponds to the standard deep-MIML framework. Interestingly, this mask can also be interpreted as an attention weight matrix, which is then multiplied with the multi-instance logit matrix V2. While there is other literature using attention for MIL (Ilse et al., 2018), none of the existing methods uses a robust calculation of the label embedding as the input to the attention. The proposed coattention framework is easily interpretable since both labels and samples lie in the same representation space, with theoretical justiﬁcations we have shown in Section 5. Notice that the co-attention framework in Figure 2 can be trained end-to-end.

7. Experimental Results

In this section, we empirically verify the effectiveness of EAGLE from multiple standpoints. First, we run simulations to verify and explain the beneﬁt of label assignment as analyzed in Theorem 1. Next, we run synthetic experiments on standard XMC datasets to understand the advantages of EAGLE under multiple aggregation rules. Lastly, for the natural extension of EAGLE in the non-extreme setting (as

Extreme Multi-label Classiﬁcation from Aggregated Labels

Table 1: Statistics of 4 XMC datasets. sample size column includes training & test set. The last column includes precisions with the clean datasets, which can be thought of as the oracle performance given an XMC dataset with aggregated labels.

Dataset # feat. # label sample size avg samples/label avg labels/sample std. precision (P@1/3/5)

Eur Lex-4K 5,000 3,993 15,539 / 3,809 25.73 5.31 82.71 / 69.42 / 58.14 Wiki-10K 101,938 30,938 14,146 / 6,616 8.52 18.64 84.31 / 72.57 / 63.39 Amazon Cat-13K 203,882 13,330 1,186,239 / 306,782 448.57 5.04 93.03 / 79.16 / 64.52 Wiki-325K 1,617,899 325,056 1,778,351 / 587,084 17.46 3.19 66.04 / 43.63 / 33.05

Table 2: Comparing Baseline, EAGLE-0 (EAGLE without label learning) and EAGLE on small/mid/large-size XMC datasets with aggregated labels. O stands for oversized model (>5GB). R-4/10 randomly selects 4/10 samples in each group and observes their aggregated labels. C clusters samples based on hierarchical k-means. The cluster depth is determined based on sample size (8 for Eur Lex-4k, Wiki-10k and 16 for Amazon Cat-13k and Wiki-325k).

Eur Lex-4k Wiki-10k Amazon Cat-13k Wiki-325k

Baseline EAGLE-0 EAGLE Baseline EAGLE-0 EAGLE Baseline EAGLE-0 EAGLE Baseline EAGLE-0 EAGLE

R-4 P1 76.58 80.16 80.87 73.07 78.28 80.38 80.62 78.83 81.34 34.09 62.45 60.60 P3 61.36 63.49 65.33 60.84 63.81 66.15 65.81 67.45 69.97 33.34 41.96 40.23 P5 49.50 51.01 52.88 53.31 54.66 57.22 54.06 54.80 56.98 26.05 31.23 29.80

R-10 P1 62.95 62.58 67.56 64.87 65.21 68.89 56.42 60.71 63.47 O 52.43 54.81 P3 47.72 44.07 48.03 51.13 50.33 53.60 47.74 49.05 51.59 O 34.02 35.80 P5 37.59 33.25 35.91 42.89 41.22 44.68 40.74 38.06 39.47 O 24.84 26.18

C P1 17.94 39.11 43.11 16.34 39.13 40.25 74.65 56.41 56.19 O 45.03 46.39 P3 16.27 28.02 30.08 16.08 30.70 31.21 64.85 48.32 48.07 O 27.98 28.96 P5 14.83 22.46 23.78 15.96 25.69 26.09 53.61 40.01 39.93 O 20.53 21.27

Figure 3: Regression task with aggregated outputs. The advantage of AS (estimator w. label assignment) over No AS (estimator w.o. label assignment) matches with the result in Theorem 1. The y-axis is the root mean square (RMS) value normalized by RMS of the maximum likelihood estimator with known correspondences (lower is better).

mentioned in Section 6), we study multiple MIML tasks and show the beneﬁt of EAGLE over standard MIML solution. We include details of the experimental settings and more comparison results in the Appendix.

7.1. Simulations

We design a toy regression task to better explain the performance of our approach from an empirical perspective. Our data generating process strictly follows the setting in Theorem 1. We set σ1 = σe = 1.0 and vary σ2 from 0.0 to 10.0, which corresponds to heterogeneity within group

changes from low to high.

Results. In Figure 3, as the deviation within each sample group increases, AS performs much better, which is due to the pg difference in the error rate between (4) and the ﬁrst term in (5). On the other hand, AS may perform slightly worse than No AS in the well-clustered setting, which is due to the second term in (5). See another toy classiﬁcation task with similar observations in the Appendix.

7.2. Extreme Multi-label Experiments

We ﬁrst verify our idea on 4 standard extreme classiﬁcation tasks3(1 small, 2 mid-size and 1 large), whose detailed statistics are shown in Table 1. For all tasks, the samples are grouped under different rules including: (i) random clustering: each group of samples are randomly selected; (ii) hierarchical clustering: samples are hierarchically clustered using k-means. Each sample in the original XMC dataset belongs to exactly one of the groups. As described in Section 4, EAGLE learns the label embeddings, and assigns every label in the group to one of the samples based on the embeddings. Then, we run Parabel and compare the ﬁnal performance. Notice that it is possible to assign labels more cleverly, however, we focus on the quality of the label embedding learned through EAGLE hence we stick to this

3http://manikvarma.org/downloads/XC/XMLRepository.html

Extreme Multi-label Classiﬁcation from Aggregated Labels

Figure 4: Ablation study: comparing Precision@1 between Baseline (no label assignment), EAGLE-0 (EAGLE without label learning) and EAGLE (EAGLE with label learning). The y-axis calculates the percentage of precision decrement over the oracle performance (trained with known sample-label correspondences). We study different factors including left: group size in random grouping; middle: hierarchical clustering depth size; right: heterogeneity within group changes from low (hierarchical clustering) to high (random grouping).

simple assigning rule. We consider (i) Baseline: Parabel without label assignment; (ii) EAGLE-0: EAGLE without label learning (T = 0); and (iii) EAGLE: EAGLE with label learning (T = 20 by default).

Results. We report the performance using the standard Precision@1/3/5 metrics in Table 2. From the empirical results, we ﬁnd that EAGLE performs better than EAGLE-0 almost consistently, across all tasks and all grouping methods, and is much better than Baseline where we ignore such aggregation structure. Baseline performs much better only on Amazon Cat-13k with hierarchical clustering, which is because of the low heterogeneity within each cluster, as theoretically explained by our Theorem 1. Notice that the precision on standard Amazon Cat-13k achieves 93.04, which implies that the samples are easily separated. Furthermore, we also provide ablation study on Eur Lex-4K in Figure 4 to understand the inﬂuence of group size and clustering rule. We report the decrement percentage over a model trained with known correspondences. As a sanity check, as the size of the group gets smaller and annotation gets ﬁner in Figure 4-(a) & (b), all methods have 0% decrement. More interestingly, in the other regime of more coarse annotations, (a) & (b) show that the beneﬁt of EAGLE-0 varies when the clustering rule changes while the beneﬁt of EAGLE is consistent. The consistency also exists when we change the heterogeneity within group by injecting noise to the feature representation when running the hierarchical clustering algorithm, as shown in Figure 4-(c).

7.3. MIML Experiments

First, we run a set of synthetic experiments on the standard MNIST & Fashion-MNIST image datasets, where we use the raw 784-dimension vector as the representation. Each sample in our training set consists of g random digit/clothing images and the set of corresponding labels

Table 3: Prediction accuracy on multiple MIML tasks.

MNIST Fashion Yelp group size 4 / 50 4 / 50 4

Deep-MIML 94.70/33.33 84.70/19.00 40.69

EAGLE-0 94.82/36.10 84.89/27.62 45.82 EAGLE 94.82/38.46 85.09/28.65 46.25

(we set g = 4, 50). We then test the accuracy on the standard test set. On the other hand, we collect the standard Yelp s customer review data from the web. Our goal is to predict the tags of a restaurant based on the customer reviews. We choose 10 labels with balanced positive samples and report the overall accuracy. Notice that each single review can be splitted into multiple sentences, as a result, we formulate it as an MIML problem similar to (Feng & Zhou, 2017). We retrieve the feature of each instance using Infer Sent4, an offthe-shelf sentence embedding method. We randomly collect 20k reviews with 4 sentences for training, 10k reviews with single sentence for validation and 10k reviews with single sentence for testing. We report the top-1 precision.

For both the tasks, we use a two-layer feed-forward neural network as the base model, identical to the setting in (Feng & Zhou, 2017). We compare Deep-MIML with the extension of EAGLE-0 and EAGLE for MIML, as illustrated in Figure 2. We ﬁrst do a hyper-parameter search for Deep-MIML to ﬁnd the best learning rate and the ideal epoch number (the epoch number with best validation performance). Then we ﬁx those hyper-parameters and use them for all algorithms.

Results. The performance of EAGLE in multiple MIML tasks is shown in Table 3. We see a 5.6% absolute im-

4https://github.com/facebookresearch/Infer Sent

Extreme Multi-label Classiﬁcation from Aggregated Labels

Boot Figure 5: Visualization of learned label embeddings on Fashion-MNIST dataset.

provement over Deep-MIML on the Yelp dataset. There is consistent improvement for the image tasks as well. Notice that the improvement on the large group size is much signiﬁcant than on the small group size. This is because the original Deep-MIML framework is able to handle the easy MIML tasks well, but is less effective for difﬁcult tasks. Figure 5 visualizes the learned label embedding for Fashion MNIST dataset. All these results corroborate the advantage of using label embeddings at the beginning of feed-forward neural networks in MIML.

8. Conclusions & Discussion

In this paper, we study XMC with aggregated labels, and propose the ﬁrst efﬁcient algorithm EAGLE that advances standard XMC methods in most settings. Our work leaves open several interesting issues to study in the future. First, while using positively labeled groups to learn label embedding, what would be the most efﬁcient way to also learn/sample from negatively labeled groups? Second, is there a way to estimate the clustering quality and adjust the hyper-parameters accordingly? Moving forward, we believe the co-attention framework we proposed in Section 6 can help design deeper neural network architectures for MIML with better performance and interpretation.

Extreme Multi-label Classiﬁcation from Aggregated Labels

Abid, A., Poon, A., and Zou, J. Linear regression with

shufﬂed labels. ar Xiv preprint ar Xiv:1705.01342, 2017.

Agrawal, R., Gupta, A., Prabhu, Y., and Varma, M. Multi-

label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd International Conference on World Wide Web, pp. 13 24, 2013.

Babbar, R. and Schölkopf, B. Dismec: Distributed sparse

machines for extreme multi-label classiﬁcation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 721 729, 2017.

Bamler, R. and Mandt, S. Extreme classiﬁcation via adver-

sarial softmax approximation. In International Conference on Learning Representations, 2020.

Chang, M.-W., Srikumar, V., Goldwasser, D., and Roth,

D. Structured output learning with indirect supervision. In International Conference on Machine Learning, pp. 199 206, 2010.

Chang, W.-C., Yu, H.-F., Zhong, K., Yang, Y., and Dhillon,

I. X-BERT: extreme multi-label text classiﬁcation using bidirectional encoder representations from transformers. ar Xiv preprint ar Xiv:1905.02331, 2019.

Collier, O. and Dalalyan, A. S. Minimax rates in permu-

tation estimation for feature matching. The Journal of Machine Learning Research, 17(1):162 192, 2016.

Denton, E., Weston, J., Paluri, M., Bourdev, L., and Fergus,

R. User conditional hashtag prediction for images. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1731 1740, 2015.

Dietterich, T. G., Lathrop, R. H., and Lozano-Pérez, T. Solv-

ing the multiple instance problem with axis-parallel rectangles. Artiﬁcial Intelligence, 89(1-2):31 71, 1997.

Feng, J. and Zhou, Z.-H. Deep MIML network. In Thirty-

First AAAI Conference on Artiﬁcial Intelligence, 2017.

Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J.,

Ranzato, M., and Mikolov, T. Devise: A deep visualsemantic embedding model. In Advances in Neural Information Processing Systems, pp. 2121 2129, 2013.

Guo, C., Mousavi, A., Wu, X., Holtmann-Rice, D. N., Kale,

S., Reddi, S., and Kumar, S. Breaking the glass ceiling for embedding-based classiﬁers for large output spaces. In Advances in Neural Information Processing Systems, pp. 4944 4954, 2019.

Haghighatshoar, S. and Caire, G. Signal recovery from unla-

beled samples. IEEE Transactions on Signal Processing, 66(5):1242 1257, 2017.

Hsu, D. J., Shi, K., and Sun, X. Linear regression with-

out correspondence. In Advances in Neural Information Processing Systems, pp. 1531 1540, 2017.

Ilse, M., Tomczak, J., and Welling, M. Attention-based deep

multiple instance learning. In International Conference on Machine Learning, pp. 2132 2141, 2018.

Jain, H., Prabhu, Y., and Varma, M. Extreme multi-label loss

functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935 944, 2016.

Jain, H., Balasubramanian, V., Chunduri, B., and Varma, M.

Slice: Scalable linear extreme classiﬁers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search

and Data Mining, pp. 528 536. ACM, 2019.

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh,

Y. W. Set transformer: A framework for attention-based

permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744 3753, 2019.

Liu, J., Chang, W.-C., Wu, Y., and Yang, Y. Deep learning

for extreme multi-label text classiﬁcation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115 124, 2017.

Liu, T. and Tao, D. Classiﬁcation with noisy labels by

importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):447 461, 2015.

Maron, O. and Lozano-Pérez, T. A framework for multiple-

instance learning. In Advances in Neural Information Processing Systems, pp. 570 576, 1998.

Mc Callum, A. Multi-label text classiﬁcation with a mix-

ture model trained by EM. In AAAI Workshop on Text Learning, pp. 1 7, 1999.

Medini, T. K. R., Huang, Q., Wang, Y., Mohan, V., and

Shrivastava, A. Extreme classiﬁcation in log memory using count-min sketch: A case study of Amazon search with 50M products. In Advances in Neural Information Processing Systems, pp. 13244 13254, 2019.

Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari,

A. Learning with noisy labels. In Advances in Neural Information Processing Systems, pp. 1196 1204, 2013.

Extreme Multi-label Classiﬁcation from Aggregated Labels

Pananjady, A., Wainwright, M. J., and Courtade, T. A. De-

noising linear models with permuted data. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 446 450. IEEE, 2017a.

Pananjady, A., Wainwright, M. J., and Courtade, T. A. Lin-

ear regression with shufﬂed data: Statistical and computational limits of permutation recovery. IEEE Transactions on Information Theory, 64(5):3286 3300, 2017b.

Partalas, I., Kosmopoulos, A., Baskiotis, N., Artieres, T.,

Paliouras, G., Gaussier, E., Androutsopoulos, I., Amini, M.-R., and Galinari, P. LSHTC: A benchmark for largescale text classiﬁcation. ar Xiv preprint ar Xiv:1503.08581, 2015.

Prabhu, Y. and Varma, M. Fast XML: A fast, accurate and

stable tree-classiﬁer for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263 272, 2014.

Prabhu, Y., Kag, A., Gopinath, S., Dahiya, K., Harsola, S.,

Agrawal, R., and Varma, M. Extreme multi-label learning with label features for warm-start tagging, ranking & recommendation. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 441 449. ACM, 2018a.

Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., and Varma,

M. Parabel: Partitioned label trees for extreme classiﬁcation with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pp. 993 1002. International World Wide Web Conferences Steering Committee, 2018b.

Shen, Y. and Sanghavi, S. Iterative least trimmed squares

for mixed linear regression. In Advances in Neural Information Processing Systems, pp. 6076 6086, 2019a.

Shen, Y. and Sanghavi, S. Learning with bad training data

via iterative trimmed loss minimization. In International Conference on Machine Learning, pp. 5739 5748, 2019b.

Si, S., Zhang, H., Keerthi, S. S., Mahajan, D., Dhillon, I. S.,

and Hsieh, C.-J. Gradient boosted decision trees for high dimensional sparse output. In International Conference on Machine Learning, pp. 3182 3190. JMLR. org, 2017.

Siblini, W., Kuntz, P., and Meyer, F. Craftml, an efﬁcient

clustering-based random forest for extreme multi-label learning. In International Conference on Machine Learning, pp. 4664 4673, 2018.

Wang, H. and Poon, H. Deep probabilistic logic: A unify-

ing framework for indirect supervision. ar Xiv preprint ar Xiv:1808.08485, 2018.

Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up

to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011.

Wydmuch, M., Jasinska, K., Kuznetsov, M., Busa-Fekete,

R., and Dembczynski, K. A no-regret generalization of hierarchical softmax to extreme multi-label classiﬁcation. In Advances in Neural Information Processing Systems, pp. 6355 6366, 2018.

Yen, I. E., Huang, X., Dai, W., Ravikumar, P., Dhillon, I.,

and Xing, E. Ppdsparse: A parallel primal-dual sparse method for extreme classiﬁcation. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 545 553, 2017.

Yen, I. E.-H., Huang, X., Ravikumar, P., Zhong, K., and

Dhillon, I. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classiﬁcation. In International Conference on Machine Learning, pp. 3069 3077, 2016.

You, R., Zhang, Z., Dai, S., and Zhu, S. HAXMLNet: Hi-

erarchical attention network for extreme multi-label text classiﬁcation. ar Xiv preprint ar Xiv:1904.12578, 2019.

Yu, H.-F., Jain, P., Kar, P., and Dhillon, I. Large-scale

multi-label learning with missing labels. In International Conference on Machine Learning, pp. 593 601, 2014.

Zhang, M.-L. and Zhou, Z.-H. A review on multi-label

learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819 1837, 2013.

Zhang, W., Yan, J., Wang, X., and Zha, H. Deep extreme

multi-label learning. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 100 107. ACM, 2018.

Zhang, Z.-L. and Zhang, M.-L. Multi-instance multi-label

learning with application to scene classiﬁcation. In Advances in Neural Information Processing Systems, pp. 1609 1616, 2007.

Zhou, Z.-H., Zhang, M.-L., Huang, S.-J., and Li, Y.-F. Multi-

instance multi-label learning. Artiﬁcial Intelligence, 176 (1):2291 2320, 2012.