# accelerating_extreme_classification_via_adaptive_feature_agglomeration__d514064b.pdf

Accelerating Extreme Classiﬁcation via Adaptive Feature Agglomeration

Ankit Jalan and Purushottam Kar Department of CSE, IIT Kanpur, INDIA aankitjalan@gmail.com, purushot@cse.iitk.ac.in

Extreme classiﬁcation seeks to assign each data point, the most relevant labels from a universe of a million or more labels. This task is faced with the dual challenge of high precision and scalability, with millisecond level prediction times being a benchmark. We propose DEFRAG, an adaptive feature agglomeration technique to accelerate extreme classiﬁcation algorithms. Despite past works on feature clustering and selection, DEFRAG distinguishes itself in being able to scale to millions of features, and is especially beneﬁcial when feature sets are sparse, which is typical of recommendation and multi-label datasets. The method comes with provable performance guarantees and performs efﬁcient task-driven agglomeration to reduce feature dimensionalities by an order of magnitude or more. Experiments show that DEFRAG can not only reduce training and prediction times of several leading extreme classiﬁcation algorithms by as much as 40%, but also be used for feature reconstruction to address the problem of missing features, as well as offer superior coverage on rare labels.

1 Introduction The task of taking assigning data points, one or more labels from a vast universe of millions of labels is often referred to as the extreme classiﬁcation problem. Although reminiscent of the classical multi-label learning problem, the emphasis on addressing extremely large label spaces distinguishes extreme classiﬁcation. Recent advances in extreme classiﬁcation have allowed problems such as ranking, recommendation and retrieval to be viewed and formulated as multi-label problems, indeed with millions of labels. This focus on extremely large label sets has given us stateof-the-art methods for product recommendation [Jain et al., 2016], search advertising [Prabhu et al., 2018], and video recommendation [Weston et al., 2013], as well as led to advances in our understanding of scalable optimization [Prabhu et al., 2018], and distributed and parallel processing [Yen et al., 2017; Babbar and Sch olkopf, 2017]. Recent advances have utilized a variety of techniques label embeddings, random forests, binary relevance, which we review in 3.

Nevertheless, extreme classiﬁcation algorithms continue to face several challenges that we enumerate below.

1. Precision: data points often have very few labels relevant to them (e.g. of the millions of products on sale, very few products would interest any given customer). It is challenging to accurately identify these few relevant labels among the millions of irrelevant ones.

2. Prediction: for use in live recommendation systems, extremely rapid predictions are expected, typically within milliseconds. This often restricts the algorithmic techniques that can be used, to computationally frugal ones.

3. Processing: extreme classiﬁcation datasets often contain millions of data points, each represented as a million-dimensional vector itself. It is challenging to offer scalable training on such large datasets.

4. Parity: huge label sets exhibit power-law behavior with most labels being rare i.e. relevant to very few data points. This causes algorithms to focus only on popular labels, neglecting the vast majority of rare ones. However, this is detrimental for recommendation outcomes.

1.1 Our Contributions In this work, we develop the DEFRAG method and variants to address these speciﬁc challenges for a large family of algorithms. Our contributions are summarized below.

1. We propose the DEFRAG algorithm that accelerates extreme classiﬁcation algorithms by performing efﬁcient feature agglomeration on datasets with millions of features. DEFRAG performs agglomeration by constructing a balanced hierarchy which offers faster and better agglomerates than traditional clustering methods.

2. We show that DEFRAG provably preserves the performance of several extreme classiﬁcation algorithms.

3. We exploit DEFRAG s agglomerates in a novel manner to develop the REFRAG algorithm to address the parity problem by performing efﬁcient label re-ranking. This vastly improves the coverage of existing algorithms by accurately predicting extremely rare labels.

4. We develop the FIAT algorithm to perform scalable feature imputation which preserves prediction accuracy even when a large fraction of data features are removed.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

5. We perform extensive experimentation on large-scale datasets to establish that DEFRAG not only offers signiﬁcant reductions in training and prediction times, but that it does so with little or no reduction in precision.

2 Problem Formulation and Notation The training data will be provided as n labeled data points (xi, yi), i = 1, . . . , n where xi Rd is the feature vector and yi {0, 1}L is the label vector. There may be several (upto L) labels associated with each data point. Extreme classiﬁcation datasets exhibit extreme sparsity in feature and label vectors. Let ˆd denote the average number of non-zero features per data point and ˆL denote the average number of active labels per data point. 6 shows that ˆd d and ˆL L. We will denote the feature matrix using X = x1, . . . , xn Rd n

and the label matrix using Y = y1, . . . , yn {0, 1}L n. Let F = {F1, . . . , FK} denote any K-partition of the feature set [d] i.e. Fi Fj = if i = j and SK k=1 Fk = [d]. Let dk := |Fk| denote the size of the kth cluster. For any vector z Rd, let zj denote its jth coordinate. For any set Fk F, let z Fk := [zj] j Fk Rdk denote the (shorter) vector containing only coordinates from the set Fk. Feature Agglomeration. This involves creating clusters of features and then summing up features within a cluster. If F is a partition of the features [d], then for every cluster Fk F, we create a a single super -feature. Thus, given a vector z Rd, we can create an agglomerated vector z[F] RK (abbreviated to just z for sake of notational simplicity) with just K features using the clustering F. The kth dimension of z will be zk = P j Fk zi for k = 1, . . . , K. The DEFRAG algorithm will automatically learn relevant feature clusters F.

3 Related Works We discuss relevant works in extreme classiﬁcation and scalable clustering and feature agglomeration techniques here. Binary Relevance. Also known as one-vs-all methods, these techniques, for example Di SMEC [Babbar and Sch olkopf, 2017], PPDSparse [Yen et al., 2017], and Pro XML [Babbar and Sch olkopf, 2019], learn L binary classiﬁers: for each label l [L], a binary classiﬁer is learnt to distinguish data points that contain label l from those that do not. Binary relevance methods offer some of the highest precision values among extreme classiﬁcation algorithms [Prabhu et al., 2018]. However, despite advances in parallel training and active set methods, they still incur training and prediction times that are prohibitive for most applications. Label/Feature Embedding. These techniques project feature and/or label vectors onto a low dimensional space i.e. xi 7 ˆxi, yi 7 ˆyi where ˆxi, ˆyi Rp, p min {d, L} using random or learnt projections. Prediction and training is performed in the low dimensional space Rp for speed. These methods SLEEC [Bhatia et al., 2015], Annex ML [Tagami, 2017] and LEML [Yu et al., 2014] offer strong theoretical guarantees, but are usually forced to choose a moderate value of p to maintain scalability. This often results in low precision values and causes these methods to struggle on rare labels.

𝐹1 𝐹3 𝐹𝐾 𝐹2

ℱ= 𝐹1, , 𝐹𝐾

Figure 1: An illustration of the feature agglomeration process.

Data Partitioning. These techniques learn a decision tree over the data points which are hierarchically clustered into several leaves, with the hope that the similar data points, i.e. those with similar label vectors, end up in the same leaf. A simple classiﬁer (usually constant) performs label prediction at a leaf. These methods Pfastre XML [Jain et al., 2016], Fast XML [Prabhu and Varma, 2014] and CRAFTML [Siblini et al., 2018] offer fast prediction times due to prediction being logarithmic in number of leaves in a balanced tree.

Label Partitioning. These methods instead learn to organize labels into (overlapping) clusters, using hierarchical partitioning techniques. Prediction is done by taking a data point to one or more of the leaves of the tree and using a simple method such as 1-vs-all among labels present at that leaf. These methods PLT [Jasinska et al., 2016], Parabel [Prabhu et al., 2018], LPSR [Weston et al., 2013] offer fast prediction times due to the tree structure, as well as high precision by using a 1-vs-all classiﬁer at the leaves, as Parabel does.

Large Scale (Feature) Clustering. Clustering, as well as feature clustering and agglomeration, are well-studied topics. Past works include techniques for scalable balanced k-means using alternating minimization techniques SCBC [Banerjee and Ghosh, 2006] and BCLS [Liu et al., 2017], scalable spectral clustering using landmarking LSC [Chen and Cai, 2011], and scalable information-theoretic clustering ITDC [Dhillon et al., 2003]. We do compare DEFRAG against all these algorithms. These algorithms were chosen since they were able to scale to at least the smallest datasets in our experiments.

4 DEFRAG: a Daptive Extreme Featu Re AGglomeration

We now describe the DEFRAG method, discuss its key advantages and then develop the REFRAG method for rare label prediction and the FIAT method for feature imputation. Recall from 2 that given a K-partition F of the features [d], feature agglomeration takes each cluster Fk F and agglomerates all features j Fk by summing up their feature values. Given a dataset with d-dimensional features, DEFRAG ﬁrst clusters these features into balanced clusters, with each cluster containing, say no more than d0 features. Suppose this process results in K clusters. DEFRAG then uses feature agglomeration (see 2 and Figure 1) to obtain K-dimensional features for all data points in the dataset which are then used for training and testing.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 1 DEFRAG: Make-Tree

Input: Feature set S [d], representative vectors zi Rp for each feature i S, maximum leaf size d0 Output: A tree with each leaf having upto d0 features 1: if |S| d0 then 2: n Make-Leaf(S) // No need to split 3: else 4: n Make-Internal-Node() 5: {S+, S } Balanced-Split(S, zi, i S ) // Balanced k-means or n DCG split 6: n+ Make-Tree(S+, zi, i S+ , d0) 7: n Make-Tree(S+, zi, i S , d0) 8: n.Left-Child n+ 9: n.Right-Child n 10: end if 11: return Root node of this tree n

DEFRAG ﬁrst creates a representative vector for each feature j [d] and then performs hierarchical clustering on them (see Algorithm 1) to obtain feature clusters. At each internal node of the hierarchy, features at that node are split into two children nodes of equal sizes by solving either a balanced spherical 2-means problem or else by minimizing a ranking loss like n DCG [Prabhu and Varma, 2014] which we call DEFRAG-N (please see the full version for details). This process is continued till we are left with less than d0 features at a node, in which case the node is made a leaf. We now discuss two methods to construct these representative vectors. DEFRAG-X. This variant clusters together co-occurrent features e.g. j, j [d] where data points with a non-zero (or high) value for feature j also have a non-zero (or high) value for feature j . DEFRAG-X represents each feature j [d] as an n-dimensional vector pj = [x1 j, . . . , xn j ] Rn, i.e. as a list of values that feature takes in all data points. DEFRAG-XY. This variant clusters together co-predictive features e.g. j, j [d] where data points where feature j is non-zero have similar labels as data points where feature j is non-zero. To do so, DEFRAG-XY represents each feature j [d] as an L-dimensional vector qj = Pn i=1 xi jyi RL, essentially as a weighted aggregate of the label vectors of data points where the feature j is non-zero.

4.1 Advantages of DEFRAG DEFRAG suits high-dim., sparse data better than dimensionality reduction techniques like PCA or random projection. 1. Applying feature agglomeration to a vector involves summing up the coordinates of that vector and is much cheaper than performing PCA or a random projection. 2. Linear projections densify vectors and so methods like LEML and SLEEC are compelled to use a small embedding dimension ( 500) for sake of scalability which leads to information loss. Feature agglomeration, however, does not densify vectors: if a vector x Rd has only s non-zero coordinates, then for any feature Kclustering F, the vector x[F] RK cannot have more than s non-zero coordinates. This allows DEFRAG to

operate with relatively large values of K (e.g. K = d/8 is default in our experiments as we set d0 = 8) without worrying about memory or time issues. DEFRAG thus offers mildly agglomerated vectors which preserve much of the information of the original vector, yet offer speedups due to the reduced dimensionality. 3. Feature agglomeration has an implicit weight-tying effect: a linear model learnt over the agglomerated features effectively places the same weight over all features belonging to a cluster Fk F. This reduces the capacity of the model and can improve generalization error. 5 shows that feature agglomeration using a feature clustering F with small clustering error provably preserves the performance of all linear models. Speciﬁcally, for every model w Rd over the original vectors, there must exist a model w RK over the agglomerated vectors such that for any vector x Rd, we have w x w x[F]. Thus, similar 1-vs-all models, trees and label partitions can be learnt over x[F]. We note that all algorithms discussed in 3 ultimately use just linear models as components (e.g. binary relevance methods learn L linear classiﬁers one per label, embedding methods learn linear projections, data and label partitioning methods learn linear models to split internal tree nodes, etc.). Task adaptivity. DEFRAG-XY takes into account labels in its feature representation which makes it task-adaptive as compared to dimensionality reduction or clustering methods like k-means, PCA which do not take consider labels. Indeed, we will see that on many datasets, DEFRAG-XY outperforms DEFRAG-X which also does not take labels into account. Novelty, Speed and Scalability. Hierarchical feature agglomeration is novel in the context of extreme classiﬁcation although hierarchical data partitioning (Parabel) and hierarchical label partitioning (Pfastre XML) have been successfully attempted before. The representative vectors created by DEFRAG are themselves sparse and hierarchical feature agglomeration offers speedy feature clustering. DEFRAG s overhead on the training process is thus, very small.

Time Complexity. Let nnz(X) = n ˆd be the number of non-zero elements in the feature matrix X. Computing the feature representations pj, j [d] takes O (nnz(X)) time. The total time taken to perform balanced spherical 2means clustering for all nodes at a certain level in the tree is O (nnz(X)) as well. Since DEFRAG performs balanced splits, there can be at most O (log d) levels in the tree, thus giving us a total time complexity of O (nnz(X) log d).

4.2 FIAT: Feature Imputation via Agglomera Tion Co-occurrence based feature imputation is popularly used to overcome the problem of missing features. However, this becomes prohibitive for extreme classiﬁcation settings since the standard co-occurrence matrix C = XX is too dense to store and operate. We exploit the feature clusters offered by DEFRAG to create a scalable co-occurrence based feature imputation algorithm FIAT. For any feature cluster Fk F let XFk Rd n denote the matrix with only those rows that belong to the cluster Fk. Given this, we compute a pseudo co-occurrence matrix CF = PK k=1 XFk X Fk Rd d.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Note that CF has a block-diagonal structure and has only upto d2

K non-zero entries where K is the number of clusters. Thus, it is much cheaper to store and operate. Given a feature vector x Rd that we suspect has missing features, we perform feature imputation on it by simply calculating CFx.

4.3 REFRAG: REranking via Featu Re AGglomeration Algorithms frequently neglect rare labels (that occur in very few data points) in favor of popular ones [Wei and Li, 2018]. To address this, we propose an efﬁcient re-ranking solution based on the pseudo co-occurrence matrix CF described earlier. First compute the matrix product CFXY Rd L. The lth column of this matrix l [L] can be interpreted as giving us a prototype data point ξl Rd for the label l. These prototypes can be used to get the afﬁnity score of a

test data point xt to a label l [L] as e γ

2. Once a base classiﬁcation algorithm such as Parabel or Di SMEC has given scores for the test point xt with respect to various labels, instead of predicting the labels with the highest scores right-away, we combine the classiﬁer scores with these afﬁnity scores and then make the predictions. We note that a similar approach was proposed by [Jain et al., 2016] who did achieve enhanced performance on rare labels. However, whereas their method requires an optimization problem to be solved to obtain the prototypes, we have a closed form expression for prototypes in our model given the efﬁciently computable pseudo co-occurrence matrix CF. Due to lack of space, further algorithmic details as well as proofs of theorems in 5 are presented in the full version of the paper available at the URL given below. Full Version Link: http://arxiv.org/abs/1905.11769

5 Performance Guarantees In this section we establish that DEFRAG provably preserves the performance of extreme classiﬁcation algorithms. For any vector v Rp we will utilize the orthogonal decomposition v = v + v where v is the component of v along the allones vector 1p = (1, . . . , 1) Rp and v is the component orthogonal to it i.e. 1 p v = 0. At the core of our results is the following lemma. Given a real valued matrix Z Rd p for some p > 0 and a K-partition F of the feature set [d], we will let Zk Rdk p denote the matrix formed out of the rows of the matrix that correspond to the partition Fk. Lemma 1. Given any matrix Z Rd p and any Kpartition F = {F1, . . . , FK} of [d], suppose there exist vectors µ1, . . . , µK Rp such that Zk = 1dk(µk) + k where 1dk := (1, . . . , 1) Rdk, then for every w Rd and every k [K], there must exist a real value cw,k R such that

(w Fk cw,k 1dk) Zk Z k (w Fk cw,k 1dk) k w Fk 2

Lemma 1 will be used to show below that, if a group of features is well-clustered , then it is possible to tie together weights corresponding to those features in every linear model. Theorem 2. Upon executing DEFRAG-X with a feature matrix X = [x1, . . . , xn] and label matrix Y = [y1, . . . , yn],

suppose we obtain a feature K-partition F = [F1, . . . , FK] with errk denoting the Euclidean clustering error within the kth cluster, then for any loss function ℓ( ) that is L-Lipschitz and for every linear model w Rd, there must exist a model w RK such that for all subsets of data points S [n], s X

i S (ℓ(w xi; yi) ℓ( w xi; yi))2 L

w Fk 2 errk.

To simplify this result, let w0 = maxk [K] w Fk 2 2 maxk [K] w Fk 2 2 (since cluster sizes dk are typically small, w0 is small too) and use the fact that errk 0 to get s X

i S (ℓ(w xi; yi) ℓ( w xi; yi))2 L w0

In the full version, we show that DEFRAG-XY preserves the performance of label clustering methods such as Parabel. A few points are notable about the above results. Uniform Model Preservation. Theorem 2 guarantees that if the clustering error is small (and DEFRAG does minimize clustering error), then for every possible linear model w Rd over the original features xi, we can learn a model w Rd over the agglomerated features xi such that both models behave similarly with respect to any Lipschitz loss function. Note that Theorem 2 holds simultaneously for all linear models w, and is thus, agnostic to the algorithm used to learn w. Classiﬁer Preservation. Most leading algorithms (Parabel, Di SMEC, Pfastre XML, PPDSparse, SLEEC) construct classiﬁers by learning several linear models using hinge loss, exponential loss etc. which are Lipschitz. By preserving the performance of all such individual linear models, DEFRAG preserves the overall performance of these algorithms too. Note that Theorem 2 holds uniformly over all subsets S [n] of data points, which is useful since these algorithms often learn several linear models on various subsets of the data. Graceful Adaptivity. Suppose for a model w, the weights within a cluster Fk are similar i.e. w Fk w 1dk, w R. Then this implies w Fk 0 and the contribution of this cluster to the total error will be very small. This indicates that if some of the original weights are anyway tied together, DEFRAG automatically offers extremely accurate reconstructions.

6 Experimental Results We studied the effects of using DEFRAG variants with several extreme classiﬁcation algorithms, as well as compared DEFRAG with other clustering algorithms. Our implementation of DEFRAG is available at the URL given below. Code Link: https://github.com/purushottamkar/defrag/ Datasets and benchmark implementations. All datasets, train-test splits, and implementations of extreme classiﬁcation algorithms were sourced from the Extreme Classiﬁcation Repository [Bhatia et al., 2019]. Implementations of clustering algorithms were sourced from the respective authors whenever possible. For SCBC, LSC and ITDC, public implementations were not available and scalable implementations were created in the Python language.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Method LMI Bal. Ent. Time P1(%) P3(%) P5(%) (min)

ITDC 0.47 Inf 0.87 0.7 73.85 59.63 49.19 SCBC 0.39 321 0.75 0.55 71.96 59.50 49.41 LSC 0.60 130 0.93 5.7 71.44 58.68 48.71 BCLS 0.50 Inf 0.91 2.5 74.14 60.96 50.53 DEFRAG-X 0.37 1.11 0.99 0.03 78.97 65.68 54.46 Wiki10-31K

ITDC 0.52 Inf 0.88 23 82.03 69.72 60.07 DEFRAG-G 0.46 1.08 0.99 0.48 82.72 69.62 60.23 DEFRAG-X 0.36 1.08 0.99 0.45 84.99 73.47 63.91

Table 1: A comparison of DEFRAG with other clustering algorithms on clustering quality (deﬁnitions of clustering metrics in full version), as measured by loss of mutual information (LMI), balance factor, normalized entropy, clustering time, and classiﬁcation performance when the Parabel algorithm was executed upon agglomerated features given by the clustering algorithms. BCLS, LSC and SCBC could not scale to Wiki10. A balance factor of Inf indicates the presence of an empty cluster. DEFRAG not only outperforms other clustering algorithms in terms of clustering quality and classiﬁcation accuracy, but offers clustering times that can be an order of magnitude smaller. DEFRAG-G denotes the DEFRAG-X algorithm executed on word features learnt by the Glo Ve algorithm [Pennington et al., 2014]. DEFRAG could not be outperformed by carefully crafted word vector representations like Glo Ve either. The clustering time for DEFRAG-G does not include time taken to extract (dense) Glo Ve word features from raw text.

Hyperparameters. If available, hyperparameter settings recommended by authors were used for all methods. If unavailable, a ﬁne grid search was performed over a reasonable range to offer adequately tuned hyperparameters to the methods. DEFRAG had its only hyperparameter, the max size of a feature cluster d0 (see Algorithm 1), ﬁxed to 8.

Comparison with other clustering methods. Table 1 compares DEFRAG with other clustering algorithms on clustering quality, execution time and classiﬁcation performance (please see the full version for deﬁnitions of clustering metrics). Features were agglomerated according to feature clusters given by all algorithms and Parabel was executed on them. DEFRAG handily outperforms all other methods.

Dataset-wise and method-wise performance. Table 3 presents the outcome of using DEFRAG with several leading algorithms on 8 datasets. On Wiki10 and Delicious, DEFRAG-XY+Parabel offers the best overall performance across all methods. More generally, the table shows 21 instances, across the 8 datasets, of how DEFRAG performs with various algorithms. In 3 of these instances, DEFRAG outperforms the base method (EURLex-PPDSparse, Wiki10Parabel and Delicious-Parabel), in 7 others, DEFRAG lags by less than 2.5%, in 8 others, it lags by less than 5%. Only in 3 cases is the lag > 5%. DEFRAG variants do seem to work best with the Parabel method.

Trade-offs offered by DEFRAG. It is easy to see that if we create a small number of clusters K, by setting d0 to be a large value, then the agglomerated vectors will be lower dimensional and as such, offer faster training/prediction and smaller model sizes. However this may cause a dip in predic-

4 5 6 Model Size (GB)

Precision @ 1

Delicious-200K

0.80 0.85 0.90 Test Time

Precision @ 5

Wiki LSHTC-325K

0.50 0.55 0.60 Test Time

Precision @ 1

Amazon Cat-13K

Parabel DEFRAG-X DEFRAG-X3 DEFRAG-XY DEFRAG-XY3

300 400 Train Time

Precision @ 1

Figure 2: Trade-offs offered by DEFRAG variants. The maximum size of clusters in DEFRAG variants d0 was changed among the values 32, 16, 8 (default), and 4. The plots show how this affects prediction accuracy, training/test time and model sizes. The black line shows Parabel s default performance with the black triangle marking its model size, test time etc. Aggressive clustering (e.g. d0 = 32) gives faster training/prediction times and smaller model sizes but also some drop in accuracy. DEFRAG-X3 and DEFRAG-XY3 refer to an ensemble of 3 independent realizations of DEFRAG (please see the full version for more details and additional trade-off plots.

Method P1 P3 P5 N1 N3 N5

Pfast Re XML 19.02 18.34 18.43 19.02 18.49 18.52 REFRAG 20.56 19.51 19.26 20.56 19.74 19.54

Delicious-200K

Pfast Re XML 3.15 3.87 4.43 3.15 3.68 4.06 REFRAG 7.34 8.05 8.66 7.34 7.86 8.27

Table 2: REFRAG with propensity scored metrics. N1,3,5 refer to propensity weighted n DGG@k. Values from [Bhatia et al., 2019].

tion accuracy. Figure 2 shows that DEFRAG variants offer attractive trade-offs in this respect. Rare-label prediction with REFRAG. Table 2 shows that REFRAG offers much better propensity-weighted metrics [Jain et al., 2016] (which down-weigh popular and emphasize rare labels) than Pfastre XML which also attempts label re-ranking. Figure 3 shows that REFRAG achieves much better coverage (3.85) than Parabel (1.23) on Delicious and in general, predicts rare labels far more accurately. Figure 3 also shows that FIAT offers resilience to feature erasures.

Acknowledgments The authors thank the reviewers for comments on improving the presentation of the paper. The authors are also thankful to the lab-team members at the CSE department, IIT Kanpur esp. M. Bagga, S. Malhotra, N. Yadav, B. K. Mishra for their support in running the experiments. P. K. thanks Microsoft Research India and Tower Research for research grants.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

1 2 3 5 7 10 20 30 50 70100 Label Percentile

Macro Precision

Dataset: delicious Large CP:0.60 CPF:0.56 CPR:0.72

Parabel Pfast Re XML Parabel + REFRAG-X

1 2 3 5 7 10 20 30 50 70100 Label Percentile

Macro Precision

Dataset: wiki10 CP:1.29 CPF:2.81 CPR:4.50

0 20 40 60 80 100 Noise (in %)

Precision @ 1

Delicious-200K - FIAT

DEFRAG-X3 Parabel

0 20 40 60 80 100 Noise (in %)

Precision @ 1

Wiki-10K - FIAT

DEFRAG-X3 Parabel

Figure 3: REFRAG offers far superior coverage of rare labels than Parabel on Wiki10 and Delicious datasets. Please see the full version for deﬁnitions of coverage and other details. In the last two plots, a fraction of features was randomly erased from the feature vectors of all test data points. The FIAT algorithm is more robust to such erasures than the default Parabel algorithm. The gap between FIAT and Parabel widens as erasures become more common. Please see the full version for more details and plots.

Total Train Test Model Method P1(%) P3(%) P5(%) Time Time Time Size (hr) (hr) (ms) (GB)

Pfast Re XML 70.41 59.22 50.56 0.07 0.07 1.26 0.26 DEFRAG-X 68.50 56.57 47.78 0.05 0.05 1.55 0.19

SLEEC 72.96 56.03 45.49 0.06 0.06 1.62 0.70 DEFRAG-X 67.89 51.55 42.04 0.03 0.03 1.05 0.31

Dismec 82.85 70.37 58.69 0.04 0.04 1.1 0.08 DEFRAG-X 79.12 66.39 54.97 0.02 0.02 0.5 0.02

PPDSparse 72.90 57.1 45.8 0.010 0.010 0.03 0.01 DEFRAG-X 71.40 57.9 47.4 0.006 0.006 0.05 0.01

Parabel 82.28 68.81 57.58 0.010 0.010 0.73 0.03 DEFRAG-X 78.97 65.68 54.46 0.009 0.008 0.59 0.01 DEFRAG-XY 79.23 65.77 54.65 0.009 0.008 0.59 0.01

Pro XML 83.40 70.90 59.10 - - - -

Pfast Re XML 75.67 64.55 57.35 0.27 0.27 11.94 1.12 DEFRAG-X 69.79 58.54 52.52 0.14 0.13 11.74 0.80

SLEEC 84.28 72.05 61.80 0.38 0.38 6.00 3.9 DEFRAG-X 83.87 70.35 59.76 0.17 0.17 3.50 2.0

Dismec 84.12 74.71 65.94 1.48 1.48 42 7.1 DEFRAG-X 82.30 72.14 63.78 0.66 0.65 15 1.5

PPDSparse 74.68 60.03 49.12 0.59 0.59 2.2 0.04 DEFRAG-X 63.55 50.42 41.20 0.50 0.50 3.7 0.03

Parabel 84.19 72.46 63.37 0.13 0.13 2.04 0.18 DEFRAG-X 84.99 73.47 63.91 0.10 0.09 1.46 0.14 DEFRAG-XY 85.08 73.76 64.06 0.11 0.09 1.47 0.14

Amazon-670K

Pfast Re XML 36.90 34.22 32.10 5.70 5.70 6.10 10.98 DEFRAG-X 32.67 30.27 28.40 2.70 2.69 7.21 9.40

SLEEC 32.48 28.87 26.31 2.22 2.22 1.43 8.0 DEFRAG-X 31.40 28.04 25.69 1.63 1.63 1.62 4.2

Parabel 44.92 39.77 35.98 0.24 0.24 0.81 1.94 DEFRAG-X 42.71 37.71 33.93 0.23 0.21 0.76 1.68 DEFRAG-XY 42.62 37.72 33.94 0.23 0.21 0.77 1.66

Di SMEC 44.70 39.70 36.10 - - - - Pro XML 43.50 38.70 35.30 - - - -

Total Train Test Model Method P1(%) P3(%) P5(%) Time Time Time Size (hr) (hr) (ms) (GB)

Amazon Cat-13K

Pfast Re XML 85.56 75.19 62.84 11.66 11.66 0.54 19.02 DEFRAG-X 84.71 73.48 61.19 7.01 7.00 0.53 16.17

Dismec 93.80 79.07 64.05 6.68 6.68 1.45 6.0 DEFRAG-X 89.39 74.90 60.67 4.19 4.18 0.89 1.1

Parabel 93.06 79.15 64.51 0.43 0.43 0.62 0.61 DEFRAG-X 91.70 77.25 62.79 0.44 0.40 0.57 0.39 DEFRAG-XY 92.36 78.20 63.55 0.43 0.40 0.58 0.38

Delicious-200K

Parabel 46.86 40.08 36.70 5.33 5.33 2.22 6.36 DEFRAG-X 47.23 40.53 37.19 3.23 3.16 1.05 4.83 DEFRAG-XY 47.61 40.90 37.66 3.34 3.12 1.06 4.76

Pfast Re XML 41.72 37.83 35.58 - - - - Di SMEC 45.50 38.70 35.50 - - - -

Wiki LSHTC-325K

Pfast Re XML 58.47 37.70 27.57 11.23 11.23 2.66 14.20 DEFRAG-X 50.86 32.08 23.40 6.03 6.03 2.14 12.82

Parabel 65.04 43.24 32.05 0.58 0.58 0.91 3.09 DEFRAG-X 59.49 39.25 29.20 0.56 0.50 0.79 2.50 DEFRAG-XY 61.38 40.42 29.99 0.54 0.50 0.78 2.44

PPDSparse 64.08 41.26 30.12 - - - - Di SMEC 64.40 42.50 31.50 - - - - Pro XML 63.60 41.50 30.80 - - - -

Wikipedia-500K

Parabel 68.70 49.57 38.64 5.13 5.13 3.11 5.68 DEFRAG-X 65.15 44.96 34.85 3.27 3.14 1.62 5.25 DEFRAG-XY 64.73 44.79 34.76 3.31 3.20 1.62 5.22

Di SMEC 70.20 50.60 39.70 - - - - Pro XML 69.00 49.10 38.80 - - - -

Parabel 47.42 44.66 42.55 3.14 3.14 0.73 31.43 DEFRAG-X 45.68 42.85 40.76 2.93 2.83 0.66 25.34 DEFRAG-XY 45.11 42.36 40.30 2.87 2.80 0.63 25.22

Pfast Re XML 43.83 41.81 40.09 - - - -

Table 3: DEFRAG s performance when used with various extreme classiﬁcation algorithms. Total time for DEFRAG includes clustering=train time. On EURLex, Wiki10 and Delicious, DEFRAG actually achieves better classiﬁcation accuracy than the base classiﬁer itself (bold items). DEFRAG allows expensive 1-vs-all methods like Di SMEC and PPDSparse to be executed in a scalable manner with training time reductions of upto 40% on Amazon Cat and model size reductions of upto 20% on Wiki LSHTC. Precision values on certain datasets are being reported for sake of easy comparison. These were not obtained in our experiments and are being sourced from original publications: [Babbar and Sch olkopf, 2017], [Babbar and Sch olkopf, 2019], [Bhatia et al., 2019]. For all but Wiki-500K and Amazon-3M, DEFRAG was executed in an ensemble of 3 independent realizations (details in the full version). Except for Di SMEC on Amazon Cat (which required 12 cores to execute scalably), all times are reported on a single core.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Babbar and Sch olkopf, 2017] Rohit Babbar and Bernhard Sch olkopf. Di SMEC - Distributed Sparse Machines for Extreme Multi-label Classiﬁcation. In 10th ACM International Conference on Web Search and Data Mining (WSDM), 2017. [Babbar and Sch olkopf, 2019] Rohit Babbar and Bernhard Sch olkopf. Data Scarcity, Robustness and Extreme Multi-label Classiﬁcation. Machine Learning, 2019. https://doi.org/10.1007/s10994-019-05791-5. [Banerjee and Ghosh, 2006] Arindam Banerjee and Joydeep Ghosh. Scalable Clustering Algorithms with Balancing Constraints. Data Mining and Knowledge Discovery, 13(3):365 395, 2006. [Bhatia et al., 2015] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse Local Embeddings for Extreme Multi-label Classiﬁcation. In 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015. [Bhatia et al., 2019] Kush Bhatia, Kunal Dahiya, Himanshu Jain, Yashoteja Prabhu, and Manik Varma. The Extreme Classiﬁcation Repository. http://manikvarma.org/ downloads/XC/XMLRepository.html, 2019. Retrieved on February 15, 2019. [Chen and Cai, 2011] Xinlei Chen and Deng Cai. Large Scale Spectral Clustering with Landmark-Based Representation. In 25th AAAI Conference on Artiﬁcial Intelligence (AAAI), 2011. [Dhillon et al., 2003] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. A Divisive Information Theoretic Feature Clustering Algorithm for Text Classiﬁcation. Journal of Machine Learning Research, 3:1265 1287, 2003. [Jain et al., 2016] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications. In 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016. [Jasinska et al., 2016] Kalina Jasinska, Krzysztof Dembczy nski, R obert Busa-Fekete, Karlson Pfannschmidt, Timo Klerx, and Eyke H ullermeier. Extreme F-Measure Maximization using Sparse Probability Estimates. In 33rd International Conference on Machine Learning (ICML), 2016. [Liu et al., 2017] Hanyang Liu, Junwei Han, Feiping Nie, and Xuelong Li. Balanced Clustering with Least Square

Regression. In 31st AAAI Conference on Artiﬁcial Intelligence (AAAI), 2017. [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glo Ve: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. [Prabhu and Varma, 2014] Yashoteja Prabhu and Manik Varma. Fast XML: A Fast, Accurate and Stable Treeclassiﬁer for e Xtreme Multi-label Learning. In 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2014. [Prabhu et al., 2018] Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned Label Trees for Extreme Classiﬁcation with Application to Dynamic Search Advertising. In 27th International World Wide Web Conference (WWW), 2018. [Siblini et al., 2018] Wissam Siblini, Pascale Kuntz, and Frank Meyer. CRAFTML, an Efﬁcient Clustering-based Random Forest for Extreme Multi-label Learning. In 35th International Conference on Machine Learning (ICML), 2018. [Tagami, 2017] Yukihiro Tagami. Annex ML: Approximate Nearest Neighbor Search for Extreme Multi-label Classiﬁcation. In 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2017. [Wei and Li, 2018] Tong Wei and Yu-Feng Li. Does Tail Label Help for Large-Scale Multi-Label Learning. In 27th International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2018. [Weston et al., 2013] Jason Weston, Ameesh Makadia, and Hector Yee. Label Partitioning For Sublinear Ranking. In 30th International Conference on Machine Learning (ICML), 2013. [Yen et al., 2017] Ian E. H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon, and Eric Xing. PPDSparse: A Parallel Primal-Dual Sparse Method for Extreme Classiﬁcation. In 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2017. [Yu et al., 2014] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. Large-scale Multi-label Learning with Missing Labels. In 31st International Conference on Machine Learning (ICML), 2014.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)