# extreme_classification_via_adversarial_softmax_approximation__14ca9c28.pdf Published as a conference paper at ICLR 2020 EXTREME CLASSIFICATION VIA ADVERSARIAL SOFTMAX APPROXIMATION Robert Bamler & Stephan Mandt Department of Computer Science University of California, Irvine {rbamler,mandt}@uci.edu Training a classifier over a large number of classes, known as extreme classification , has become a topic of major interest with applications in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost proportional to the number of classes C, which often is prohibitively expensive. A popular scalable softmax approximation relies on uniform negative sampling, which suffers from slow convergence due a poor signal-to-noise ratio. In this paper, we propose a simple training method for drastically enhancing the gradient signal by drawing negative samples from an adversarial model that mimics the data distribution. Our contributions are three-fold: (i) an adversarial sampling mechanism that produces negative samples at a cost only logarithmic in C, thus still resulting in cheap gradient updates; (ii) a mathematical proof that this adversarial sampling minimizes the gradient variance while any bias due to non-uniform sampling can be removed; (iii) experimental results on large scale data sets that show a reduction of the training time by an order of magnitude relative to several competitive baselines. 1 INTRODUCTION In many problems in science, healthcare, or e-commerce, one is interested in training classifiers over an enormous number of classes: a problem known as extreme classification (Agrawal et al., 2013; Jain et al., 2016; Prabhu & Varma, 2014; Siblini et al., 2018). For softmax (aka multinomial) regression, each gradient step incurs a cost proportional to the number of classes C. As this may be prohibitively expensive for large C, recent research has explored more scalable softmax approximations which circumvent the linear scaling in C. Progress in accelerating the training procedure and thereby scaling up extreme classification promises to dramatically improve, e.g., advertising (Prabhu et al., 2018), recommender systems, ranking algorithms (Bhatia et al., 2015; Jain et al., 2016), and medical diagnostics (Bengio et al., 2019; Lippert et al., 2017; Baumel et al., 2018) While scalable softmax approximations have been proposed, each one has its drawbacks. The most popular approach due to its simplicity is negative sampling (Mnih & Hinton, 2009; Mikolov et al., 2013), which turns the problem into a binary classification between so-called positive samples from the data set and negative samples that are drawn at random from some (usually uniform) distribution over the class labels. While negative sampling makes the updates cheaper since computing the gradient no longer scales with C, it induces additional gradient noise that leads to a poor signalto-noise ratio of the stochastic gradient estimate. Improving the signal-to-noise ratio in negative sampling while still enabling cheap gradients would dramatically enhance the speed of convergence. In this paper, we present an algorithm that inherits the cheap gradient updates from negative sampling while still preserving much of the gradient signal of the original softmax regression problem. Our approach rests on the insight that the signal-to-noise ratio in negative sampling is poor since there is no association between input features and their artificial labels. If negative samples were harder to discriminate from positive ones, a learning algorithm would obtain a better gradient signal close to the optimum. Here, we make these arguments mathematically rigorous and propose a non-uniform sampling scheme for scalably approximating a softmax classification scheme. Instead of sampling labels uniformly, our algorithm uses an adversarial auxiliary model to draw fake labels that are more realistic by taking the input features of the data into account. We prove that such procedure Published as a conference paper at ICLR 2020 reduces the gradient noise of the algorithm, and in fact minimizes the gradient variance in the limit where the auxiliary model optimally mimics the data distribution. A useful adversarial model should require only little overhead to be fitted to the data, and it needs to be able to generate negative samples quickly in order to enable inexpensive gradient updates. We propose a probabilistic version of a decision tree that has these properties. As a side result of our approach, we show how such an auxiliary model can be constructed and efficiently trained. Since it is almost hyperparameter-free, it does not cause extra complications when tuning models. The final problem that we tackle is to remove the bias that the auxiliary model causes relative to our original softmax classification. Negative sampling is typically described as a softmax approximation; however, only uniform negative sampling correctly approximates the softmax. In this paper, we show that the bias due to non-uniform negative sampling can be easily removed at test time. The stucture of our paper reflects our main contributions as follows: 1. We present a new scalable softmax approximation (Section 2). We show that non-uniform sampling from an auxiliary model can improve the signal-to-noise ratio. The best performance is achieved when this sampling mechanism is adversarial, i.e., when it generates fake labels that are hard to discriminate from the true ones. To allow for efficient training, such adversarial samples need to be generated at a rate sublinear (e.g., logarithmic) in C. 2. We design a new, simple adversarial auxiliary model that satisfies the above requirements (Section 3). The model is based on a probabilistic version of a decision tree. It can be efficiently pre-trained and included into our approach, and requires only minimal tuning. 3. We present mathematical proofs that (i) the best signal-to-noise ratio in the gradient is obtained if the auxiliary model best reflects the true dependencies between input features and labels, and that (ii) the involved bias to the softmax approximation can be exactly quantified and cheaply removed at test time (Section 4). 4. We present experiments on two classification data sets that show that our method outperforms all baselines by at least one order of magnitude in training speed (Section 5). We discuss related work in Section 6 and summarize our approach in Section 7. 2 AN ADVERSARIAL SOFTMAX APPROXIMATION We propose an efficient algorithm to train a classifier over a large set of classes, using an asymptotic equivalence between softmax classification and negative sampling (Subsection 2.1). To speed up convergence, we generalize this equivalence to model-based negative sampling in Subsection 2.2. 2.1 ASYMPTOTIC EQUIVALENCE OF SOFTMAX CLASSIFICATION AND NEGATIVE SAMPLING Softmax Classification (Notation). We consider a training data set D = {(xi, yi)}i=1:N of N data points with K-dimensional feature vectors xi RK. Each data point has a single label yi Y from a discrete label set Y. A softmax classifier is defined by a set of functions {ξy}y Y that map a feature vector x and model parameters θ to a score ξy(x, θ) R for each label y. Its loss function is ℓsoftmax(θ) = X ξy(x, θ) + log X y Y eξy (x,θ) . (1) While the first term encourages high scores ξy(x, θ) for the correct labels y, the second term encourages low scores for all labels y Y, thus preventing degenerate solutions that set all scores to infinity. Unfortunately, the sum over y Y makes gradient based minimization of ℓsoftmax(θ) expensive if the label set Y is large. Assuming that evaluating a single score ξy (x, θ) takes O(K) time, each gradient step costs O(KC), where C = |Y| is the size of the label set. Negative Sampling. Negative sampling turns classification over a large label set Y into binary classification between so-called positive and negative samples. One draws positive samples (x, y) Published as a conference paper at ICLR 2020 from the training set and constructs negative samples (x, y ) by drawing random labels y from some noise distribution pn. One then trains a logistic regression by minimizing the stochastic loss function ℓneg.sampl.(φ) = X log σ(ξy(x, φ)) log σ( ξy (x, φ)) where y pn (2) with the sigmoid function σ(z) = 1/(1 + e z). Here, we used the same score functions ξy as in Eq. 1 but introduced different model parameters φ so that we can distinguish the two models. Gradient steps for ℓneg.sampl(φ) cost only O(K) time as there is no sum over all labels y Y. Asymptotic Equivalence. The models in Eqs. 1 and Eq. 2 are exactly equivalent in the nonparametric limit, i.e., if the function class x 7 ξy(x, θ) is flexible enough to map x to any possible score. A further requirement is that pn in Eq. 2 is the uniform distribution over Y. If both conditions hold, it follows that if θ and φ minimize Eq. 1 and Eq. 2, respectively, they learn identical scores, ξy(x, θ ) = ξy(x, φ ) + const. (for uniform pn). (3) As a consequence, one is free to choose the loss function that is easier to minimize. While gradient steps are cheaper by a factor of O(C) for negative sampling, the randomly drawn negative samples increase the variance of the stochastic gradient estimator and worsen the signal-to-noise ratio of the gradient, slowing-down convergence. The next section combines the strengths of both approaches. 2.2 ADVERSARIAL NEGATIVE SAMPLING Overview. We propose a generalized variant of negative sampling that reduces the gradient noise. The main idea is to train with negative samples y that are hard to distinguish from positive samples. We draw y from a conditional noise distribution pn(y |x) using an auxiliary model. This introduces a bias, which we remove at prediction time. In summary our proposed approach consists of three steps: 1. Parameterize the noise distribution pn(y |x) by an auxiliary model and fit it to the data set. 2. Train a classifier via negative sampling (Eq. 2) using adversarial negative samples from the auxiliary model fitted in Step 1 above. For our proposed auxiliary model, drawing a negative sample costs only O(k log C) time with some k < K, i.e., it is sublinear in C. 3. The resulting model has a bias. When making predictions, remove the bias by mapping it to an unbiased softmax classifier using the generalized asymptotic equivalence in Eq. 5 below. We elaborate on the above Step 1 in Section 3. In the present section, we focus instead on Step 2 and its dependency on the choice of noise distribution pn, and on the bias removal (Step 3). Why Adversarial Noise Improves Learning. We first provide some intuition why uniform negative sampling is not optimal, and how sampling from a non-uniform noise distribution may improve the gradient signal. We argue that the poor gradient signal is caused by the fact that negative samples are too easy to distinguish from positive samples. A data set with many categories is typically comprised of several hierarchical clusters, with large clusters of generic concepts and small sub-clusters of specialized concepts. When drawing negative samples uniformly across the data set, the correct label will likely belong to a different generic concept than the negative sample. For example, an image classifier will therefore quickly learn to distinguish, e.g., dogs from bicycles, but since negative samples from the same cluster are rare, it takes much longer to learn the differences between a Boston Terrier and a French Bulldog. The model quickly learns to assign very low scores ξy (x, φ) 0 to such obviously wrong labels, making their contribution to the gradient exponentially small, || φ log σ( ξy (x, φ))||2 = σ(ξy (x, φ)) || φξy (x, φ)||2 eξy (x,φ) || φξy (x, φ)||2 for ξy (x, φ) 0. (4) A similar vanishing gradient problem was pointed out for word embeddings by Chen et al. (2018). Here, the vanishing gradient is due to different word frequencies, and a popular solution is therefore to draw negative samples from a nonuniform but unconditional noise distribution pn(y ) based on the empirical word frequencies (Mikolov et al., 2013). This introduces a bias which does not matter for word embeddings since the focus is not on classification but rather on learning useful representations. Published as a conference paper at ICLR 2020 Going beyond frequency-adjusted negative sampling, we show that one can drastically improve the procedure by generating negative samples from an auxiliary model. We therefore propose to generate negative samples y pn(y |x) conditioned on the input feature x. This has the advantage that the distribution of negative samples can be made much more similar to the distribution of positive samples, leading to a better signal-to-noise ratio. One consequence is that the introduced bias can no longer be ignored, which is what we address next. Bias Removal. Negative sampling with a nonuniform noise distribution introduces a bias. For a given input feature vector x, labels y with a high noise probability pn(y |x) are frequently drawn as negative samples, causing the model to learn a low score ξy (x, φ ). Conversely, a low pn(y |x) leads to an inflated score ξy (x, φ ). It turns out that this bias can be easily quantified via a generalization of Eq. 3. We prove in Theorem 1 (Section 4) that in the nonparametric limit for arbitrary pn(y |x), ξy(x, θ ) = ξy(x, φ ) + log pn(y|x) + const. (nonparametric limit and arbitrary pn). (5) Eq. 5 is an asymptotic equivalence between softmax classification (Eq. 1) and generalized negative sampling (Eq. 2). While strict equality holds only in the nonparametric limit, many models are flexible enough that Eq. 5 holds approximately in practice. Eq. 5 allows us to make unbiased predictions by mapping biased negative sampling scores ξy(x, φ ) to unbiased softmax scores ξy(x, θ ). There is no need to solve for the corresponding model parameters θ , the scores ξy(x, θ ) suffice for predictions. Regularization. In practice, softmax classification typically requires a regularizer with some strength λ > 0 to prevent overfitting. With the asymptotic equivalence in Eq. 5, regularizing the softmax scores ξy(x, θ) is similar to regularizing ξy(x, φ) + log pn(y|x) in the proposed generalized negative sampling method. We thus propose to use the following regularized variant of Eq. 2, ℓ(reg.) neg.sampl.(φ) = 1 h log σ(ξy(x, φ)) + λ ξy(x, φ) + log pn(y|x) 2 (6) log σ( ξy (x, φ)) + λ ξy (x, φ) + log pn(y |x) 2i ; y pn(y |x). Comparison to GANs. The use of adversarial negative samples, i.e., negative samples that are designed to confuse the logistic regression in Eq. 2, bears some resemblance to generative adversarial networks (GANs) (Goodfellow et al., 2014). The crucial difference is that GANs are generative models, whereas we train a discriminative model over a discrete label space Y. The generator pn in our setup only needs to find a rough approximation of the (conditional) label distribution because the final predictive scores in Eq. 5 combine the generator scores log pn(y|x) with the more expressive discriminator scores ξy(x, φ ). This allows us to use a very restrictive but efficient generator model (see Section 3 below) that we can keep constant while training the discriminator. By contrast, the focus in GANs is on finding the best possible generator, which requires concurrent training of a generator and a discriminator via a potentially unstable nested min-max optimization. 3 CONDITIONAL GENERATION OF ADVERSARIAL SAMPLES Having proposed a general approach for improved negative sampling with an adversarial auxiliary model pn (Section 2), we now describe a simple construction for such a model that satisfies all requirements. The model is essentially a probabilistic version of a decision tree which is able to conditionally generate negative samples by ancestral sampling. Readers who prefer to proceed can skip this section without loosing the main thread of the paper. Our auxiliary model has the following properties: (i) it can be efficiently fitted to the training data D requiring minimal hyperparameter tuning and subleading computational overhead over the training of the main model; (ii) drawing negative samples y pn(y |x) scales only as O(log |Y|), thus improving over the linear scaling of the softmax loss function (Eq. 1); and (iii) the log likelihood log pn(y|x) can be evaluated explicitly so that we can apply the bias removal in Eq. 5. Satisfying requirements (i) and (ii) on model efficiency comes at the cost of some model performance. This is an acceptable trade-off since the performance of pn affects only the quality of negative samples. Model. Our auxiliary model for pn is inspired by the Hierarchical Softmax model due to Morin & Bengio (2005). It is a balanced probabilistic binary decision tree, where each leaf node is mapped Published as a conference paper at ICLR 2020 uniquely to a label y Y. A decision tree imposes a hierarchical structure on Y, which can impede performance if it does not reflect any semantic structure in Y. Morin & Bengio (2005) rely on an explicitly provided semantic hierarchical structure, or ontology . Since an ontology is often not available, we instead construct a hierarchical structure in a data driven way. Our method has some similarity to the approach by Mnih & Hinton (2009), but it is more principled in that we fit both the model parameters and the hierarchical structure by maximizing a single log likelihood function. To sample from the model, one walks from the tree s root to some leaf. At each node ν, one makes a binary decision ζ { 1} whether to continue to the right child (ζ = 1) or to the left child (ζ = 1). Given a feature vector x, we model the likelihood of these decisions as σ ζ(w ν x + bν) , where the weight vector wν and the scalar bias bν are model parameters associated with node ν. Denoting the unique path πy from the root node ν0 to the leaf node associated with label y as a sequence of nodes and binary decisions, πy (ν0, ζ0), (ν1, ζ1), . . . , the log likelihood of the training set D is thus (x,y) D log pn(y|x) = X (ν,ζ) πy log σ ζ(w ν x + bν) . (7) Greedy Model Fitting. We maximize the likelihood L in Eq. 7 over (i) the model parameters wν and bν of all nodes ν, and (ii) the hierarchical structure, i.e., the mapping between labels y and leaf nodes. The latter involves an exponentially large search space, making exact maximization intractable. We use a greedy approximation where we recursively split the label set Y into halves and associate each node ν with a subset Yν Y. We start at the root node ν0 with Yν0 = Y and finishing at the leaves with a single label per leaf. For each node ν, we maximize the terms in L that depend on wν and bν. These terms correspond to data points with a label y Yν, leading to the objective (x,y) D y Yν log σ ζy(w ν x + bν) . (8) We alternate between a continuous maximization of Lν over wν and bν, and a discrete maximization over the binary indicators ζy { 1} that define how we split Yν into two equally sized halves. The continuous optimization is over a convex function and it converges quickly to machine precision with Newton ascent, which is free of hyperparameters like learning rates. For the discrete optimization, we note that changing ξy for any y Yν from 1 to 1 (or from 1 to 1) increases (or decreases) Lν by log σ(w ν x + bν) log σ( w ν x bν) = X w ν x + bν . (9) Here, the sums over Dy run over all data points in D with label y, and the second equality is an algebraic identity of the sigmoid function. We maximize Lν over all ζy under the boundary condition that the split be into equally sized halves by setting ζy 1 for the half of y Yν with largest y and ζy 1 for the other half. If this changes any ζy then we switch back to the continuous optimization. Otherwise, we have reached a local optimum for node ν, and we proceed to the next node. Technical Details. In the interest of clarity, the above description left out the following details. Most importantly, to prioritize efficiency over accuracy, we preprocess the feature vectors x and project them to a smaller space Rk with k < K using principal component analysis (PCA). Sampling from pn thus costs only O(k log |Y|) time. This dimensionality reduction only affects the quality of negative samples. The main model (Eq. 2) still operates on the full feature space RK. Second, we add a quadratic regularizer λn(||wν||2 + b2 ν) to Lν, with strength λn set by cross validation. Third, we introduce uninhabited padding labels if |Y| is not a power of two. We ensure that pn( y|x) = 0 for all padding labels y by setting bν to a very large positive or negative value if either of ν s children contains only padding labels. Finally, we initialize the optimization with bν 0 and by setting wν Rk to the dominant eigenvector of the covariance matrix of the set of vectors {P x Dy x}y Yν. 4 THEORETICAL ASPECTS We formalize and prove the two main premises of the algorithm proposed in Section 2.2. Theorem 1 below states the equivalence between softmax classification and negative sampling (Eq. 5), and Theorem 2 formalizes the claim that adversarial negative samples maximize the signal-to-noise ratio. Published as a conference paper at ICLR 2020 Theorem 1. In the nonparametric limit, the optimal model parameters θ and φ that minimize ℓsoftmax(θ) in Eq. 1 and ℓneg.sampl.(φ) in Eq. 2, respectively, satisfy Eq. 5 for all x in the data set and all y Y. Here, the const. term on the right-hand side of Eq. 5 is independent of y. Proof. Minimizing ℓsoftmax(θ) fits the maximum likelihood estimate of a model with likelihood pθ(y|x) = eξy(x,θ)/Zθ(x) with normalization Zθ(x) = P y Y eξy (x,θ). In the nonparametric limit, the score functions ξy(x, θ) are arbitrarily flexible, allowing for a perfect fit, thus p D(y|x) = pθ (y|x) = eξy(x,θ )/Zθ (x) (nonparametric limit). (10) Similarly, ℓneg.sampl.(φ) is the maximum likelihood objective of a binomial model that discriminates positive from negative samples. The nonparametric limit admits again a perfect fit so that the learned ratio of positive rate σ(ξy(x, φ)) to negative rate σ( ξy(x, φ)) equals the empirical ratio, pn(y|x) = σ(ξy(x, φ )) σ( ξy(x, φ )) = eξy(x,φ ) (nonparametric limit) (11) where we used the identity σ(z)/σ( z) = ez. Inserting Eq. 10 for p D(y|x) and taking the logarithm leads to Eq. 5. Here, the const. term works out to log Zθ (x), which is indeed independent of y. Signal-to-Noise Ratio. In preparation for Theorem 2 below, we define a quantitative measure for the signal-to-noise ratio (SNR) in stochastic gradient descent (SGD). In the vicinity of the minimum φ of a loss function ℓ(φ), the gradient g Hℓ(φ φ ) is approximately proportional to the Hessian Hℓof ℓat φ . SGD estimates g via stochastic gradient estimates ˆg, whose noise is measured by the covariance matrix Cov[ˆg, ˆg]. Thus, the eigenvalues {ηi} of the matrix A := HℓCov[ˆg, ˆg] 1 measure the SNR along different directions in parameter space. We define an overall scalar SNR η as η := 1 P i 1/ηi = 1 Tr A 1 = 1 Tr Cov[ˆg, ˆg] H 1 ℓ . (12) Here, we sum over the inverses 1/ηi rather than ηi so that η mini ηi and thus maximizing η encourages large values for all ηi. The definition in Eq. 12 has the useful property that η is invariant under arbitrary invertible reparameterization of φ. Expressing φ in terms of new model parameters φ maps Hℓto J HℓJ and Cov[ˆg, ˆg] to J Cov[ˆg, ˆg]J, where J := φ/ φ is the Jacobian. Inserting into Eq. 12 and using the cyclic property of the trace, Tr[PQ] = Tr[QP], all Jacobians cancel. Theorem 2. For negative sampling (Eq. 2) in the nonparametric limit, the signal-to-noise ratio η defined in Eq. 12 is maximal if pn = p D, i.e., for adversarial negative samples. Proof. In the nonparametric limit, the scores ξy(x, φ) can be regarded as independent variables for all x and y. We therefore treat the scores directly as model parameters, using the invariance of η under reparameterization. Using only Eq. 2, Eq. 11, and properties of the σ-function, we show in Appendix A.1 that the Hessian of the loss function is diagonal in this coordinate system, and given by Hℓ= diag(αx,y) with αx,y = pn(y|x) σ(ξy(x, φ )) (13) and that the noise covariance matrix is block diagonal, Cov[ˆg, ˆg] = diag(Cx) with blocks Cx = N diag(αx,:) 2Nαx,:α x,: (14) where αx,: (αx,y)y Y denotes a |Y|-dimensional column vector. Thus, the trace in Eq. 12 is x Tr Cx diag 1 x Tr I 2αx,:α x,: diag 1 We thus have to maximize P y Y αx,y for each x in the training set. We find from Eq. 13 and Eq. 11, X y Y αx,y (13) = Epn(y|x) σ(ξy(x, φ )) = Epn(y|x) 1 1+e ξy(x,φ ) (11) = Epn(y|x) with f(z) := 1/(1 + 1/z). Using Jensen s inequality for the concave function f, we find that the right-hand side of Eq. 16 has the upper bound f Epn(y|x)[p D(y|x)/pn(y|x)] = f(1) = 1 2, which it reaches precisely if the argument of f in Eq. 16 is a constant, i.e., iff pn(y|x) = p D(y|x) y Y. Published as a conference paper at ICLR 2020 Table 1: Sizes of data sets and hyperparameters. N = number of training points; C = number of categories (after preprocessing); ρ = learning rate; λ = regularizer; σ2 0 = prior variance. Size of adv. neg. s. uniform freq. Data set data set (proposed) neg. s. neg. s. NCE A&R OVE Wikipedia-500K N=1,646,302 ρ=0.01 ρ=0.001 ρ=0.003 ρ=0.01 ρ=0.03 ρ=0.02 C=217,240 λ=0.001 λ=0.0001 λ=10 5 λ=0.003 σ2 0=0.1 σ2 0=1 Amazon-670K N=490,449 ρ=0.01 ρ=0.01 ρ=0.003 ρ=0.01 ρ=0.1 ρ=0.03 C=213,874 λ=0.001 λ=0.0003 λ=10 5 λ=0.001 σ2 0=10 σ2 0=10 We evaluated the proposed adversarial negative sampling method on two established benchmarks by comparing speed of convergence and predictive performance against five different baselines. Datasets, Preprocessing and Model. We used the Wikipedia-500K and Amazon-670K data sets from the Extreme Classification Repository (Bhatia et al.) with K = 512-dimensional XML-CNN features (Liu et al., 2017) downloaded from (Saxena). As oth data sets contain multiple labels per data point we follow the approach in (Ruiz et al., 2018) and keep only the first label for each data point. Table 1 shows the resulting sizes. We fit a liner model with scores ξy(x, φ) = x wy + by, where the model parameters φ are the weight vectors wy RK and biases by R for each label y. Baselines. We compare our proposed method to five baselines: (i) standard negative sampling with a uniform noise distribution; (ii) negative sampling with an unconditional noise distribution pn(y ) set to the empirical label frequencies; (iii) noise contrastive estimation (NCE, see below); (iv) Augment and Reduce (Ruiz et al., 2018); and (v) One vs. Each (Titsias, 2016). We do not compare to full softmax classification, which would be unfeasible on the large data sets (see Table 1; a single epoch of optimizing the full softmax loss would scale as O(NCK)). However, we provide additional results that compare softmax against negative sampling on a smaller data set in Appendix A.2. NCE (Gutmann & Hyvärinen, 2010) is sometimes used as a synonym for negative sampling in the literature, but the original proposal of NCE is more general and allows for a nonuniform base distribution. We use our trained auxiliary model (Section 3) for the base distribution of NCE. Compared to our proposed method, NCE uses the base distribution only during training and not for predictions. Therefore, NCE has to re-learn everything that is already captured by the base distribution. This is less of an issue in the original setup for which NCE was proposed, namely unsupervised density estimation over a continuous space. By contrast, training a supervised classifier effectively means training a separate model for each label y Y, which is expensive if Y is large. Thus, having to re-learn what the base distribution already captures is potentially wasteful. Hyperparameters. We tuned the hyperparameters for each method individually using the validation set. Table 1 shows the resulting hyperparameters. For the proposed method and baselines (i)-(iii) we used an Adagrad optimizer (Duchi et al., 2011) and considered learning rates ρ {0.0003, 0.001, 0.003, 0.01, 0.03} and regularizer strengths (see Eq. 6) λ {10 5, 3 10 5, . . . , 0.03}. For Augment and Reduce and One vs. Each we used the implementation published by the authors (Ruiz), and tuned the learning rate ρ and prior variance σ2 0. For the auxiliary model, we used a feature dimension of k = 16 and regularizer strength λn = 0.1 for both data sets. Results. Figure 1 shows our results on the Wikipedia-500K data set (left two plots) and the Amazon670K data set (right two plots). For each data set, we plot the the predictive log likelihood per test data point (first and third plot) and the predictive accuracy (second and fourth plot). The green curve in each plot shows our proposed adversarial negative sampling methods. Both our method and NCE (orange) start slightly shifted to the right to account for the time to fit the auxiliary model. Our main observation is that the proposed method converges orders of magnitude faster and reaches better accuracies (second and third plot in Figure 1) than all baselines. On the (smaller) Amazon-670K data set, standard uniform and frequency based negative sampling reach a slightly higher predictive Published as a conference paper at ICLR 2020 0 2 4 training time (hours) pred. log likelihood Wikipedia-500K 0 2 4 training time (hours) pred. accuracy Wikipedia-500K 0 2 4 training time (hours) pred. log likelihood Amazon-670K 0 2 4 training time (hours) pred. accuracy Amazon-670K adversarial neg. sampl. (proposed) freq. based neg. sampl. NCE [Gutmann et al., 2010] uniform neg. sampl. A&R [Ruiz et al., 2018] Ov E [Titsias, 2016] Figure 1: Learning curves for our proposed adversarial negative sampling method (green) and for five different baselines on two large data sets (see Table 1). log likelihood, but our method performs considerably better in terms of predictive accuracy on both data sets. This may be understood as the predictive accuracy is very sensitive to the precise scores of the highest ranked labels, as a small change in these scores can affect which label is ranked highest. With adversarial negative sampling, the training procedure focuses on getting the scores of the highest ranked labels right, thus improving in particular the predictive accuracy. 6 RELATED WORK Efficient Evaluation of the Softmax Loss Function. Methods to speed up evaluation of Eq. 1 include augmenting the model by adding auxiliary latent variables that can be marginalized over analytically (Galy-Fajou et al., 2019; Wenzel et al., 2019; Ruiz et al., 2018; Titsias, 2016). More closely related to our work are methods based on negative sampling (Mnih & Hinton, 2009; Mikolov et al., 2013) and noise contrastive estimation (Gutmann & Hyvärinen, 2010). Generalizations of negative sampling to non-uniform noise distributions have been proposed, e.g., in (Zhang & Zweigenbaum, 2018; Chen et al., 2018; Wang et al., 2014; Gutmann & Hyvärinen, 2010). Our method differs from these proposals by drawing the negative samples from a conditional distribution that takes the input feature into account, and by requiring the model to learn only correlations that are not already captured by the noise distribution. We further derive the optimal distribution for negative samples, and we propose an efficient way to approximate it via an auxiliary model. Adversarial training (Miyato et al., 2017) is a popular method for training deep generative models (Tu, 2007; Goodfellow et al., 2014). By contrast, our method trains a discriminative model over a discrete set of labels (see also our comparison to GANs at the end of Section 2.2). A different sampling-based approximation of softmax classification is sampled softmax (Bengio et al., 2003). It directly approximates the sum over classes y in the loss (Eq. 1) by sampling, which is biased even for a uniform sampling distribution. A nonuniform sampling distribution can remove or reduce the bias (Bengio & Senécal, 2008; Blanc & Rendle, 2018; Rawat et al., 2019). By contrast, our method uses negative sampling, and it uses a nonuniform distribution to reduce the gradient variance. Decision Trees. Decision trees (Somvanshi & Chavan, 2016) are popular in the extreme classification literature (Agrawal et al., 2013; Jain et al., 2016; Prabhu & Varma, 2014; Siblini et al., 2018; Weston et al., 2013; Bhatia et al., 2015; Jasinska et al., 2016). Our proposed method employs a probabilistic decision tree that is similar to Hierarchical Softmax (Morin & Bengio, 2005; Mikolov et al., 2013). While decision trees allow for efficient training and sampling in O(log C) time, their hierarchical architecture imposes a structural bias. Our proposed method trains a more expressive model without such a structural bias on top of the decision tree to correct for any structural bias. 7 CONCLUSIONS We proposed a simple method to train a classifier over a large set of labels. Our method is based on a scalable approximation to the softmax loss function via a generalized form of negative sampling. By generating adversarial negative samples from an auxiliary model, we proved that we maximize the signal-to-noise ratio of the stochastic gradient estimate. We further show that, while the auxiliary model introduces a bias, we can remove the bias at test time. We believe that due to its simplicity, our method can be widely used, and we publish the code1 of both the main and the auxiliary model. 1https://github.com/mandt-lab/adversarial-negative-sampling Published as a conference paper at ICLR 2020 ACKNOWLEDGEMENTS Stephan Mandt acknowledges funding from DARPA (HR001119S0038), NSF (FW-HTF-RM), and Qualcomm. Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pp. 13 24. ACM, 2013. Tal Baumel, Jumana Nassour-Kassis, Raphael Cohen, Michael Elhadad, and Noemie Elhadad. Multilabel classification of patient notes: case study on icd code assignment. In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Samy Bengio, Krzysztof Dembczynski, Thorsten Joachims, Marius Kloft, and Manik Varma. Extreme classification (dagstuhl seminar 18291). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. Yoshua Bengio and Jean-Sébastien Senécal. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713 722, 2008. Yoshua Bengio, Jean-Sébastien Senécal, et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pp. 1 9, 2003. Kush Bhatia, Kunal Dahiya, Himanshu Jain, Yashoteja Prabhu, and Manik Varma. The extreme classification repository: Multi-label datasets & code. http://manikvarma.org/downloads/ XC/XMLRepository.html. Accessed: 2019-05-23. Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multi-label classification. In Advances in neural information processing systems, pp. 730 738, 2015. Guy Blanc and Steffen Rendle. Adaptive sampled softmax with kernel based sampling. In International Conference on Machine Learning, pp. 589 598, 2018. Long Chen, Fajie Yuan, Joemon M Jose, and Weinan Zhang. Improving negative sampling for word representation using self-embedded features. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 99 107. ACM, 2018. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. Théo Galy-Fajou, Florian Wenzel, Christian Donner, and Manfred Opper. Multi-class gaussian process classification made conjugate: Efficient inference via data augmentation. In Uncertainty in Artificial Intelligence, 2019. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297 304, 2010. Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935 944. ACM, 2016. Kalina Jasinska, Krzysztof Dembczynski, Róbert Busa-Fekete, Karlson Pfannschmidt, Timo Klerx, and Eyke Hullermeier. Extreme f-measure maximization using sparse probability estimates. In International Conference on Machine Learning, pp. 1435 1444, 2016. Published as a conference paper at ICLR 2020 Christoph Lippert, Riccardo Sabatini, M Cyrus Maher, Eun Yong Kang, Seunghak Lee, Okan Arikan, Alena Harley, Axel Bernal, Peter Garst, Victor Lavrenko, et al. Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences, 114(38):10166 10171, 2017. Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for extreme multilabel text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115 124. ACM, 2017. Eneldo Loza Mencia and Johannes Fürnkranz. Efficient pairwise multilabel classification for largescale problems in the legal domain. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 50 65. Springer, 2008. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111 3119, 2013. Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. 2017. Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081 1088, 2009. Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246 252. Citeseer, 2005. Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 263 272. ACM, 2014. Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pp. 993 1002. International World Wide Web Conferences Steering Committee, 2018. Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, and Sanjiv Kumar. Sampled softmax with random fourier features. In Advances in Neural Information Processing Systems (Neur IPS), 2019. Francisco JR Ruiz. Augment and reduce github repository. https://github.com/ franrruiz/augment-reduce. Accessed: 2019-05-23. Francisco JR Ruiz, Michalis K Titsias, Adji B Dieng, and David M Blei. Augment and reduce: Stochastic inference for large categorical distributions. In International Conference on Machine Learning, pp. 4400 4409, 2018. Siddhartha Saxena. Xml-cnn github repository. https://github.com/siddsax/XML-CNN. Accessed: 2019-05-23. Wissam Siblini, Pascale Kuntz, and Frank Meyer. Craftml, an efficient clustering-based random forest for extreme multi-label learning. In The 35th International Conference on Machine Learning.(ICML 2018), 2018. Madan Somvanshi and Pranjali Chavan. A review of machine learning techniques using decision tree and support vector machine. In 2016 International Conference on Computing Communication Control and automation (ICCUBEA), pp. 1 7. IEEE, 2016. Michalis K Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities. In Advances in Neural Information Processing Systems, pp. 4161 4169, 2016. Zhuowen Tu. Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8. IEEE, 2007. Published as a conference paper at ICLR 2020 Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Twenty-Eighth AAAI conference on artificial intelligence, 2014. Florian Wenzel, Théo Galy-Fajou, Christan Donner, Marius Kloft, and Manfred Opper. Efficient gaussian process classification using pòlya-gamma data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5417 5424, 2019. Jason Weston, Ameesh Makadia, and Hector Yee. Label partitioning for sublinear ranking. In International Conference on Machine Learning, pp. 181 189, 2013. Zheng Zhang and Pierre Zweigenbaum. Gneg: Graph-based negative sampling for word2vec. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 566 571, 2018. Published as a conference paper at ICLR 2020 A.1 DETAILS OF THE PROOF OF THEOREM 2 In the nonparametric limit, the score functions ξy(x, φ) are so flexible that they can take arbitrary values for all x in the data set and all y Y. Taking advantage of the invariance of η under reparameterization, we parameterize the model directly by its scores. We use the shorthand ξx,y ξy(x, φ), and we denote the collection of all scores over all x and y Y by boldface ξ (ξx,y)x,y. Hessian. Eq. 2 defines the loss ℓneg.sampl. as a stochastic function. SGD minimizes its expectation, ℓ(ξ) := E ℓneg.sampl.(ξ) = X p D(y|x) log σ(ξx,y) pn(y|x) log σ( ξx,y) (A1) where the sum over x runs over all feature vectors in the training set. We obtain the gradient gx,y ξx,yℓ(ξ) = p D(y|x) σ( ξx,y) + pn(y|x) σ(ξx,y) (A2) where we used the relation z log σ(z) = σ( z). The gradient is a vector whose components span all combinations of x and y. The Hessian matrix Hℓcontains the derivatives of each gradient component gx,y by each coordinate ξ x, y. Since gx,y in Eq. A2 depends only on the single coordinate ξx,y, only the diagonal parts of the Hessian are nonzero, i.e., the components with x = x and y = y. Thus, Hℓ= diag(αx,y) with αx,y = ξx,y gx,y. (A3) Using the identity zσ(z) = σ(z) σ( z), we find αx,y = p D(y|x) + pn(y|x) σ( ξx,y) σ(ξx,y). (A4) Since we evaluate the Hessian in the nonparametric limit at the minimum of the loss, the scores ξx,y satisfy Eq. 11, i.e., p D(y|x) σ( ξx,y) = pn(y|x) σ(ξx,y). (A5) This allows us to simplify Eq. A4 by eliminating p D, αx,y = pn(y|x) σ(ξx,y) + σ( ξx,y) σ(ξx,y) = pn(y|x) σ(ξx,y). (A6) Eqs. A3 and A6 together prove Eq. 13 of the main text. Noise Covariance Matrix. SGD uses estimates ˆℓof the loss function in Eq. A1, obtained by drawing a positive sample (x, y) D and a label for the negative sample y pn(y |x), thus ˆℓ(ξ) = N log σ(ξx,y) + log σ( ξx,y ) (A7) where the factor of N |D| is because the sum over x in Eq. A1 scales proportionally to the size of the data set D (in practice one typically normalizes the loss function by N without affecting the signal to noise ratio). One uses ˆℓto obtain unbiased gradient estimates ˆg. We introduce new symbols x and y for the components ˆg x, y of the gradient estimate to avoid confusion with the x and y drawn from the data set and the y drawn from the noise distribution in Eq. A7 above. Since the scores are independent variables in the nonparametric limit, the derivative ξ x, y ξx,y is one if x = x and y = y, and zero otherwise. We denote this by indicator functions 1 x=x and 1 y=y. Thus, we obtain ˆg x, y ξ x, y ˆℓ(ξ) = N σ( ξx,y)1 y=y σ(ξx,y )1 y=y 1 x=x (A8) We evaluate the covariance matrix of ˆg at the minimum of the loss function. Here, E[ˆg] g = 0, and thus Cov[ˆg, ˆg] E[ˆg ˆg ] E[ˆg] E[ˆg ] simplifies to E[ˆg ˆg ]. Introducing yet another pair of indices x and y to distinguish the two factors of ˆg, we denote the components of the covariance matrix as Cov[ˆg x, y, ˆg x, y] E(x,y) D, y pn ˆg x, y ˆg x, y . (A9) Here, the expectation is over p D(x, y) pn(y |x) = p D(x) p D(y|x) pn(y |x). We start with the evaluation of the expectation over x, using Ex p D[ ] = 1 N P x[ ] where the sum runs over all x in the Published as a conference paper at ICLR 2020 data set. If x = x or x = x, then either one of the two gradient estimates ˆg in the expectation on the right-hand side of Eq. A9 vanishes. Therefore, only terms with x = x = x contribute, and the covariance matrix is block diagonal in x as claimed in Eq. 14 of the main text. The blocks Cx of the block diagonal matrix have entries (Cx) y, y Cov[ˆgx, y, ˆgx, y] = 1 N Ep D(y|x) pn(y |x) ˆgx, y ˆgx, y . (A10) where we find for the product ˆgx, y ˆgx, y by inserting Eq. A8 and multiplying out the terms, ˆgx, y ˆgx, y = N 2h σ( ξx, y)2 1 y=y + σ(ξx, y)2 1 y=y 1 y= y σ( ξx, y) σ(ξx, y) 1 y=y y=y σ(ξx, y) σ( ξx, y) 1 y=y y=y i (A11) Taking the expectation in Eq. A10 leads to the following substitutions: 1 y=y p D( y|x); 1 y=y p D( y|x); 1 y=y pn( y|x); 1 y=y pn( y|x). (A12) Thus, we find, (Cx) y, y = N h p D( y|x) σ( ξx, y)2 + pn( y|x) σ(ξx, y)2 1 y= y (A13) p D( y|x) pn( y|x) σ( ξx, y) σ(ξx, y) pn( y|x) p D( y|x) σ(ξx, y) σ( ξx, y) i . Using Eq. A5, we can again eliminate p D, (Cx) y, y = N h pn( y|x) σ( ξx, y) σ(ξx, y) + pn( y|x) σ(ξx, y)2 1 y= y 2pn( y|x) pn( y|x) σ(ξx, y) σ(ξx, y) i = N h αx, y σ( ξx, y) + σ(ξx, y) 1 y= y 2 αx, y αx, y = N αx, y 1 y= y 2 αx, y αx, y . (A14) Eq. A14 is the component-wise explicit form of Eq. 14 of the main text. A.2 EXPERIMENTAL COMPARISON BETWEEN SOFTMAX CLASSIFICATION AND NEGATIVE SAMPLING We provide additional experimental results that evaluate the performance gap due to negative sampling compared to full softmax classification on a smaller data set. Theorem 1 states an equivalence between negative sampling and softmax classification. However, this equivalence strictly holds only (i) in the nonparametric limit, (ii) without regularization, and (iii) if the optimizer really finds the global minimum of the loss function. In practice, all three assumptions hold only approximately. Data Set and Preprocessing. To evaluate the performance gap experimentally, we used EURLex4K data set (Bhatia et al.; Mencia & Fürnkranz, 2008), which is small enough to admit direct optimization of the softmax loss function. Similar to the preprocessing of the two main data sets described in Section 5 of the main text, we converted the multi-class classification problem into a single-class classification problem by selecting the label with the smallest ID for each data point, and discarding any data points without any labels. We split off 10% of the training set for validation, and report results on the provided test set. This resulted in a training set with N = 13,960 data points and C = 3,687 categories. As in the main paper, we reduced the feature dimension to K = 512 (using PCA for simplicity here). Model and Hyperparameters. The goal of these experiments is to evaluate the performance gap due to negative sampling in general. We therefore fitted the same affine linear model as described in Section 5 of the main text using the full softmax loss function (Eq. 1) and the simplest form of negative sampling (Eq. 2), i.e., negative sampling with a uniform noise distribution. We added a quadratic regularizer with strength λ to both loss functions. Published as a conference paper at ICLR 2020 For both methods, we tested the same hyperparameter combinations as in Section 5 on the validation set using early stopping. For softmax, we extended the range of tested learning rates up to ρ = 10 as higher learning rates turned out to perform better in this method (this can be understood due to the low gradient noise). The optimal hyperparameters for softmax turned out to be a learning rate of ρ = 0.3 and regularization strength λ = 3 10 4. For negative sampling, we found ρ = 3 10 3 and λ = 3 10 4. Results. We evaluated the predictive accuracy for both methods. With the full softmax method, we obtain 33.6% correct predictions on the test set, whereas the predictive accuracy drops to 26.4% with negative sampling. This suggests that, when possible, minimizing the full softmax loss function should be preferred. However, in many cases, the softmax loss function is too expensive.