# learning_from_extreme_bandit_feedback__049a1f51.pdf

Learning from e Xtreme Bandit Feedback

Romain Lopez1, Inderjit S. Dhillon2, 3, Michael I. Jordan1, 2

1 Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 2 Amazon.com 3 Department of Computer Science, The University of Texas at Austin {romain lopez, jordan}@cs.berkeley.edu inderjit@cs.utexas.edu

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale realworld applications, supervised learning frameworks such as e Xtreme Multi-label Classiﬁcation (XMC) are widely used despite the fact that they incur signiﬁcant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (s IS) that operates in a signiﬁcantly more favorable bias-variance regime. The s IS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure named Policy Optimization for e Xtreme Models (POXM) for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the s IS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is signiﬁcantly smaller than the size of the action space. We use a supervised-tobandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: Bandit Net, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas Bandit Net sometimes improves marginally over the logging policy, our experiments show that POXM systematically and signiﬁcantly improves over all baselines.

Introduction In the classical supervised learning paradigm, it is assumed that every data point is accompanied by a label. Such labels provide a very strong notion of feedback, where the learner is able to assess not only the loss associated with the action that they have chosen but can also assess losses of actions that they did not choose. A useful weakening of this paradigm involves considering so-called bandit feedback, where the training data simply provides evaluations of selected actions without delineating the correct action. Bandit

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

feedback is often viewed as the province of reinforcement learning, but it is also possible to combine bandit feedback with supervised learning by considering a batch setting in which each data point is accompanied by an evaluation and there is no temporal component. This is the Batch Learning from Bandit Feedback (BLBF) problem (Swaminathan and Joachims 2015a). Of particular interest is the off-policy setting where the training data is provided by a logging policy, which differs from the learner s policy and differs from the optimal policy. Such problems arise in many real-world problems, including supply chains, online markets, and recommendation systems (Rahul, Dahiya, and Singh 2019), where abundant data is available in a logged format but not in a classical supervised learning format. Another difﬁculty with the classical notion of a label is that real-world problems often involve huge action spaces. This is the case, for example, in real-world recommendation systems where there may be billions of products and hundreds of millions of consumers. Not only is the cardinality of the action space challenging both from a computational point of view and a statistical point of view, but even the semantics of the labels can become obscure it can be difﬁcult to place an item conceptually in one and only category. Such challenges have motivated the development of e Xtreme multi-label classiﬁcation (XMC) and e Xtreme Regression (XR) (Bhatia et al. 2016) methods, which focus on computational scalability issues and target settings involving millions of labels. These methods have had real-world applications in domains such as e-commerce (Agrawal et al. 2013) and dynamic search advertising (Prabhu et al. 2018, 2020). We assert that the issues of bandit feedback and extremescale action spaces are related. Indeed, it is when action spaces are large that it is particularly likely that feedback will only be partial. Moreover, large action spaces tend to support multiple tasks and grow in size and scope over time, making it likely that available data will be in the form of a logging policy and not a single target input-output mapping. We also note that the standard methodology for accommodating the difference between the logging policy and an optimal policy needs to be considered carefully in the setting of large action spaces. Indeed, the standard methodology is some form of importance sampling (Swaminathan

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

and Joachims 2015a), and importance sampling estimators can run aground when their variance is too high (see, e.g., Lefortier et al. (2016)). Such variance is likely to be particularly virulent in large action spaces. Some examples of the XMC framework do treat labels as subject to random variation (Jain, Prabhu, and Varma 2016), but only with the goal of improving the prediction of rare labels; they do not tackle the broader problem of learning from logging policies in extreme-scale action spaces. It is precisely this broader problem that is our focus in the current paper. The literature on ofﬂine policy learning in Reinforcement Learning (RL) has also been concerned with correcting for implicit feedback bias (see, e.g., Degris, White, and Sutton (2012)). This line of work differs from ours, however, in that the focus in RL is not on extremely-large action spaces, and RL is often based on simulators rather than logging policies (Chen et al. 2019c; Bai, Guan, and Wang 2019). Closest to our work is the work of Chen et al. (2019b), who propose to use ofﬂine policy gradients on a large action space (millions of items). Their method relies, however, on a proprietary action embedding, unavailable to us. After a brief overview of BLBF and XMC , we present a new form of BLBF that blends bandit feedback with multilabel classiﬁcation. We introduce a novel assumption, speciﬁc to the XMC setting, in which most actions are irrelevant (i.e., incur a null reward) for a particular instance. This motivates a Rao-Blackwellized (Casella and Robert 1996) estimator of the policy value for which only a small set of relevant actions per instance are considered. We refer to this approach as selective importance sampling (s IPS). We provide a theoretical analysis of the bias-variance tradeoff of the s IPS estimator compared to naive importance sampling. In practice, the selected actions for the s IPS estimator are the top-p actions from the logging policy, where p can be adjusted from the data. We derive a novel learning method based on the s IPS estimator, which we refer to as Policy Optimizer for e Xtreme Models (POXM). Finally, we propose a modiﬁcation of a state-of-the-art neural XMC method Attention XML (You et al. 2019b) to learn from bandit feedback. Using a supervised-learning-to-bandit conversion (Dudik, Langford, and Li 2011), we benchmark POXM against Bandit Net (Joachims, Swaminathan, and de Rijke 2018), a partial matching scheme from Wang et al. (2016) and a supervised learning baseline on three XMC datasets (EUR-Lex, Wiki10-31K and Amazon-670K) (Bhatia et al. 2016). We show that naive application of the state-of-theart method Bandit Net (Joachims, Swaminathan, and de Rijke 2018) sometimes improves over the logging policy, but only marginally. Conversely, POXM provides substantial improvement over the logging policy as well as supervised learning baselines.

e Xtreme Multi-label Classiﬁcation (XMC)

Multi-label classiﬁcation aims at assigning a relevant subset Y [L] to an instance x, where [L] = {1,...,L} denotes the set of L possible labels. XMC is a speciﬁc case of multi-label classiﬁcation in which we further assume that

all Y are small subsets of a massive collection (i.e., generally Y /L < 0.01). Naive one-versus-all approaches to multi-label classiﬁcation usually do not scale to such a large number of labels and adhoc methods are often employed. Furthermore, the marginal distribution of labels across all instances exhibits a long tail, which causes additional statistical challenges. Algorithmic approaches to XMC include optimized oneversus-all methods (Babbar and Sch olkopf 2017, 2019; Yen et al. 2017, 2016), embedding-based methods (Bhatia et al. 2015; Tagami 2017; Guo et al. 2019), probabilistic label tree-based (Prabhu et al. 2018; Jasinska et al. 2016; Khandagale, Xiao, and Babbar 2020; Wydmuch et al. 2018) and deep learning-based methods (You et al. 2019b; Liu et al. 2017; You et al. 2019a; Chang et al. 2019). Each algorithm usually proposes a speciﬁc approach to model the text as well as deal with tail labels. For example, Babbar and Sch olkopf (2019) uses a robust SVM approach on TFIDF features. Pfastre XML (Jain, Prabhu, and Varma 2016) assumes a particular noise model for the observed labels and proposes to weight the importance of tail labels. Attention XML (You et al. 2019b) uses a bidirectional-LSTM to embed the raw text as well as a multi-label attention mechanism to help capture the most relevant part of the input text for each label. For datasets with large L, Attention XML trains one model per layer of a shallow and wide probabilistic latent tree using a small set of candidate labels.

Batch Learning from Bandit Feedback (BLBF) We assume that the instance x is sampled from a distribution P(x). The action for this particular instance is a unique label y [L], sampled from the logging policy ρ(y x) and a feedback value r R is observed. Repeating this data collection process yields the dataset [(xi,yi,ri)]n i=1. The BLBF problem consists in maximizing the expected reward V (π) of a policy π. We use importance sampling (IS) to estimate V (π) from data based on the logging policy as follows:

ˆVIS(π) = 1

π(yi xi) ρ(yi xi) ri. (1)

Classically, identifying the optimal policy via this estimator is infeasible without a thorough exploration of the action space by the logging policy (Langford, Strehl, and Wortman 2008). More speciﬁcally, the IS estimator ˆVIS(π) requires the following basic assumption for there to be any hope of asymptotic optimality: Assumption 1. There exists a scalar ϵ > 0 such that for all x Rd and y [L],ρ(y x) > ϵ. The IS estimator has high variance when π assigns actions that are infrequent in ρ; hence a variety of regularization schemes have been developed, based on riskupper-bound minimization, to control variance. Examples of upper bounds include empirical Bernstein concentration bounds (Swaminathan and Joachims 2015a) and various divergence-based bounds (Atan, Zame, and Mihaela Van Der Schaar 2018; Wu and Wang 2018; Johansson, Shalit, and Sontag 2016; Lopez et al. 2020). Another common strategy for reducing the variance is to propose a model of the

reward function, using as a baseline a doubly robust estimator (Dudik, Langford, and Li 2011; Su et al. 2019). A recurrent issue with BLBF is that the policy may avoid actions in the training set when the rewards are not scaled properly; this is the phenomenon of propensity overﬁtting. Swaminathan and Joachims (2015b) tackled this problem via the self-normalized importance sampling estimator (SNIS), in which IS estimates are normalized by the average importance weight. SNIS is invariant to translation of the rewards and may be used as a safeguard against propensity overﬁtting. Bandit Net (Joachims, Swaminathan, and de Rijke 2018) made this approach amenable to stochastic optimization by translating the reward distribution:

ˆVBN(π) = 1

π(yi xi) ρ(yi xi) [ri λ], (2)

and selecting λ over a small grid based on the SNIS estimate of the policy value. Learning an XMC algorithm from bandit feedback requires ofﬂine learning from slates Y , where each element of the slate comes from a large action space. Swaminathan et al. (2017) proposes a pseudo-inverse estimator for ofﬂine learning from combinatorial bandits. However, such an approach is intractable for large action spaces as it requires inverting a matrix whose size is linear in the number of actions. Another line of work focuses on ofﬂine evaluation and learning of semi-bandits for ranking (Li et al. 2018; Joachims, Swaminathan, and Schnabel 2017) but only with a small number of actions. In real-world data, a partial matching strategy between instances and relevant actions is applied in applications to internet marketing for policy evaluation (Wang et al. 2016; Li, Kim, and Zitouni 2015). More recently, Chen et al. (2019b) proposed a top-k off-policy correction method for a real-world recommender system. Their approach deals with millions of actions although it treats label embeddings as given, whereas this problem is in general a hard problem for XMC.

Bandit Feedback and Multi-label Classiﬁcation

We consider a setting in which the algorithm (e.g., a policy for a recommendation system) observes side information x Rd and is allowed to output a subset Y [L] of the L possible labels. Side information is independent at each round and sampled from a distribution P(x). We assume that the subset Y has ﬁxed size Y = ℓ, which allows us to adopt the slate notation Y = (y1,...,yℓ). The algorithm observes noisy feedback for each label, R = (r1,...,rℓ), and we further assume that the joint distribution over R decomposes as P (R x,Y ) = ℓ j=1 P (rj x,yj). We will denote the conditional reward distribution as a function: δ(x,y) = E[r x,y]. In the case of multi-label classiﬁcation, this feedback can be formed with random variables indicating whether each individual label is inside the true set of labels for each datapoint (Gentile and Orabona 2014). More concretely, feedback may be formed from sale or click information (Chen et al. 2019b,c).

We are interested in optimizing a policy π(Y x) from ofﬂine data. Accessible data is sampled according to an existing algorithm, the logging policy ρ(Y x). We assume that both joint distributions over the slate decompose into an auto-regressive process. For example, for π we assume:

π(Y x) = ℓ j=1 π(yj x,y1 j 1). (3)

Introducing this decomposition does not result in any loss of generality, as long as the action order is identiﬁable (otherwise, one would need to consider all possible orderings (Kool, van Hoof, and Welling 2020b)). This is a reasonable hypothesis because the order of the actions may also be logged as supplementary information. We now deﬁne the value of a policy π as:

V (π) = EP(x)Eπ(Y x)E[1 R x,Y ]. (4)

In our setting, the reward decomposes as a sum of independent contributions of each individual action. The reward may in principle be generalized to be rank dependent, or to consider interactions between items (Gentile and Orabona 2014), but this is beyond the scope of this work. A general approach for ofﬂine policy learning is to estimate V (π) from logged data using importance sampling (Swaminathan and Joachims 2015a). As emphasized in Swaminathan et al. (2017), the combinatorial size of the action space Ω(Lℓ) may yield an impractical variance for importance sampling. This is particularly the case for XMC, where typical values of L are minimally in the thousands. A natural strategy to improve over the IS estimator on the slate Y is to exploit the additive reward decomposition in Eq. (4). Along with the factorization of the policy in Eq. (3), we may reformulate the policy value as:

V (π) = EP(x)

ℓ j=1 Eπ(y1 j x)δ(x,yj). (5)

The beneﬁt of this new decomposition is that instead of performing importance sampling on Y , we can now use ℓIS estimators, each with a better bias-variance tradeoff. Unbiased estimation of V (π) in Eq. (5) via importance sampling still requires Assumption 1. The logging policy must therefore explore a large action space. However, most actions are unrelated to a given context and deploying an online logging policy that satisﬁes Assumption 1 may yield a poor customer experience.

Learning from e Xtreme Bandit Feedback We now explore alternative assumptions for the logging policy that may be more suitable to the setting of very large action spaces. We formalize the notion that most actions are irrelevant using the following assumption:

Assumption 2. (Sparse feedback condition). The individual feedback random variable r takes values in the bounded interval [ , ]. For all x Rd, the label set [L] can be partitioned as [L] = Ψ(x) Ψ0(x) such that for all actions y of Ψ0(x), the expected reward is minimal: δ(x,y) = .

We refer to the function Ψ as an action selector, as it maps a context to a set of relevant actions. Throughout the manuscript, we use the notation Λ0 to refer to the pointwise set complement of any action selector Λ. Intuitively, we are interested in the case where Ψ(x) L for all x. Assumption 2 is implicitly used in online marketing applications of ofﬂine policy evaluation, formulated as a partial matching between actions and instances (Wang et al. 2016; Li, Kim, and Zitouni 2015). Notably, this assumption can be assimilated to a mixed-bandit feedback setting, where we observe feedback for all of Ψ0(x) but only one selected action inside of Ψ(x). Under Assumption 2, the IS estimator will be unbiased for all logging policies that satisfy the following relaxed assumption: Assumption 3. (Ψ-overlap condition). There exists a scalar ϵ > 0 such that for all x Rd and y Ψ(x), ρ(y x) > ϵ. Batch learning from bandit feedback may be possible under this assumption, as long as the logging policy explores a set of actions large enough to cover the actions from Ψ but small enough to avoid exploring too many suboptimal actions. Furthermore, Assumption 2 also reveals the existence of Ψ0(x), a sufﬁcient statistic for estimating the reward on the irrelevant actions. Making appeal to Rao Blackwellization (Casella and Robert 1996), we can incorporate this information to estimate each of the ℓterms of Eq. (5) (e.g., in the case ℓ= 1 and = 0):

V (π) = EP(x) [π (Ψ(x) x) Eπ(y x) [δ(x,y) y Ψ(x)]]. (6) The decomposition in Eq. (6) suggests that when the action selector Ψ is known, one can estimate V (π) via importance sampling for the conditional expectation of the rewards with respect to the event {y Ψ(x)}. Intuitively, this means that one can modify the importance sampling scheme to only include a relevant subset of labels and ignore all the others. Without loss of generality, we assume that = 0 in the remainder of this manuscript. In practice, the oracle action selector Ψ is unknown and needs to be estimated from the data. It may be hard to infer the smallest Ψ such that Assumption 2 is satisﬁed. Conversely, a trivial action selector including all actions is valid (it does include all relevant actions) but is ultimately unpractical. As a ﬂexible compromise, we will replace Ψ in Eq. (6) by any action selector Φ and study the bias-variance tradeoff of the resulting plugin estimator. Let ρ be a logging policy with a large enough support to satisfy Assumption 3. Let Φ be an action selector such that Φ(x) supp ρ( x) almost surely in x, where supp denotes the support of a probability distribution. The role of Φ is to prune out actions to maintain an optimal bias-variance tradeoff. In the case ℓ= 1, the Φ-selective importance sampling (s IS) estimator ˆV Φ s IS(π) for action selection Φ can be written as:

ˆV Φ s IS(π) = 1

π(yi xi,y Φ(x))

ρ(yi xi) ri. (7)

Its bias and variance depends on how different the policy π is from the logging policy ρ (as in classical BLBF) but also on the degree of overlap of Φ with Ψ:

Theorem 1 (Bias-variance tradeoff of selective importance sampling). Let R and ρ satisfy Assumptions 2 and 3. Let Φ be an action selector such that Φ(x) supp ρ( x) almost surely in x. The bias of the s IS estimator is:

E ˆV Φ s IS(π) V (π) κ(π,Ψ,Φ), (8)

where κ(π,Ψ,Φ) = EP(x)π (Ψ(x) Φ0(x) x) quantiﬁes the overlap between the oracle action selector Ψ and the proposed action selector Φ, weighted by the policy π. The performance of the two estimators can be compared as follows:

MSE[ ˆV Φ s IS(π)] MSE[ ˆVIS(π)] + 2 2κ(π,Ψ,Φ)

n EP(x) π2 (Φ0(x) x)

ρ(Φ0(x) x) , (9)

where σ2 = infx,y Rd [K] E[r2 x,y].

We provide the complete proof of this theorem in the appendix1. As expected by Rao-Blackellization, we see that if Φ completely covers Ψ (i.e., for all x Rd,Ψ(x) Φ(x)), then ˆV Φ s IS(π) is unbiased and has more favorable performance than ˆVIS(π). Admittedly, Eq. (9) shows that both estimators have similar mean-square error when π puts no mass on potentially irrelevant actions y Φ0(x). However, during the process of learning the optimal policy or in the event of propensity overﬁtting, we expect π to put a non-zero mass on potentially irrelevant actions y Φ0(x), with positive probability in x. For these reasons, we expect ˆV Φ s IS(π) to provide signiﬁcant improvement over ˆVIS(π) for policy learning. Even though Eq. (9) provides insight into the performance of s IS, unfortunately it cannot be used directly in selecting Φ. We instead propose a greedy heuristic to select a small number of action selectors. For example, Φp(x) corresponds to the top-p labels for instance x according to the logging policy. With this approach, the bias of the ˆV Φ s IS(π) estimator is a decreasing function of p, as the overlap with Ψ increases. Furthermore, the variance increases with p as long as the added actions are irrelevant. In practice, we use a small grid search for p {10,20,50,100} and choose the optimal p with the SNIS estimator, as in Bandit Net. We believe this is a reasonable approach whenever the logging policy ranks the relevant items sufﬁciently high but can be improved (e.g., top-p for p in 5 to 100).

Policy Optimization for e Xtreme Models

We apply s IS for each of the ℓterms of the policy value from Eq. (5) in order to estimate V (π) from bandit feedback, (xi,Yi,Ri)n i=1, and learn an optimal policy. As an additional step to reduce the variance, we prune the importance sampling weights of earlier slate actions, following Achiam et al. (2017):

ˆV Φ s IS(π) = 1

πΦ(yi,j xi,y1 j 1)

ρ(yi,j xi,y1 j 1) ri,j, (10)

1Please visit https://arxiv.org/abs/2009.12947 for supplementary information.

Figure 1: Expected R@5 and CDF of the logging policy for the top-k action for each XMC dataset. Exploration is limited to a subset of relevant actions.

where πΦ designates the distribution π restricted to the set Φ(x) for every x. Because π(Y x) is a copula, one can derive all joint distributions starting from the corresponding one-dimensional marginals (Sklar 1959). In this work, we focus on the case of ordered sampling without replacement to respect an important design restriction: the slate Y must not have redundant actions. For the j-th slate component, the relevant conditional probability is formed from the base marginal probabilities π(y x) as follows:

πΦ(yj x,y1 j 1) = π(yj x)

y Φ(x) π(y x) k<j π(yk x). (11)

From a computational perspective, the action selector also diminishes the computational burden, leading to efﬁcient computations of the probabilities when the marginals are parameterized by a softmax distribution. Indeed, Eq. (11) depends only on the logits for the actions inside of the set Φ. This helps our approach to scale to large XMC datasets. As mentioned in the background section, directly maximizing the importance sampling estimate of the policy value in Eq. (10) may be pathological due to propensity overﬁtting. The Bandit Net approach may be adapted to the slate case using a different loss translation scheme for each element:

ˆV Φ s IS(π) = 1

πΦ(yi,j xi,y1 j 1)

ρ(yi,j xi,y1 j 1) [ri,j λj], (12)

and with (λ1,...,λℓ) selected out of a small grid based on the self-normalized importance sampling estimate of the policy value from the training data (Joachims, Swaminathan, and de Rijke 2018). For computational reasons, we only search for a unique λ and, following Joachims, Swaminathan, and de Rijke (2018), we focus on the grid {0.7,0.8,0.9,1.0}. We refer to this approach as Policy Optimization for e Xtreme Models (POXM), named after the seminal algorithm from Swaminathan and Joachims (2015a).

Experiments We evaluate our approach on real-world datasets with a supervised learning to bandit feedback conversion (Dudik, Langford, and Li 2011; Gentile and Orabona 2014). We report results on three datasets from the Extreme Classiﬁcation Repository (Bhatia et al. 2016), with L ranging from several thousand to half a million (Table 1). EUR-Lex (Mencia and F urnkranz 2008) has a relatively small label set and each instance has a sparse label set. Wiki10-31K (Zubiaga 2012)

Dataset Ntrain Ntest D L L ˆL

EUR-Lex 15,449 3,865 186,104 3,956 5.30 20.79 Wiki10-31K 14,146 6,616 101,938 30,938 18.64 8.52 Amazon-670K 490,449 153,025 135,909 670,091 5.45 3.99

Ntrain: #training instances, Ntest: #test instances, D: #features, L: #labels and size of the action space, L: average #labels per instance, ˆL: the average #instances per label. The partition of training and test is from the data source.

Table 1: XMC datasets used for semi-simulation of e Xtreme bandit feedback.

has a larger label set as well as more abundant annotations. Finally, Amazon-670K (Mc Auley and Leskovec 2013) has more than half a million labels. To our knowledge, this is the ﬁrst time that such action spaces have been considered for BLBF.

Simulating Bandit Feedback from XMC Datasets An XMC dataset is a collection of observations (xi,Y i )n i=1 for which each instance xi is associated with an optimal set of labels Y i . To form a logging policy ρ, we train Attention XML on a small fraction α of the dataset to get estimates of the marginal probability for each label (values are provided in the appendix). These probabilities must be normalized in order to sum to one, as expected in the multi-label setting (Wydmuch et al. 2018). The ground-truth labels may be used to investigate whether Φp (the top p actions from ρ) approximately satisﬁes the Ψ-covering condition. On the EUR-Le X dataset the obtained logging policy on its top 20 action covers around 75% of the rewards (Figure 1). Using more actions may be suboptimal as these may add variance with only a marginal beneﬁt on the bias, as captured by Theorem 1. Finally, we form bandit feedback for slates of size ℓby sampling without replacement from ρ. The reward is a binary variable depending on whether the chosen action belongs to the reference set Y . We ﬁx ℓ= 5 in all experiments.

Evaluation Metrics P@k (Precision at k), n DCG@k (normalized Discounted Cumulative Gain at k) as well as PSP@k (Propensity Scored Precision at k) are widely used metrics for evaluating XMC methods (Jain, Prabhu, and Varma 2016; Bhatia et al. 2016). We adapt these metrics to the evaluation of stochastic policies by taking expectations of the relevant statistics over

Methods R@3 R@5 n DCR@3 n DCR@5 PSR@3 PSR@5

EUR-Lex Logging policy 33.79 31.23 33.77 34.07 22.33 21.66 Direct Method 39.58 32.22 42.64 38.69 25.81 26.58 Bandit Net 15.60 13.51 17.68 16.29 8.58 8.48 PM-Bandit Net 20.44 15.13 24.17 20.42 9.51 9.52 POXM 52.38 44.48 55.73 51.64 35.42 35.25 Attention XML 73.08 61.10 76.37 70.49 51.29 53.86

Wiki10-31K Logging policy 42.49 38.80 43.57 41.13 7.43 7.46 Direct Method 48.96 38.16 55.72 46.38 8.22 7.78 Bandit Net 49.92 36.16 56.54 45.08 7.20 7.20 PM-Bandit Net 49.06 37.04 55.91 45.74 7.09 7.04 POXM 60.45 53.03 64.22 58.26 10.70 10.58 Attention XML 77.78 68.78 79.94 73.19 17.05 17.93

Amazon-670K Logging policy 17.89 17.05 18.77 18.65 13.06 13.06 Direct Method 23.42 20.14 25.16 23.28 15.82 16.30 Bandit Net 16.83 14.54 17.18 16.11 11.67 11.67 PM-Bandit Net 17.31 14.76 17.67 16.42 12.05 12.05 POXM 26.89 23.72 28.93 27.22 19.59 20.75 Attention XML 40.67 36.94 43.04 41.35 32.36 35.12

Table 2: Performance comparisons of POXM and other competing methods over the three medium-scale datasets. All experiments are conducted with bandit feedback. In italic are the results from the Attention XML manuscript, for the full-information feedback on all the training data (the supervised learning skyline).

slates of size k (with distinct items). For example, R@k (Reward at k) is deﬁned as:

R@k = Eπ(y1,...,yk) 1 k

k l=1 1{yl Y }. (13)

Similarly, we deﬁne n DCR@k and PSR@k (analogous to n DCG@k and PSP@k). We estimate those metrics using sampling without replacement.

Competing Methods and Experimental Settings We compare POXM to other ofﬂine policy learning methods in the speciﬁc context of Attention XML (You et al. 2019b). Furthermore, hyperparameters that are speciﬁc to Attention XML are ﬁxed across all experiments (Table 2 of You et al. (2019b)) so that our results are not confounded by those choices. To reduce training time and focus on how well each method deals with large action spaces, we use the LSTM weights from Attention XML and treat those as ﬁxed for all experiments. Finally, we noticed that the scale of gradients of the objective function for IS-based methods was different from supervised learning methods (similarly reported in Joachims, Swaminathan, and de Rijke (2018)). Consequently, we lowered the learning rate for these algorithms from 1e 4 to 5e 5. We compare POXM to several baselines. First, we report results for the Direct Method (DM), a supervised learning baseline where Attention XML is trained with a partial classiﬁcation loss, using only the feedback from ℓactions for each instance. The deterministic policy picks the top-k actions from the predicted value, akin to Prabhu et al. (2020).

Second, we use Bandit Net as a baseline. For this, we train Attention XML using gradients of Eq. (10), but without conditioning on action set Φ. Instead, we use the full softmax (akin to Joachims, Swaminathan, and de Rijke (2018)) or we approximate it with negative sampling (Mikolov et al. 2013) at training time (only for the Amazon-670K dataset). Finally, we also investigate the effect of the partial matching strategy of Wang et al. (2016); Li, Kim, and Zitouni (2015) while training with Bandit Net (referred to as Bandit Net PM). In this baseline we ignore feedback from actions that are not in Φp.

Results Table 2 shows the performance results of POXM and other competing methods. POXM consistently outperformed the logging policy and always signiﬁcantly improved over the competing methods. As expected, the performance is lower than Attention XML learned on the full training set. The direct method also improved over the logging policy but only marginally, which is attributable to the bias from the logging policy. Bandit Net and its partial matching variant did not improve over the logging policy on both EUR-Lex and Amazon-670K. We believe this is due to the sparsity of the rewards. Indeed, Bandit Net outperforms the logging policy as well as the Direct Method baseline on Wiki10-31K that has many more labels per instance. Furthermore, we see that partial matching has a positive effect on Bandit Net for EURLe X but not for the other datasets. For all choices of logging policy in Figure 1, the optimal value of p selected by POXM is the smallest possible (p =

Figure 2: Data-driven selection of p on the EUR-Le X dataset. Left: logging policy statistics for three randomization scenarios (A, B, C, described in appendix). Middle: R@5 performance for each POXM variant and each logging policy. Right: SNIS estimates used for selection of p in POXM.

10). Therefore, we investigated how the algorithm behaved with more stochastic policies on the EUR-Le X dataset. For this, we injected Gumbel noise into the label probabilities (details in the appendix) and analyzed the performance of POXM for logging policies with p {10,20,50}. We provide summary statistics for the three logging policies and report the results of POXM in Figure 2. We see that each logging policy has a best performing value of p (middle) that is aligned with the summary statistics of the logging policy (left) as well as the normalized importance sampling (SNIS) policy value estimate (right). This shows that POXM keeps improving over the logging policy for more stochastic policies and that SNIS is a reasonable procedure for selecting the parameter p.

We have presented POXM: a scalable algorithmic framework for learning XMC classiﬁers from bandit feedback. On real-world datasets, we have shown that POXM is systematically able to improve over the logging policy. This is not the case for the current state-of-the-art method, Bandit Net. The latter does not always improve over the logging policy, which may be attributable to propensity overﬁtting. All public datasets for e Xtreme multi-label classiﬁcation present the problem of imbalanced label distribution. Indeed, certain important labels (commonly referred to as tail labels), with more descriptive power, might be rarely used because of biases inherent to the data collection process. Although we do not provide a speciﬁc treatment of tail labels in this manuscript, we proposed in the appendix a simple extension of POXM (named w POXM) based on Jain, Prabhu, and Varma (2016) to address this problem. Brieﬂy, we extended the traditional data generating process for BLBF to treat the labels as noisy, and assumed that our observation scheme is biased towards the head labels. This leads to a slight modiﬁcation of the s IS estimator and the POXM procedure to include the label propensity scores. w POXM signiﬁcantly improved over POXM for all propensity-weighted metrics, with 4.77% improvement of the PSR@3 metric. We leave more reﬁned analyses for future work. An important point in the XMC literature is computational efﬁciency. In this study, we used a machine with 8

GPUs Tesla K80 to run our experiments. This is mainly because our implementation relies on Attention XML, itself implemented in Py Torch. The runtime of POXM on each dataset ranges from less than one hour for EUR-Le X to less than three hours for Amazon-670K. An important aspect of POXM s implementation is the reduced softmax computation. We veriﬁed this on the Amazon-670K dataset in which we tracked the runtime for growing size of the parameter p. For less than p 100 actions, POXM took around 3s to backpropagate through 1,000 samples. However, this runtime was multiplied by ten for p=10,000 (25s) and we could not run POXM for p 20,000 because of an out-of-memory error. An interesting research direction would be to apply this framework to other XMC algorithms. A performance gap remains between POXM and the skyline performance from the supervised method Attention XML. It is possible that alternative parameterizations of the policy π may further improve performance; for example, using a probabilistic latent tree for policy gradients as in Chen et al. (2019a) or using the Gumbel-Top-k trick (Kool, van Hoof, and Welling 2020a). Furthermore, doubly robust estimators (Dudik, Langford, and Li 2011; Su et al. 2019; Wang et al. 2019) may further help in incorporating prior knowledge about the reward function.

Acknowledgements We acknowledge Kush Batia, Lerrel Pinto, Adam Stooke, Sujay Sanghavi, Hsiang-Fu Yu, Arya Mazumdar and Rajat Sen for helpful conversations.

References Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained Policy Optimization. In International Conference on Machine Learning. Agrawal, R.; Gupta, A.; Prabhu, Y.; and Varma, M. 2013. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In International World Wide Web Conference. Atan, O.; Zame, W. R.; and Mihaela Van Der Schaar. 2018. Counterfactual Policy Optimization Using Domain Adversarial Neural Networks. In ICML Causal ML Workshop.

Babbar, R.; and Sch olkopf, B. 2017. Dismec: Distributed sparse machines for extreme multi-label classiﬁcation. In International Conference on Web Search and Data Mining.

Babbar, R.; and Sch olkopf, B. 2019. Data scarcity, robustness and extreme multi-label classiﬁcation. Machine Learning .

Bai, X.; Guan, J.; and Wang, H. 2019. A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation. In Advances in Neural Information Processing Systems.

Bhatia, K.; Dahiya, K.; Jain, H.; Mittal, A.; Prabhu, Y.; and Varma, M. 2016. The extreme classiﬁcation repository: Multi-label datasets and code. URL http://manikvarma.org/ downloads/XC/XMLRepository.html.

Bhatia, K.; Jain, H.; Kar, P.; Varma, M.; and Jain, P. 2015. Sparse local embeddings for extreme multi-label classiﬁcation. In Advances in Neural Information Processing Systems.

Casella, G.; and Robert, C. P. 1996. Rao-Blackwellisation of sampling schemes. Biometrika .

Chang, W.-C.; Yu, H.-F.; Zhong, K.; Yang, Y.; and Dhillon, I. 2019. X-BERT: e Xtreme Multi-label Text Classiﬁcation with using Bidirectional Encoder Representations from Transformers. ar Xiv .

Chen, H.; Dai, X.; Cai, H.; Zhang, W.; Wang, X.; Tang, R.; Zhang, Y.; and Yu, Y. 2019a. Large-scale interactive recommendation with tree-structured policy gradient. In AAAI Conference on Artiﬁcial Intelligence.

Chen, M.; Beutel, A.; Covington, P.; Jain, S.; Belletti, F.; and Chi, E. H. 2019b. Top-k off-policy correction for a REINFORCE recommender system. In International Conference on Web Search and Data Mining.

Chen, X.; Li, S.; Li, H.; Jiang, S.; Qi, Y.; and Song, L. 2019c. Generative Adversarial User Model for Reinforcement Learning Based Recommendation System. In International Conference on Machine Learning.

Degris, T.; White, M.; and Sutton, R. S. 2012. Off-policy actor-critic. In International Conference on Machine Learning.

Dudik, M.; Langford, J.; and Li, L. 2011. Doubly Robust Policy Evaluation and Learning. In International Conference on Machine Learning.

Gentile, C.; and Orabona, F. 2014. On multilabel classiﬁcation and ranking with bandit feedback. The Journal of Machine Learning Research .

Guo, C.; Mousavi, A.; Wu, X.; Holtmann-Rice, D.; Kale, S.; Reddi, S.; and Kumar, S. 2019. Breaking the Glass Ceiling for Embedding-Based Classiﬁers for Large Output Spaces. In Advances in Neural Information Processing Systems.

Jain, H.; Prabhu, Y.; and Varma, M. 2016. Extreme multilabel loss functions for recommendation, tagging, ranking & other missing label applications. In International Conference on Knowledge Discovery and Data Mining.

Jasinska, K.; Dembczynski, K.; Busa-Fekete, R.; Pfannschmidt, K.; Klerx, T.; and Hullermeier, E. 2016. Extreme F-measure maximization using sparse probability estimates. In International Conference on Machine Learning.

Joachims, T.; Swaminathan, A.; and de Rijke, M. 2018. Deep Learning with Logged Bandit Feedback. In International Conference on Learning Representations.

Joachims, T.; Swaminathan, A.; and Schnabel, T. 2017. Unbiased learning-to-rank with biased feedback. In International Conference on Web Search and Data Mining.

Johansson, F.; Shalit, U.; and Sontag, D. 2016. Learning Representations for Counterfactual Inference. In International Conference on Machine Learning.

Khandagale, S.; Xiao, H.; and Babbar, R. 2020. Bonsaidiverse and shallow trees for extreme multi-label classiﬁcation. Machine Learning .

Kool, W.; van Hoof, H.; and Welling, M. 2020a. Ancestral Gumbel-Top-k Sampling for Sampling Without Replacement. Journal of Machine Learning Research .

Kool, W.; van Hoof, H.; and Welling, M. 2020b. Estimating gradients for discrete random variables by sampling without replacement. In International Conference on Learning Representations.

Langford, J.; Strehl, A.; and Wortman, J. 2008. Exploration scavenging. In International Conference on Machine learning.

Lefortier, D.; Swaminathan, A.; Gu, X.; Joachims, T.; and de Rijke, M. 2016. Large-scale Validation of Counterfactual Learning Methods: A Test-Bed. In What If workshop: Neur IPS.

Li, L.; Kim, J. Y.; and Zitouni, I. 2015. Toward predicting the outcome of an A/B experiment for search relevance. In International Conference on Web Search and Data Mining.

Li, S.; Abbasi-Yadkori, Y.; Kveton, B.; Muthukrishnan, S.; Vinay, V.; and Wen, Z. 2018. Ofﬂine evaluation of ranking policies with click models. In International Conference on Knowledge Discovery and Data Mining.

Liu, J.; Chang, W.-C.; Wu, Y.; and Yang, Y. 2017. Deep learning for extreme multi-label text classiﬁcation. In International Conference on Research and Development in Information Retrieval.

Lopez, R.; Li, C.; Yan, X.; Xiong, J.; Jordan, M.; Qi, Y.; and Song, L. 2020. Cost-Effective Incentive Allocation via Structured Counterfactual Inference. In AAAI Conference in Artiﬁcial Intelligence.

Mc Auley, J.; and Leskovec, J. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In International Conference on Recommender Systems.

Mencia, E. L.; and F urnkranz, J. 2008. Efﬁcient pairwise multilabel classiﬁcation for large-scale problems in the legal domain. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases.

Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems.

Prabhu, Y.; Kag, A.; Harsola, S.; Agrawal, R.; and Varma, M. 2018. Parabel: Partitioned label trees for extreme classiﬁcation with application to dynamic search advertising. In International World Wide Web Conference.

Prabhu, Y.; Kusupati, A.; Gupta, N.; and Varma, M. 2020. Extreme Regression for Dynamic Search Advertising. In International Conference on Web Search and Data Mining.

Rahul; Dahiya, H.; and Singh, D. 2019. A Review of Trends and Techniques in Recommender Systems. In International Conference on Internet of Things: Smart Innovation and Usages.

Sklar, A. 1959. Fonctions de repartition a n-dimensions et leurs marges. Publications de l Institut Statistique de l Universit e de Paris .

Su, Y.; Wang, L.; Santacatterina, M.; and Joachims, T. 2019. CAB: Continuous adaptive blending for policy evaluation and learning. In International Conference on Machine Learning.

Swaminathan, A.; and Joachims, T. 2015a. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. In International Conference on Machine Learning.

Swaminathan, A.; and Joachims, T. 2015b. The Self Normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems.

Swaminathan, A.; Krishnamurthy, A.; Agarwal, A.; Dudik, M.; Langford, J.; Jose, D.; and Zitouni, I. 2017. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems.

Tagami, Y. 2017. Annex ML: Approximate nearest neighbor search for extreme multi-label classiﬁcation. In International Conference on Knowledge Discovery and Data Mining.

Wang, L.; Bai, Y.; Bhalla, A.; and Joachims, T. 2019. Batch Learning from Bandit Feedback through Bias Corrected Reward Imputation. In ICML Workshop on Real-World Sequential Decision Making.

Wang, Y.; Yin, D.; Jie, L.; Wang, P.; Yamada, M.; Chang, Y.; and Mei, Q. 2016. Beyond ranking: Optimizing whole-page presentation. In International Conference on Web Search and Data Mining.

Wu, H.; and Wang, M. 2018. Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization. In International Conference on Machine Learning.

Wydmuch, M.; Jasinska, K.; Kuznetsov, M.; Busa-Fekete, R.; and Dembczynski, K. 2018. A no-regret generalization of hierarchical softmax to extreme multi-label classiﬁcation. In Advances in Neural Information Processing Systems.

Yen, I. E.-H.; Huang, X.; Dai, W.; Ravikumar, P.; Dhillon, I.; and Xing, E. 2017. Ppdsparse: A parallel primal-dual

sparse method for extreme classiﬁcation. In International Conference on Knowledge Discovery and Data Mining. Yen, I. E.-H.; Huang, X.; Ravikumar, P.; Zhong, K.; and Dhillon, I. 2016. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classiﬁcation. In International Conference on Machine Learning. You, R.; Zhang, Z.; Dai, S.; and Zhu, S. 2019a. HAXMLNet: Hierarchical Attention Network for Extreme Multi-Label Text Classiﬁcation. ar Xiv . You, R.; Zhang, Z.; Wang, Z.; Dai, S.; Mamitsuka, H.; and Zhu, S. 2019b. Attention XML: Label Tree-based Attention Aware Deep Model for High-Performance Extreme Multi Label Text Classiﬁcation. In Advances in Neural Information Processing Systems. Zubiaga, A. 2012. Enhancing navigation on Wikipedia with social tags. ar Xiv .