# learning_from_extreme_bandit_feedback__049a1f51.pdf Learning from e Xtreme Bandit Feedback Romain Lopez1, Inderjit S. Dhillon2, 3, Michael I. Jordan1, 2 1 Department of Electrical Engineering and Computer Sciences, University of California, Berkeley 2 Amazon.com 3 Department of Computer Science, The University of Texas at Austin {romain lopez, jordan}@cs.berkeley.edu inderjit@cs.utexas.edu We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale realworld applications, supervised learning frameworks such as e Xtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (s IS) that operates in a significantly more favorable bias-variance regime. The s IS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure named Policy Optimization for e Xtreme Models (POXM) for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the s IS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-tobandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: Bandit Net, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas Bandit Net sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines. Introduction In the classical supervised learning paradigm, it is assumed that every data point is accompanied by a label. Such labels provide a very strong notion of feedback, where the learner is able to assess not only the loss associated with the action that they have chosen but can also assess losses of actions that they did not choose. A useful weakening of this paradigm involves considering so-called bandit feedback, where the training data simply provides evaluations of selected actions without delineating the correct action. Bandit Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. feedback is often viewed as the province of reinforcement learning, but it is also possible to combine bandit feedback with supervised learning by considering a batch setting in which each data point is accompanied by an evaluation and there is no temporal component. This is the Batch Learning from Bandit Feedback (BLBF) problem (Swaminathan and Joachims 2015a). Of particular interest is the off-policy setting where the training data is provided by a logging policy, which differs from the learner s policy and differs from the optimal policy. Such problems arise in many real-world problems, including supply chains, online markets, and recommendation systems (Rahul, Dahiya, and Singh 2019), where abundant data is available in a logged format but not in a classical supervised learning format. Another difficulty with the classical notion of a label is that real-world problems often involve huge action spaces. This is the case, for example, in real-world recommendation systems where there may be billions of products and hundreds of millions of consumers. Not only is the cardinality of the action space challenging both from a computational point of view and a statistical point of view, but even the semantics of the labels can become obscure it can be difficult to place an item conceptually in one and only category. Such challenges have motivated the development of e Xtreme multi-label classification (XMC) and e Xtreme Regression (XR) (Bhatia et al. 2016) methods, which focus on computational scalability issues and target settings involving millions of labels. These methods have had real-world applications in domains such as e-commerce (Agrawal et al. 2013) and dynamic search advertising (Prabhu et al. 2018, 2020). We assert that the issues of bandit feedback and extremescale action spaces are related. Indeed, it is when action spaces are large that it is particularly likely that feedback will only be partial. Moreover, large action spaces tend to support multiple tasks and grow in size and scope over time, making it likely that available data will be in the form of a logging policy and not a single target input-output mapping. We also note that the standard methodology for accommodating the difference between the logging policy and an optimal policy needs to be considered carefully in the setting of large action spaces. Indeed, the standard methodology is some form of importance sampling (Swaminathan The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) and Joachims 2015a), and importance sampling estimators can run aground when their variance is too high (see, e.g., Lefortier et al. (2016)). Such variance is likely to be particularly virulent in large action spaces. Some examples of the XMC framework do treat labels as subject to random variation (Jain, Prabhu, and Varma 2016), but only with the goal of improving the prediction of rare labels; they do not tackle the broader problem of learning from logging policies in extreme-scale action spaces. It is precisely this broader problem that is our focus in the current paper. The literature on offline policy learning in Reinforcement Learning (RL) has also been concerned with correcting for implicit feedback bias (see, e.g., Degris, White, and Sutton (2012)). This line of work differs from ours, however, in that the focus in RL is not on extremely-large action spaces, and RL is often based on simulators rather than logging policies (Chen et al. 2019c; Bai, Guan, and Wang 2019). Closest to our work is the work of Chen et al. (2019b), who propose to use offline policy gradients on a large action space (millions of items). Their method relies, however, on a proprietary action embedding, unavailable to us. After a brief overview of BLBF and XMC , we present a new form of BLBF that blends bandit feedback with multilabel classification. We introduce a novel assumption, specific to the XMC setting, in which most actions are irrelevant (i.e., incur a null reward) for a particular instance. This motivates a Rao-Blackwellized (Casella and Robert 1996) estimator of the policy value for which only a small set of relevant actions per instance are considered. We refer to this approach as selective importance sampling (s IPS). We provide a theoretical analysis of the bias-variance tradeoff of the s IPS estimator compared to naive importance sampling. In practice, the selected actions for the s IPS estimator are the top-p actions from the logging policy, where p can be adjusted from the data. We derive a novel learning method based on the s IPS estimator, which we refer to as Policy Optimizer for e Xtreme Models (POXM). Finally, we propose a modification of a state-of-the-art neural XMC method Attention XML (You et al. 2019b) to learn from bandit feedback. Using a supervised-learning-to-bandit conversion (Dudik, Langford, and Li 2011), we benchmark POXM against Bandit Net (Joachims, Swaminathan, and de Rijke 2018), a partial matching scheme from Wang et al. (2016) and a supervised learning baseline on three XMC datasets (EUR-Lex, Wiki10-31K and Amazon-670K) (Bhatia et al. 2016). We show that naive application of the state-of-theart method Bandit Net (Joachims, Swaminathan, and de Rijke 2018) sometimes improves over the logging policy, but only marginally. Conversely, POXM provides substantial improvement over the logging policy as well as supervised learning baselines. e Xtreme Multi-label Classification (XMC) Multi-label classification aims at assigning a relevant subset Y [L] to an instance x, where [L] = {1,...,L} denotes the set of L possible labels. XMC is a specific case of multi-label classification in which we further assume that all Y are small subsets of a massive collection (i.e., generally Y /L < 0.01). Naive one-versus-all approaches to multi-label classification usually do not scale to such a large number of labels and adhoc methods are often employed. Furthermore, the marginal distribution of labels across all instances exhibits a long tail, which causes additional statistical challenges. Algorithmic approaches to XMC include optimized oneversus-all methods (Babbar and Sch olkopf 2017, 2019; Yen et al. 2017, 2016), embedding-based methods (Bhatia et al. 2015; Tagami 2017; Guo et al. 2019), probabilistic label tree-based (Prabhu et al. 2018; Jasinska et al. 2016; Khandagale, Xiao, and Babbar 2020; Wydmuch et al. 2018) and deep learning-based methods (You et al. 2019b; Liu et al. 2017; You et al. 2019a; Chang et al. 2019). Each algorithm usually proposes a specific approach to model the text as well as deal with tail labels. For example, Babbar and Sch olkopf (2019) uses a robust SVM approach on TFIDF features. Pfastre XML (Jain, Prabhu, and Varma 2016) assumes a particular noise model for the observed labels and proposes to weight the importance of tail labels. Attention XML (You et al. 2019b) uses a bidirectional-LSTM to embed the raw text as well as a multi-label attention mechanism to help capture the most relevant part of the input text for each label. For datasets with large L, Attention XML trains one model per layer of a shallow and wide probabilistic latent tree using a small set of candidate labels. Batch Learning from Bandit Feedback (BLBF) We assume that the instance x is sampled from a distribution P(x). The action for this particular instance is a unique label y [L], sampled from the logging policy ρ(y x) and a feedback value r R is observed. Repeating this data collection process yields the dataset [(xi,yi,ri)]n i=1. The BLBF problem consists in maximizing the expected reward V (π) of a policy π. We use importance sampling (IS) to estimate V (π) from data based on the logging policy as follows: ˆVIS(π) = 1 π(yi xi) ρ(yi xi) ri. (1) Classically, identifying the optimal policy via this estimator is infeasible without a thorough exploration of the action space by the logging policy (Langford, Strehl, and Wortman 2008). More specifically, the IS estimator ˆVIS(π) requires the following basic assumption for there to be any hope of asymptotic optimality: Assumption 1. There exists a scalar ϵ > 0 such that for all x Rd and y [L],ρ(y x) > ϵ. The IS estimator has high variance when π assigns actions that are infrequent in ρ; hence a variety of regularization schemes have been developed, based on riskupper-bound minimization, to control variance. Examples of upper bounds include empirical Bernstein concentration bounds (Swaminathan and Joachims 2015a) and various divergence-based bounds (Atan, Zame, and Mihaela Van Der Schaar 2018; Wu and Wang 2018; Johansson, Shalit, and Sontag 2016; Lopez et al. 2020). Another common strategy for reducing the variance is to propose a model of the reward function, using as a baseline a doubly robust estimator (Dudik, Langford, and Li 2011; Su et al. 2019). A recurrent issue with BLBF is that the policy may avoid actions in the training set when the rewards are not scaled properly; this is the phenomenon of propensity overfitting. Swaminathan and Joachims (2015b) tackled this problem via the self-normalized importance sampling estimator (SNIS), in which IS estimates are normalized by the average importance weight. SNIS is invariant to translation of the rewards and may be used as a safeguard against propensity overfitting. Bandit Net (Joachims, Swaminathan, and de Rijke 2018) made this approach amenable to stochastic optimization by translating the reward distribution: ˆVBN(π) = 1 π(yi xi) ρ(yi xi) [ri λ], (2) and selecting λ over a small grid based on the SNIS estimate of the policy value. Learning an XMC algorithm from bandit feedback requires offline learning from slates Y , where each element of the slate comes from a large action space. Swaminathan et al. (2017) proposes a pseudo-inverse estimator for offline learning from combinatorial bandits. However, such an approach is intractable for large action spaces as it requires inverting a matrix whose size is linear in the number of actions. Another line of work focuses on offline evaluation and learning of semi-bandits for ranking (Li et al. 2018; Joachims, Swaminathan, and Schnabel 2017) but only with a small number of actions. In real-world data, a partial matching strategy between instances and relevant actions is applied in applications to internet marketing for policy evaluation (Wang et al. 2016; Li, Kim, and Zitouni 2015). More recently, Chen et al. (2019b) proposed a top-k off-policy correction method for a real-world recommender system. Their approach deals with millions of actions although it treats label embeddings as given, whereas this problem is in general a hard problem for XMC. Bandit Feedback and Multi-label Classification We consider a setting in which the algorithm (e.g., a policy for a recommendation system) observes side information x Rd and is allowed to output a subset Y [L] of the L possible labels. Side information is independent at each round and sampled from a distribution P(x). We assume that the subset Y has fixed size Y = ℓ, which allows us to adopt the slate notation Y = (y1,...,yℓ). The algorithm observes noisy feedback for each label, R = (r1,...,rℓ), and we further assume that the joint distribution over R decomposes as P (R x,Y ) = ℓ j=1 P (rj x,yj). We will denote the conditional reward distribution as a function: δ(x,y) = E[r x,y]. In the case of multi-label classification, this feedback can be formed with random variables indicating whether each individual label is inside the true set of labels for each datapoint (Gentile and Orabona 2014). More concretely, feedback may be formed from sale or click information (Chen et al. 2019b,c). We are interested in optimizing a policy π(Y x) from offline data. Accessible data is sampled according to an existing algorithm, the logging policy ρ(Y x). We assume that both joint distributions over the slate decompose into an auto-regressive process. For example, for π we assume: π(Y x) = ℓ j=1 π(yj x,y1 j 1). (3) Introducing this decomposition does not result in any loss of generality, as long as the action order is identifiable (otherwise, one would need to consider all possible orderings (Kool, van Hoof, and Welling 2020b)). This is a reasonable hypothesis because the order of the actions may also be logged as supplementary information. We now define the value of a policy π as: V (π) = EP(x)Eπ(Y x)E[1 R x,Y ]. (4) In our setting, the reward decomposes as a sum of independent contributions of each individual action. The reward may in principle be generalized to be rank dependent, or to consider interactions between items (Gentile and Orabona 2014), but this is beyond the scope of this work. A general approach for offline policy learning is to estimate V (π) from logged data using importance sampling (Swaminathan and Joachims 2015a). As emphasized in Swaminathan et al. (2017), the combinatorial size of the action space Ω(Lℓ) may yield an impractical variance for importance sampling. This is particularly the case for XMC, where typical values of L are minimally in the thousands. A natural strategy to improve over the IS estimator on the slate Y is to exploit the additive reward decomposition in Eq. (4). Along with the factorization of the policy in Eq. (3), we may reformulate the policy value as: V (π) = EP(x) ℓ j=1 Eπ(y1 j x)δ(x,yj). (5) The benefit of this new decomposition is that instead of performing importance sampling on Y , we can now use ℓIS estimators, each with a better bias-variance tradeoff. Unbiased estimation of V (π) in Eq. (5) via importance sampling still requires Assumption 1. The logging policy must therefore explore a large action space. However, most actions are unrelated to a given context and deploying an online logging policy that satisfies Assumption 1 may yield a poor customer experience. Learning from e Xtreme Bandit Feedback We now explore alternative assumptions for the logging policy that may be more suitable to the setting of very large action spaces. We formalize the notion that most actions are irrelevant using the following assumption: Assumption 2. (Sparse feedback condition). The individual feedback random variable r takes values in the bounded interval [ , ]. For all x Rd, the label set [L] can be partitioned as [L] = Ψ(x) Ψ0(x) such that for all actions y of Ψ0(x), the expected reward is minimal: δ(x,y) = . We refer to the function Ψ as an action selector, as it maps a context to a set of relevant actions. Throughout the manuscript, we use the notation Λ0 to refer to the pointwise set complement of any action selector Λ. Intuitively, we are interested in the case where Ψ(x) L for all x. Assumption 2 is implicitly used in online marketing applications of offline policy evaluation, formulated as a partial matching between actions and instances (Wang et al. 2016; Li, Kim, and Zitouni 2015). Notably, this assumption can be assimilated to a mixed-bandit feedback setting, where we observe feedback for all of Ψ0(x) but only one selected action inside of Ψ(x). Under Assumption 2, the IS estimator will be unbiased for all logging policies that satisfy the following relaxed assumption: Assumption 3. (Ψ-overlap condition). There exists a scalar ϵ > 0 such that for all x Rd and y Ψ(x), ρ(y x) > ϵ. Batch learning from bandit feedback may be possible under this assumption, as long as the logging policy explores a set of actions large enough to cover the actions from Ψ but small enough to avoid exploring too many suboptimal actions. Furthermore, Assumption 2 also reveals the existence of Ψ0(x), a sufficient statistic for estimating the reward on the irrelevant actions. Making appeal to Rao Blackwellization (Casella and Robert 1996), we can incorporate this information to estimate each of the ℓterms of Eq. (5) (e.g., in the case ℓ= 1 and = 0): V (π) = EP(x) [π (Ψ(x) x) Eπ(y x) [δ(x,y) y Ψ(x)]]. (6) The decomposition in Eq. (6) suggests that when the action selector Ψ is known, one can estimate V (π) via importance sampling for the conditional expectation of the rewards with respect to the event {y Ψ(x)}. Intuitively, this means that one can modify the importance sampling scheme to only include a relevant subset of labels and ignore all the others. Without loss of generality, we assume that = 0 in the remainder of this manuscript. In practice, the oracle action selector Ψ is unknown and needs to be estimated from the data. It may be hard to infer the smallest Ψ such that Assumption 2 is satisfied. Conversely, a trivial action selector including all actions is valid (it does include all relevant actions) but is ultimately unpractical. As a flexible compromise, we will replace Ψ in Eq. (6) by any action selector Φ and study the bias-variance tradeoff of the resulting plugin estimator. Let ρ be a logging policy with a large enough support to satisfy Assumption 3. Let Φ be an action selector such that Φ(x) supp ρ( x) almost surely in x, where supp denotes the support of a probability distribution. The role of Φ is to prune out actions to maintain an optimal bias-variance tradeoff. In the case ℓ= 1, the Φ-selective importance sampling (s IS) estimator ˆV Φ s IS(π) for action selection Φ can be written as: ˆV Φ s IS(π) = 1 π(yi xi,y Φ(x)) ρ(yi xi) ri. (7) Its bias and variance depends on how different the policy π is from the logging policy ρ (as in classical BLBF) but also on the degree of overlap of Φ with Ψ: Theorem 1 (Bias-variance tradeoff of selective importance sampling). Let R and ρ satisfy Assumptions 2 and 3. Let Φ be an action selector such that Φ(x) supp ρ( x) almost surely in x. The bias of the s IS estimator is: E ˆV Φ s IS(π) V (π) κ(π,Ψ,Φ), (8) where κ(π,Ψ,Φ) = EP(x)π (Ψ(x) Φ0(x) x) quantifies the overlap between the oracle action selector Ψ and the proposed action selector Φ, weighted by the policy π. The performance of the two estimators can be compared as follows: MSE[ ˆV Φ s IS(π)] MSE[ ˆVIS(π)] + 2 2κ(π,Ψ,Φ) n EP(x) π2 (Φ0(x) x) ρ(Φ0(x) x) , (9) where σ2 = infx,y Rd [K] E[r2 x,y]. We provide the complete proof of this theorem in the appendix1. As expected by Rao-Blackellization, we see that if Φ completely covers Ψ (i.e., for all x Rd,Ψ(x) Φ(x)), then ˆV Φ s IS(π) is unbiased and has more favorable performance than ˆVIS(π). Admittedly, Eq. (9) shows that both estimators have similar mean-square error when π puts no mass on potentially irrelevant actions y Φ0(x). However, during the process of learning the optimal policy or in the event of propensity overfitting, we expect π to put a non-zero mass on potentially irrelevant actions y Φ0(x), with positive probability in x. For these reasons, we expect ˆV Φ s IS(π) to provide significant improvement over ˆVIS(π) for policy learning. Even though Eq. (9) provides insight into the performance of s IS, unfortunately it cannot be used directly in selecting Φ. We instead propose a greedy heuristic to select a small number of action selectors. For example, Φp(x) corresponds to the top-p labels for instance x according to the logging policy. With this approach, the bias of the ˆV Φ s IS(π) estimator is a decreasing function of p, as the overlap with Ψ increases. Furthermore, the variance increases with p as long as the added actions are irrelevant. In practice, we use a small grid search for p {10,20,50,100} and choose the optimal p with the SNIS estimator, as in Bandit Net. We believe this is a reasonable approach whenever the logging policy ranks the relevant items sufficiently high but can be improved (e.g., top-p for p in 5 to 100). Policy Optimization for e Xtreme Models We apply s IS for each of the ℓterms of the policy value from Eq. (5) in order to estimate V (π) from bandit feedback, (xi,Yi,Ri)n i=1, and learn an optimal policy. As an additional step to reduce the variance, we prune the importance sampling weights of earlier slate actions, following Achiam et al. (2017): ˆV Φ s IS(π) = 1 πΦ(yi,j xi,y1 j 1) ρ(yi,j xi,y1 j 1) ri,j, (10) 1Please visit https://arxiv.org/abs/2009.12947 for supplementary information. Figure 1: Expected R@5 and CDF of the logging policy for the top-k action for each XMC dataset. Exploration is limited to a subset of relevant actions. where πΦ designates the distribution π restricted to the set Φ(x) for every x. Because π(Y x) is a copula, one can derive all joint distributions starting from the corresponding one-dimensional marginals (Sklar 1959). In this work, we focus on the case of ordered sampling without replacement to respect an important design restriction: the slate Y must not have redundant actions. For the j-th slate component, the relevant conditional probability is formed from the base marginal probabilities π(y x) as follows: πΦ(yj x,y1 j 1) = π(yj x) y Φ(x) π(y x) k