# deep_learning_with_logged_bandit_feedback__293f87db.pdf Published as a conference paper at ICLR 2018 DEEP LEARNING WITH LOGGED BANDIT FEEDBACK Thorsten Joachims Cornell University tj@cs.cornell.edu Adith Swaminathan Microsoft Research adswamin@microsoft.com Maarten de Rijke University of Amsterdam derijke@uva.nl We propose a new output layer for deep neural networks that permits the use of logged contextual bandit feedback for training. Such contextual bandit feedback can be available in huge quantities (e.g., logs of search engines, recommender systems) at little cost, opening up a path for training deep networks on orders of magnitude more data. To this effect, we propose a counterfactual risk minimization approach for training deep networks using an equivariant empirical risk estimator with variance regularization, Bandit Net, and show how the resulting objective can be decomposed in a way that allows stochastic gradient descent training. We empirically demonstrate the effectiveness of the method by showing how deep networks Res Nets in particular can be trained for object recognition without conventionally labeled images. 1 INTRODUCTION Log data can be recorded from online systems such as search engines, recommender systems, or online stores at little cost and in huge quantities. For concreteness, consider the interaction logs of an ad-placement system for banner ads. Such logs typically contain a record of the input to the system (e.g., features describing the user, banner ad, and page), the action that was taken by the system (e.g., a specific banner ad that was placed) and the feedback furnished by the user (e.g., clicks on the ad, or monetary payoff). This feedback, however, provides only partial information contextual-bandit feedback limited to the actions taken by the system. We do not get to see how the user would have responded, if the system had chosen a different action (e.g., other ads or banner types). Thus, the feedback for all other actions the system could have taken is typically not known. This makes learning from log data fundamentally different from traditional supervised learning, where correct predictions and a loss function provide feedback for all actions. In this paper, we propose a new output layer for deep neural networks that allows training on logged contextual bandit feedback. By circumventing the need for full-information feedback, our approach opens a new and intriguing pathway for acquiring knowledge at unprecedented scale, giving deep neural networks access to this abundant and ubiquitous type of data. Similarly, it enables the application of deep learning even in domains where manually labeling full-information feedback is not viable. In contrast to online learning with contextual bandit feedback (e.g., (Williams, 1992; Agarwal et al., 2014)), we perform batch learning from bandit feedback (BLBF) (Beygelzimer & Langford, 2009; Swaminathan & Joachims, 2015a;b;c) and the algorithm does not require the ability to make interactive interventions. At the core of the new output layer for BLBF training of deep neural networks lies a counterfactual training objective that replaces the conventional cross-entropy objective. Our approach called Bandit Net follows the view of a deep neural network as a stochastic policy. We propose a counterfactual risk minimization (CRM) objective that is based on an equivariant estimator of the true error that only requires propensity-logged contextual bandit feedback. This makes our training objective fundamentally different from the conventional cross-entropy objective for supervised classification, which requires full-information feedback. Equivariance in our context means that the learning result is invariant to additive translations of the loss, and it is more formally defined in Section 3.2. To enable large-scale training, we show how this training objective can be decomposed to allow stochastic gradient descent (SGD) optimization. In addition to the theoretical derivation of Bandit Net, we present an empirical evaluation that verifies the applicability of the theoretical argument. It demonstrates how a deep neural network architec- Published as a conference paper at ICLR 2018 ture can be trained in the BLBF setting. In particular, we derive a Bandit Net version of Res Net (He et al., 2016) for visual object classification. Despite using potentially much cheaper data, we find that Bandit-Res Net can achieve the same classification performance given sufficient amounts of contextual bandit feedback as Res Net trained with cross-entropy on conventionally (full-information) annotated images. To easily enable experimentation on other applications, we share an implementation of Bandit Net.1 2 RELATED WORK Several recent works have studied weak supervision approaches for deep learning. Weak supervision has been used to pre-train good image features (Joulin et al., 2016) and for information retrieval (Dehghani et al., 2017). Closely related works have studied label corruption on CIFAR10 recently (Zhang et al., 2016). However, all these approaches use weak supervision/corruption to construct noisy proxies for labels, and proceed with traditional supervised training (using crossentropy or mean-squared-error loss) with these proxies. In contrast, we work in the BLBF setting, which is an orthogonal data-source, and modify the loss functions optimized by deep nets to directly implement risk minimization. Virtually all previous methods that can learn from logged bandit feedback employ some form of risk minimization principle (Vapnik, 1998) over a model class. Most of the methods (Beygelzimer & Langford, 2009; Bottou et al., 2013; Swaminathan & Joachims, 2015a) employ an inverse propensity scoring (IPS) estimator (Rosenbaum & Rubin, 1983) as empirical risk and use stochastic gradient descent (SGD) to optimize the estimate over large datasets. Recently, the self-normalized estimator (Trotter & Tukey, 1956) has been shown to be a more suitable estimator for BLBF (Swaminathan & Joachims, 2015c). The self-normalized estimator, however, is not amenable to stochastic optimization and scales poorly with dataset size. In our work, we demonstrate how we can efficiently optimize a reformulation of the self-normalized estimator using SGD. Previous BLBF methods focus on simple model classes: log-linear and exponential models (Swaminathan & Joachims, 2015a) or tree-based reductions (Beygelzimer & Langford, 2009). In contrast, we demonstrate how current deep learning models can be trained effectively via batch learning from bandit feedback (BLBF), and compare these with existing approaches on a benchmark dataset (Krizhevsky & Hinton, 2009). Our work, together with independent concurrent work (Serban et al., 2017), demonstrates success with off-policy variants of the REINFORCE (Williams, 1992) algorithm. In particular, our algorithm employs a Lagrangian reformulation of the self-normalized estimator, and the objective and gradients of this reformulation are similar in spirit to the updates of the REINFORCE algorithm. This connection sheds new light on the role of the baseline hyper-parameters in REINFORCE: rather than simply reduce the variance of policy gradients, our work proposes a constructive algorithm for selecting the baseline in the off-policy setting and it suggests that the baseline is instrumental in creating an equivariant counterfactual learning objective. 3 BANDITNET: COUNTERFACTUAL RISK MINIMIZATION FOR DEEP NETS To formalize the problem of batch learning from bandit feedback for deep neural networks, consider the contextual bandit setting where a policy π takes as input x X and outputs an action y Y. In response, we observe the loss (or payoff) δ(x, y) of the selected action y, where δ(x, y) is an arbitrary (unknown) function that maps actions and contexts to a bounded real number. For example, in display advertising, the context x could be a representation of the user and page, y denotes the displayed ad, and δ(x, y) could be the monetary payoff from placing the ad (zero if no click, or dollar amount if clicked). The contexts are drawn i.i.d. from a fixed but unknown distribution Pr(X). In this paper, a (deep) neural network is viewed as implementing a stochastic policy π. We can think of such a network policy as a conditional distribution πw(Y | x) over actions y Y , where w are the parameters of the network. The network makes a prediction by sampling an action y πw(Y | x), where deterministic πw(Y | x) are a special case. As we will show as part of the empirical 1http://www.joachims.org/banditnet/ Published as a conference paper at ICLR 2018 evaluation, many existing network architectures are compatible with this stochastic-policy view. For example, any network fw(x, y) with a softmax output layer πw(y | x) = exp(fw(x, y)) P y Y exp(fw(x, y )) (1) can be re-purposed as a conditional distribution from which one can sample actions, instead of interpreting it as a conditional likelihood like in full-information supervised learning. The goal of learning is to find a policy πw that minimizes the risk (analogously: maximizes the payoff) defined as R(πw) = E x Pr(X) E y πw(Y |x)[δ(x, y)]. (2) Any data collected from an interactive system depends on the policy π0 that was running on the system at the time, determining which actions y and losses δ(x, y) are observed. We call π0 the logging policy, and for simplicity assume that it is stationary. The logged data D are n tuples of observed context xi Pr(X), action yi π0(Y | xi) taken by the logging policy, the probability of this action pi π0(yi | xi), which we call the propensity, and the received loss δi δ(xi, yi): D = [(x1, y1, p1, δ1) , . . . , (xn, yn, pn, δn)] . (3) We will now discuss how we can use this logged contextual bandit feedback to train a neural network policy πw(Y | x) that has low risk R(πw). 3.1 COUNTERFACTUAL RISK MINIMIZATION While conditional maximum likelihood is a standard approach for training deep neural networks, it requires that the loss δ(xi, y) is known for all y Y. However, we only know δ(xi, yi) for the particular yi chosen by the logging policy π0. We therefore take a different approach following (Langford et al., 2008; Swaminathan & Joachims, 2015b), where we directly minimize an empirical risk that can be estimated from the logged bandit data D. This approach is called counterfactual risk minimization (CRM) (Swaminathan & Joachims, 2015b), since for any policy πw it addresses the counterfactual question of how well that policy would have performed, if it had been used instead of π0. While minimizing an empirical risk as an estimate of the true risk R(πw) is a common principle in machine learning (Vapnik, 1998), getting a reliable estimate based on the training data D produced by π0 is not straightforward. The logged bandit data D is not only incomplete (i.e., we lack knowledge of δ(xi, y) for many y Y that πw would have chosen differently from π0), but it is also biased (i.e., the actions preferred by π0 are over-represented). This is why existing work on training deep neural networks either requires full knowledge of the loss function, or requires the ability to interactively draw new samples yi πw(Y | xi) for any new policy πw. In our setting we can do neither we have a fixed dataset D that is limited to samples from π0. To nevertheless get a useful estimate of the empirical risk, we explicitly address both the bias and the variance of the risk estimate. To correct for sampling bias and handle missing data, we approach the risk estimation problem using importance sampling and thus remove the distribution mismatch between π0 and πw (Langford et al., 2008; Owen, 2013; Swaminathan & Joachims, 2015b): R(πw) = E x Pr(X) E y πw(Y |x)[δ(x, y)] = E x Pr(X) E y π0(Y |x) δ(x, y)πw(y | x) The latter expectation can be estimated on a sample D of n bandit-feedback examples using the following IPS estimator (Langford et al., 2008; Owen, 2013; Swaminathan & Joachims, 2015b): ˆRIPS(πw) = 1 i=1 δi πw(yi | xi) π0(yi | xi) . (5) This IPS estimator is unbiased and has bounded variance, if the logging policy has full support in the sense that x, y : π0(y | x) ϵ > 0. While at first glance it may seem natural to directly train the parameters w of a network to optimize this IPS estimate as an empirical risk, there are at least three obstacles to overcome. First, we will argue in the following section that the naive IPS Published as a conference paper at ICLR 2018 estimator s lack of equivariance makes it sub-optimal for use as an empirical risk for high-capacity models. Second, we have to find an efficient algorithm for minimizing the empirical risk, especially making it accessible to stochastic gradient descent (SGD) optimization. And, finally, we are faced with an unusual type of bias-variance trade-off since distance from the exploration policy impacts the variance of the empirical risk estimate for different w. 3.2 EQUIVARIANT COUNTERFACTUAL RISK MINIMIZATION While Eq. (5) provides an unbiased empirical risk estimate, it exhibits the possibly severe problem of propensity overfitting when directly optimized within a learning algorithm (Swaminathan & Joachims, 2015c). It is a problem of overfitting to the choices yi of the logging policy, and it occurs on top of the normal overfitting to the δi. Propensity overfitting is linked to the lack of equivariance of the IPS estimator: while the minimizer of true risk R(πw) does not change when translating the loss by a constant (i.e., x, y : δ(x, y) + c) by linearity of expectation, c + min w E x Pr(X) E y πw(Y |x)[δ(x, y)] = min w E x Pr(X) E y πw(Y |x)[δ(x, y) + c] (6) the minimizer of the IPS-estimated empirical risk ˆRIPS(πw) can change dramatically for finite training samples, and c + min w 1 n i=1 δi πw(yi | xi) π0(yi | xi) = min w 1 n i=1 (δi + c)πw(yi | xi) π0(yi | xi) . (7) Intuitively, when c shifts losses to be positive numbers, policies πw that put as little probability mass as possible on the observed actions have low risk estimates. If c shifts the losses to the negative range, the exact opposite is the case. For either choice of c, the choice of the policy eventually selected by the learning algorithm can be dominated by where π0 happens to sample data, not by which actions have low loss. The following self-normalized IPS estimator (SNIPS) addresses the propensity overfitting problem (Swaminathan & Joachims, 2015c) and is equivariant: ˆRSNIPS(πw) = 1 n Pn i=1 δi πw(yi|xi) 1 n Pn i=1 πw(yi|xi) π0(yi|xi) . (8) In addition to being equivariant, this estimate can also have substantially lower variance than Eq. (5), since it exploits the knowledge that the denominator πw(yi | xi) π0(yi | xi) (9) always has expectation 1: Z πw(yi | xi) π0(yi | xi) π0(yi | xi) Pr(xi)dyidxi = 1 Z 1 Pr(xi)dxi = 1. (10) The SNIPS estimator uses this knowledge as a multiplicative control variate (Swaminathan & Joachims, 2015c). While the SNIPS estimator has some bias, this bias asymptotically vanishes at a rate of O( 1 n) (Hesterberg, 1995). Using the SNIPS estimator as our empirical risk implies that we need to solve the following optimization problem for training: ˆw = arg min w ℜN ˆRSNIPS(πw). (11) Thus, we now turn to designing efficient optimization methods for this training objective. 3.3 TRAINING ALGORITHM Unfortunately, the training objective in Eq. (11) does not permit stochastic gradient descent (SGD) optimization in the given form (see Appendix C), which presents an obstacle to efficient and effective training of the network. To remedy this problem, we will now develop a reformulation that retains Published as a conference paper at ICLR 2018 both the desirable properties of the SNIPS estimator, as well as the ability to reuse established SGD training algorithms. Instead of optimizing a ratio as in Eq. (11), we will reformulate the problem into a series of constrained optimization problems. Let ˆw be a solution of Eq. (11), and at that solution let S be the value of the control variate for π ˆ w as defined in Eq. (9). For simplicity, assume that the minimizer ˆw is unique. If we knew S , we could equivalently solve the following constrained optimization problem: ˆw = arg min w ℜN 1 n i=1 δi πw(yi | xi) π0(yi | xi) subject to 1 n πw(yi | xi) π0(yi | xi) = S . (12) Of course, we do not actually know S . However, we can do a grid search in {S1, . . . , Sk} for S and solve the above optimization problem for each value, giving us a set of solutions { ˆw1, . . . , ˆwk}. Note that S is just a one-dimensional quantity, and that the sensible range we need to search for S concentrates around 1 as n increases (see Appendix B). To find the overall (approximate) ˆw that optimizes the SNIPS estimate, we then simply take the minimum: ˆw = arg min ( ˆ wj,Sj) 1 n Pn i=1 δi π ˆ wj (yi|xi) π0(yi|xi) Sj . (13) This still leaves the question of how to solve each equality constrained risk minimization problem using SGD. Fortunately, we can perform an equivalent search for S without constrained optimization. To this effect, consider the Lagrangian of the constrained optimization problem in Eq. (12) with Sj in the constraint instead of S : L(w, λ) = 1 δiπw(yi | xi) π0(yi | xi) λ πw(yi | xi) π0(yi | xi) (δi λ)πw(yi | xi) π0(yi | xi) +λSj. The variable λ is an unconstrained Lagrange multiplier. To find the minimum of Eq. (12) for a particular Sj, we need to minimize L(w, λ) w.r.t. w and maximize w.r.t. λ. ˆwj = arg min w ℜN max λ L(w, λ) (14) However, we are not actually interested in the constrained solution of Eq. (12) for any specific Sj. We are merely interested in exploring a certain range S [S1, Sk] in our search for S . So, we can reverse the roles of λ and S, where we keep λ fixed and determine the corresponding S in hindsight. In particular, for each {λ1, . . . , λk} we solve ˆwj = arg min w ℜN L(w, λj). (15) Note that the solution ˆwj does not depend on Sj, so we can compute Sj after we have found the minimum ˆwj. In particular, we can determine the Sj that corresponds to the given λj using the necessary optimality conditions, πw(yi | xi) w (δi λj) π0(yi | xi) = 0 and L λj = 1 πw(yi | xi) π0(yi | xi) Sj = 0, (16) by solving the second equality of Eq. (16). In this way, the sequence of λj produces solutions ˆwj corresponding to a sequence of {S1, . . . , Sk}. To identify the sensible range of S to explore, we can make use of the fact that Eq. (9) concentrates around its expectation of 1 for each πw as n increases. Theorem 2 in Appendix B provides a characterization of how large the range needs to be. Furthermore, we can steer the exploration of S via λ, since the resulting S changes monotonically with λ: (λa < λb) and ( ˆwa = ˆwb are not equivalent optima in Eq. (15)) (Sa < Sb). (17) A more formal statement and proof are given as Theorem 1 in Appendix A. In the simplest form one could therefore perform a grid search on λ, but more sophisticated search methods are possible too. After this reformulation, the key computational problem is finding the solution of Eq. (15) for each λj. Note that in this unconstrained optimization problem, the Lagrange multiplier effectively translates the loss values in the conventional IPS estimate: ˆwj = arg min w 1 n i=1 (δi λj)πw(yi | xi) π0(yi | xi) = arg min w ˆRλj IPS(πw). (18) Published as a conference paper at ICLR 2018 50000 100000 150000 200000 250000 Error Rate (test) Number of Bandit-Feedback Examples Bandit-Res Net Full Info Res Net with Cross E Figure 1: Learning curve of Bandit Net. The x-axis is the amount of bandit feedback, the y-axis is the test error. Given enough bandit feedback, Bandit-Res Net converges to the skyline performance. We denote this λ-translated IPS estimate with ˆRλ IPS(πw). Note that each such optimization problem is now in the form required for SGD, where we merely weight the derivative of the stochastic policy network πw(y | x) by a factor (δi λj)/π0(yi | xi). This opens the door for re-purposing existing fast methods for training deep neural networks, and we demonstrate experimentally that SGD with momentum is able to optimize our objective scalably. Similar loss translations have previously been used in on-policy reinforcement learning (Williams, 1992), where they are motivated as minimizing the variance of the gradient estimate (Weaver & Tao, 2001; Greensmith et al., 2004). However, the situation is different in the off-policy setting we consider. First, we cannot sample new roll-outs from the current policy under consideration, which means we cannot use the standard variance-optimal estimator used in REINFORCE. Second, we tried using the (estimated) expected loss of the learned policy as the baseline as is commonly done in REINFORCE, but will see in the experiment section that this value for λ is far from optimal. Finally, it is unclear whether gradient variance, as opposed to variance of the ERM objective, is really the key issue in batch learning from bandit feedback. In this sense, our approach provides a rigorous justification and a constructive way of picking the value of λ in the off-policy setting namely the value for which the corresponding Sj minimizes Eq. (13). In addition, one can further add variance regularization (Swaminathan & Joachims, 2015b) to improve the robustness of the risk estimate in Eq. (18) (see Appendix D for details). 4 EMPIRICAL EVALUATION The empirical evaluation is designed to address three key questions. First, it verifies that deep models can indeed be trained effectively using our approach. Second, we will compare how the same deep neural network architecture performs under different types of data and training objectives in particular, conventional cross-entropy training using full-information data. In order to be able to do this comparison, we focus on synthetic contextual bandit feedback data for training Bandit Net that is sampled from the full-information labels. Third, we explore the effectiveness and fidelity of the approximate SNIPS objective. For the following Bandit Net experiments, we adapted the Res Net20 architecture (He et al., 2016) by replacing the conventional cross-entropy objective with our counterfactual risk minimization objective. We evaluate the performance of this Bandit-Res Net on the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset, where we can compare training on full-information data with training on bandit feedback, and where there is a full-information test set for estimating prediction error. To simulate logged bandit feedback, we perform the standard supervised to bandit conversion (Beygelzimer & Langford, 2009). We use a hand-coded logging policy that achieves about 49% error rate on the training data, which is substantially worse than what we hope to achieve after learning. This emulates a real world scenario where one would bootstrap an operational system with a mediocre policy (e.g., derived from a small hand-labeled dataset) and then deploys it to log bandit feedback. This logged bandit feedback data is then used to train the Bandit-Res Net. We evaluate the trained model using error rate on the held out (full-information) test set. We compare this model against the skyline of training a conventional Res Net using the full-information feedback Published as a conference paper at ICLR 2018 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 Error Rate (test) Lagrange Multiplier (lambda) Error Rate (test) 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 SNIPS Error Estimate (train) Value of Control Variate Lagrange Multiplier (lambda) SNIPS Error Estimate (train) Control Variate Figure 2: The x-axis shows the value of the Lagrange multiplier λ used for training. Left plot shows the test error. Right plot shows the value of the SNIPS objective and the normalizer S. The size of the training set is 50k bandit-feedback examples. from the 50,000 training examples. Both the conventional full-information Res Net as well as the Bandit-Res Net use the same network architecture, the same hyperparameters, the same data augmentation scheme, and the same optimization method that were set in the CNTK implementation of Res Net20. Since CIFAR10 does not come with a validation set for tuning the variance-regularization constant γ, we do not use variance regularization for Bandit-Res Net. The Lagrange multiplier λ {0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0, 1.05} is selected on the training set via Eq. (13). The only parameter we adjusted for Bandit-Res Net is lowering the learning rate to 0.1 and slowing down the learning rate schedule. The latter was done to avoid confounding the Bandit-Res Net results with potential effects from early stopping, and we report test performance after 1000 training epochs, which is well beyond the point of convergence in all runs. Learning curve. Figure 1 shows the prediction error of the Bandit-Res Net as more and more bandit feedback is provided for training. First, even though the logging policy that generated the bandit feedback has an error rate of 49%, the prediction error of the policy learned by the Bandit-Res Net is substantially better. It is between 13% and 8.2%, depending on the amount of training data. Second, the horizontal line is the performance of a conventional Res Net trained on the full-information training set. It serves as a skyline of how good Bandit-Res Net could possibly get given that it is sampling bandit feedback from the same full-information training set. The learning curve in Figure 1 shows that Bandit-Res Net converges to the skyline performance given enough bandit feedback training data, providing strong evidence that our training objective and method can effectively extract the available information provided in the bandit feedback. Effect of the choice of Lagrange multiplier. The left-hand plot in Figure 2 shows the test error of solutions ˆwj depending on the value of the Lagrange multiplier λj used during training. It shows that λ in the range 0.8 to 1.0 results in good prediction performance, but that performance degrades outside this area. The SNIPS estimates in the right-hand plot of Figure 2 roughly reflects this optimal range, given empirical support for both the SNIPS estimator and the use of Eq. (13). We also explored two other methods for selecting λ. First, we used the straightforward IPS estimator as the objective (i.e., λ = 0), which leads to prediction performance worse than that of the logging policy (not shown). Second, we tried using the (estimated) expected loss of the learned policy as the baseline as is commonly done in REINFORCE. As Figure 1 shows, it is between 0.130 and 0.083 for the best policies we found. Figure 2 (left) shows that these baseline values are well outside of the optimum range. Also shown in the right-hand plot of Figure 2 is the value of the control variate in the denominator of the SNIPS estimate. As expected, it increases from below 1 to above 1 as λ is increased. Note that large deviations of the control variate from 1 are a sign of propensity overfitting (Swaminathan & Joachims, 2015c). In particular, for all solutions ˆwj the estimated standard error of the control variate Sj was less than 0.013, meaning that the normal 95% confidence interval for each Sj is contained in [0.974, 1.026]. If we see a ˆwj with control variate Sj outside this range, we should be suspicious of propensity overfitting to the choices of the logging policy and discard this solution. Published as a conference paper at ICLR 2018 5 CONCLUSIONS AND FUTURE WORK We proposed a new output layer for deep neural networks that enables the use of logged contextual bandit feedback for training. This type of feedback is abundant and ubiquitous in the form of interaction logs from autonomous systems, opening up the possibility of training deep neural networks on unprecedented amounts of data. In principle, this new output layer can replace the conventional cross-entropy layer for any network architecture. We provide a rigorous derivation of the training objective, linking it to an equivariant counterfactual risk estimator that enables counterfactual risk minimization. Most importantly, we show how the resulting training objective can be decomposed and reformulated to make it feasible for SGD training. We find that the Bandit Net approach applied to the Res Net architecture achieves predictive accuracy comparable to conventional full-information training for visual object recognition. The paper opens up several directions for future work. First, it enables many new applications where contextual bandit feedback is readily available. Second, in settings where it is infeasible to log propensity-scored data, it would be interesting to combine Bandit Net with propensity estimation techniques. Third, there may be improvements to Bandit Net, like smarter search techniques for S, more efficient counterfactual estimators beyond SNIPS, and the ability to handle continuous outputs. ACKNOWLEDGMENTS This research was supported in part by NSF Award IIS-1615706, a gift from Bloomberg, the Criteo Faculty Research Award program, and the Netherlands Organisation for Scientific Research (NWO) under project nr. 612.001.116. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors. A. Agarwal, D. Hsu, S. Kale, J. Langford, Lihong Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In ICML, 2014. A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. In KDD, pp. 129 138, 2009. L. Bottou, J. Peters, J. Quinonero-Candela, D. Charles, M. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. JMLR, 14:3207 3260, 2013. M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft. Neural ranking models with weak supervision. In SIGIR, pp. 65 74, 2017. E. Greensmith, P.L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. JMLR, 5:1471 1530, 2004. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. T. Hesterberg. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37:185 194, 1995. A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, pp. 67 84, 2016. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Computer Science Department, University of Toronto, 2009. J. Langford, A. Strehl, and J. Wortman. Exploration scavenging. In ICML, pp. 528 535, 2008. A.B. Owen. Monte Carlo theory, methods and examples. 2013. P. Rosenbaum and D. Rubin. The central role of propensity score in observational studies for causal effects. Biometrica, 70:41 55, 1983. Published as a conference paper at ICLR 2018 I. Serban, C. Sankar, M. Germain, S. Zhang, Z. Lin, S. Subramanian, T. Kim, M. Pieper, S. Chandar, N. R. Ke, S. Mudumba, A. de Brebisson, J. M. R. Sotelo, D. Suhubdy, V. Michalski, A. Nguyen, J. Pineau, and Y. Bengio. A deep reinforcement learning chatbot. Ar Xiv e-prints, September 2017. A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, pp. 814 823, 2015a. A. Swaminathan and T. Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR, 16:1731 1755, 2015b. A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In NIPS, 2015c. H. F. Trotter and J. W. Tukey. Conditional monte carlo for normal samples. In Symposium on Monte Carlo Methods, pp. 64 79, 1956. V. Vapnik. Statistical Learning Theory. Wiley, Chichester, GB, 1998. L. Weaver and N. Tao. The optimal reward baseline for gradient-based reinforcement learning. In UAI, pp. 538 545, 2001. R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), May 1992. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. Co RR, abs/1611.03530, 2016. Published as a conference paper at ICLR 2018 A APPENDIX: STEERING THE EXPLORATION OF S THROUGH λ. Theorem 1. Let λa < λb and let ˆwa = arg min w ˆRλa IPS(πw) (19) ˆwb = arg min w ˆRλb IPS(πw). (20) If the optima ˆwa and ˆwb are not equivalent in the sense that ˆRλa IPS(π ˆ wa) = ˆRλa IPS(π ˆ wb) and ˆRλb IPS(π ˆ wa) = ˆRλb IPS(π ˆ wb), then Sa < Sb. (21) Proof. Abbreviate f(w) = 1 n Pn i=1 δi πw(yi|xi) π0(yi|xi) and g(w) = 1 n Pn i=1 πw(yi|xi) π0(yi|xi) . Then ˆRλ IPS(πw) = f(w) λg(w), (22) where g(w) corresponds to the value of the control variate S. Since ˆwa and ˆwb are not equivalent optima, we know that f( ˆwa) λa g( ˆwa) < f( ˆwb) λa g( ˆwb) (23) f( ˆwb) λb g( ˆwb) < f( ˆwa) λb g( ˆwa) (24) Adding the two inequalities and solving implies that f( ˆwa) λa g( ˆwa) + f( ˆwb) λb g( ˆwb) < f( ˆwb) λa g( ˆwb) + f( ˆwa) λb g( ˆwa) (25) λa g( ˆwa) + λb g( ˆwb) > λa g( ˆwb) + λb g( ˆwa) (26) (λb λa) g( ˆwb) > (λb λa) g( ˆwa) (27) g( ˆwb) > g( ˆwa) (28) Sb > Sa (29) B APPENDIX: CHARACTERIZING THE RANGE OF S TO EXPLORE. Theorem 2. Let p π0(y | x) be a lower bound on the propensity for the logging policy, then constraining the solution of Eq. (11) to the w with control variate S [1 ϵ, 1 + ϵ] for a training set of size n will not exclude the minimizer of the true risk w = arg minw W R(πw) in the policy space W with probability at least 1 2 exp 2nϵ2p2 . (30) Proof. For the optimal w , let πw (yi | xi) π0(yi | xi) (31) be the control variate in the denominator of the SNIPS estimator. S is a random variable that is a sum of bounded random variables between 0 and max x,y πw (y | x) π0(y | x) 1 We can bound the probability that the control variate S of the optimum w lies outside of [1 ϵ, 1+ϵ] via Hoeffding s inequality: P(|S 1| ϵ) 2 exp 2n2ϵ2 = 2 exp 2nϵ2p2 . (34) The same argument applies to any individual policy πw, not just w . Note, however, that it can still be highly likely that at least one policy πw with w W shows a large deviation in the control variate for high-capacity W, which can lead to propensity overfitting when using the naive IPS estimator. Published as a conference paper at ICLR 2018 C APPENDIX: WHY DIRECT STOCHASTIC OPTIMIZATION OF RATIO ESTIMATORS IS NOT POSSIBLE. Suppose we have a dataset of n BLBF samples D = {(x1, y1, δ1, p1) . . . (xn, yn, δn, pn)} where each instance is an i.i.d. sample from the data generating distribution. In the sequel we will be considering two datasets of n+1 samples, D = D {(x , y , δ , p )} and D = D {(x , y , δ , p )} where (x , y , δ , p ) = (x , y , δ , p ) and (x , y , δ , p ), (x , y , δ , p ) / D. For notational convenience, let fi := δi πw(yi|xi) π0(yi|xi) , and fi := wfi; gi := πw(yi|xi) π0(yi|xi) , and gi := wgi. First consider the vanilla IPS risk estimate of Eq. (5). ˆRIPS(πw) = 1 i=1 δi πw(yi | xi) π0(yi | xi) = 1 To maximize this estimate using stochastic optimization, we must construct an unbiased gradient estimate. That is, we randomly select one sample from D and compute a gradient α((xi, yi, δi, pi)) and we require that w ˆRIPS(πw) = 1 i=1 fi = Ei D [α((xi, yi, δi, pi))] . Here the expectation is over our random choice of 1 out of n samples. Observe that α((xi, yi, δi, pi)) = fi suffices (and indeed, this corresponds to vanilla SGD): Ei D [α((xi, yi, δi, pi))] = 1 nα((xi, yi, δi, pi)) = 1 i=1 fi = w ˆRIPS(πw). Other choices of α( ) can also produce unbiased gradient estimates, and this leads to the study of stochastic variance-reduced gradient optimization. Now let us attempt to construct an unbiased gradient estimate for Eq. (8): ˆRSNIPS(πw) = Pn i=1 fi Pn i=1 gi . Suppose such a gradient estimate exists, β((xi, yi, δi, pi)). Then, w ˆRSNIPS(πw) = w Pn i=1 fi Pn i=1 gi = Ei D [β((xi, yi, δi, pi))] = 1 i=1 β((xi, yi, δi, pi)). This identity is true for any sample of BLBF instances in particular, for D and D : Pn i=1 fi + f Pn i=1 gi + g = 1 n + 1β((xi, yi, δi, pi)) + β((x , y , δ , p )) Pn i=1 fi + f Pn i=1 gi + g = 1 n + 1β((xi, yi, δi, pi)) + β((x , y , δ , p )) Subtracting these two equations, Pn i=1 fi + f Pn i=1 gi + g Pn i=1 fi + f Pn i=1 gi + g = β((x , y , δ , p )) β((x , y , δ , p )) The LHS clearly depends on {(xi, yi, δi, pi)}n i=1 in general, while the RHS does not! This contradiction indicates that no construction of β that only looks at a sub-sample of the data can yield an unbiased gradient estimate of ˆRSNIPS(πw). Published as a conference paper at ICLR 2018 D APPENDIX: VARIANCE REGULARIZATION Unlike in conventional supervised learning, a counterfactual empirical risk estimator like ˆRIPS(πw) can have vastly different variances Var( ˆRIPS(πw)) for different πw in the hypothesis space (and ˆRSNIPS(πw) as well) (Swaminathan & Joachims, 2015b). Intuitively, the closer the particular πw is to the exploration policy π0, the larger the effective sample size (Owen, 2013) will be and the smaller the variance of the empirical risk estimate. For the optimization problems we solve in Eq. (18), this means that we should trust the λ-translated risk estimate ˆRλj IPS(πw) more for some w than for others, as we use ˆRλj IPS(πw) only as a proxy for finding the policy that minimizes its expected value (i.e., the true loss). To this effect, generalization error bounds that account for this variance difference (Swaminathan & Joachims, 2015b) motivate a new type of overfitting control. This leads to the following training objective (Swaminathan & Joachims, 2015b), which can be thought of as a more reliable version of Eq. (18): ˆwj = arg min w ˆRλj IPS(πw) + γ d Var( ˆRλj IPS(πw)) Here, d Var( ˆRλj IPS(πw)) is the estimated variance of ˆRλj IPS(πw) on the training data, and γ is a regularization constant to be selected via cross validation. The intuition behind this objective is that we optimize the upper confidence interval, which depends on the variance of the risk estimate for each πw. While this objective again does not permit SGD optimization in its given form, it has been shown that a Taylor-majorization can be used to successively upper bound the objective in Eq. (35), and that typically a small number of iterations suffices to converge to a local optimum (Swaminathan & Joachims, 2015b). Each such Taylor-majorization is again of a form A πw(yi | xi) π0(yi | xi) + B πw(yi | xi) π0(yi | xi) for easily computable constants A and B (Swaminathan & Joachims, 2015b), which allows for SGD optimization.