# efficient_counterfactual_learning_from_bandit_feedback__e873dfd8.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Efficient Counterfactual Learning from Bandit Feedback Yusuke Narita Yale University yusuke.narita@yale.edu Shota Yasui Cyber Agent Inc. yasui shota@cyberagent.co.jp Kohei Yata Yale University kohei.yata@yale.edu What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark. 1 Introduction Interactive bandit systems (e.g. personalized education and medicine, ad/news/recommendation/search platforms) produce log data valuable for evaluating and redesigning the systems. For example, the logs of a news recommendation system record which news article was presented and whether the user read it, giving the system designer a chance to make its recommendation more relevant. Exploiting log data is, however, more difficult than conventional supervised machine learning: the result of each log is only observed for the action chosen by the system (e.g. the presented news) but not for all the other actions the system could have taken. Moreover, the log entries are biased in that the logs overrepresent actions favored by the system. A potential solution to this problem is an A/B test that compares the performance of counterfactual systems. However, A/B testing counterfactual systems is often technically or managerially infeasible, since deploying a new policy is timeand money-consuming, and entails a risk of failure. This leads us to the problem of counterfactual (off-policy) evaluation and learning, where one aims to use batch data collected by a logging policy to estimate the value of a counterfactual policy or algorithm without employing it (Li et al. 2010; Strehl et al. 2010; Li et al. 2011; 2012; Bottou et al. 2013; Swaminathan and Joachims 2015a; 2015b; Wang, Agarwal, and Dudik 2017; Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Swaminathan et al. 2017). Such evaluation allows us to compare the performance of counterfactual policies to decide which policy should be deployed in the field. This alternative approach thus solves the above problem with the naive A/B test approach. Method. For off-policy evaluation with log data of bandit feedback, this paper develops and empirically implements a variance minimization technique. Variance reduction and statistical efficiency are important for minimizing the uncertainty we face in decision making. Indeed, an important open question raised by Li (2015) is how to achieve statistically more efficient (even optimal) offline estimation from batch bandit data. This question motivates a set of studies that bound and characterize the variances of particular estimators (Dud ık et al. 2014; Li, Munos, and Szepesv ari 2015; Thomas, Theocharous, and Ghavamzadeh 2015; Munos et al. 2016; Thomas and Brunskill 2016; Agarwal et al. 2017). We study this statistical efficiency question in the context of offline policy value estimation with log data from a class of contextual bandit algorithms. This class includes most of the widely-used algorithms such as contextual ϵ-Greedy and Thompson Sampling, as well as their non-contextual analogs and random A/B testing. We allow the logging policy to be unknown, degenerate (non-stochastic), and timevarying, all of which are salient in real-world bandit applications. We also allow the evaluation target policy to be degenerate, again a key feature of real-life situations. We consider offline estimators for the expected reward from a counterfactual policy. Our estimators can also be used for estimating the average treatment effect. Our estimators are variations of well-known inverse probability weighting estimators (Horvitz and Thompson (1952), Rosenbaum and Rubin (1983), and modern studies cited above) except that we use an estimated propensity score (logging policy) even if we know the true propensity score. We show the following result, building upon Bickel et al. (1993), Hirano, Imbens, and Ridder (2003), and Ackerberg et al. (2014) among others: Theoretical Result 1. Our estimators minimize the variance among all reasonable estimators. More precisely, our estimators minimize the asymptotic variance among all asymptotically normal estimators (in the standard statistical sense defined in Section 3). We also provide estimators for the asymptotic variances of our estimators, thus allowing analysts to calculate the variance in practice. In contrast to Result 1, we also find: Theoretical Result 2. Standard estimators using the true propensity score (logging policy) have larger asymptotic variances than our estimators. Perhaps counterintuitively, therefore, the policy-maker should use an estimated propensity score even when she knows the true one. Application. We empirically apply our estimators to evaluate and optimize the design of online advertisement formats. Our application is based on proprietary data provided by Cyber Agent Inc., the second largest Japanese advertisement company with about 6 billion USD market capitalization (as of November 2018). This company uses a contextual bandit algorithm to determine the visual design of advertisements assigned to users. Their algorithm produces logged bandit data. We use this data and our estimators to optimize the advertisement design for maximizing the click through rates (CTR). In particular, we estimate how much the CTR would be improved by a counterfactual policy of choosing the best action (advertisement) for each context (user characteristics). We first obtain the following result: Empirical Result A. Consistent with Theoretical Results 1-2, our estimators produce narrower confidence intervals about the counterfactual policy s CTR than a benchmark using the known propensity score (Swaminathan and Joachims 2015b). This result is reported in Figure 1, where the confidence intervals using True Propensity Score (Benchmark) are wider than other confidence intervals using propensity scores estimated either by the Gradient Boosting, Random Forest, or Ridge Logistic Regression. Thanks to this variance reduction, we conclude that the logging policy s CTR is below the confidence interval of the hypothetical policy of choosing the best advertisement for each context. This leads us to obtain the following bottomline: Empirical Result B. Unlike the benchmark estimator, our estimator predicts the hypothetical policy to statistically significantly improve the CTR by 10-15% (compared to the logging policy). Empirical Results A and B therefore show that our estimator can substantially reduce uncertainty we face in real-world policy-making. 2 Setup 2.1 Data Generating Process We consider a general multi-armed contextual bandit setting. There is a set of m + 1 actions (equivalently, arms or treatments), A = {0, ..., m}, that the decision maker can choose from. Let Y ( ) : A R denote a potential reward function that maps actions into rewards or outcomes, where Y (a) is the reward when action a is chosen (e.g., whether True Propensity Score (Benchmark) Gradient Boosting Machine Ridge Logistic Random Forest Propensity Score Estimator Expected Reward Estimate Logging Policy Counterfactural Policy: Best Action by Context Figure 1: Improving Ad Design with Lower Uncertainty Notes: This figure shows estimates of the expected CTRs of the logging policy and a counterfactual policy of choosing the best action for each context. CTRs are multiplied by a constant for confidentiality reasons. We obtain these estimates by the self-normalized inverse probability weighting estimator using the true propensity score (benchmark thanks to Swaminathan and Joachims (2015b)) or estimated propensity scores (our proposal), both of which are defined and analyzed in Section 3. Bars indicate 95% confidence intervals based on our asymptotic variance estimators developed in Section 4. an advertisement as an action results in a click). Let X denote context or covariates (e.g., the user s demographic profile and browsing history) that the decision maker observes when picking an action. We denote the set of contexts by X. We think of (Y ( ), X) as a random vector with unknown distribution G. We consider log data coming from the following data generating process (DGP), which is similar to those used in the literature on the offline evaluation of contextual bandit algorithms (Li et al. 2010; Strehl et al. 2010; Li et al. 2011; 2012; Swaminathan and Joachims 2015a; 2015b; Swaminathan et al. 2017). We observe data {(Yt, Xt, Dt)}T t=1 with T observations. Dt (Dt0, ..., Dtm) where Dta is a binary variable indicating whether action a is chosen in round t. Yt denotes the reward observed in round t, i.e., Yt Pm a=0 Dta Yt(a). Xt denotes the context observed in round t. A key feature of our DGP is that the data {(Yt, Xt, Dt)}T t=1 are divided into B batches, where different batches may use different choice probabilities (propensity scores). Let Xb t {1, 2, ..., B} denote a random variable indicating the batch to which round t belongs. We treat this batch number as one of context variables and write Xt = ( Xt, Xb t ), where Xt is the vector of context variables other than Xb t . Let pt = (pt0, ..., ptm) (A) denote the potentially unknown probability vector indicating the probability that each action is chosen in round t. Here (A) {(pa) a pa = 1} with pa being the probability that action a is chosen. A contextual bandit algorithm is a sequence {Fb}B b=1 of distribution functions of choice probabilities pt conditional on Xt, where Fb : X ( (A)) for b {1, 2, ..., B} and X is the support of X, where ( (A)) is the set of distributions over (A). Fb takes context Xt as input and returns a distribution of probability vector pt in rounds of batch b. Fb can vary across batches but does not change across rounds within batch b. We assume that the log data are generated by a contextual bandit algorithm {Fb}B b=1 as follows: In each round t = 1, ..., T, (Yt( ), Xt) is i.i.d. drawn from distribution G. Re-order round numbers so that they are monotonically increasing in their batch numbers Xb t . In each round t within batch b {1, 2, ..., B} and given Xt, probability vector pt = (pt0, ..., ptm) is drawn from Fb( | Xt). Action is randomly chosen based on probability vector pt, creating the action choice Dt and the associated reward Yt. Here, the contextual bandit algorithm {Fb}B b=1 and the realized probability vectors {pt}T t=1 may or may not be known to the analyst. We also allow for the realization of pt to be degenerate, i.e., a certain action may be chosen with probability 1 at a point in time. Examples. This DGP allows for many popular bandit algorithms, as the following examples illustrate. In each of the examples below, the contextual bandit algorithm Fb is degenerate and produces a particular probability vector pt for sure. Example 1 (Random A/B testing). We always choose each action uniformly at random: pta = 1 m+1 always holds for any a A and any t = 1, ..., T. In the remaining examples, at every batch b, the algorithm uses the history of observations from the previous b 1 batches to estimate the mean and the variance of the potential reward under each action conditional on each context: µ(a|x) E[Y (a)| X = x] and σ2(a|x) V[Y (a)| X = x]. We denote the estimates using the history up to batch b 1 by ˆµb 1(a|x) and ˆσ2 b 1(a|x). See Li et al. (2012) and Dimakopoulou, Athey, and Imbens (2017) for possible estimators based on generalized linear models and generalized random forest, respectively. The initial estimates, ˆµ0(a|x) and ˆσ2 0(a|x), are set to any values. Example 2 (ϵ-Greedy). In each round within batch b, we choose the best action based on ˆµb 1(a| Xt) with probability 1 ϵb and choose the other actions uniformly at random with probability ϵb: 1 ϵb if a = argmax a A ˆµb 1(a | Xt) ϵb m otherwise. Example 3 (Thompson Sampling using Gaussian priors). In each round within batch b, we sample the potential reward yt(a) from distribution N(ˆµb 1(a| Xt), ˆσ2 b 1(a| Xt)) for each action, and choose the action with the highest sampled potential reward, argmax a A yt(a ). As a result, this algo- rithm chooses actions with the following probabilities: pta = Pr{a = argmax a A yt(a )}, where (yt(0), ..., yt(m)) N(ˆµb 1( Xt), ˆΣb 1( Xt)), ˆµb 1(x) = (ˆµb 1(0|x), ..., ˆµb 1(m|x)) , and ˆσ2 b 1(0|x) 0 0 0 ... 0 0 0 ˆσ2 b 1(m|x) In Examples 2 and 3, pt depends on the random realization of the estimates ˆµb 1(a|x) and ˆσ2 b 1(a|x), and so does the associated Fb. If the data are sufficiently large, the uncertainty in the estimates vanishes: ˆµb 1(a|x) and ˆσ2 b 1(a|x) converge to µb 1(a|x) E[Y (a)| X = x, Xb b 1] and σ2 b 1(a|x) V[Y (a)| X = x, Xb b 1], respectively. In this case, Fb becomes nonrandom since it depends on the fixed realizations µb 1(a|x) and σ2 b 1(a|x). In the following analysis, we consider this large-sample scenario and assume that Fb is nonrandom. To make the notation simpler, we put {Fb}B b=1 together into a single distribution F : X ( (A)) obtained by F( | X, Xb = b) = Fb( | X) for each b {1, 2, ..., B}. We use this to rewrite our DGP as follows: In each round t = 1, ..., T, (Yt( ), Xt) is i.i.d. drawn from distribution G. Given Xt, probability vector pt = (pt0, ..., ptm) is drawn from F( |Xt). Action is randomly chosen based on probability vector pt, creating the action choice Dt and the associated reward Yt. p0a(x) Pr D p, p F(Da = 1|X = x) for each a, and let p0(x) = (p00(x), ..., p0m(x)) . This is the choice probability vector conditional on each context. We call p0( ) the logging policy or the propensity score. F is common for all rounds regardless of the batch to which they belong. Thus pt and Dt are i.i.d. across rounds. Because (Yt( ), Xt) is i.i.d. and Yt = Pm a=0 Dta Yt(a), each observation (Yt, Xt, Dt) is i.i.d.. Note also that Dt is independent of Yt( ) conditional on Xt. 2.2 Parameters of Interest We are interested in using the log data to estimate the expected reward from any given counterfactual policy π : X (A), which chooses a distribution of actions given each context: V π E(Y ( ),X) G[ a=0 Y (a)π(a|X)] = E(Y ( ),X) G, D p0(X)[ a=0 Y (a)Da π(a|X) p0a(X)], (1) where the last equality uses the independence of D and Y ( ) conditional on X and the definition of p0( ). Here, (A) {(pa) Rm+1| P a pa 1}. We allow the counterfactual policy π to be degenerate, i.e., π may choose a particular action with probability 1. Depending on the choice of π, V π represents a variety of parameters of interest. When we set π(a|x) = 1 for a particular action a and π(a |x) = 0 for all a A\{a} for all x X, V π equals E(Y ( ),X) G[Y (a)], the expected reward from action a. When we set π(a|x) = 1, π(0|x) = 1 and π(a |x) = 0 for all a A\{0, a} for all x X, V π equals E(Y ( ),X) G[Y (a) Y (0)], the average treatment effect of action a over action 0. Such treatment effects are of scientific and policy interests in medical and social sciences. Business and managerial interests also motivate treatment effect estimation. For example, when a company implements a bandit algorithm using a particular reward measure like an immediate purchase, the company is often interested in treatment effects on other outcomes like long-term user retention. 3 Efficient Value Estimation We consider the efficient estimation of the expected reward from a counterfactual policy, V π. We consider an estimator consisting of two steps. In the first step, we nonparametrically estimate the propensity score vector p0( ) by a consistent estimator. Possible estimators include machine learning algorithms such as gradient boosting, as well as nonparametric sieve estimators and kernel regression estimators, as detailed in Section 3.2. In the second step, we plug the estimated propensity ˆp( ) into the sample analogue of expression (1) to estimate V π (in practice, some trimming or thresholding may be desirable for numerical stability): a=0 Yt Dta π(a|Xt) Alternatively, one can use a self-normalized estimator inspired by Swaminathan and Joachims (2015b) when Pm a=0 π(a|x) = 1 for all x X: 1 T PT t=1 Pm a=0 Yt Dta π(a|Xt) 1 T PT t=1 Pm a=0 Dta π(a|Xt) Swaminathan and Joachims (2015b) suggest that ˆV π SN tends to be less biased than ˆV π in small sample. Unlike Swaminathan and Joachims (2015b), however, we use the estimated propensity score rather than the true one. The above estimators estimate a scalar parameter V π defined as a function of the distribution of (Y ( ), X), on which we impose no parametric assumption. Our estimators therefore attempt to solve a semiparametric estimation problem, i.e., a partly-parametric and partly-nonparametric estimation problem. For this semiparametric estimation problem, we first derive the semiparametric efficiency bound on how efficient and precise the estimation of the parameter can be, which is a semiparametric analog of the Cramer-Rao bound (Bickel et al. 1993). The asymptotic variance of any asymptotically normal estimator is no smaller than the semiparametric efficiency bound. Following the standard statistics terminology, we say that estimator ˆθ for parameter θ is asymptotically normal if T(ˆθ θ) N(0, Σ) as T , where denotes convergence in distribution, and N(0, Σ) denotes a normally distributed random variable with mean 0 and variance Σ. We call Σ the asymptotic variance of ˆθ. The semiparametric efficiency bound for θ is a lower bound on the asymptotic variance of asymptotically normal estimators; the Supplementary Material provides a formal definition of the semiparametric efficiency bound. We show the above estimators achieve the semiparametric efficiency bound, i.e., they minimize the asymptotic variance among all asymptotically normal estimators. Our analysis uses a couple of regularity conditions. We first assume that the logging policy p0( ) ex ante chooses every action with a positive probability for every context. Assumption 1. There exists some p such that 0 < p Pr D p, p F (Da = 1|X = x) p0a(x) for any x X and for a = 0, ..., m. Note that Assumption 1 is consistent with the possibility that the realization of pta takes on value 0 or 1 (as long as it takes on positive values with a positive probability). We also assume the existence of finite second moments of potential rewards. Assumption 2. E[Y (a)2] < for a = 0, ..., m. The following proposition provides the semiparametric efficiency bound for V π. All the proofs are in the Supplementary Material. Lemma 1 (Semiparametric Efficiency Bound). Under Assumptions 1 and 2, the semiparametric efficiency bound for V π, the expected reward from counterfactual policy π, is a=0 V[Y (a)|X]π(a|X)2 p0a(X) + (θ(X) V π)2], where θ(X) = Pm a=0 E[Y (a)|X]π(a|X) is the expected reward from policy π conditional on X. Lemma 1 implies the semiparametric efficiency bounds for the expected reward from each action and for the average treatment effect, since they are special cases of V π. Corollary 1. Suppose that Assumptions 1 and 2 hold. Then, the semiparametric efficiency bound for the expected reward from each action, E[Y (a)], is E V[Y (a)|X] p0a(X) + (E[Y (a)|X] E[Y (a)])2 . The semiparametric efficiency bound for the average treatment effect, E[Y (a) Y (0)], is E V[Y (0)|X] p00(X) + V[Y (a)|X] + (E[Y (a) Y (0)|X] E[Y (a) Y (0)])2 . Our proposed estimators are two-step generalizedmethod-of-moment estimators and are asymptotically normal under some regularity conditions, one of which requires that the convergence rate of ˆp( ) be faster than n1/4 (Newey 1994; Chen 2007). Given the asympotic normality of the estimators, we find that they achieve the semiparametric efficiency bound, building upon Ackerberg et al. (2014) among others. Theorem 1 (Efficient Estimators). Suppose that Assumptions 1 and 2 hold and that ˆp( ) is a consistent estimator for p0( ). Then, the variance of ˆV π and ˆV π SN achieves the semiparametric efficiency bound for V π (provided in Lemma 1). 3.1 Inefficient Value Estimation In some environments, we know the true p0( ) or observe the realization of the probability vectors {pt}T t=1. In this case, an alternative way to estimate V π is to use the sample analogue of the expression (1) without estimating the propensity score. If we know p0( ), a possible estimator is a=0 Yt Dta π(a|Xt) p0a(Xt). If we observe the realization of {pt}T t=1, we may use a=0 Yt Dta π(a|Xt) When Pm a=0 π(a|x) = 1 for all x X, it is again possible to use their self-normalized versions: 1 T PT t=1 Pm a=0 Yt Dta π(a|Xt) p0a(Xt) 1 T PT t=1 Pm a=0 Dta π(a|Xt) p0a(Xt) . 1 T PT t=1 Pm a=0 Yt Dta π(a|Xt) pta 1 T PT t=1 Pm a=0 Dta π(a|Xt) These intuitive estimators turn out to be less efficient than the estimators with the estimated propensity score, as the following result shows. Theorem 2 (Inefficient Estimators). Suppose that the propensity score p0( ) is known and we observe the realization of {pt}T t=1. Suppose also that Assumptions 1 and 2 hold and that ˆp( ) is a consistent estimator for p0( ). Then, the asymptotic variances of V π, V π, V π SN, and V π SN are no smaller than that of ˆV π and ˆV π SN. Generically, V π, V π, V π SN, and V π SN are strictly less efficient than ˆV π and ˆV π SN in the following sense. 1. If Pr(E[Y (a)|X] π(a|X) p0a(X) = θ(X) for some a) > 0, then the asymptotic variances of V π, V π, V π SN and V π SN are strictly larger than that of ˆV π and ˆV π SN. 2. If Pr(E[Y (a)2|X]π(a|X)2(E[ 1 pa |X] 1 p0a(X)) = 0 for some a) > 0, then the asymptotic variance of V π and V π SN is strictly larger than that of ˆV π and ˆV π SN. The condition in Part 1 of Theorem 2 is about the dominating term in the difference between ˆV π and V π. The proofs of Theorems 1 and 2 show that the asymptotic variance of ˆV π is the asymptotic variance of V π 1 T PT t=1 h Pm a=0 E[Y (a)|Xt] π(a|Xt) p0a(Xt)Dta θ(Xt) i . Part 1 of Theorem 2 requires that the second term be not always zero so that the asymptotic variance of ˆV π is different from that of V π. As long as the two variances are not the same, ˆV π achieves variance reduction. Part 2 of Theorem 2 requires that E[ 1 pta |Xt] 1 E[pta|Xt](= E[ 1 pta |Xt] 1 p0a(Xt)) = 0 with a positive probability. This means that pt is not always the same as the true propensity score p0(Xt), i.e., F( |Xt) is not degenerate (recall that pt is drawn from F( |Xt) whose expected value is p0(Xt)). Under this condition, V π has a strictly larger asymptotic variance than V π and ˆV π. Theorems 1 and 2 suggest that we should use an estimated score regardless of whether the propensity score is known. To develop some intuition for this result, consider a simple situation where the context Xt always takes some constant value x. Suppose that we are interested in estimation of the expected reward from action a, E[Y (a)]. Since X is constant across rounds, a natural nonparametric estimator for p0a(x) is the proportion of rounds in which action a was chosen: ˆpa(x) = T . The estimator using the estimated propensity score is t=1 Yt Dta ˆpa(x) = 1 PT t=1 Dta t=1 Yt Dta. The estimator using the true propensity score is t=1 Yt Dta p0a(x) = 1 Tp0a(x) t=1 Yt Dta. When action a happens to be chosen frequently in a sample so that PT t=1 Dta is larger, the absolute value of PT t=1 Yt Dta tends to be larger in the sample. Because of this positive correlation between PT t=1 Dta and the absolute value of PT t=1 Yt Dta, ˆV π has a smaller variance than V π, which produces no correlation between the numerator and the denominator. Similar intuition applies to the comparison between V π and ˆV π. 3.2 How to Estimate Propensity Scores? There are several options for the first step estimation of the propensity score. 1. A sieve Least Squares (LS) estimator: ˆpa( ) = argmin pa( ) Ha T t=1 (Dta pa(Xt))2, where Ha T = {pa(x) = Pka T j=1 qaj(x)λaj = qka T (x) λa} and ka T as T . Here {qaj} j=1 is some known basis functions defined on X and qka T ( ) = (qa1( ), ..., qaka T ( )) . 2. A sieve Logit Maximum Likelihood estimator: ˆp( ) = argmax p( ) HT a=0 Dat log pa(Xt), where HT = {p : X (0, 1)m+1 : pa(x) = exp(Rk T (x) λa) 1+Pm a=1 exp(Rk T (x) λa) for a = 1, ..., m, p0(x) = 1 Pm a=1 pa(x)}. Here Rk T ( ) = (R1( ), ..., Rk T ( )) and {Rj} j=1 is the set of some basis functions. 3. Prediction of Dta by Xt using a modern machine learning algorithm like random forest, ridge logistic regression, and gradient boosting. The above estimators are known to satisfy consistency with a convergence rate faster than n1/4 under regularity conditions (Newey 1997; Cattaneo 2010; Knight and Fu 2000; Blanchard, Lugosi, and Vayatis 2003; B uhlmann and Van De Geer 2011; Wager and Athey 2018). How should one choose a propensity score estimator? We prefer an estimated score to the true one because it corrects the discrepancy between the realized action assignment in the data and the assignment predicted by the true score. To achieve this goal, a good propensity score estimator should fit the data better than the true one, which means that the estimator should overfit to some extent. As a concrete example, in our empirical analysis, random forest produces a larger (worse) variance than gradient boosting and ridge logistic regression (see Figure 1 and Table 1). This is because random forest fits the data worse, which is due to its bagging aspect preventing random forest from overfitting. In general, however, we do not know which propensity score estimator achieves the best degree of overfitting. We would therefore suggest that the analyst try different estimators to determine which one is most efficient. 4 Estimating Asymptotic Variance We often need to estimate the asymptotic variance of the above estimators. For example, variance estimation is crucial for determining whether a counterfactual policy is statistically significantly better than the logging policy. We propose an estimator that uses the sample analogue of an expression of the asymptotic variance. As shown in the proof of Theorem 1, the asymptotic variance of ˆV π and ˆV π SN is E[(g(Y, X, D, V π, p0) + α(X, D, p0, µ0))2], where µ0 : X Rm+1 such that µ0(a|x) = E[Y (a)|X = x] for each a and x, g(Y, X, D, θ, p) = a=0 Y Da π(a|X) α(X, D, p, µ) = a=0 µ(a|X)π(a|X) pa(X) (Da pa(X)). We estimate this asymptotic variance in two steps. In the first step, we obtain estimates of V π and p0 using the method in Section 3. In addition, we estimate µ0(a|x) by nonparametric regression of Yt on Xt using the subsample with Dta = 1 for each a. Denote the estimate by ˆµ. For this regression, one may use a sieve Least Squares estimator and machine learning algorithms. In the empirical application below, we use ridge logistic regression. In the second step, we plug the estimates of V π, p0 and µ0 into the sample analogue of E[(g(Y, X, D, V π, p0) + α(X, D, p0, µ0))2] to estimate the asymptotic variance: when we use ˆV π: \ AV ar( ˆV π) t=1 (g(Yt, Xt, Dt, ˆV π, ˆp) + α(Xt, Dt, ˆp, ˆµ))2. When we use ˆV π SN, its asymptotic variance estimator is obtained by replacing ˆV π with ˆV π SN in the above expression. This asymptotic variance estimator is a two-step generalized-method-of-moment estimator, and is shown to be a consistent estimator under the condition that the first step estimator of (V π, p0, µ0) is consistent and some regularity conditions (Newey 1994). It is easier to estimate the asymptotic variance of V π and V π SN with the true propensity score. Their asymptotic variance is E[g(Y, X, D, V π, p0)2] by the standard central limit theorem. When we use V π, we estimate this asymptotic variance by \ AV ar( V π) = 1 t=1 g(Yt, Xt, Dt, V π, p0)2 When we use V π SN, its asymptotic variance estimator is obtained by replacing V π with V π SN in the above expression. 5 Real-World Application We apply our estimators described in Sections 3 and 4 to empirically evaluate and optimize the design of online advertisements. This application uses proprietary data provided by Cyber Agent Inc., which we described in the introduction. This company uses a contextual bandit algorithm to determine the visual design of advertisements assigned to user impressions (there are four design choices). This algorithm produces logged bandit data. We use this logged bandit data and our estimators to improve their advertisement design for maximizing the click through rates (CTR). In the notation of our theoretical framework, reward Y is a click, action a is one of the four possible individual advertisement designs, and context X is user and ad characteristics used by the company s logging policy. The logging policy (the company s existing contextual bandit algorithm) works as follows. For each round, the logging policy first randomly samples each action s predicted reward from a beta distribution. This beta distribution is parametrized by the predicted CTR for each context, where the CTR prediction is based on a Factorization Machine (Rendle 2010). The logging policy then chooses the action (advertisement) with the largest sampled reward prediction. The logging policy and the underlying CTR prediction stay Existing Logging Policy Policy of Choosing Best Action by Context Propensity Score Estimator CI for Expected Reward Shrinkage in CI CI Shrinkage in CI True Score (Benchmark) 1.036 0.083 0 1.140 0.171 0 Gradient Boosting Machine 1.047 0.064 22.5% 1.197 0.131 23.4% Ridge Logistic Regression 0.987 0.058 30.2% 1.108 0.113 34.1% Random Forest 1.036 0.077 6.62% 1.194 0.159 7.41% Sample Size 57, 619 Table 1: Improving Ad Design with Lower Uncertainty Notes: The first and third columns of this table show 95% confidence intervals of the expected CTRs V π of the logging policy and a hypothetical policy of choosing the best action (ad) for each context. CTRs are multiplied by a constant for confidentiality reasons. We obtain the CTR estimates by the self-normalized inverse probability weighting estimator V π SN using the true propensity score (Swaminathan and Joachims 2015b) or the estimated propensity score ( ˆV π SN in Section 3). We estimate standard errors and confidence intervals based on the method described in Section 4. The second and fourth columns show the size of reductions in confidence interval length, i.e., value α such that the length of the confidence interval is equal to 100 α% of the length of the confidence interval using the true propensity score. the same for all rounds in each day. Each day therefore performs the role of a batch in the model in Section 2. This somewhat nonstandard logging policy and the resulting log data are an example of our DGP in Section 2. This logging policy may have room for improvement for several reasons. First, the logging policy randomly samples advertisements and does not necessarily choose the advertisement with the best predicted CTR. Also, the logging policy uses a predictive Factorization Machine for its CTR prediction, which may be different from the causal CTR (the causal effect of each advertisement on the probability of a click). To improve on the logging policy, we first estimate the propensity score by random forest, ridge logistic regression, or gradient boosting (implemented by XGBoost). These estimators are known to satisfy the regularity conditions (e.g. consistency) required for our theoretical results, as explained in Section 3.2. With the estimated propensity score, we then use our estimator ˆV π SN to estimate the expected reward from two possible policies: (1) the logging policy and (2) a counterfactual policy that chooses the best action (advertisement) that is predicted to maximize the CTR conditional on each context. To implement this counterfactual policy, we estimate E[Y (a)|X] by ridge logistic regression for each action a and context X used by the logging policy (we apply one-hot encoding to categorical variables in X). Given each context X, the counterfactual policy then chooses the action with the highest estimated value of E[Y (a)|X]. Importantly, we use separate data sets for the two estimation tasks (one for the best actions and the other for the expected reward from the hypothetical policy). Specifically, we use data logged during April 20-26, 2018 for estimating the best actions and data during April 27-29 for estimating the expected reward. This data separation allows us to avoid overfitting and overestimation of the CTR gains from the counterfactual policy. As a benchmark, we also estimate the same expected rewards based on Swaminathan and Joachims (2015b) s selfnormalized estimator V π SN, which uses the true propensity score. The resulting estimates show the following result: Empirical Result A. Consistent with Theorems 1-2, our estimator ˆV π SN with the estimated score is statistically more efficient than the benchmark V π SN with the true score. This result is reported in Figure 1 and Table 1, where the confidence intervals about the predicted CTR using True Propensity Score (Benchmark) are less precise (wider) than those using estimated propensity scores (regardless of which one of the three score estimators to use). The magnitude of this shrinkage in the confidence intervals and standard errors is 6-34%, depending on how to estimate the propensity score. This variance reduction allows us to conclude that the logging policy is below the lower bound of the confidence interval of the hypothetical policy, giving us confidence in the following implication: Empirical Result B. Compared to the logging policy, the hypothetical policy (choosing the best advertisement given each context) improves the CTR by 10-15% statistically significantly at the 5% significance level. 6 Conclusion We have investigated the most statistically efficient use of batch bandit data for estimating the expected reward from a counterfactual policy. Our estimators minimize the asymptotic variance among all asymptotically normal estimators (Theorem 1). By contrast, standard estimators have larger asymptotic variances (Theorem 2). We have also applied our estimators to improve online advertisement design. Compared to the frontier benchmark V π SN, our reward estimator ˆV π SN provides the company with more statistical confidence in how to improve on its existing bandit algorithm (Empirical Results A and B). The hypothetical policy of choosing the best advertisement given user characteristics would improve the click through rate by 10-15% at the 5% significance level. These empirical results thus highlight the practical values of Theorems 1-2. Acknowledgments. We are grateful to seminar participants at ICML/IJCAI/AAMAS Workshop Machine Learning for Causal Inference, Counterfactual Prediction, and Autonomous Action (Causal ML) and RIKEN Center for Advanced Intelligence Project, especially Junya Honda, Masaaki Imaizumi, Atsushi Iwasaki, Kohei Kawaguchi, and Junpei Komiyama. References Ackerberg, D.; Chen, X.; Hahn, J.; and Liao, Z. 2014. Asymptotic Efficiency of Semiparametric Two-step GMM. Review of Economic Studies 81(3):919 943. Agarwal, A.; Basu, S.; Schnabel, T.; and Joachims, T. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. KDD 687 696. Bickel, P. J.; Klaassen, C. A. J.; Ritov, Y.; and Wellner, J. A. 1993. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press. Blanchard, G.; Lugosi, G.; and Vayatis, N. 2003. On the Rate of Convergence of Regularized Boosting Classifiers. Journal of Machine Learning Research 4(Oct):861 894. Bottou, L.; Peters, J.; Qui nonero-Candela, J.; Charles, D. X.; Chickering, D. M.; Portugaly, E.; Ray, D.; Simard, P.; and Snelson, E. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14(1):3207 3260. B uhlmann, P., and Van De Geer, S. 2011. Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media. Cattaneo, M. D. 2010. Efficient Semiparametric Estimation of Multi-valued Treatment Effects under Ignorability. Journal of Econometrics 155(2):138 154. Chen, X. 2007. Large Sample Sieve Estimation of Seminonparametric Models. Handbook of Econometrics 6:5549 5632. Dimakopoulou, M.; Athey, S.; and Imbens, G. 2017. Estimation Considerations in Contextual Bandits. Ar Xiv. Dud ık, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly Robust Policy Evaluation and Optimization. Statistical Science 29:485 511. Hirano, K.; Imbens, G. W.; and Ridder, G. 2003. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score. Econometrica 71(4):1161 1189. Horvitz, D. G., and Thompson, D. J. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. Journal of the American Statistical Association 47(260):663 685. Knight, K., and Fu, W. 2000. Asymptotics for Lasso-type Estimators. Annals of Statistics 1356 1378. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-bandit Approach to Personalized News Article Recommendation. WWW 661 670. Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. WSDM 297 306. Li, L.; Chu, W.; Langford, J.; Moon, T.; and Wang, X. 2012. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. Journal of Machine Learning Research: Workshop and Conference Proceedings 26:19 36. Li, L.; Munos, R.; and Szepesv ari, C. 2015. Toward Minimax Off-policy Value Estimation. AISTATS 608 616. Li, L. 2015. Offline Evaluation and Optimization for Interactive Systems. WSDM. Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and Efficient Off-policy Reinforcement Learning. NIPS 1054 1062. Newey, W. K. 1994. The Asymptotic Variance of Semiparametric Estimators. Econometrica 62(6):1349 1382. Newey, W. K. 1997. Convergence Rates and Asymptotic Normality for Series Estimators. Journal of Econometrics 79(1):147 168. Rendle, S. 2010. Factorization Machines. ICDM 995 1000. Rosenbaum, P. R., and Rubin, D. B. 1983. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 70(1):41 55. Strehl, A.; Langford, J.; Li, L.; and Kakade, S. M. 2010. Learning from Logged Implicit Exploration Data. NIPS 2217 2225. Swaminathan, A., and Joachims, T. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research 16:1731 1755. Swaminathan, A., and Joachims, T. 2015b. The Selfnormalized Estimator for Counterfactual Learning. NIPS 3231 3239. Swaminathan, A.; Krishnamurthy, A.; Agarwal, A.; Dudik, M.; Langford, J.; Jose, D.; and Zitouni, I. 2017. Off-policy Evaluation for Slate Recommendation. NIPS 3635 3645. Thomas, P., and Brunskill, E. 2016. Data-efficient Offpolicy Policy Evaluation for Reinforcement Learning. ICML 2139 2148. Thomas, P. S.; Theocharous, G.; and Ghavamzadeh, M. 2015. High-Confidence Off-Policy Evaluation. AAAI 3000 3006. Wager, S., and Athey, S. 2018. Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests. Journal of the American Statistical Association 113(523):1228 1242. Wang, Y.-X.; Agarwal, A.; and Dudik, M. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. ICML 3589 3597.