# efficient_counterfactual_learning_from_bandit_feedback__e873dfd8.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Efﬁcient Counterfactual Learning from Bandit Feedback

Yusuke Narita Yale University yusuke.narita@yale.edu

Shota Yasui Cyber Agent Inc. yasui shota@cyberagent.co.jp

Kohei Yata Yale University kohei.yata@yale.edu

What is the most statistically efﬁcient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider ofﬂine estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical conﬁdence compared to a state-of-theart benchmark.

1 Introduction

Interactive bandit systems (e.g. personalized education and medicine, ad/news/recommendation/search platforms) produce log data valuable for evaluating and redesigning the systems. For example, the logs of a news recommendation system record which news article was presented and whether the user read it, giving the system designer a chance to make its recommendation more relevant. Exploiting log data is, however, more difﬁcult than conventional supervised machine learning: the result of each log is only observed for the action chosen by the system (e.g. the presented news) but not for all the other actions the system could have taken. Moreover, the log entries are biased in that the logs overrepresent actions favored by the system. A potential solution to this problem is an A/B test that compares the performance of counterfactual systems. However, A/B testing counterfactual systems is often technically or managerially infeasible, since deploying a new policy is timeand money-consuming, and entails a risk of failure. This leads us to the problem of counterfactual (off-policy) evaluation and learning, where one aims to use batch data collected by a logging policy to estimate the value of a counterfactual policy or algorithm without employing it (Li et al. 2010; Strehl et al. 2010; Li et al. 2011; 2012; Bottou et al. 2013; Swaminathan and Joachims 2015a; 2015b; Wang, Agarwal, and Dudik 2017;

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Swaminathan et al. 2017). Such evaluation allows us to compare the performance of counterfactual policies to decide which policy should be deployed in the ﬁeld. This alternative approach thus solves the above problem with the naive A/B test approach.

Method. For off-policy evaluation with log data of bandit feedback, this paper develops and empirically implements a variance minimization technique. Variance reduction and statistical efﬁciency are important for minimizing the uncertainty we face in decision making. Indeed, an important open question raised by Li (2015) is how to achieve statistically more efﬁcient (even optimal) ofﬂine estimation from batch bandit data. This question motivates a set of studies that bound and characterize the variances of particular estimators (Dud ık et al. 2014; Li, Munos, and Szepesv ari 2015; Thomas, Theocharous, and Ghavamzadeh 2015; Munos et al. 2016; Thomas and Brunskill 2016; Agarwal et al. 2017). We study this statistical efﬁciency question in the context of ofﬂine policy value estimation with log data from a class of contextual bandit algorithms. This class includes most of the widely-used algorithms such as contextual ϵ-Greedy and Thompson Sampling, as well as their non-contextual analogs and random A/B testing. We allow the logging policy to be unknown, degenerate (non-stochastic), and timevarying, all of which are salient in real-world bandit applications. We also allow the evaluation target policy to be degenerate, again a key feature of real-life situations. We consider ofﬂine estimators for the expected reward from a counterfactual policy. Our estimators can also be used for estimating the average treatment effect. Our estimators are variations of well-known inverse probability weighting estimators (Horvitz and Thompson (1952), Rosenbaum and Rubin (1983), and modern studies cited above) except that we use an estimated propensity score (logging policy) even if we know the true propensity score. We show the following result, building upon Bickel et al. (1993), Hirano, Imbens, and Ridder (2003), and Ackerberg et al. (2014) among others:

Theoretical Result 1. Our estimators minimize the variance among all reasonable estimators. More precisely, our estimators minimize the asymptotic variance among all asymptotically normal estimators (in the standard statistical sense deﬁned in Section 3).

We also provide estimators for the asymptotic variances of our estimators, thus allowing analysts to calculate the variance in practice. In contrast to Result 1, we also ﬁnd: Theoretical Result 2. Standard estimators using the true propensity score (logging policy) have larger asymptotic variances than our estimators. Perhaps counterintuitively, therefore, the policy-maker should use an estimated propensity score even when she knows the true one.

Application. We empirically apply our estimators to evaluate and optimize the design of online advertisement formats. Our application is based on proprietary data provided by Cyber Agent Inc., the second largest Japanese advertisement company with about 6 billion USD market capitalization (as of November 2018). This company uses a contextual bandit algorithm to determine the visual design of advertisements assigned to users. Their algorithm produces logged bandit data. We use this data and our estimators to optimize the advertisement design for maximizing the click through rates (CTR). In particular, we estimate how much the CTR would be improved by a counterfactual policy of choosing the best action (advertisement) for each context (user characteristics). We ﬁrst obtain the following result:

Empirical Result A. Consistent with Theoretical Results 1-2, our estimators produce narrower conﬁdence intervals about the counterfactual policy s CTR than a benchmark using the known propensity score (Swaminathan and Joachims 2015b).

This result is reported in Figure 1, where the conﬁdence intervals using True Propensity Score (Benchmark) are wider than other conﬁdence intervals using propensity scores estimated either by the Gradient Boosting, Random Forest, or Ridge Logistic Regression. Thanks to this variance reduction, we conclude that the logging policy s CTR is below the conﬁdence interval of the hypothetical policy of choosing the best advertisement for each context. This leads us to obtain the following bottomline:

Empirical Result B. Unlike the benchmark estimator, our estimator predicts the hypothetical policy to statistically signiﬁcantly improve the CTR by 10-15% (compared to the logging policy).

Empirical Results A and B therefore show that our estimator can substantially reduce uncertainty we face in real-world policy-making.

2 Setup 2.1 Data Generating Process We consider a general multi-armed contextual bandit setting. There is a set of m + 1 actions (equivalently, arms or treatments), A = {0, ..., m}, that the decision maker can choose from. Let Y ( ) : A R denote a potential reward function that maps actions into rewards or outcomes, where Y (a) is the reward when action a is chosen (e.g., whether

True Propensity Score

(Benchmark)

Gradient Boosting Machine

Ridge Logistic

Random Forest

Propensity Score Estimator

Expected Reward Estimate

Logging Policy Counterfactural Policy: Best Action by Context

Figure 1: Improving Ad Design with Lower Uncertainty

Notes: This ﬁgure shows estimates of the expected CTRs of the logging policy and a counterfactual policy of choosing the best action for each context. CTRs are multiplied by a constant for conﬁdentiality reasons. We obtain these estimates by the self-normalized inverse probability weighting estimator using the true propensity score (benchmark thanks to Swaminathan and Joachims (2015b)) or estimated propensity scores (our proposal), both of which are deﬁned and analyzed in Section 3. Bars indicate 95% conﬁdence intervals based on our asymptotic variance estimators developed in Section 4.

an advertisement as an action results in a click). Let X denote context or covariates (e.g., the user s demographic proﬁle and browsing history) that the decision maker observes when picking an action. We denote the set of contexts by X. We think of (Y ( ), X) as a random vector with unknown distribution G. We consider log data coming from the following data generating process (DGP), which is similar to those used in the literature on the ofﬂine evaluation of contextual bandit algorithms (Li et al. 2010; Strehl et al. 2010; Li et al. 2011; 2012; Swaminathan and Joachims 2015a; 2015b; Swaminathan et al. 2017). We observe data {(Yt, Xt, Dt)}T t=1 with T observations. Dt (Dt0, ..., Dtm) where Dta is a binary variable indicating whether action a is chosen in round t. Yt denotes the reward observed in round t, i.e., Yt Pm a=0 Dta Yt(a). Xt denotes the context observed in round t. A key feature of our DGP is that the data {(Yt, Xt, Dt)}T t=1 are divided into B batches, where different batches may use different choice probabilities (propensity scores). Let Xb t {1, 2, ..., B} denote a random variable indicating the batch to which round t belongs. We treat this batch number as one of context variables and write Xt = ( Xt, Xb t ), where Xt is the vector of context variables other than Xb t . Let pt = (pt0, ..., ptm) (A) denote the potentially unknown probability vector indicating the probability that each action is chosen in round t. Here (A) {(pa)

a pa = 1} with pa being the probability that action a is chosen. A contextual bandit algorithm is a sequence {Fb}B b=1 of distribution functions of choice probabilities pt conditional on Xt, where Fb : X ( (A)) for b {1, 2, ..., B} and X is the support of X, where ( (A)) is the set of distributions over (A). Fb takes context Xt as input and returns a distribution of probability vector pt in rounds of batch b. Fb can vary across batches but does not change across rounds within batch b. We assume that the log data are generated by a contextual bandit algorithm {Fb}B b=1 as follows:

In each round t = 1, ..., T, (Yt( ), Xt) is i.i.d. drawn from distribution G. Re-order round numbers so that they are monotonically increasing in their batch numbers Xb t . In each round t within batch b {1, 2, ..., B} and given Xt, probability vector pt = (pt0, ..., ptm) is drawn from Fb( | Xt). Action is randomly chosen based on probability vector pt, creating the action choice Dt and the associated reward Yt.

Here, the contextual bandit algorithm {Fb}B b=1 and the realized probability vectors {pt}T t=1 may or may not be known to the analyst. We also allow for the realization of pt to be degenerate, i.e., a certain action may be chosen with probability 1 at a point in time.

Examples. This DGP allows for many popular bandit algorithms, as the following examples illustrate. In each of the examples below, the contextual bandit algorithm Fb is degenerate and produces a particular probability vector pt for sure. Example 1 (Random A/B testing). We always choose each action uniformly at random: pta = 1 m+1 always holds for any a A and any t = 1, ..., T. In the remaining examples, at every batch b, the algorithm uses the history of observations from the previous b 1 batches to estimate the mean and the variance of the potential reward under each action conditional on each context: µ(a|x) E[Y (a)| X = x] and σ2(a|x) V[Y (a)| X = x]. We denote the estimates using the history up to batch b 1 by ˆµb 1(a|x) and ˆσ2 b 1(a|x). See Li et al. (2012) and Dimakopoulou, Athey, and Imbens (2017) for possible estimators based on generalized linear models and generalized random forest, respectively. The initial estimates, ˆµ0(a|x) and ˆσ2 0(a|x), are set to any values. Example 2 (ϵ-Greedy). In each round within batch b, we choose the best action based on ˆµb 1(a| Xt) with probability 1 ϵb and choose the other actions uniformly at random with probability ϵb:

1 ϵb if a = argmax a A ˆµb 1(a | Xt) ϵb m otherwise.

Example 3 (Thompson Sampling using Gaussian priors). In each round within batch b, we sample the potential reward yt(a) from distribution N(ˆµb 1(a| Xt), ˆσ2 b 1(a| Xt))

for each action, and choose the action with the highest sampled potential reward, argmax a A yt(a ). As a result, this algo-

rithm chooses actions with the following probabilities:

pta = Pr{a = argmax a A yt(a )},

where (yt(0), ..., yt(m)) N(ˆµb 1( Xt), ˆΣb 1( Xt)), ˆµb 1(x) = (ˆµb 1(0|x), ..., ˆµb 1(m|x)) , and

ˆσ2 b 1(0|x) 0 0

0 ... 0 0 0 ˆσ2 b 1(m|x)

In Examples 2 and 3, pt depends on the random realization of the estimates ˆµb 1(a|x) and ˆσ2 b 1(a|x), and so does the associated Fb. If the data are sufﬁciently large, the uncertainty in the estimates vanishes: ˆµb 1(a|x) and ˆσ2 b 1(a|x) converge to µb 1(a|x) E[Y (a)| X = x, Xb b 1] and σ2 b 1(a|x) V[Y (a)| X = x, Xb b 1], respectively. In this case, Fb becomes nonrandom since it depends on the ﬁxed realizations µb 1(a|x) and σ2 b 1(a|x). In the following analysis, we consider this large-sample scenario and assume that Fb is nonrandom. To make the notation simpler, we put {Fb}B b=1 together into a single distribution F : X ( (A)) obtained by F( | X, Xb = b) = Fb( | X) for each b {1, 2, ..., B}. We use this to rewrite our DGP as follows:

In each round t = 1, ..., T, (Yt( ), Xt) is i.i.d. drawn from distribution G. Given Xt, probability vector pt = (pt0, ..., ptm) is drawn from F( |Xt). Action is randomly chosen based on probability vector pt, creating the action choice Dt and the associated reward Yt.

p0a(x) Pr D p, p F(Da = 1|X = x)

for each a, and let p0(x) = (p00(x), ..., p0m(x)) . This is the choice probability vector conditional on each context. We call p0( ) the logging policy or the propensity score. F is common for all rounds regardless of the batch to which they belong. Thus pt and Dt are i.i.d. across rounds. Because (Yt( ), Xt) is i.i.d. and Yt = Pm a=0 Dta Yt(a), each observation (Yt, Xt, Dt) is i.i.d.. Note also that Dt is independent of Yt( ) conditional on Xt.

2.2 Parameters of Interest We are interested in using the log data to estimate the expected reward from any given counterfactual policy π : X (A), which chooses a distribution of actions given each context:

V π E(Y ( ),X) G[

a=0 Y (a)π(a|X)]

= E(Y ( ),X) G, D p0(X)[

a=0 Y (a)Da π(a|X) p0a(X)], (1)

where the last equality uses the independence of D and Y ( ) conditional on X and the deﬁnition of p0( ). Here, (A) {(pa) Rm+1| P

a pa 1}. We allow the counterfactual policy π to be degenerate, i.e., π may choose a particular action with probability 1. Depending on the choice of π, V π represents a variety of parameters of interest. When we set π(a|x) = 1 for a particular action a and π(a |x) = 0 for all a A\{a} for all x X, V π equals E(Y ( ),X) G[Y (a)], the expected reward from action a. When we set π(a|x) = 1, π(0|x) = 1 and π(a |x) = 0 for all a A\{0, a} for all x X, V π equals E(Y ( ),X) G[Y (a) Y (0)], the average treatment effect of action a over action 0. Such treatment effects are of scientiﬁc and policy interests in medical and social sciences. Business and managerial interests also motivate treatment effect estimation. For example, when a company implements a bandit algorithm using a particular reward measure like an immediate purchase, the company is often interested in treatment effects on other outcomes like long-term user retention.

3 Efﬁcient Value Estimation

We consider the efﬁcient estimation of the expected reward from a counterfactual policy, V π. We consider an estimator consisting of two steps. In the ﬁrst step, we nonparametrically estimate the propensity score vector p0( ) by a consistent estimator. Possible estimators include machine learning algorithms such as gradient boosting, as well as nonparametric sieve estimators and kernel regression estimators, as detailed in Section 3.2. In the second step, we plug the estimated propensity ˆp( ) into the sample analogue of expression (1) to estimate V π (in practice, some trimming or thresholding may be desirable for numerical stability):

a=0 Yt Dta π(a|Xt)

Alternatively, one can use a self-normalized estimator inspired by Swaminathan and Joachims (2015b) when Pm a=0 π(a|x) = 1 for all x X:

1 T PT t=1 Pm a=0 Yt Dta π(a|Xt)

1 T PT t=1 Pm a=0 Dta π(a|Xt)

Swaminathan and Joachims (2015b) suggest that ˆV π SN tends to be less biased than ˆV π in small sample. Unlike Swaminathan and Joachims (2015b), however, we use the estimated propensity score rather than the true one. The above estimators estimate a scalar parameter V π deﬁned as a function of the distribution of (Y ( ), X), on which we impose no parametric assumption. Our estimators therefore attempt to solve a semiparametric estimation problem, i.e., a partly-parametric and partly-nonparametric estimation problem. For this semiparametric estimation problem, we ﬁrst derive the semiparametric efﬁciency bound on how efﬁcient and precise the estimation of the parameter can be, which is a semiparametric analog of the Cramer-Rao bound (Bickel et al. 1993). The asymptotic variance of any

asymptotically normal estimator is no smaller than the semiparametric efﬁciency bound. Following the standard statistics terminology, we say that estimator ˆθ for parameter θ is asymptotically normal if

T(ˆθ θ) N(0, Σ) as T , where denotes convergence in distribution, and N(0, Σ) denotes a normally distributed random variable with mean 0 and variance Σ. We call Σ the asymptotic variance of ˆθ. The semiparametric efﬁciency bound for θ is a lower bound on the asymptotic variance of asymptotically normal estimators; the Supplementary Material provides a formal deﬁnition of the semiparametric efﬁciency bound. We show the above estimators achieve the semiparametric efﬁciency bound, i.e., they minimize the asymptotic variance among all asymptotically normal estimators. Our analysis uses a couple of regularity conditions. We ﬁrst assume that the logging policy p0( ) ex ante chooses every action with a positive probability for every context. Assumption 1. There exists some p such that 0 < p Pr D p, p F (Da = 1|X = x) p0a(x) for any x X and for a = 0, ..., m. Note that Assumption 1 is consistent with the possibility that the realization of pta takes on value 0 or 1 (as long as it takes on positive values with a positive probability). We also assume the existence of ﬁnite second moments of potential rewards. Assumption 2. E[Y (a)2] < for a = 0, ..., m. The following proposition provides the semiparametric efﬁciency bound for V π. All the proofs are in the Supplementary Material. Lemma 1 (Semiparametric Efﬁciency Bound). Under Assumptions 1 and 2, the semiparametric efﬁciency bound for V π, the expected reward from counterfactual policy π, is

a=0 V[Y (a)|X]π(a|X)2

p0a(X) + (θ(X) V π)2],

where θ(X) = Pm a=0 E[Y (a)|X]π(a|X) is the expected reward from policy π conditional on X. Lemma 1 implies the semiparametric efﬁciency bounds for the expected reward from each action and for the average treatment effect, since they are special cases of V π. Corollary 1. Suppose that Assumptions 1 and 2 hold. Then, the semiparametric efﬁciency bound for the expected reward from each action, E[Y (a)], is

E V[Y (a)|X]

p0a(X) + (E[Y (a)|X] E[Y (a)])2 .

The semiparametric efﬁciency bound for the average treatment effect, E[Y (a) Y (0)], is

E V[Y (0)|X]

p00(X) + V[Y (a)|X]

+ (E[Y (a) Y (0)|X] E[Y (a) Y (0)])2 .

Our proposed estimators are two-step generalizedmethod-of-moment estimators and are asymptotically normal under some regularity conditions, one of which requires

that the convergence rate of ˆp( ) be faster than n1/4 (Newey 1994; Chen 2007). Given the asympotic normality of the estimators, we ﬁnd that they achieve the semiparametric efﬁciency bound, building upon Ackerberg et al. (2014) among others. Theorem 1 (Efﬁcient Estimators). Suppose that Assumptions 1 and 2 hold and that ˆp( ) is a consistent estimator for p0( ). Then, the variance of ˆV π and ˆV π SN achieves the semiparametric efﬁciency bound for V π (provided in Lemma 1).

3.1 Inefﬁcient Value Estimation In some environments, we know the true p0( ) or observe the realization of the probability vectors {pt}T t=1. In this case, an alternative way to estimate V π is to use the sample analogue of the expression (1) without estimating the propensity score. If we know p0( ), a possible estimator is

a=0 Yt Dta π(a|Xt) p0a(Xt).

If we observe the realization of {pt}T t=1, we may use

a=0 Yt Dta π(a|Xt)

When Pm a=0 π(a|x) = 1 for all x X, it is again possible to use their self-normalized versions:

1 T PT t=1 Pm a=0 Yt Dta π(a|Xt) p0a(Xt)

1 T PT t=1 Pm a=0 Dta π(a|Xt) p0a(Xt) .

1 T PT t=1 Pm a=0 Yt Dta π(a|Xt)

pta 1 T PT t=1 Pm a=0 Dta π(a|Xt)

These intuitive estimators turn out to be less efﬁcient than the estimators with the estimated propensity score, as the following result shows. Theorem 2 (Inefﬁcient Estimators). Suppose that the propensity score p0( ) is known and we observe the realization of {pt}T t=1. Suppose also that Assumptions 1 and 2 hold and that ˆp( ) is a consistent estimator for p0( ). Then, the asymptotic variances of V π, V π, V π SN, and V π SN are no smaller than that of ˆV π and ˆV π SN. Generically, V π, V π, V π SN, and V π SN are strictly less efﬁcient than ˆV π

and ˆV π SN in the following sense.

1. If Pr(E[Y (a)|X] π(a|X)

p0a(X) = θ(X) for some a) > 0, then

the asymptotic variances of V π, V π, V π SN and V π SN are strictly larger than that of ˆV π and ˆV π SN. 2. If Pr(E[Y (a)2|X]π(a|X)2(E[ 1

pa |X] 1 p0a(X)) =

0 for some a) > 0, then the asymptotic variance of V π

and V π SN is strictly larger than that of ˆV π and ˆV π SN. The condition in Part 1 of Theorem 2 is about the dominating term in the difference between ˆV π and V π. The proofs of Theorems 1 and 2 show that the asymptotic

variance of ˆV π is the asymptotic variance of V π

1 T PT t=1 h Pm a=0 E[Y (a)|Xt] π(a|Xt)

p0a(Xt)Dta θ(Xt) i . Part 1 of Theorem 2 requires that the second term be not always zero so that the asymptotic variance of ˆV π is different from that of V π. As long as the two variances are not the same, ˆV π achieves variance reduction. Part 2 of Theorem 2 requires that E[ 1

pta |Xt] 1 E[pta|Xt](= E[ 1

pta |Xt] 1 p0a(Xt)) = 0 with a positive probability. This means that pt is not always the same as the true propensity score p0(Xt), i.e., F( |Xt) is not degenerate (recall that pt is drawn from F( |Xt) whose expected value is p0(Xt)). Under this condition, V π has a strictly larger asymptotic variance than V π and ˆV π. Theorems 1 and 2 suggest that we should use an estimated score regardless of whether the propensity score is known. To develop some intuition for this result, consider a simple situation where the context Xt always takes some constant value x. Suppose that we are interested in estimation of the expected reward from action a, E[Y (a)]. Since X is constant across rounds, a natural nonparametric estimator for p0a(x) is the proportion of rounds in which action a was

chosen: ˆpa(x) =

T . The estimator using the estimated propensity score is

t=1 Yt Dta ˆpa(x) = 1 PT t=1 Dta

t=1 Yt Dta.

The estimator using the true propensity score is

t=1 Yt Dta p0a(x) = 1 Tp0a(x)

t=1 Yt Dta.

When action a happens to be chosen frequently in a sample so that PT t=1 Dta is larger, the absolute value of PT t=1 Yt Dta tends to be larger in the sample. Because of this positive correlation between PT t=1 Dta and the absolute value of PT t=1 Yt Dta, ˆV π has a smaller variance than V π, which produces no correlation between the numerator and the denominator. Similar intuition applies to the comparison between V π and ˆV π.

3.2 How to Estimate Propensity Scores?

There are several options for the ﬁrst step estimation of the propensity score.

1. A sieve Least Squares (LS) estimator:

ˆpa( ) = argmin pa( ) Ha T

t=1 (Dta pa(Xt))2,

where Ha T = {pa(x) = Pka T j=1 qaj(x)λaj = qka T (x) λa} and ka T as T . Here {qaj} j=1 is some known basis functions deﬁned on X and qka T ( ) = (qa1( ), ..., qaka T ( )) .

2. A sieve Logit Maximum Likelihood estimator:

ˆp( ) = argmax p( ) HT

a=0 Dat log pa(Xt),

where HT = {p : X (0, 1)m+1 : pa(x) = exp(Rk T (x) λa) 1+Pm a=1 exp(Rk T (x) λa) for a = 1, ..., m, p0(x) = 1 Pm a=1 pa(x)}. Here Rk T ( ) = (R1( ), ..., Rk T ( )) and {Rj} j=1 is the set of some basis functions. 3. Prediction of Dta by Xt using a modern machine learning algorithm like random forest, ridge logistic regression, and gradient boosting. The above estimators are known to satisfy consistency with a convergence rate faster than n1/4 under regularity conditions (Newey 1997; Cattaneo 2010; Knight and Fu 2000; Blanchard, Lugosi, and Vayatis 2003; B uhlmann and Van De Geer 2011; Wager and Athey 2018). How should one choose a propensity score estimator? We prefer an estimated score to the true one because it corrects the discrepancy between the realized action assignment in the data and the assignment predicted by the true score. To achieve this goal, a good propensity score estimator should ﬁt the data better than the true one, which means that the estimator should overﬁt to some extent. As a concrete example, in our empirical analysis, random forest produces a larger (worse) variance than gradient boosting and ridge logistic regression (see Figure 1 and Table 1). This is because random forest ﬁts the data worse, which is due to its bagging aspect preventing random forest from overﬁtting. In general, however, we do not know which propensity score estimator achieves the best degree of overﬁtting. We would therefore suggest that the analyst try different estimators to determine which one is most efﬁcient.

4 Estimating Asymptotic Variance We often need to estimate the asymptotic variance of the above estimators. For example, variance estimation is crucial for determining whether a counterfactual policy is statistically signiﬁcantly better than the logging policy. We propose an estimator that uses the sample analogue of an expression of the asymptotic variance. As shown in the proof of Theorem 1, the asymptotic variance of ˆV π and ˆV π SN is E[(g(Y, X, D, V π, p0) + α(X, D, p0, µ0))2], where µ0 : X Rm+1 such that µ0(a|x) = E[Y (a)|X = x] for each a and x,

g(Y, X, D, θ, p) =

a=0 Y Da π(a|X)

α(X, D, p, µ) =

a=0 µ(a|X)π(a|X)

pa(X) (Da pa(X)).

We estimate this asymptotic variance in two steps. In the ﬁrst step, we obtain estimates of V π and p0 using the method in Section 3. In addition, we estimate µ0(a|x) by nonparametric regression of Yt on Xt using the subsample with

Dta = 1 for each a. Denote the estimate by ˆµ. For this regression, one may use a sieve Least Squares estimator and machine learning algorithms. In the empirical application below, we use ridge logistic regression. In the second step, we plug the estimates of V π, p0 and µ0 into the sample analogue of E[(g(Y, X, D, V π, p0) + α(X, D, p0, µ0))2] to estimate the asymptotic variance: when we use ˆV π:

\ AV ar( ˆV π)

t=1 (g(Yt, Xt, Dt, ˆV π, ˆp) + α(Xt, Dt, ˆp, ˆµ))2.

When we use ˆV π SN, its asymptotic variance estimator is obtained by replacing ˆV π with ˆV π SN in the above expression. This asymptotic variance estimator is a two-step generalized-method-of-moment estimator, and is shown to be a consistent estimator under the condition that the ﬁrst step estimator of (V π, p0, µ0) is consistent and some regularity conditions (Newey 1994). It is easier to estimate the asymptotic variance of V π and V π SN with the true propensity score. Their asymptotic variance is E[g(Y, X, D, V π, p0)2] by the standard central limit theorem. When we use V π, we estimate this asymptotic variance by

\ AV ar( V π) = 1

t=1 g(Yt, Xt, Dt, V π, p0)2

When we use V π SN, its asymptotic variance estimator is obtained by replacing V π with V π SN in the above expression.

5 Real-World Application

We apply our estimators described in Sections 3 and 4 to empirically evaluate and optimize the design of online advertisements. This application uses proprietary data provided by Cyber Agent Inc., which we described in the introduction. This company uses a contextual bandit algorithm to determine the visual design of advertisements assigned to user impressions (there are four design choices). This algorithm produces logged bandit data. We use this logged bandit data and our estimators to improve their advertisement design for maximizing the click through rates (CTR). In the notation of our theoretical framework, reward Y is a click, action a is one of the four possible individual advertisement designs, and context X is user and ad characteristics used by the company s logging policy. The logging policy (the company s existing contextual bandit algorithm) works as follows. For each round, the logging policy ﬁrst randomly samples each action s predicted reward from a beta distribution. This beta distribution is parametrized by the predicted CTR for each context, where the CTR prediction is based on a Factorization Machine (Rendle 2010). The logging policy then chooses the action (advertisement) with the largest sampled reward prediction. The logging policy and the underlying CTR prediction stay

Existing Logging Policy Policy of Choosing Best Action by Context

Propensity Score Estimator CI for Expected Reward Shrinkage in CI CI Shrinkage in CI

True Score (Benchmark) 1.036 0.083 0 1.140 0.171 0

Gradient Boosting Machine 1.047 0.064 22.5% 1.197 0.131 23.4%

Ridge Logistic Regression 0.987 0.058 30.2% 1.108 0.113 34.1%

Random Forest 1.036 0.077 6.62% 1.194 0.159 7.41%

Sample Size 57, 619

Table 1: Improving Ad Design with Lower Uncertainty

Notes: The ﬁrst and third columns of this table show 95% conﬁdence intervals of the expected CTRs V π of the logging policy and a hypothetical policy of choosing the best action (ad) for each context. CTRs are multiplied by a constant for conﬁdentiality reasons. We obtain the CTR estimates by the self-normalized inverse probability weighting estimator V π SN using the true propensity score (Swaminathan and Joachims 2015b) or the estimated propensity score ( ˆV π SN in Section 3). We estimate standard errors and conﬁdence intervals based on the method described in Section 4. The second and fourth columns show the size of reductions in conﬁdence interval length, i.e., value α such that the length of the conﬁdence interval is equal to 100 α% of the length of the conﬁdence interval using the true propensity score.

the same for all rounds in each day. Each day therefore performs the role of a batch in the model in Section 2. This somewhat nonstandard logging policy and the resulting log data are an example of our DGP in Section 2. This logging policy may have room for improvement for several reasons. First, the logging policy randomly samples advertisements and does not necessarily choose the advertisement with the best predicted CTR. Also, the logging policy uses a predictive Factorization Machine for its CTR prediction, which may be different from the causal CTR (the causal effect of each advertisement on the probability of a click). To improve on the logging policy, we ﬁrst estimate the propensity score by random forest, ridge logistic regression, or gradient boosting (implemented by XGBoost). These estimators are known to satisfy the regularity conditions (e.g. consistency) required for our theoretical results, as explained in Section 3.2. With the estimated propensity score, we then use our estimator ˆV π SN to estimate the expected reward from two possible policies: (1) the logging policy and (2) a counterfactual policy that chooses the best action (advertisement) that is predicted to maximize the CTR conditional on each context. To implement this counterfactual policy, we estimate E[Y (a)|X] by ridge logistic regression for each action a and context X used by the logging policy (we apply one-hot encoding to categorical variables in X). Given each context X, the counterfactual policy then chooses the action with the highest estimated value of E[Y (a)|X]. Importantly, we use separate data sets for the two estimation tasks (one for the best actions and the other for the expected reward from the hypothetical policy). Speciﬁcally, we use data logged during April 20-26, 2018 for estimating the best actions and data during April 27-29 for estimating the expected reward. This data separation allows us to avoid overﬁtting and overestimation of the CTR gains from the counterfactual policy.

As a benchmark, we also estimate the same expected rewards based on Swaminathan and Joachims (2015b) s selfnormalized estimator V π SN, which uses the true propensity score. The resulting estimates show the following result:

Empirical Result A. Consistent with Theorems 1-2, our estimator ˆV π SN with the estimated score is statistically more efﬁcient than the benchmark V π SN with the true score.

This result is reported in Figure 1 and Table 1, where the conﬁdence intervals about the predicted CTR using True Propensity Score (Benchmark) are less precise (wider) than those using estimated propensity scores (regardless of which one of the three score estimators to use). The magnitude of this shrinkage in the conﬁdence intervals and standard errors is 6-34%, depending on how to estimate the propensity score. This variance reduction allows us to conclude that the logging policy is below the lower bound of the conﬁdence interval of the hypothetical policy, giving us conﬁdence in the following implication:

Empirical Result B. Compared to the logging policy, the hypothetical policy (choosing the best advertisement given each context) improves the CTR by 10-15% statistically signiﬁcantly at the 5% signiﬁcance level.

6 Conclusion We have investigated the most statistically efﬁcient use of batch bandit data for estimating the expected reward from a counterfactual policy. Our estimators minimize the asymptotic variance among all asymptotically normal estimators (Theorem 1). By contrast, standard estimators have larger asymptotic variances (Theorem 2). We have also applied our estimators to improve online advertisement design. Compared to the frontier benchmark V π SN, our reward estimator ˆV π SN provides the company

with more statistical conﬁdence in how to improve on its existing bandit algorithm (Empirical Results A and B). The hypothetical policy of choosing the best advertisement given user characteristics would improve the click through rate by 10-15% at the 5% signiﬁcance level. These empirical results thus highlight the practical values of Theorems 1-2.

Acknowledgments. We are grateful to seminar participants at ICML/IJCAI/AAMAS Workshop Machine Learning for Causal Inference, Counterfactual Prediction, and Autonomous Action (Causal ML) and RIKEN Center for Advanced Intelligence Project, especially Junya Honda, Masaaki Imaizumi, Atsushi Iwasaki, Kohei Kawaguchi, and Junpei Komiyama.

References Ackerberg, D.; Chen, X.; Hahn, J.; and Liao, Z. 2014. Asymptotic Efﬁciency of Semiparametric Two-step GMM. Review of Economic Studies 81(3):919 943. Agarwal, A.; Basu, S.; Schnabel, T.; and Joachims, T. 2017. Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers. KDD 687 696. Bickel, P. J.; Klaassen, C. A. J.; Ritov, Y.; and Wellner, J. A. 1993. Efﬁcient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press. Blanchard, G.; Lugosi, G.; and Vayatis, N. 2003. On the Rate of Convergence of Regularized Boosting Classiﬁers. Journal of Machine Learning Research 4(Oct):861 894. Bottou, L.; Peters, J.; Qui nonero-Candela, J.; Charles, D. X.; Chickering, D. M.; Portugaly, E.; Ray, D.; Simard, P.; and Snelson, E. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14(1):3207 3260. B uhlmann, P., and Van De Geer, S. 2011. Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media. Cattaneo, M. D. 2010. Efﬁcient Semiparametric Estimation of Multi-valued Treatment Effects under Ignorability. Journal of Econometrics 155(2):138 154. Chen, X. 2007. Large Sample Sieve Estimation of Seminonparametric Models. Handbook of Econometrics 6:5549 5632. Dimakopoulou, M.; Athey, S.; and Imbens, G. 2017. Estimation Considerations in Contextual Bandits. Ar Xiv. Dud ık, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly Robust Policy Evaluation and Optimization. Statistical Science 29:485 511. Hirano, K.; Imbens, G. W.; and Ridder, G. 2003. Efﬁcient Estimation of Average Treatment Effects Using the Estimated Propensity Score. Econometrica 71(4):1161 1189. Horvitz, D. G., and Thompson, D. J. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. Journal of the American Statistical Association 47(260):663 685. Knight, K., and Fu, W. 2000. Asymptotics for Lasso-type Estimators. Annals of Statistics 1356 1378.

Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-bandit Approach to Personalized News Article Recommendation. WWW 661 670. Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased Ofﬂine Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. WSDM 297 306. Li, L.; Chu, W.; Langford, J.; Moon, T.; and Wang, X. 2012. An Unbiased Ofﬂine Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. Journal of Machine Learning Research: Workshop and Conference Proceedings 26:19 36. Li, L.; Munos, R.; and Szepesv ari, C. 2015. Toward Minimax Off-policy Value Estimation. AISTATS 608 616. Li, L. 2015. Ofﬂine Evaluation and Optimization for Interactive Systems. WSDM. Munos, R.; Stepleton, T.; Harutyunyan, A.; and Bellemare, M. 2016. Safe and Efﬁcient Off-policy Reinforcement Learning. NIPS 1054 1062. Newey, W. K. 1994. The Asymptotic Variance of Semiparametric Estimators. Econometrica 62(6):1349 1382. Newey, W. K. 1997. Convergence Rates and Asymptotic Normality for Series Estimators. Journal of Econometrics 79(1):147 168. Rendle, S. 2010. Factorization Machines. ICDM 995 1000. Rosenbaum, P. R., and Rubin, D. B. 1983. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 70(1):41 55. Strehl, A.; Langford, J.; Li, L.; and Kakade, S. M. 2010. Learning from Logged Implicit Exploration Data. NIPS 2217 2225. Swaminathan, A., and Joachims, T. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. Journal of Machine Learning Research 16:1731 1755. Swaminathan, A., and Joachims, T. 2015b. The Selfnormalized Estimator for Counterfactual Learning. NIPS 3231 3239. Swaminathan, A.; Krishnamurthy, A.; Agarwal, A.; Dudik, M.; Langford, J.; Jose, D.; and Zitouni, I. 2017. Off-policy Evaluation for Slate Recommendation. NIPS 3635 3645. Thomas, P., and Brunskill, E. 2016. Data-efﬁcient Offpolicy Policy Evaluation for Reinforcement Learning. ICML 2139 2148. Thomas, P. S.; Theocharous, G.; and Ghavamzadeh, M. 2015. High-Conﬁdence Off-Policy Evaluation. AAAI 3000 3006. Wager, S., and Athey, S. 2018. Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests. Journal of the American Statistical Association 113(523):1228 1242. Wang, Y.-X.; Agarwal, A.; and Dudik, M. 2017. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. ICML 3589 3597.