# balanced_linear_contextual_bandits__56c47524.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Balanced Linear Contextual Bandits Maria Dimakopoulou,1 Zhengyuan Zhou,2 Susan Athey,3 Guido Imbens3 1Department of Management Science & Engineering, Stanford University 2Department of Electrical Engineering, Stanford University 3Graduate School of Business, Stanford University madima@stanford.edu, zyzhou@stanford.edu, athey@stanford.edu, imbens@stanford.edu Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for linear contextual bandits with balancing and show that our algorithms match the state of the art theoretical guarantees. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model misspecification and prejudice in the initial training data. Introduction Contextual bandits seek to learn a personalized treatment assignment policy in the presence of treatment effects that vary with observed contextual features. In such settings, there is a need to balance the exploration of actions for which there is limited knowledge in order to improve performance in the future against the exploitation of existing knowledge in order to attain better performance in the present (see (Bubeck and Cesa-Bianchi 2012) for a survey). Since large amounts of data can be required to learn how the benefits of alternative treatments vary with individual characteristics, contextual bandits can play an important role in making experimentation and learning more efficient. Several successful contextual bandit designs have been proposed (Auer 2003), (Li et al. 2010), (Agrawal and Goyal 2013), (Agarwal et al. 2014), (Bastani and Bayati 2015). The existing literature has provided regret bounds (e.g., the general bounds of (Russo and Van Roy 2014), the bounds of (Rigollet and Zeevi 2010), (Perchet and Rigollet 2013), (Slivkins 2014) in the case of non-parametric function of arm rewards), has demonstrated successful applications (e.g., news article recommendations (Li et al. 2010) or mobile health (Lei, Tewari, and Murphy 2017)), and has proposed system designs to apply these algorithms in practice (Agarwal et al. 2016). Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. In the contextual setting, one does not expect to see many future observations with the same context as the current observation, and so the value of learning from pulling an arm for this context accrues when that observation is used to estimate the outcome from this arm for a different context in the future. Therefore, the performance of contextual bandit algorithms can be sensitive to the estimation method of the outcome model or the exploration method used. In the initial phases of learning when samples are small, biases are likely to arise in estimating the outcome model using data from previous non-uniform assignments of contexts to arms. The bias issue is aggravated in the case of a mismatch between the generative model and the functional form used for estimation of the outcome model, or similarly, when the heterogeneity in treatment effects is too complex to estimate well with small datasets. In that case methods that proceed under the assumption that the functional form for the outcome model is correct may be overly optimistic about the extent of the learning so far, and emphasize exploitation over exploration. Another case where biases can arise occurs when training observations from certain regions of the context space are scarce (e.g., prejudice in training data if a non-representative set of users arrives in initial batches of data). These problems are common in real-world settings, such as in survey experiments in the domain of social sciences or in applications to health, recommender systems, or education. For example, early adopters of an online course may have different characteristics than later adopters. Reweighting or balancing methods address model misspecification by making the estimation doubly-robust, , robust against misspecification of the reward function, important here, and robust against the specification of the propensity score (not as important here because in the bandit setting we know the propensity score). The term doubly-robust comes from the extensive literature on offline policy evaluation (Scharfstein, Rotnitzky, and Robins 1999); it means in our case that when comparing two policies using historical data, we get consistent estimates of the average difference in outcomes for segments of the context whether we have either a well-specified model of rewards or not, as long as we have a good model of the arm assignment policy (i.e., accurate propensity scores). Because in a contextual bandit the learner controls the arm assignment policy conditional on the observed context, we therefore have access to accu- rate propensity scores even in small samples. So, even when the reward model is severely misspecified, the learner can use the propensity scores to obtain unbiased estimates of the reward function for each range of values of the context. We suggest the integration of balancing methods from the causal inference literature (Imbens and Rubin 2015) in online contextual bandits. We focus on the domain of linear online contextual bandits with provable guarantees, such as Lin UCB (Li et al. 2010) and Lin TS (Agrawal and Goyal 2013), and we propose two new algorithms, balanced linear UCB (BLUCB) and balanced linear Thompson sampling (BLTS). BLTS and BLUCB build on Lin TS and Lin UCB respectively and extend them in a way that makes them less prone to problems of bias. The balancing will lead to lower estimated precision in the reward functions, and thus will emphasize exploration longer than the conventional linear TS and UCB algorithms, leading to more robust estimates. The balancing technique is well-known in machine learning, especially in domain adaptation and studies in learningtheoretic frameworks (Huang et al. 2007), (Zadrozny 2004), (Cortes, Mansour, and Mohri 2010). There is a number of recent works which approach contextual bandits through the framework of causality (Bareinboim, Forney, and Pearl 2015), (Bareinboim and Pearl 2015), (Forney, Pearl, and Bareinboim 2017), (Lattimore, Lattimore, and Reid 2016). There is also a significant body of research that leverages balancing for offline evaluation and learning of contextual bandit or reinforcement learning policies from logged data (Strehl et al. 2010), (Dud ık, Langford, and Li 2011), (Li et al. 2012), (Dud ık et al. 2014), (Li et al. 2014), (Swaminathan and Joachims 2015), (Jiang and Li 2016), (Thomas and Brunskill 2016), (Athey and Wager 2017), (Kallus 2017), (Wang, Agarwal, and Dud ık 2017), (Deshpande et al. 2017), (Kallus and Zhou 2018), (Zhou, Athey, and Wager 2018). In the offline setting, the complexity of the historical assignment policy is taken as given, and thus the difficulty of the offline evaluation and learning of optimal policies is taken as given. Therefore, these results lie at the opposite end of the spectrum from our work, which focuses on the online setting. Methods for reducing the bias due to adaptive data collection have also been studied for non-contextual multiarmed bandits (Villar, Bowden, and Wason 2015), (Nie et al. 2018), but the nature of the estimation in contextual bandits is qualitatively different. Importance weighted regression in contextual bandits was first mentioned in (Agarwal et al. 2014), but without a systematic motivation, analysis and evaluation. To our knowledge, our paper is the first work to integrate balancing in the online contextual bandit setting, to perform a large-scale evaluation of it against direct estimation method baselines with theoretical guarantees and to provide a theoretical characterization of balanced contextual bandits that match the regret bound of their direct method counterparts. The effect of importance weighted regression is also evaluated in (Bietti, Agarwal, and Langford 2018), but this is a successor to the extended version of our paper (Dimakopoulou et al. 2017). We prove that the regret bound of BLTS is O d p KT 1+ϵ/ϵ and that the regret bound on BLUCB Td K where d is the number of features in the context, K is the number of arms and T is the horizon. Our regret bounds for BLTS and BLUCB match the existing state-of-the-art regret bounds for Lin TS (Agrawal and Goyal 2013) and Lin UCB (Chu et al. 2011) respectively. We provide extensive and convincing empirical evidence for the effectiveness of BLTS and BLUCB (in comparison to Lin TS and Lin UCB) by considering the problem of multiclass classification with bandit feedback. Specifically, we transform a K-class classification task into a K-armed contextual bandit (Dud ık, Langford, and Li 2011) and we use 300 public benchmark datasets for our evaluation. It is important to point out that, even though BLTS and Lin TS share the same theoretical guarantee, BLTS outperforms Lin TS empirically. Similarly, BLUCB has a strong empirical advantage over Lin UCB. In bandits, this phenomenon is not uncommon. For instance, it is well-known that even though the existing UCB bounds are often tighter than those of Thompson sampling, Thompson sampling performs better in practice than UCB (Chapelle and Li 2011). We find that this is also the case for balanced linear contextual bandits, as in our evaluation BLTS has a strong empirical advantage over BLUCB. Overall, in this large-scale evaluation, BLTS outperforms Lin UCB, BLUCB and Lin TS. In our empirical evaluation, we also consider a synthetic example that simulates in a simple way two issues of bias that often arise in practice, training data with non-representative contexts and model misspecification, and find again that BLTS is the most effective in escaping these biases. Problem Formulation & Algorithms Contextual Bandit Setting In the stochastic contextual bandit setting, there is a finite set of arms, a A, with cardinality K. At every time t, the environment produces (xt, rt) D, where xt is a ddimensional context vector xt and rt = (rt(1), . . . , rt(K)) is the reward associated with each arm in A. The contextual bandit chooses arm at A for context xt and observes the reward only for the chosen arm, rt(at). The optimal assignment for context xt is a t = argmaxa {E[rt(a)|xt = x]}. The expected cumulative regret over horizon T is defined as R(T) E h PT t=1 (rt(a t ) rt(at)) i . At each time t = 1, . . . , T, the contextual bandit assigns arm at to context xt based on the history of observations up to that time, (xτ, aτ, rτ(aτ))t 1 τ=1. The goal is to find the assignment rule that minimizes R(T). Linear Contextual Bandits Linear contextual bandits rely on modeling and estimating the reward distribution corresponding to each arm a A given context xt = x. Specifically the expected reward is assumed to be a linear function of the context xt with some unknown coefficient vector θa, E[rt(a)|xt = x] = x θa, and the variance is typically assumed to be constant V[rt(a)|xt = x] = σ2 a. In the setting we are studying, there K models to be estimated, as many as the arms in A. At every time t, this estimation of θa is done separately for Algorithm 1 Balanced Linear Thompson Sampling 1: Input: Regularization parameter λ > 0, propensity score threshold γ (0, 1), constant α (default is 1) 2: Set ˆθa null, Ba null, a A 3: Set Xa empty matrix, ra empty vector a A 4: for t = 1, 2, . . . , T do 5: if a A s.t. ˆθa = null or Ba = null then 6: Select a Uniform(A) 7: else 8: Draw θa from N ˆθa, α2V(ˆθa) for all a A 9: Select a = arg max a A x t θa 10: end if 11: Observe reward rt(a). 12: Set Wa empty matrix 13: for τ {1, . . . , t} s.t. aτ = a do 14: Compute pa(xτ) and set w = 1 max(γ,pa(xτ )) 15: Wa diag(Wa, w) 16: end for 17: Xa [Xa : x t ] 18: Ba X a Wa Xa + λI 19: ra [ra : rt(a)] 20: ˆθa B 1 a X a Wara 21: V(ˆθa) B 1 a (ra X a ˆθa) Wa(ra X a ˆθa) 22: end for each arm a on the history of observations corresponding to this arm, (Xa, ra) = {(xt, rt(at)) , t : at = a}. Thompson Sampling (Thompson 1933), (Scott 2010), (Agrawal and Goyal 2012), (Russo et al. 2018) and Upper Confidence Bounds (UCB) (Lai and Robbins 1985), (Auer, Cesa Bianchi, and Fischer 2002) are two different methods which are highly effective in dealing with the exploration vs. exploitation trade-off in multi-armed bandits. Lin TS (Agrawal and Goyal 2013) and Lin UCB (Li et al. 2010) are linear contextual bandit algorithms associated with Thompson sampling and UCB respectively. At time t, Lin TS and Lin UCB apply ridge regression with regularization parameter λ to the history of observations (Xa, ra) for each arm a A, in order to obtain an estimate ˆθa and its variance Va(ˆθa). For the new context xt, ˆθa and its variance are used by Lin TS and Lin UCB to obtain the conditional mean ˆµa(xt) = x t ˆθa of the reward associated with each arm a A, and its variance V(ˆµa(xt)) = x t V(ˆθa)xt. Lin TS assumes that the expected reward µa(xt) associated with arm a conditional on the context xt is Gaussian N ˆµa(xt), α2V(ˆµa(xt)) , where α is an appropriately chosen constant. Lin TS draws a sample µa(xt) from the distribution of each arm a and context xt is then assigned to the arm with the highest sample, at = argmaxa{ µa(xt)}. Lin UCB uses the estimate ˆθa and its variance to compute upper confidence bounds for the expected reward µa(xt) of context xt associated with each arm a A and assigns the context to the arm with the highest upper confidence bound, Algorithm 2 Balanced Linear UCB 1: Input: Regularization parameter λ > 0, propensity score threshold γ (0, 1), constant α. 2: Set ˆθa null, Ba null, a A 3: Set Xa empty matrix, ra empty vector a A 4: for t = 1, 2, . . . , T do 5: if a A s.t. ˆθa = null or Ba = null then 6: Select a Uniform(A) 7: else 8: Select a = arg max a A x t ˆθa + α q x t V(ˆθa)xt 9: end if 10: Observe reward rt(a). 11: Set Wa empty matrix 12: for τ {1, . . . , t} s.t. aτ = a do 13: Estimate ˆpa(xτ) and set w = 1 max(γ,ˆpa(xτ )) 14: Wa diag(Wa, w) 15: end for 16: Xa [Xa : x t ] 17: Ba X a Wa Xa + λI 18: ra [ra : rt(a)] 19: ˆθa B 1 a X a Wara 20: V(ˆθa) B 1 a (ra X a ˆθa) Wa(ra X a ˆθa) 21: end for at = argmaxa n ˆµa(xt) + α p V(ˆµa(xt)) o , where α is an appropriately chosen constant. Linear Contextual Bandits with Balancing In this section, we show how to integrate balancing methods from the causal inference literature in linear contextual bandits, in order to make estimation less prone to bias issues. Balanced linear Thompson sampling (BLTS) and balanced linear UCB (BLUCB) are online contextual bandit algorithms that perform balanced estimation of the model of all arms in order to obtain a Gaussian distribution and an upper confidence bound respectively for the reward associated with each arm conditional on the context. We focus on the method of inverse propensity weighting (Imbens and Rubin 2015). The idea is that at every time t, the linear contextual bandit weighs each observation (xτ, aτ, rτ(aτ)), τ = 1, . . . , t in the history up to time t by the inverse probability of context xτ being assigned to arm aτ. This probability is called propensity score and is denoted as paτ (xτ). Then, for each arm a A, the linear contextual bandit weighs each observation (xτ, a, rτ(a)) in the history of arm a by wa = 1/pa(xτ) and uses weighted regression to obtain the estimate ˆθBLTS a with variance V(ˆθBLTS a ). In BLTS (Algorithm 1), the propensity scores are known because Thompson sampling performs probability matching, i.e., it assigns a context to an arm with the probability that this arm is optimal. Since computing the propensity scores involves high order integration, they can be approximated via Monte-Carlo simulation. Each iteration draws a sample from the posterior reward distribution of each arm a conditional on x, where the posterior is the one that the algorithm considered at the end of a randomly selected prior time period. The propensity score pa(xτ) is the fraction of the Monte Carlo iterations in which arm a has the highest sampled reward, where the arrival time of context xτ is treated as random. For every arm a, the history (Xa, ra, pa) is used to obtain a balanced estimate ˆθBLTS a of θa and its variance V(ˆθBLTS a ) which produce a normally distributed estimate of µa N x t ˆθBLTS a , α2x t V(ˆθBLTS a )xt of the reward of arm a for context xt, where α is a parameter of the algorithm. In BLUCB (Algorithm 2), the observations are weighed by the inverse of estimated propensity scores. Note that UCB-based contextual bandits have deterministic assignment rules and conditional on the context the propensity score is either zero or one. But with the standard assumption that the arrival of contexts is random, at every time period t the estimated probability ˆpa(xτ) is obtained by the prediction of the trained multinomial logistic regression model on (xτ, aτ)t 1 τ=1. Subsequently, (Xa, ra, ˆpa) is used to obtain a balanced estimate ˆθBLUCB a of θa and its variance V(ˆθBLUCB a ). These are used to construct the upper confidence bound, x t ˆθa + α q x t V(ˆθBLUCB a )xt, for the reward of arm a for context xt, where α is a constant. (For some results, e.g., (Auer 2002), α needs to be slowly increasing in t.) Note that ˆθBLTS a , V(ˆθBLTS a ) and ˆθBLUCB a , V(ˆθBLUCB a ) can be computed in closed form or via the bootstrap. Weighting the observations by the inverse propensity scores reduces bias, but even when the propensity scores are known it increases variance, particularly when they are small. Clipping the propensity scores (Crump et al. 2009) with some threshold γ, e.g. 0.1 helps control the variance increase. This threshold γ is an additional parameter to BLTS (Algorithm 1) and BLUCB (Algorithm 2) compared to Lin TS and Lin UCB. Finally, note that one could integrate in the contextual bandit estimation other covariate balancing methods, such as the method of approximate residual balancing (Athey, Imbens, and Wager 2018) or the method of (Kallus 2017). For instance, with approximate residual balancing one would use as weights wa = argminw (1 ζ) w 2 2 + ζ x XT a w 2 t:at=a wt = 1 and 0 wt n 2/3 a where ζ (0, 1) is a tuning parameter, na = PT t=1 1{at = a} and x = 1 T PT t=1 xt and then use wa to modify the parametric and non-parametric model estimation as outlined before. Theoretical Guarantees for BLTS and BLUCB In this section, we establish theoretical guarantees of BLTS and BLUCB that are comparable to Lin TS and Lin UCB. We start with a few technical assumptions that are standard in the contextual bandits literature. Assumption 1. Linear Realizability: There exist parameters {θa}a A such that given any context x, E[rt(a)|x] = x θa, a A, t 0. We use the standard (frequentist) regret criterion and standard assumptions on the regularity of the distributions. Definition 1. The instantaneous regret at iteration t is x t θa t x t θat, where a t is the optimal arm at iteration t and at is the arm taken at iteration t. The cumulative regret R(T) with horizon T is the defined as R(T) = P t=1 x t θa t x t θat . Definition 2. We denote the canonical filtration of the underlying contextual bandits problem by {Ft} t=1, where Ft = σ({xs}t s=1, {as}t s=1, {rs(as)}t s=1, xt+1): the sigma algebra1 generated by all the random variables up to and including iteration t, plus xt+1. In other words, Ft contains all the information that is available before making the decision for iteration t + 1. Assumption 2. For each a A and every t 1: 1. Sub-Gaussian Noise: rt(a) x t θa is conditionally sub-Gaussian: there exists some La > 0, such that E[es(rt(a) x t θa) | Ft 1] exp( s2L2 a 2 ), s, xt. 2. Bounded Contexts and Parameters: The contexts xt and parameters θa are assumed to be bounded. Consequently, without loss of generality, we can rescale them such that xt 2 1, θa 2 1, a, t. Remark 1. Note that we make no assumption of the underlying {xt} t=1 process: the contexts {xt} t=1 need not to be fixed beforehand or come from some stationary process. Further, they can even be adapted to σ({xs}t s=1, {as}t s=1, {rs(as)}t s=1), in which case they are called adversarial contexts in the literature as the contexts can be chosen by an adversary who chooses a context after observing the arms played and the corresponding rewards. If {xt} t=1 is an IID process, then the problem is known as stochastic contextual bandits. From this viewpoint, adversarial contextual bandits are more general, but the regret bounds tend to be worse. Both are studied in the literature. Theorem 1. Under Assumption 1 and Assumption 2: 1. If BLTS is run with α = q δ ϵ in Algorithm 1, then with probability at least 1 δ, R(T) = O d q 2. If BLUCB is run with α = q δ in Algorithm 2, then with probability at least 1 δ, R(T) = O We refer the reader to Appendix A of the supplemental material of the extended version of this paper (Dimakopoulou et al. 2017) for the regret bound proofs. Remark 2. The above bound essentially matches the existing state-of-the art regret bounds for linear Thompson sampling with direct model estimation (e.g. (Agrawal and Goyal 2013)). Note that in (Agrawal and Goyal 2013), an infinite number of arms is also allowed, but all arms share the same parameter θ. The final regret bound is O d2 ϵ . Note that even though no explicit dependence on K is present in the regret bound (and hence our regret bound appears as a 1All the random variables xt, at, rt are defined on some common underlying probability space, which we do not write out explicitly here. K worse), this is to be expected, as we have K parameters to estimate, one for each arm. Note that here we do not assume any structure on the K arms; they are just K stand-alone parameters, each of which needs to be independently estimated. Similarly, for BLUCB, our regret bound is O Td K , which is a factor of K worse than that of (Chu et al. 2011), which establishes a O Td regret bound. Again, this is because a single true θ is assumed in (Chu et al. 2011), rather than K arm-dependent parameters. Of course, we also point out that our regret bounds are not tight, nor do they achieve state-of-the-art regret bounds in contextual bandits algorithms in general. The lower bound Ω( d T) is established in (Chu et al. 2011) for linear contextual bandits (again in the context of a single parameter θ for all K arms). In general, UCB based algorithms ((Auer 2003; Chu et al. 2011; Bubeck and Cesa-Bianchi 2012; Abbasi-Yadkori, P al, and Szepesv ari 2011)) tend to have better (and sometimes near-optimal) theoretical regret bounds. In particular, the state-of-the-art bound of O( d T log K) for linear contextual bandits is given in (Bubeck and Cesa-Bianchi 2012) (optimal up to a O(log K) factor). However, as mentioned in the introduction, Thompson sampling based algorithms tend to perform much better in practice (even though their regret bounds tend not to match UCB based algorithms, as is also the case here). Hence, our objective here is not to provide state-ofthe-art regret guarantees. Rather, we are motivated to design algorithms that have better empirical performance (compared to both the existing UCB style algorithms and Thompson sampling style algorithms), which also enjoy the baseline theoretical guarantee. Finally, we give some quick intuition for the proof. For BLTS, we first show that estimated means concentrate around true mean (i.e. x t ˆθa concentrates around x t θa). Then, we establish that sampled means concentrate around the estimated means (i.e. x t θa concentrates around x t ˆθa). These two steps together indicate that the sampled mean is close to the true mean. A further consequence of that is we can then bound the instantaneous regret (regret at each time step t) in terms of the sum of two standard deviations: one corresponds to the optimal arm at time t, the other corresponds to the actual selected arm at t. The rest of the proof then follows by giving tight characterizations of these two standard deviations. For BLUCB, the proof again utilizes the first concentration mentioned above: the estimated means concentrate around true mean (note that there is no sampled means in BLUCB). The rest of the proof adopts a similar structure as in (Chu et al. 2011). Computational Results In this section, we present computational results that compare the performance of our balanced linear contextual bandits, BLTS and BLUCB, with the direct method linear contextual bandit algorithms that have theoretical guarantees, Lin TS and Lin UCB. Our evaluation focuses on contextual bandits with linear realizability assumption and strong theoretical guarantees. First, we present a simple synthetic ex- ample that simulates bias in the training data by underrepresentation or over-representation of certain regions of the context space and investigates the performance of the considered linear contextual bandits both when the outcome model of the arms matches the true reward generative process and when it does not match the true reward generative process. Second, we conduct an experiment by leveraging 300 public, supervised cost-sensitive classification datasets to obtain contextual bandit problems, treating the features as the context, the labels as the actions and revealing only the reward for the chosen label. We show that BLTS performs better than Lin TS and that BLUCB performs better than Lin UCB. The randomized assignment nature of Thompson sampling facilitates the estimation of the arms outcomes models compared to UCB, and as a result Lin TS outperforms Lin UCB and BLTS outperforms BLUCB. Overall, BLTS has the best performance. In the supplemental material, we include experiments against the policy-based contextual bandit from (Agarwal et al. 2014) which is statistically optimal but it is also outperformed by BLTS. A Synthetic Example This simulated example aims to reflect in a simple way two issues that often arise in practice. The first issue is the presence of bias in the training data by under-representation or over-representation of certain regions. A personalized policy that is trained based on such data and is applied to the entire context space will result in biased decisions for certain contexts. The second issue is the problem of mismatch between the true reward generative process and the functional form used for estimation of the outcome model of the arms, which is common in applications with complex generative models. Model misspecification aggravates the presence of bias in the learned policies. We use this simple example to present in an intuitive manner why balancing and randomized assignment rule help with these issues, before moving on to a large-scale evaluation of the algorithms in real datasets in the next section. Consider a simulation design where there is a warm-start batch of training observations, but it consists of contexts focused on one region of the context space. There are three arms A = {0, 1, 2} and the contexts xt = (xt,0, xt,1) are two-dimensional with xt,j N(0, 1), j {0, 1}. The rewards corresponding to each arm a A are generated as follows; rt(0) = 0.5(xt,0+1)2+0.5(xt,1+1)2+ϵt, rt(1) = 1 + ϵt, and rt(2) = 2 0.5(xt,0 + 1)2 0.5(xt,1 + 1)2 + ϵt, where ϵt N(0, σ2), σ2 = 0.01. The expected values of the three arms rewards are shown in Figure 1. Figure 1: Expectation of each arm s reward, E[rt(0)] = 0.5(xt,0 + 1)2 + 0.5(xt,1 + 1)2 (red), E[rt(1)] = 1 (yellow), E[rt(2)] = 2 0.5(xt,0 + 1)2 0.5(xt,1 + 1)2 (blue). (a) Well-specified Lin TS (b) Well-specified Lin UCB (c) Well-specified BLTS (d) Well-specified BLUCB Figure 2: Evolution of the arm assignment in the context space for well-specified Lin TS, Lin UCB, BLTS, BLUCB. In the warm-start data, xt,0 and xt,1 are generated from a truncated normal distribution N(0, 1) on the interval ( 1.15, 0.85), while in subsequent data xt,0 and xt,1 are drawn from N(0, 1) without the truncation. Each one of the 50 warm-start contexts is assigned to one of the three arms at random with equal probability. Note that the warm-start contexts belong to a region of the context space where the reward surfaces do not change much with the context. Therefore, when training the reward model for the first time, the estimated reward of arm a = 2 (blue) is the highest, the one of arm a = 1 (yellow) is the second highest and the one of arm a = 0 (red) is the lowest across the context space. We run our experiment with a learning horizon T = 10000. The regularization parameter λ, which is present in all algorithms, is chosen via cross-validation every time the model is updated. The constant α, which is present in all algorithms, is optimized among values 0.25, 0.5, 1 in the Thompson sampling bandits (the value α = 1 corresponds to standard Thompson sampling, (Chapelle and Li 2011) suggest that smaller values may lower regret) and among values 1, 2, 4 in the UCB bandits (Chapelle and Li 2011). The propensity threshold γ for BLTS and BLUCB is optimized among the values 0.01, 0.05, 0.1, 0.2. Well-Specified Outcome Models In this section, we compare the behavior of Lin TS, Lin UCB, BLTS and BLUCB when the outcome model of the contextual bandits is wellspecified, i.e., it includes both linear and quadratic terms. Note that this is still in the domain of linear contextual bandits, if we treat the quadratic terms as part of the context. (a) Mis-specified Lin TS (b) Mis-specified Lin UCB (c) Mis-specified BLTS (d) Mis-specified BLUCB Figure 3: Evolution of the arm assignment in the context space for misspecified Lin TS, Lin UCB, BLTS, BLUCB. First, we compare Lin TS and Lin UCB. Figure 2a shows that the uncertainty and the stochastic nature of Lin TS leads to a dispersed assignment of arms a = 1 and a = 2 and to the crucial assignment of a few contexts to arm a = 0. This allows Lin TS to start decreasing the bias in the estimation of all three arms. Within the first few learning observations, Lin TS estimates the outcome models of all three arms correctly and finds the optimal assignment. On the other hand, Figure 2b, shows that the deterministic nature of Lin UCB assigns entire regions of the context space to the same arm. As a result not enough contexts are assigned to a = 0 and Lin UCB delays the correction of bias in the estimation of this arm. Another way to understand the problem is that the outcome model in the Lin UCB bandit has biased coefficients combined with estimated uncertainty that is too low to incentivize the exploration of arm a = 0 initially. Lin UCB finds the correct assignment after 240 observations. Second, we study the performance of BLTS and BLUCB. In Figure 2d, we observe that balancing has a significant impact on the performance of UCB, since BLUCB finds the optimal assignment after 110 observations, much faster than Lin UCB. This is because the few observations of arm a = 0 outside of the context region of the warm-start batch are weighted more heavily by BLUCB. As a result, BLUCB, despite its deterministic nature which complicates estimation, is able to reduce its bias more quickly via balancing Figure 2c shows that BLTS is also able to find the optimal assignment a few observations earlier than Lin TS. The first column of Table 1 shows the percentage of sim- ulations in which Lin TS, Lin UCB, BLTS and BLUCB find the optimal assignment within T = 10000 contexts for the well-specified case. BLTS outperforms all other algorithms by a large margin. Mis-Specified Outcome Models We now study the behavior of Lin TS, Lin UCB, BLTS and BLUCB when the outcome models include only linear terms of the context and therefore are misspecified. In real-world domains, the true data generative process is complex and very difficult to capture by the simpler outcome models assumed by the learning algorithms. Hence, model mismatch is very likely. We first compare Lin TS and Lin UCB. In Figures 3a, 3b, we see that during the first time periods, both bandits assign most contexts to arm a = 2 and a few contexts to arm a = 1. Lin TS finds faster than Lin UCB the linearly approximated area in which arm a = 2 is suboptimal. However, both Lin TS and Lin UCB have trouble identifying that the optimal arm is a = 0. Due to the low estimate of a = 0 from the mis-representative warm-start observations, Lin UCB does not assign contexts to arm a = 0 for a long time and therefore, delays to estimate the model of a = 0 correctly. Lin TS does assign a few contexts to arm a = 0, but they are not enough to quickly correct the estimation bias of arm a = 0 either. On the other hand, BLTS is able to harness the advantages of the stochastic assignment rule of Thompson sampling. The few contexts assigned to arm a = 0 are weighted more heavily by BLTS. Therefore, as shown in Figure 3c, BLTS corrects the estimation error of arm a = 0 and finds the (constrained) optimal assignment already after 20 observations. On the other hand, BLUCB does not handle better than Lin UCB the estimation problem created by the deterministic nature of the assignment in the misspecified case, as shown in Figure 3d. The second column of table 1 shows the percentage of simulations in which Lin TS, Lin UCB, BLTS and BLUCB find the optimal assignment within T = 10000 contexts for the misspecified case. Again, BLTS has a strong advantage. This simple synthetic example allowed us to explain transparently where the benefits of balancing in linear bandits stem from. Balancing helps escape biases in the training data and can be more robust in the case of model misspecification. While, as we proved, balanced linear contextual bandits share the same strong theoretical guarantees, this indicates towards their better performance in practice compared to other contextual bandits with linear realizability assumption. We investigate this further in the next section with an extensive evaluation on real classification datasets. Well-Specified Mis-Specified Lin TS 84% 39% Lin UCB 51% 29% BLTS 92% 58% BLUCB 79% 30% Table 1: Percentage of simulations in which Lin TS, Lin UCB, BLTS and BLUCB find the optimal assignment within learning horizon of 10000 contexts Multiclass Classification with Bandit Feedback Adapting a classification task to a bandit problem is a common method for comparing contextual bandit algorithms (Dud ık, Langford, and Li 2011), (Agarwal et al. 2014), (Bietti, Agarwal, and Langford 2018). In a classification task, we assume data are drawn IID from a fixed distribution: (x, c) D, where x X is the context and c 1, 2, . . . , K is the class. The goal is to find a classifier π : X {1, 2, . . . , K} that minimizes the classification error E(x,c) D1 {π(x) = c}. The classifier can be seen as an arm-selection policy and the classification error is the policy s expected regret. Further, if only the loss associated with the policy s chosen arm is revealed, this becomes a contextual bandit setting. So, at time t, context xt is sampled from the dataset, the contextual bandit selects arm at {1, 2, . . . , K} and observes reward rt(at) = 1 {at = ct}, where ct is the unknown, true class of xt. The performance of a contextual bandit algorithm on a dataset with n observations is measured with respect to the normalized cumulative regret, 1 n Pn t=1 (1 rt(at)). We use 300 multiclass datasets from the Open Media Library (Open ML). The datasets vary in number of observations, number of classes and number of features. Table 2 summarizes the characteristics of these benchmark datasets. Each dataset is randomly shuffled. Observations Datasets 100 58 > 100 and 1000 152 > 1000 and 10000 57 > 10000 33 Classes Count 2 243 > 2 and 10 48 > 10 9 Features Count 10 154 > 10 and 100 106 > 100 40 Table 2: Characteristics of the 300 datasets used for the experiments of multiclass classification with bandit feedback. We evaluate Lin TS, BLTS, Lin UCB and BLUCB on these 300 benchmark datasets. We run each contextual bandit on every dataset for different choices of input parameters. The regularization parameter λ, which is present in all algorithms, is chosen via cross-validation every time the model is updated. The constant α, which is present in all algorithms, is optimized among values 0.25, 0.5, 1 in the Thompson sampling bandits (Chapelle and Li 2011) and among values 1, 2, 4 in the UCB bandits (Chapelle and Li 2011). The propensity threshold γ for BLTS and BLUCB is optimized among the values 0.01, 0.05, 0.1, 0.2. Apart from baselines that belong in the family of contextual bandits with linear realizability assumption and have strong theoretical guarantees, we also evaluate the policy-based ILOVETOCONBANDITS (ILTCB) from (Agarwal et al. 2014) that does not estimate a model, but instead it assumes access to an oracle for solving fully supervised cost-sensitive classification problems and achieves the statistically optimal regret. Figure 4 shows the pairwise comparison of Lin TS, BLTS, Figure 4: Comparing Lin TS, BLTS, Lin UCB, BLUCB, ILTCB on 300 datasets. BLUCB outperforms Lin UCB. BLTS outperforms Lin TS, Lin UCB, BLUCB, ILTCB. Lin UCB, BLUCB and ILTCB on the 300 classification datasets. Each point corresponds to a dataset. The x coordinate is the normalized cumulative regret of the column bandit and the y coordinate is the normalized cumulative regret of the row bandit. The point is blue when the row bandit has smaller normalized cumulative regret and wins over the column bandit. The point is red when the row bandit loses from the column bandit. The point s size grows with the significance of the win or loss. The first important observation is that the improved model estimation achieved via balancing leads to better practical performance across a large number of contextual bandit instances. Specifically, BLTS outperforms Lin TS and BLUCB outperforms Lin UCB. The second important observation is that deterministic assignment rule bandits are at a disadvantage compared to randomized assignment rule bandits. The improvement in estimation via balancing is not enough to outweigh the fact that estimation is more difficult when the assignment is deterministic and BLUCB is outperformed by Lin TS. Overall, BLTS which has both balancing and a randomized assignment rule, outperforms all other linear contextual bandits with strong theoretical guarantees. BLTS also outperforms the model-agnostic ILTCB algorithm. We refer the reader to Appendix B of the supplemental material of the extended version of this paper (Dimakopoulou et al. 2017) for details on the datasets. Closing Remarks Contextual bandits are poised to play an important role in a wide range of applications: content recommendation in webservices, where the learner wants to personalize recommendations (arm) to the profile of a user (context) to maximize engagement (reward); online education platforms, where the learner wants to select a teaching method (arm) based on the characteristics of a student (context) in order to maximize the student s scores (reward); and survey experiments, where the learner wants to learn what information or persuasion (arm) influences the responses (reward) of subjects as a function of their demographics, political beliefs, or other characteristics (context). In these settings, there are many potential sources of bias in estimation of outcome models, not only due to the inherent adaptive data collection, but also due to mismatch between the true data generating process and the outcome model assumptions, and due to prejudice in the training data in form of under-representation or overrepresentation of certain regions of the context space. To reduce bias, we have proposed new contextual bandit algorithms, BLTS and BLUCB, which build on linear contextual bandits Lin TS and Lin UCB respectively and improve them with balancing methods from the causal inference literature. We derived the first regret bound analysis for linear contextual bandits with balancing and we showed linear contextual bandits with balancing match the theoretical guarantees of the linear contextual bandits with direct model estimation; namely that BLTS matches the regret bound of Lin TS and BLUCB matches the regret bound of Lin UCB. A synthetic example simulating covariate shift and model misspecification and a large-scale experiment with real multiclass classification datasets demonstrated the effectiveness of balancing in contextual bandits, particularly when coupled with Thompson sampling. Acknowledgments The authors would like to thank Emma Brunskill for valuable comments on the paper and John Langford, Miroslav Dud ık, Akshay Krishnamurthy and Chicheng Zhang for useful discussions regarding the evaluation on classification datasets. This research is generously supported by ONR grant N00014-17-1-2131, by the Sloan Foundation, by the Arvanitidis in Memory of William K. Linvill Stanford Graduate Fellowship and by the Onassis Foundation. Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In NIPS. Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; and Schapire, R. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. ICML. Agarwal, A.; Bird, S.; Cozowicz, M.; Hoang, L.; Langford, J.; Lee, S.; Li, J.; Melamed, D.; Oshri, G.; Ribas, O.; Sen, S.; and Slivkins, A. 2016. Making contextual decisions with low technical debt. ar Xiv preprint ar Xiv:1606.03966. Agrawal, S., and Goyal, N. 2012. Analysis of thompson sampling for the multi-armed bandit problem. Journal of Machine Learning Research Workshop and Conference Proceedings. Agrawal, S., and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. ICML. Athey, S., and Wager, S. 2017. Efficient policy learning. ar Xiv preprint ar Xiv:1702.02896. Athey, S.; Imbens, G. W.; and Wager, S. 2018. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning. Auer, P. 2002. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research 3(Nov). Auer, P. 2003. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research. Bareinboim, E., and Pearl, J. 2015. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences. Bareinboim, E.; Forney, A.; and Pearl, J. 2015. Bandits with unobserved confounders: A causal approach. ICML. Bastani, H., and Bayati, M. 2015. Online decision-making with high-dimensional covariates. Bietti, A.; Agarwal, A.; and Langford, J. 2018. A contextual bandit bake-off. ar Xiv preprint ar Xiv:1802.04064. Bubeck, S., and Cesa-Bianchi, N. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning. Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. NIPS. Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. In AISTATS. Cortes, C.; Mansour, Y.; and Mohri, M. 2010. Learning bounds for importance eeighting. NIPS. Crump, R. K.; Hotz, V. J.; Imbens, G. W.; and Mitnik, O. A. 2009. Dealing with limited overlap in estimation of average treatment effects. Biometrika 96(1):187 199. Deshpande, Y.; Mackey, L.; Syrgkanis, V.; and Taddy, M. 2017. Accurate inference in adaptive linear models. ar Xiv preprint ar Xiv:1712.06695. Dimakopoulou, M.; Zhou, Z.; Athey, S.; and Imbens, G. 2017. Estimation considerations in contextual bandits. ar Xiv preprint ar Xiv:1711.07077. Dud ık, M.; Erhan, D.; Langford, J.; and Li, L. 2014. Doubly robust policy evaluation and optimization. Statistical Science. Dud ık, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. ICML. Forney, A.; Pearl, J.; and Bareinboim, E. 2017. Counterfactual data-fusion for online reinforcement learners. ICML. Huang, J.; Gretton, A.; Borgwardt, K. M.; Scholkopf, B.; and Smola, A. J. 2007. Correcting sample selection bias by unlabeled data. NIPS. Imbens, G. W., and Rubin, D. B. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Jiang, N., and Li, L. 2016. Doubly robust off-policy value evaluation for reinforcement learning. ICML. Kallus, N., and Zhou, A. 2018. Policy evaluation and optimization with continuous treatments. AISTATS. Kallus, N. 2017. Balanced policy evaluation and learning. ar Xiv preprint ar Xiv:1705.07384. Lai, T., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics. Lattimore, F.; Lattimore, T.; and Reid, M. D. 2016. Causal bandits: Learning good interventions via causal inference. NIPS. Lei, H.; Tewari, A.; and Murphy, S. 2017. An actor-critic contextual bandit algorithm for personalized mobile health interventions. ar Xiv preprint ar Xiv:1706.09090. Li, L.; Chu, W.; Langford, J.; and Schapire, R. 2010. A contextual-bandit approach to personalized news article recommendation. WWW. Li, L.; Chu, W.; Langford, J.; Moon, T.; and Wang, X. 2012. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. Journal of Machine Learning Research Workshop and Conference Proceedings. Li, L.; Chen, S.; Kleban, J.; and Gupta, A. 2014. Counterfactual estimation and optimization of click metrics for search engines. ar Xiv preprint ar Xiv:1403.1891. Nie, X.; Tian, X.; Taylor, J.; and Zou, J. 2018. Why adaptively collected data have negative bias and how to correct for it. International Conference on Artificial Intelligence and Statistics. Perchet, V., and Rigollet, P. 2013. The multi-armed bandit problem with covariates. The Annals of Statistics. Rigollet, P., and Zeevi, R. 2010. Nonparametric bandits with covariates. COLT. Russo, D., and Van Roy, B. 2014. Learning to optimize via posterior sampling. Mathematics of Operations Research. Russo, D. J.; Van Roy, B.; Kazerouni, A.; Osband, I.; and Wen, Z. 2018. A tutorial on thompson sampling. Foundations and Trends in Machine Learning. Scharfstein, D. O.; Rotnitzky, A.; and Robins, J. M. 1999. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association 94(448):1096 1120. Scott, S. 2010. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry. Slivkins, A. 2014. Contextual bandits with similarity information. Journal of Machine Learning Research. Strehl, A.; Langford, J.; Li, L.; and Kakade, S. M. 2010. Learning from logged implicit exploration data. In NIPS. Swaminathan, A., and Joachims, T. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research. Thomas, P., and Brunskill, E. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. ICML. Thompson, W. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. Villar, S.; Bowden, J.; and Wason, J. 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical Science. Wang, Y. X.; Agarwal, A.; and Dud ık, M. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. ICML. Zadrozny, B. 2004. Learning and evaluating classifiers under sample selection bias. ICML. Zhou, Z.; Athey, S.; and Wager, S. 2018. Offline multiaction policy learning: Generalization and optimization. ar Xiv preprint ar Xiv:1810.04778.