# bayesian_counterfactual_risk_minimization__defce877.pdf Bayesian Counterfactual Risk Minimization Ben London 1 Ted Sandler 1 We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning from logged bandit feedback. Using PACBayesian analysis, we derive a new generalization bound for the truncated inverse propensity score estimator. We apply the bound to a class of Bayesian policies, which motivates a novel, potentially data-dependent, regularization technique for CRM. Experimental results indicate that this technique outperforms standard L2 regularization, and that it is competitive with variance regularization while being both simpler to implement and more computationally efficient. 1. Introduction In industrial applications of machine learning, model development is typically an iterative process, involving multiple trials of offline training and online experimentation. For example, a content streaming service might explore various recommendation strategies in a series of A/B tests. The data that is generated by this process e.g., impression and interaction logs can be used to augment training data and further refine a model. However, learning from logged interactions poses two fundamental challenges: (1) the feedback obtained from interaction is always incomplete, since one only observes responses (usually referred to as rewards) for actions that were taken; (2) the distribution of observations is inherently biased by the policy that determined which action to take in each context. This problem of learning from logged data has been studied under various names by various authors (Strehl et al., 2010; Dud ık et al., 2011; Bottou et al., 2013; Swaminathan and Joachims, 2015a). We adopt the moniker counterfactual risk minimization (CRM), introduced by Swaminathan and Joachims (2015a), though it is also known as offline policy optimization in the reinforcement learning literature. 1Amazon, Seattle, WA, USA. Correspondence to: Ben London . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). The goal of CRM is to learn a policy from data that was logged by a previous policy so as to maximize expected reward (alternatively, minimize risk) over draws of future contexts. Using an analysis based on Bennett s inequality, Swaminathan and Joachims derived an upper bound on the risk of a stochastic policy,1 which motivated learning with variance-based regularization. In this work, we study CRM from a Bayesian perspective, in which one s uncertainty over actions becomes uncertainty over hypotheses. We view a stochastic policy as a distribution over hypotheses, each of which is a mapping from contexts to actions. Our work bridges the gap between CRM, which has until now been approached from the frequentist perspective, and Bayesian methods, which are often used to balance exploration and exploitation in contextual bandit problems (Chapelle and Li, 2011). Using a PAC-Bayesian analysis, we prove an upper bound on the risk of a Bayesian policy trained on logged data. We then apply this bound to a class of Bayesian policies based on the mixed logit model. This analysis suggests an intuitive regularization strategy for Bayesian CRM based on the L2 distance from the logging policy s parameters. Our logging policy regularization (LPR) is effectively similar to variance regularization, but simpler to implement and more computationally efficient. We derive two Bayesian CRM objectives based on LPR, one of which is convex. We also consider the scenario in which the logging policy is unknown. In this case, we propose a two-step procedure to learn the logging policy, and then use the learned parameters to regularize training a new policy. We prove a corresponding risk bound for this setting using a distributiondependent prior. We end with an empirical study of our theoretical results. First, we show that LPR outperforms standard L2 regularization whenever the logging policy is better than a uniform distribution. Second, we show that LPR is competitive with variance regularization, and even outperforms it on certain problems. Finally, we demonstrate that it is indeed possible to learn the logging policy for LPR with negligible impact on performance. These findings establish LPR as a simple, effective method for Bayesian CRM. 1In a similar vein, Strehl et al. (2010) proved a lower bound on the expected reward of a deterministic policy. Bayesian Counterfactual Risk Minimization Note: All proofs are deferred to the supplemental material. 2. Preliminaries Let X denote a set of contexts, and A denote a finite set of k discrete actions. We are interested in finding a stochastic policy, π : X k, which maps X to the probability simplex on k vertices, denoted k; in other words, π defines a conditional probability distribution over actions given contexts, from which we can sample actions. For a given context, x X, we denote the conditional distribution on A by π(x), and the probability mass of a particular action, a A, by π(a | x). Each action is associated with a stochastic, contextual reward, given by an unknown function, ρ : X A [0, 1], which we assume is bounded. When an action is played in response to a context, we only observe the reward for said action. This type of incomplete feedback is commonly referred to as bandit feedback. We assume a stationary distribution, D, over contexts and reward functions. Our goal will be to find a policy that maximizes the expected reward over draws of (x, ρ) D and a π(x); or, put differently, one that minimizes the risk, R(π) 1 E (x,ρ) D E a π(x) [ρ(x, a)] . We assume that we have access to a dataset of logged observations (i.e., examples), S (xi, ai, pi, ri)n i=1, where (xi, ρ) were sampled from D; action ai was sampled with probability pi π0(ai | xi) from a fixed logging policy, π0; and reward ri ρ(xi, ai) was observed. The distribution of S, which we denote by (D π0)n, is biased by the logging policy, since we only observe rewards for actions that were sampled from its distribution. Nonetheless, if π0 has full support, we can obtain an unbiased estimate of R(π) by scaling each reward by its inverse propensity score (IPS) (Rosenbaum and Rubin, 1983), p 1 i , which yields the IPS estimator, ˆR(π, S) 1 1 i=1 ri π(ai | xi) Unfortunately, IPS can have very high variance. This issue can be mitigated by truncating (or clipping) pi to the interval [τ, 1] (as proposed in (Strehl et al., 2010)), yielding ˆRτ(π, S) 1 1 i=1 ri π(ai | xi) max{pi, τ}, which we will sometimes refer to as the empirical risk. This estimator reduces variance, at the cost of adding bias. However, since max{pi, τ} pi, we have that ˆRτ(π, S) ˆR(π, S), which implies E S (D π0)n h ˆRτ(π, S) i E S (D π0)n h ˆR(π, S) i = R(π). Thus, if ˆRτ(π, S) concentrates, then by minimizing it, we minimize a probabilistic upper bound on the risk. Remark 1. There are other estimators we can consider. For instance, we could truncate the ratio of the policy and the logging policy, min{π(ai | xi)/pi, τ 1} (Ionides, 2008; Swaminathan and Joachims, 2015a). However, this form of truncation is incompatible with our subsequent analysis because the policy is inside the min operator. Avoiding truncation altogether, we could use the self-normalizing estimator (Swaminathan and Joachims, 2015b), but this is also incompatible, since the estimator does not decompose as a sum of i.i.d. random variables. Finally, we note that our theory does apply, with small modifications, to the doublyrobust estimator (Dud ık et al., 2011). 2.1. Counterfactual Risk Minimization Our work is heavily influenced by Swaminathan and Joachims (2015a), who coined the term counterfactual risk minimization (CRM) to refer to the problem of learning a policy from logged bandit feedback by minimizing an upper bound on the risk. Their bound is a function of the truncated2 IPS estimator, the sample variance of the truncated IPS-weighted rewards under the policy, ˆVτ(π, S), and a measure of the complexity, C : Π R+, of the class of policies being considered, Π {π : X k}. R(π) ˆRτ(π, S) + O q ˆVτ (π,S) C(Π) When ˆVτ(π, S) is sufficiently small, the bound s dominating term is O(n 1), which is the so-called fast learning rate. This motivates a variance-regularized learning objective, arg minπ Π ˆRτ(π, S) + λ q for a regularization parameter, λ > 0. Swaminathan and Joachims propose a majorization-minimization algorithm named policy optimization for exponential models (POEM) to solve this optimization. 3. PAC-Bayesian Analysis In this work, we view CRM from a Bayesian perspective. We consider stochastic policies whose action distributions are induced by distributions over hypotheses. Instead of sampling directly from a distribution on the action set, we sample from a distribution on a hypothesis space, H {h : X A}, in which each element is a deterministic mapping from contexts to actions.3 As such, for a distribution, Q, on H, the probability of an action, a A, 2Though Swaminathan and Joachims used a different form of truncation, their results nonetheless hold for our truncation. 3This view of stochastic policies was also used by Seldin et al. (2011) to analyze contextual bandits with PAC-Bayes. Bayesian Counterfactual Risk Minimization given a context, x X, is the probability that a random hypothesis, h Q, maps x to a; that is, πQ(a | x) Pr h Q {h(x) = a} = E h Q [1{h(x) = a}] . (3) Usually, the hypothesis space consists of functions of a certain parametric form, so the distribution is actually over the parameter values. We analyze one such class in Section 4. To analyze Bayesian policies, we use the PAC-Bayesian framework (also known as simply PAC-Bayes) . The PACBayesian learning paradigm proceeds as follows: first, we fix a hypothesis space, H, and a prior distribution, P, on H; then, we receive some data, S, drawn from a fixed distribution; given S, we learn a posterior, Q, on H, from which we can sample hypotheses to classify new instances. In our PAC-Bayesian formulation of CRM, the learned posterior becomes our stochastic policy (Equation 3). Given a context, x X, we sample an action by sampling h Q (independent of x) and returning h(x). (In PAC-Bayesian terminology, this procedure is called the Gibbs classifier.) Remark 2. Instead of sampling actions via a posterior over hypotheses, we could equivalently sample policies from a posterior over policies, {π : X k}, then sample actions from said policies. The Bayesian policy would then be the expected policy, πQ(a | x) Eπ Q[π(a | x)]. That said, it is more traditional in PAC-Bayes and perhaps more flexible to think in terms of the Gibbs classifier, which directly maps contexts to actions. It is important to note that the choice of prior cannot depend on the training data; however, the prior can generate the data. Indeed, we can generate S by sampling (xi, ρ) D, h P and logging (xi, h(xi), π0(h(xi) | xi), ρ(xi, h(xi))), for i = 1, . . . , n. Thus, in the PAC-Bayesian formulation of CRM, the prior can be the logging policy. We elaborate on this in Section 4. 3.1. Risk Bounds The heart of our analysis is an application of the PACBayesian theorem a generalization bound for Bayesian learning to upper-bound the risk. The particular PACBayesian bound we use is by Mc Allester (2003). Omitting details (in the interest of space), the bound states that, for any fixed prior, P on H, with probability at least 1 δ over draws of the data, S, all posteriors, Q, on H, satisfy R(Q) ˆR(Q, S)+ O ˆR(Q, S)DKL(Q P) n + DKL(Q P) where R(Q) and ˆR(Q, S) are the risk and empirical risk, respectively. The hallmark of a PAC-Bayesian bound is the KL divergence from the prior to the posterior. This can be interpreted as a complexity measure, similar to the VC dimension, covering number or Rademacher complexity (Mohri et al., 2012). The divergence penalizes posteriors that stray from the prior, effectively penalizing overfitting. One attractive property of Mc Allester s bound is that, if the empirical risk is sufficiently small, then the generalization error, R(Q) ˆR(Q, S), is O(n 1). Thus, the bound captures both realizable and non-realizable learning problems. To apply the PAC-Bayesian theorem to CRM, we design a loss function that allows us to state the risk of a policy in terms of the risk of the Gibbs classifier. This yields the following risk bound. Theorem 1. Let H {h : X A} denote a hypothesis space mapping contexts to actions. For any n 1, δ (0, 1), τ (0, 1) and fixed prior, P, on H, with probability at least 1 δ over draws of S (D π0)n, the following holds simultaneously for all posteriors, Q, on H: R(πQ) ˆRτ(πQ, S) 2 ˆRτ(πQ, S) 1 + 1 τ DKL(Q P) + ln n + 2 DKL(Q P) + ln n τ (n 1) . (4) It is important to note that the truncated IPS estimator, ˆRτ, can be negative, achieving its minimum at 1 τ 1. This means that when ˆRτ is minimized, the middle O(n 1/2) term disappears and the O(n 1) term dominates the bound, yielding the fast learning rate. That said, our bound may not be as tight as Swaminathan and Joachims (Equation 1), since the variance is sometimes smaller than the mean. To achieve a similar rate, we could perhaps use Seldin et al. s (2012) PAC-Bayesian Bernstein bound. Though Theorem 1 assumes that the truncation parameter, τ, is fixed a priori, we can derive a risk bound (given in the supplemental material) that holds for all τ simultaneously meaning, τ can be data-dependent, such as the 10th percentile of the logged propensities. Theorem 1 has an intriguing interpretation when the prior is defined as the logging policy. In this case, one can minimize an upper bound on the risk by minimizing the empirical risk while keeping the learned policy close to the logging policy. We explore this idea, and its relationship to variance regularization, in the next section. 4. Mixed Logit Models We will apply our PAC-Bayesian analysis to the following class of stochastic policies. We first define a hypothesis space, H {hw,γ : w Rd, γ Rk}, of functions of the form hw,γ(x) arg max a A w φ(x, a) + γa, (5) Bayesian Counterfactual Risk Minimization where φ(x, a) Rd outputs features of the context and action, subject to supx X,a A φ(x, a) B. If each γa is sampled from a standard Gumbel distribution, Gum(0, 1) (location 0, scale 1), then hw,γ(x) produces a sample from a softmax model, ςw(a | x) exp(w φ(x, a)) P a A exp(w φ(x, a )). (6) Further, if w is normally distributed, then hw,γ(x) has a logistic-normal distribution (Aitchison and Shen, 1980). We define the posterior, Q, as a Gaussian over softmax parameters, w N(µ, σ2I), for some learned µ Rd and σ2 (0, ), with standard Gumbel perturbations, γ Gum(0, 1)k. As such, we have that πQ(a | x) = E w,γ [1{hw,γ(x) = a}] = E w N(µ,σ2I) [ςw(a | x)] . (7) This model is alternately referred to as a mixed logit or random parameter logit. We can define the prior in any way that seems reasonable without access to training data, of course. In the absence of any prior knowledge, a logical choice of prior is the standard (zero mean, unit variance) multivariate normal distribution, with standard Gumbel perturbations. This prior corresponds to a Bayesian policy that takes uniformly random actions, and motivates standard L2 regularization of µ. However, we know that the data was generated by the logging policy, and this knowledge motivates a different kind of prior (hence, regularizer). If the logging policy performs better than a uniform action distribution which we can verify empirically, using IPS with the logs then it makes sense to define the prior in terms of the logging policy. Let us assume that the logging policy is known (we relax this assumption in Section 5) and has a softmax form (Equation 6), with parameters µ0 Rd. We define the prior, P, as an isotropic Gaussian centered at the logging policy s parameters, w N(µ0, σ2 0I), for some predetermined σ2 0 (0, ), with standard Gumbel perturbations, γ Gum(0, 1)k. This prior encodes a belief that the logging policy, while not perfect, is a good starting point. Using the logging policy to define the prior does not violate the PAC-Bayes paradigm, since the logging policy is fixed before generating the training data. The Bayesian policy induced by this prior may not correspond to the actual logging policy, but we can define the prior any way we want. Remark 3. We used isotropic covariances for the prior and posterior in order to simplify our analysis and presentation. That said, it is possible to use more complex covariances. 4.1. Bounding the KL Divergence The KL divergence between the above prior and posterior constructions motivates an interesting regularizer for CRM. To derive it, we upper-bound the KL divergence by a function of the model parameters. Lemma 1. For distributions P N(µ0, σ2 0I) Gum(0, 1)k and Q N(µ, σ2I) Gum(0, 1)k, with µ0, µ Rd and 0 < σ2 σ2 0 < , DKL(Q P) µ µ0 2 2 ln σ2 0 σ2 . (8) One implication of Lemma 1, captured by the term µ µ0 2, is that, to generalize, the learned policy s parameters should stay close to the logging policy s parameters. This intuition concurs with Swaminathan and Joachims s (2015a) variance regularization, since one way to reduce the variance is to not stray too far from the logging policy. It is also reminiscent of trust region policy optimization (Schulman et al., 2015), a reinforcement learning algorithm in which each update to the policy s action distribution is constrained to not diverge too much from the current one.4 Implementing Lemma 1 s guideline in practice requires a simple modification to the usual L2 regularization: instead of λ µ 2 (where λ > 0 controls the amount of regularization), use λ µ µ0 2. Of course, this assumes that the logging policy s parameters, µ0, are known; we address the scenario in which the logging policy is unknown in Section 5. 4.2. Approximating the Action Probabilities In practice, computing the posterior action probabilities (Equation 7) of a mixed logit model is difficult, since there is no analytical expression for the mean of the logisticnormal distribution (Aitchison and Shen, 1980). It is therefore difficult to log propensities, or to compute the IPS estimator, which is a function of the learned and logged probabilities. Since it is easy to sample from a mixed logit, we can use Monte Carlo methods to estimate the probabilities. Alternatively, we can bound the probabilities by a function of the mean parameters, µ. Lemma 2. If supx X,a A φ(x, a) B, then ςµ(a | x) e σ2B2 2 πQ(a | x) ςµ(a | x) e2σ2B2. By Lemma 2, the softmax probabilities induced by the mean parameters provide lower and upper bounds on the probabilities of the mixed logit model. The bounds tighten as the variance, σ2, becomes smaller. For instance, if σ2 = O(n 1), then πQ(a | x) ςµ(a | x) as n . During learning, we can use the lower bound of the learned probabilities to upper-bound the IPS estimator. We over- 4Interestingly, via Fenchel duality and Cauchy-Schwarz, a bound like Equation 8 holds for the KL divergence between softmax action distributions: DKL(ςµ(x) ςµ0(x)) O( µ µ0 ). Bayesian Counterfactual Risk Minimization load our previous notation to define a new estimator, ˆRτ(µ, σ2, S) 1 exp( σ2B2 i=1 ri ςµ(ai | xi) max{pi, τ}. This estimator is biased, but the bias decreases with σ2. Importantly, ˆRτ(µ, σ2, S) is easy to compute (assuming the action set is not too large), since it avoids the logisticnormal integral. When the learned posterior is deployed, we can log the upper bound of the propensities, so that future training with the logged data has an upper bound on the IPS estimator. 4.3. Bayesian CRM for Mixed Logit Models We now present a risk bound for the Bayesian policy, πQ, using the softmax policy, ςµ, on the mean parameters, µ. Theorem 2. Let H denote the hypothesis space defined in Equation 5, and let πQ denote the mixed logit policy defined in Equation 7. For any n 1, δ (0, 1), τ (0, 1), µ0 Rd and σ2 0 (0, ), with probability at least 1 δ over draws of S (D π0)n, the following holds simultaneously for all µ Rd and σ2 (0, σ2 0]: R(πQ) ˆRτ(µ, σ2, S) s ˆRτ(µ, σ2, S) 1 + 1 τ Γ(µ0, σ2 0, µ, σ2) + 2 ln n + Γ(µ0, σ2 0, µ, σ2) + 2 ln n δ τ (n 1) , (9) where Γ(µ0, σ2 0, µ, σ2) µ µ0 2 σ2 0 + d ln σ2 0 σ2 . (10) Theorem 2 provides an upper bound on the risk that can be computed with training data. Moreover, the bound is differentiable and smooth, meaning it can be optimized using gradient-based methods. This motivates a new regularized learning objective for Bayesian CRM. Proposition 1. The following optimization minimizes an upper bound on Equation 9: arg min µ Rd, σ2 (0,σ2 0] ˆRτ(µ, σ2, S)+ µ µ0 2 σ2 0 τ (n 1) d ln σ2 τ (n 1). (11) Equation 11 is unfortunately non-convex. However, we can upper-bound ˆRτ(µ, σ2, S) to obtain an objective that is differentiable, smooth and convex. Proposition 2. The following convex optimization minimizes an upper bound on Equation 9: arg min µ Rd 1 n i=1 ri ln ςµ(ai | xi) max{pi, τ} + µ µ0 2 σ2 0 τ (n 1), (12) with σ2 min n 2d B2τ(n 1) 1 n Pn i=1 ri max{pi,τ} 1 , σ2 0 o . Conveniently, Equation 12 is equivalent to a weighted softmax regression with a modified L2 regularizer. This optimization can be solved using standard methods, with guaranteed convergence to a global optimum. Moreover, by decoupling the optimizations of µ and σ2 in the upper bound (refer to the proof for details), we can solve for the optimal σ2 in closed form. Equation 12 also has a connection to policy gradient methods (Sutton et al., 2000). We discuss this connection in the supplemental material. In practice, one usually tunes the amount of regularization to optimize the empirical risk on a held-out validation dataset. By Propositions 1 and 2, this is equivalent to tuning the variance of the prior, σ2 0. Though µ0 could in theory be any fixed vector, the case when it is the parameters of the logging policy corresponds to an interesting regularizer. This regularizer instructs the learning algorithm to keep the learned policy close to the logging policy, which effectively reduces the variance of the estimator. Using Theorem 2, we can examine how the parameters σ2 0 and σ2 affect the bias-variance trade-off. Recall from Lemma 2 that higher values of σ2 increase the bias of the estimator, ˆRτ(µ, σ2, S). To reduce this bias, we want σ2 to be small; e.g., σ2 = Θ(n 1) results in negligible bias. However, if we also have σ2 0 = Θ(1), then Γ(µ0, σ2 0, µ, σ2) which can be interpreted as the variance of the estimator has a term, d ln σ2 0 σ2 = O(d ln n), that depends linearly on the number of features, d. When d is large, this term can dominate the risk bound. The dependence on d is eliminated when σ2 0 = σ2; but if σ2 0 = Θ(n 1), then Γ(µ0, σ2 0, µ, σ2) = O( µ µ0 2 n), which makes the risk bound vacuous. 5. When the Logging Policy Is Unknown In Section 4, we assumed that the logging policy was known and used it to construct a prior. However, there may be settings in which the logging policy is unknown. We can nonetheless construct a prior that approximates the logging policy by learning from its logged actions. At first, this idea may sound counterintuitive. After all, the prior is supposed to be fixed before drawing the training data. However, the expected value of a function of the data is constant with respect to any realization of the data. Thus, the expected estimator of the logging policy is independent of the data, and can serve as a valid prior. We then just need to show that the estimator concentrates around its mean. This type of analysis was introduced by Catoni (2007) and later developed by Lever et al. (2010), among others. Overloading our previous notation, let L : Rd X A R+ denote a loss function that measures the fit of param- Bayesian Counterfactual Risk Minimization eters w Rd, given context x X and action a A. We will assume that L is both convex and β-Lipschitz with respect to w. This assumption is satisfied by, e.g., the negative log-likelihood. For a dataset, S (D π0)n, containing logged contexts and actions, let n Pn i=1 L(w, xi, ai) + λ w 2 (13) denote a regularized objective; let ˆµ0(S) arg minw Rd F(w, S) (14) denote its minimizer; and let µ0 ES (D π0)n[ˆµ0(S)] denote the expected minimizer. Since µ0 is a constant, it is independent of any realization of S. We can therefore construct a Gaussian prior around µ0, which makes the KL divergence proportional to µ µ0 2. Since F is strongly convex, its minimizer exhibits uniform algorithmic stability; meaning, it is robust to perturbations of the training data. Due to this property, the random variable ˆµ0(S) concentrates around its mean, µ0 (Liu et al., 2017). Thus, with high probability, ˆµ0(S) µ0 is small, and µ µ0 is approximately µ ˆµ0(S) . Theorem 3. Let H denote the hypothesis space defined in Equation 5, and let πQ denote the mixed logit policy defined in Equation 7. Let ˆµ0(S) denote the minimizer defined in Equation 14, for a convex, β-Lipschitz loss function. For any n 1, δ (0, 1), τ (0, 1) and σ2 0 (0, ), with probability at least 1 δ over draws of S (D π0)n, the following holds for all µ Rd and σ2 (0, σ2 0]: R(πQ) ˆRτ(µ, σ2, S) s ˆRτ(µ, σ2, S) 1 + 1 τ ˆΓ(ˆµ0(S), σ2 0, µ, σ2) + 2 ln 2n + ˆΓ(ˆµ0(S), σ2 0, µ, σ2) + 2 ln 2n δ τ (n 1) , (15) ˆΓ(ˆµ0(S), σ2 0, µ, σ2) µ ˆµ0(S) + β σ2 0 +d ln σ2 0 σ2 . It is straightforward to show that Propositions 1 and 2 hold for Theorem 3 with µ0 ˆµ0(S), which motivates a two-step learning procedure for Bayesian CRM: (1) using logged data, S, but ignoring rewards, solve Equation 14 to estimate softmax parameters, ˆµ0(S), that approximate the logging policy; (2) using S again, including the rewards, solve Equation 11 or 12, with µ0 ˆµ0(S), to train a new mixed logit policy. Remark 4. Throughout, we have assumed that the logged data includes the propensities, which enable IPS weighting. Given that we can learn to approximate the logging policy, it seems natural to use the learned propensities in the absence of the true propensities. In practice, this may work, though we do not provide any formal guarantees for it. 6. Experiments Our Bayesian analysis of CRM suggests a new regularization technique, which we will henceforth refer to as logging policy regularization (LPR). Using the logging policy to construct a prior, we regularize by the squared distance between the (learned) logging policy s softmax parameters, µ0, and the posterior mean, µ, over softmax parameters. In this section, we empirically verify the following claims: (1) LPR outperforms standard L2 regularization whenever the logging policy outperforms a uniform action distribution; (2) LPR is competitive with variance regularization (i.e., POEM), and is also faster to optimize; (3) when the logging policy is unknown, we can estimate it from logged data, then use the estimator in LPR with little deterioration in performance. We will use the class of mixed logit models from Section 4. For simplicity, we choose to only optimize the posterior mean, µ, assuming that the posterior variance, σ2, is fixed to some small value, e.g., n 1. This is inconsequential, since we will approximate the posterior action probabilities, πQ(a | x), with a softmax of the mean parameters, ςµ(a | x). By Lemma 2, with small σ2, this is a reasonable approximation. In a small departure from our analysis, we add an unregularized bias term for each action. We evaluate two methods based on LPR. The first method, inspired by Proposition 1, combines LPR with the truncated IPS estimator: arg min µ Rd ˆRτ(ςµ, S) + λ µ µ0 2 , (16) where τ (0, 1) and λ 0 are free parameters. We call this method IPS-LPR. The second method, inspired by Proposition 1, is a convex upper bound: arg min µ Rd 1 n i=1 ri ln ςµ(ai | xi) max{pi, τ} + λ µ µ0 2 . (17) Since the first term is essentially a weighted negative loglikelihood, we call this method WNLL-LPR. We compare the above methods to several baselines. The first baseline is IPS with standard L2 regularization, which essentially replaces µ µ0 2 with µ 2 in Equation 16. We call this baseline IPS-L2. The second baseline is POEM (Swaminathan and Joachims, 2015a), which solves the variance regularized objective in Equation 2 (with Π as the class of softmax policies with parameters µ Rd) using a majorization-minimization algorithm. We also test a variant of POEM that adds L2 regularization, which we refer to as POEM-L2. All methods require some form of IPS truncation. For IPSL2, IPS-LPR and WNLL-LPR, we use max{pi, τ} 1; for POEM and POEM-L2, we use min{π(ai | xi)/pi, τ 1}, Bayesian Counterfactual Risk Minimization per Swaminathan and Joachims s original formulation. In all experiments, we set τ 0.01. Since all methods support stochastic first-order optimization, we use Ada Grad (Duchi et al., 2011) with minibatches of 100 examples. We set the learning rate to 0.1 and the smoothing parameter to one, which we find necessary for numerical stability. Unless otherwise stated, we run training for 500 epochs, with random shuffling of the training data at each epoch. All model parameters are initialized to zero, and all runs of training are seeded such that every method receives the same sequence of examples. We report results on two benchmark image classification datasets: Fashion-MNIST (Xiao et al., 2017) and CIFAR100 (Krizhevsky and Hinton, 2009). Fashion-MNIST consists of 70,000 (60,000 training; 10,000 testing) grayscale images from 10 categories of apparel and accessories. We extract features from each image by normalizing pixel intensities to [0, 1] and flattening the (28 28)-pixel grid to a 784-dimensional vector. CIFAR-100 consists of 60,000 (50,000 training; 10,000 testing) color images from 100 general object categories. As this data is typically modeled with deep convolutional neural networks, we use transfer learning to extract features expressive enough to yield decent performance with the class of log-linear models described in Section 4. Specifically, we use the last hidden layer of a pre-trained Res Net-50 network (He et al., 2016), which was trained on Image Net (Deng et al., 2009), to output 2048-dimensional features for CIFAR-100. Following prior work, we use a standard supervisedto-bandit conversion to simulate logged bandit feedback (Beygelzimer and Langford, 2009). We start by randomly sampling 1,000 training examples (without replacement) to train a softmax logging policy using supervised learning. We then use the logging policy to sample a label (i.e., action) for each remaining training example. The reward is one if the sampled label matches the true label, and zero otherwise. We repeat this procedure 10 times, using 10 random splits of the training data, to generate 10 datasets of logged contexts, actions, propensities and rewards. We compare methods along two metrics. Our primary metric is the expected reward under the stochastic policy, Ea π(x)[ρ(x, a)], averaged over the testing data. Our secondary metric which is not directly supported by our analysis, but is nonetheless of interest is the reward of the deterministic argmax policy, ρ(x, arg maxa A π(a | x)). 6.1. Logging Policy as Prior We first investigate our claim that the logging policy is a better prior than a standard normal distribution, thus motivating LPR over L2 regularization. For each simulated log dataset, we train new policies using IPS-L2 (a) Varying the amount of regularization. (b) Varying the quality of the logging policy. Figure 1: L2 regularization vs. LPR. Each line is the average of 10 trials, with shading to indicate the 95% confidence interval. Figure 1a plots the expected test reward as a function of λ. Figure 1b analyzes a spectrum of logging policies from the uniform action distribution (ϵ = 0) to the trained distribution (ϵ = 1). and IPS-LPR, with regularization parameter values λ = 10 6, 10 5, . . . , 1. Figure 1a plots the expected test reward as a function of λ. The dotted black line indicates the performance of the logging policy. We find that IPS-LPR outperforms IPS-L2 for each value of λ; meaning, for any amount of regularization, IPS-LPR is always better. Further, while the performance of IPS-L2 degrades to that of a uniform action distribution as we over-regularize, the performance of IPS-LPR converges to that of the logging policy. This illustrates the natural intuition that a policy that acts better than random guessing is an informative prior. An implication of this statement is that, as the logging policy s action distribution becomes more uniform, its efficacy as a prior should diminish. To verify this, we construct a sequence of logging policies that interpolate between the above logging policy and the uniform distribution, by multiplying the weights by an inverse-temperature parameter, ϵ = 0, 0.2, . . . , 1. We then generate log datasets for each logging policy, and train new policies using IPS-L2 and IPS-LPR, with λ 0.001. As expected (see Figure 1b), the performance of IPS-LPR gradually converges to that of IPS-L2 as the logging policy converges to uniform. One could also ask what happens when when the logging policy is worse than a uniform distribution. Indeed, though not shown here, we find that IPS-LPR performs worse than IPS-L2 in that scenario. However, one could reasonably argue that such a scenario is unlikely to occur in practice, since there is no point to deploying a logging policy that performs worse than a uniform distribution. Bayesian Counterfactual Risk Minimization 6.2. Comparison to POEM As discussed earlier, LPR relates to variance regularization in that one way to minimize variance is to keep the new policy close to the logging policy. We are therefore prompted to investigate how LPR compares to variance regularization (i.e., POEM) in practice. In this experiment, our goal is to achieve the highest expected reward for each method on each log dataset, without looking at the testing data. Accordingly, we tune the regularization parameter, λ, using 5-fold cross-validation on each log dataset, with truncated IPS estimation of expected reward on the holdout set. For simplicity, we use grid search over powers of ten; λ = 10 8, . . . , 10 3 for LPR and λ = 10 3, . . . , 102 for variance regularization. For POEM-L2, we tune the L2 regularization parameter (in the same range as LPR) by fixing the variance regularization parameter to its optimal value. During parameter tuning, we limit training to 100 epochs. Once the parameter values have been selected, we train a new policy on the entire log dataset for 500 (Fashion MNIST) or 1000 (CIFAR-100) epochs and evaluate it on the testing data. Table 1 reports the results of this experiment. For completeness, we include results for all proposed methods and baselines, including the logging policy. On Fashion MNIST, the variance regularization baselines (POEM and POEM-L2) achieve the highest expected reward, but the LPR methods (IPS-LPR and WNLL-LPR) are competitive. Indeed, the differences between these methods are not statistically significant according to a paired t-test with significance threshold 0.05. Meanwhile, all four significantly outperform IPS-L2 and the logging policy. Interestingly, WNLL-LPR performs best in terms of the argmax policy, perhaps owing to the fact that it is optimizing what is essentially a classification loss. Indeed, in classification problems with bandit feedback and binary rewards, the first term in Equation 17 is an unbiased estimator of the expected negative log-likelihood, which is a surrogate for the expected misclassification rate of the argmax policy. The CIFAR-100 data presents a more challenging learning problem than Fashion-MNIST, since it has a much larger action set, and several times as many features. It is perhaps due to these difficulties that the baselines are unable to match the performance of the logging policy which, despite being trained on far less data, is given full supervision. Meanwhile, both LPR methods outperform the logging policy by wide margins. We believe this is due to the fact that LPR is designed with incremental training in mind. The new policy is encouraged to stay close to the logging policy not just to hedge against overfitting, but also because the logging policy is assumed to be a good starting point. It is worth comparing the running times of POEM and LPR. Recall that POEM is a majorization-minimization Table 1: Test set rewards for Fashion-MNIST and CIFAR-100, averaged over 10 trials, with 5-fold cross-validation of regularization parameters at each trial. Fashion-MNIST CIFAR-100 Method stoch. argmax stoch. argmax Logging Policy 0.5123 0.7099 0.3770 0.4797 IPS-L2 0.7778 0.7890 0.3475 0.3624 POEM 0.8060 0.8124 0.3338 0.3392 POEM-L2 0.8050 0.8126 0.3486 0.3641 IPS-LPR 0.7955 0.8154 0.5553 0.6134 WNLL-LPR 0.7978 0.8305 0.6143 0.6272 IPS-LLPR 0.7950 0.8153 0.5455 0.6077 WNLL-LLPR 0.7978 0.8305 0.6143 0.6272 algorithm designed to enable stochastic optimization of a variance-regularized objective. At each epoch of training, POEM constructs an upper bound to the objective by processing all examples in the training data. This additional computation effectively doubles POEM s time complexity relative to the LPR methods, which only require one pass over the data per epoch. On Fashion-MNIST, we find that POEM is on average 25% slower than IPS-LPR. 6.3. Learning the Logging Policy Per Section 5, when the logging policy is unknown, we can estimate its softmax parameters, µ0, then use the estimate, ˆµ0(S), in LPR. We now verify this claim empirically on Fashion-MNIST. Using the log datasets from the previous sections, we learn the logging policy with the regularized negative log-likelihood, L(w, x, a) ln ςw(a | x). We optimize this objective using 100 epochs of Ada Grad, with the same settings as the other experiments. We set the regularization parameter aggressively high, λ 0.01, to ensure that the learned distribution does not become too peaked. Given ˆµ0(S) for each log dataset, we then train new policies using IPS-LPR and WNLL-LPR, with the same λ values tuned in Section 6.2. The results of this experiment are given in the bottom section of Table 1, as methods IPSLLPR and WNLL-LLPR (for learned LPR). The rewards are nearly identical to those when the logging policy is known, thus demonstrating that LPR does not require the actual logging policy in order to be effective. 7. Conclusion We have presented a PAC-Bayesian analysis of counterfactual risk minimization, for learning Bayesian policies from logged bandit feedback. We applied our risk bound to a class of mixed logit policies, from which we derived two Bayesian CRM objectives based on logging policy regularization. Our empirical study indicated that LPR can achieve significant improvements over existing methods. Bayesian Counterfactual Risk Minimization Acknowledgements We thank Thorsten Joachims for thoughtful discussions and helpful feedback. J. Aitchison and S. Shen. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261 272, 1980. A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. In Knowledge Discovery and Data Mining, 2009. L. Bottou, J. Peters, J. Qui nonero Candela, D. Charles, D. Chickering, E. Portugaly, D. Ray, P. Simard, and E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207 3260, 2013. O. Catoni. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 of Institute of Mathematical Statistics Lecture Notes Monograph Series. Institute of Mathematical Statistics, 2007. O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Neural Information Processing Systems, 2011. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Image Net: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011. M. Dud ık, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In International Conference on Machine Learning, 2011. P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classifiers. In International Conference on Machine Learning, 2009. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. E. Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295 311, 2008. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. J. Langford and J. Shawe-Taylor. PAC-Bayes and margins. In Neural Information Processing Systems, 2002. G. Lever, F. Laviolette, and J. Shawe-Taylor. Distributiondependent PAC-Bayes priors. In Algorithmic Learning Theory, 2010. T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, 2017. D. Mc Allester. PAC-Bayesian model averaging. In Computational Learning Theory, 1999. D. Mc Allester. Simplified PAC-Bayesian margin bounds. In COLT, pages 203 215, 2003. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2012. ISBN 978-0-26201825-8. P. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41 55, 1983. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015. M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classification. Journal of Machine Learning Research, 3:233 269, 2002. Y. Seldin, P. Auer, F. Laviolette, J. Shawe-Taylor, and R. Ortner. PAC-Bayesian analysis of contextual bandits. In Neural Information Processing Systems, 2011. Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and Peter Auer. PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58 (12):7086 7093, 2012. A. Strehl, J. Langford, L. Li, and S. Kakade. Learning from logged implicit exploration data. In Neural Information Processing Systems, 2010. R. Sutton, D. Mc Allester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems, 2000. A. Swaminathan and T. Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16: 1731 1755, 2015a. A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In Neural Information Processing Systems, 2015b. H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017.