# balanced_policy_evaluation_and_learning__d8d8c68d.pdf Balanced Policy Evaluation and Learning Nathan Kallus Cornell University and Cornell Tech kallus@cornell.edu We present a new approach to the problems of evaluating and learning personalized decision policies from observational data of past contexts, decisions, and outcomes. Only the outcome of the enacted decision is available and the historical policy is unknown. These problems arise in personalized medicine using electronic health records and in internet advertising. Existing approaches use inverse propensity weighting (or, doubly robust versions) to make historical outcome (or, residual) data look like it were generated by a new policy being evaluated or learned. But this relies on a plug-in approach that rejects data points with a decision that disagrees with the new policy, leading to high variance estimates and ineffective learning. We propose a new, balance-based approach that too makes the data look like the new policy but does so directly by finding weights that optimize for balance between the weighted data and the target policy in the given, finite sample, which is equivalent to minimizing worst-case or posterior conditional mean square error. Our policy learner proceeds as a two-level optimization problem over policies and weights. We demonstrate that this approach markedly outperforms existing ones both in evaluation and learning, which is unsurprising given the wider support of balancebased weights. We establish extensive theoretical consistency guarantees and regret bounds that support this empirical success. 1 Introduction Using observational data with partially observed outcomes to develop new and effective personalized decision policies has received increased attention recently [1, 7, 8, 13, 23, 29, 41 43, 45]. The aim is to transform electronic health records to personalized treatment regimes [6], transactional records to personalized pricing strategies [5], and clickand like -streams to personalized advertising campaigns [8] problems of great practical significance. Many of the existing methods rely on a reduction to weighted classification via a rejection and importance sampling technique related to inverse propensity weighting and to doubly robust estimation. However, inherent in this reduction are several shortcomings that lead to reduced personalization efficacy: it involves a naïve plug-in estimation of a denominator nuisance parameter leading either to high variance or scarcely-motivated stopgaps; it necessarily rejects a significant amount of observations leading to smaller datasets in effect; and it proceeds in a two-stage approach that is unnatural to the single learning task. In this paper, we attempt to ameliorate these by using a new approach that directly optimizes for the balance that is achieved only on average or asymptotically by the rejection and importance sampling approach. We demonstrate that this new approach provides improved performance and explain why. And, we provide extensive theory to characterize the behavior of the new methods. The proofs are given in the supplementary material. 1.1 Setting, Notation, and Problem Description The problem we consider is how to choose the best of m treatments based on an observation of covariates x 2 X Rd (also known as a context). An instance is characterized by the random variables X 2 X and Y (1), . . . , Y (m) 2 R, where X denotes the covariates and Y (t) for t 2 [m] = 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. {1, . . . , m} is the outcome that would be derived from applying treatment t. We always assume that smaller outcomes are preferable, i.e., Y (t) corresponds to costs or negative rewards. A policy is a map : X ! m from observations of covariates to a probability vector in the m-simplex m = {p 2 Rm t=1 pt = 1}. Given an observation of covariates x, the policy specifies that treatment t should be applied with probability t(x). There are two key tasks of interest: policy evaluation and policy learning. In policy evaluation, we wish to evaluate the performance of a given policy based on historical data. This is also known as off-policy evaluation, highlighting the fact that the historical data was not necessarily generated by the policy in question. In policy learning, we wish to determine a policy that has good performance. We consider doing both tasks based on data consisting of n passive, historical observations of covariate, treatment, and outcome: Sn = {(X1, T1, Y1), . . . , (Xn, Tn, Yn)}, where the observed outcome Yi = Yi(Ti) corresponds only to the treatment Ti historically applied. We use the notation X1:n to denote the data tuple (X1, . . . , Xn). The data is assumed to be iid. That is, the data is generated by drawing from a stationary population of instances (X, T, Y (1), . . . , Y (m)) and observing a censored form of this draw given by (X, T, Y (T)).1 From the (unknown) joint distribution of (X, T) in the population, we define the (unknown) propensity function 't(x) = P(T = t | X = x) = E[δT t | X = x], where δst = I [s = t] is the Kronecker delta. And, from the (unknown) joint distribution of (X, Y (t)) in the population, we define the (unknown) mean-outcome function µt(x) = E[Y (t) | X = x]. We use the notation '(x) = ('1(x), . . . , 'm(x)) and µ(x) = (µ1(x), . . . , µm(x)). Apart from being iid, we also assume the data satisfies unconfoundedness: Assumption 1. For each t 2 [m]: Y (t) is independent of T given X, i.e., Y (t) ?? T | X. This assumption is equivalent to there being a logging policy ' that generated the data by prescribing treatment t with probability 't(Xi) to each instance i and recording the outcome Yi = Yi(Ti). Therefore, especially in the case where the logging policy 't is in fact known to the user, the problem is often called learning from logged bandit feedback [41, 42]. In policy evaluation, given a policy , we wish to estimate its sample-average policy effect (SAPE), SAPE( ) = 1 t=1 t(Xi)µt(Xi), by an estimator ˆ ( ) = ˆ ( ; X1:n, T1:n, Y1:n) that depends only on the observed data and the policy . The SAPE quantifies the average outcome that a policy would induce in the sample and hence measures its risk. SAPE is strongly consistent for the population-average policy effect (PAPE): PAPE( ) = E[SAPE( )] = E[Pm t=1 t(X)µt(X)] = E[Y ( T (X))], where T (x) is defined as s random draw of treatment when X = x, T (x) Multinomial( (x)). Moreover, if is such that t (x) > 0 () t 2 argmins2[m] µs(x), then b R( ) = SAPE( ) SAPE( ) is the regret of [10]. The policy evaluation task is closely related to causal effect estimation [19] where, for m = 2, one is interested in estimating the sample and population average treatment effects: SATE = 1 i=1(µ2(Xi) µ1(Xi)), PATE = E[SATE] = E[Y (2) Y (1)]. In policy learning, we wish to find a policy ˆ that achieves small outcomes, i.e., small SAPE and PAPE. The optimal policy minimizes both SAPE( ) and PAPE( ) over all functions X ! m. 1.2 Existing Approaches and Related Work The so-called direct approach fits regression estimates ˆµt of µt on each dataset {(Xi, Yi) : Ti = t}, t 2 [m]. Given these estimates, it estimates SAPE in a plug-in fashion: ˆ direct( ) = 1 t=1 t(Xi)ˆµt(Xi). A policy is learned either by ˆ direct(x) = argmint2[m] ˆµt(x) or by minimizing ˆ direct( ) over 2 [33]. However, direct approaches may not generalize as well as weighting-based approaches [7]. Weighting-based approaches seek weights based on covariate and treatment data W( ) = W( ; X1:n, T1:n) that make the outcome data, when reweighted, look as though it were generated by the policy being evaluated or learned, giving rise to estimators that have the form 1Thus, although the data is iid, the t-treated sample {i : Ti = t} may differ systematically from the t0-treated sample {i : Ti = t0} for t 6= t0, i.e., not necessarily just by chance as in a randomized controlled trial (RCT). Bottou et al. [8], e.g., propose to use inverse propensity weighting (IPW). Noting that [17, 18] SAPE( ) = E[ 1 i=1 Yi Ti(Xi)/'Ti(Xi) | X1:n], one first fits a probabilistic classification model ˆ' to {(Xi, Ti) : i 2 [n]} and then estimates SAPE in an alternate but also plug-in fashion: ˆ IPW( ) = ˆ W IPW( ), W IPW i ( ) = Ti(Xi)/ ˆ'Ti(Xi) For a deterministic policy, t(x) 2 {0, 1}, this can be interpreted as a rejection and importance sampling approach [29, 41]: reject samples where the observed treatment does not match s recommendation and up-weight those that do by the inverse (estimated) propensity. For deterministic policies t(x) 2 {0, 1}, we have that T (X) = δT, T (X) is the complement of 0-1 loss of (X) in predicting T. By scaling and constant shifts, one can therefore reduce minimizing ˆ IPW( ) over policies 2 to minimizing a weighted classification loss over classifiers 2 , providing a reduction to weighted classification [7, 45]. Given both ˆµ(x) and ˆ'(x) estimates, Dudík et al. [13] propose a weighting-based approach that combines the direct and IPW approaches by adapting the doubly robust (DR) estimator [11, 34, 35, 38]: ˆ DR( ) = 1 t=1 t(Xi)ˆµt(Xi) + 1 i=1(Yi ˆµTi(Xi)) Ti(Xi)/ ˆ'Ti(Xi). ˆ DR( ) can be understood either as debiasing the direct estimator by via the reweighted residuals ˆ i = Yi ˆµTi(Xi) or as denoising the IPW estimator by subtracting the conditional mean from Yi. As its bias is multiplicative in the biases of the regression and propensity estimates, the estimator is consistent so long as one of the estimates is consistent. For policy learning, [1, 13] minimize this estimator via weighted classification. Athey and Wager [1] provide a tight and favorable analysis of the corresponding uniform consistency (and hence regret) of the DR approach to policy learning. Based on the fact that 1 = E[ T (X)/'T (X)], a normalized IPW (NIPW) estimator is given by normalizing the weights so they sum to n, a common practice in causal effect estimation [2, 31]: ˆ NIPW( ) = ˆ W NIPW( ), W NIPW i ( ) = W IPW Any IPW approaches are subject to considerable variance because the plugged-in propensities are in the denominator so that small errors can have outsize effects on the total estimate. Another stopgap measure is to clip the propensities [14, 20] resulting in the clipped IPW (CIPW) estimator: ˆ M-CIPW( ) = ˆ W M-CIPW( ), W M-CIPW i ( ) = Ti(Xi)/ max{M, ˆ'Ti(Xi)}. While effective in reducing variance, the practice remains ad-hoc, loses the unbiasedness of IPW (with true propensities), and requires the tuning of M. For policy learning, Swaminathan and Joachims [42] propose to minimizes over 2 the M-CIPW estimator plus a regularization term of the sample variance of the estimator, which they term POEM. The sample variance scales with the level of overlap between and T1:n, i.e., the prevalence of Ti(Xi) > 0. Indeed, when the policy class is very flexible relative to n and if outcomes are nonnegative, then the anti-logging policy Ti(Xi) = 0 minimizes any of the above estimates. POEM avoids learning the anti-logging policy by regularizing overlap, reducing variance but limiting novelty of . A refinement, SNPOEM [43] uses a normalized and clipped IPW (NCIPW) estimator (and regularizes variance): ˆ M-NCIPW( ) = ˆ W M-NCIPW( ), W M-NCIPW i ( ) = W M-CIPW i0=1 W M-CIPW Kallus and Zhou [26] generalize the IPW approach to a continuum of treatments. Kallus and Zhou [25] suggest a minimax approach to perturbations of the weights to account for confounding factors. Kallus [23] proposes a recursive partitioning approach to policy learning, the Personalization Tree (PT) and Personalization Forest (PF), that dynamically learns both weights and policy, but still uses within-partition IPW with dynamically estimated propensities. 1.3 A Balance-Based Approach Shortcomings in existing approaches. All of the above weighting-based approaches seek to reweight the historical data so that they look as though they were generated by the policy being evaluated or learned. Similarly, the DR approach seeks to make the historical residuals look like those that would be generated under the policy in question so to remove bias from the estimated regression model of the direct approach. However, the way these methods achieve this through various forms and versions of inverse propensity weighting, has three critical shortcomings: (1) By taking a simple plug-in approach for a nuisance parameter (propensities) that appears in the denominator, existing weighting-based methods are either subject to very high variance or must rely on scarcely-motivated stopgap measures such as clipping (see also [27]). (2) In the case of deterministic policies (such as an optimal policy), existing methods all have weights that are multiples of Ti(Xi), which means that one necessarily throws away every data point Ti that does not agree with the new policy recommendation T (Xi). This means that one is essentially only using a much smaller dataset than is available, leading again to higher variance.2 (3) The existing weighting-based methods all proceed in two stages: first estimate propensities and then plug these in to a derived estimator (when the logging policy is unknown). On the one hand, this raises model specification concerns, and on the other, is unsatisfactory when the task at hand is not inherently two-staged we wish only to evaluate or learn policies, not to learn propensities. A new approach. We propose a balance-based approach that, like the existing weighting-based methods, also reweights the historical data to make it look as though they were generated by the policy being evaluated or learned and potentially denoises outcomes in a doubly robust fashion, but rather than doing so circuitously via a plug-in approach, we do it directly by finding weights that optimize for balance between the weighted data and the target policy in the given, finite sample. In particular, we formalize balance as a discrepancy between the reweighted historical covariate distribution and that induced by the target policy and prove that it is directly related to the worst-case conditional mean square error (CMSE) of any weighting-based estimator. Given a policy , we then propose to choose (policy-dependent) weights W ( ) that optimize the worst-case CMSE and therefore achieve excellent balance while controlling for variance. For evaluation, we use these optimal weights to evaluate the performance of by the estimator ˆ W ( ) as well as a doubly robust version. For learning, we propose a bilevel optimization problem: minimize over 2 , the estimated risk ˆ W ( ) (or a doubly robust version thereof and potentially plus a regularization term), given by the weights W ( ) that minimize the estimation error. Our empirical results show the stark benefit of this approach while our main theoretical results (Thm. 6, Cor. 7) establish vanishing regret bounds. 2 Balanced Evaluation 2.1 CMSE and Worst-Case CMSE We begin by presenting the approach in the context of evaluation. Given a policy , consider any weights W = W( ; X1:n, T1:n) that are based on the covariate and treatment data. Given these weights we can consider both a simple weighted estimator as well as a W-weighted doubly robust estimator given a regression estimate ˆµ: i=1 Wi Yi, ˆ W,ˆµ = 1 t=1 t(Xi)ˆµt(Xi) + 1 i=1 Wi(Yi ˆµTi(Xi)). We can measure the risk of either such estimator as the conditional mean square error (CMSE), conditioned on all of the data upon which the chosen weights depend: CMSE(ˆ , ) = E[(ˆ SAPE( ))2 | X1:n, T1:n]. Minimal CMSE is the target of choosing weights for weighting-based policy evaluation. Basic manipulations under the unconfoundedness assumption decompose the CMSE of any weightingbased policy evaluation estimator into its conditional bias and variance: Theorem 1. Let i = Yi µTi(Xi) and = diag(E[ 2 1 | X1, T1], . . . , E[ 2 n | Xn, Tn]). Define Bt(W, t; ft) = 1 i=1(WiδTit t(Xi))ft(Xi) and B(W, ; f) = Pm t=1 Bt(W, t; ft) Then we have that: ˆ W SAPE( ) = B(W, ; µ) + 1 i=1 Wi i. Moreover, under Asn. 1: CMSE(ˆ W , ) = B2(W, ; µ) + 1 n2 W T W. 2 This problem is unique to policy evaluation and learning in causal effect estimation, the IPW estimator for SATE has nonzero weights on all of the data points. For policy learning with m = 2, Athey and Wager [1], Beygelzimer and Langford [7] minimize estimates of the form 1 2(ˆ ( ) ˆ (1( ) )) with ˆ ( ) = ˆ IPW( ) or = ˆ DR( ). This evaluates relative to the uniformly random policy and the resulting total weighted sums over Yi or ˆ i have nonzero weights whether Ti(Xi) = 0 or not. While a useful approach for reduction to weighted classification [7] or invoking semi-parametric theory [1], it only works for m = 2, has no effect on learning as the centering correction is constant in , and, for evaluation, is not an estimator for SAPE. Figure 1: The setting in Ex. 1 (a) X1:n, T1:n (b) µ1(x) (c) (x) Table 1: Policy evaluation performance in Ex. 1 Weights W Vanilla ˆ W Doubly robust ˆ W,ˆµ k Wk0 RMSE Bias SD RMSE Bias SD IPW, ' 2.209 0.005 2.209 4.196 0.435 4.174 13.6 2.9 IPW, ˆ' 0.568 0.514 0.242 0.428 0.230 0.361 13.6 2.9 .05-CIPW, ' 0.581 0.491 0.310 0.520 0.259 0.451 13.6 2.9 .05-CIPW, ˆ' 0.568 0.514 0.242 0.428 0.230 0.361 13.6 2.9 NIPW, ' 0.519 0.181 0.487 0.754 0.408 0.634 13.6 2.9 NIPW, ˆ' 0.463 0.251 0.390 0.692 0.467 0.511 13.6 2.9 .05-NCIPW, ' 0.485 0.250 0.415 0.724 0.471 0.550 13.6 2.9 .05-NCIPW, ˆ' 0.463 0.251 0.390 0.692 0.467 0.511 13.6 2.9 Balanced eval 0.280 0.227 0.163 0.251 0.006 0.251 90.7 3.2 Corollary 2. Let ˆµ be given such that ˆµ ?? Y1:n | X1:n, T1:n (e.g., trained on a split sample). Then we have that: ˆ W,ˆµ SAPE( ) = B(W, ; µ ˆµ) + 1 i=1 Wi i. Moreover, under Asn. 1: CMSE(ˆ W,ˆµ, ) = B2(W, ; µ ˆµ) + 1 n2 W T W. In Thm. 1 and Cor. 2, B(W, ; µ) and B(W, ; µ ˆµ) are precisely the conditional bias in evaluating for ˆ W and ˆ W,ˆµ, respectively, and 1 n2 W T W the conditional variance for both. In particular, Bt(W, t; µt) or Bt(W, t; µt ˆµt) is the conditional bias in evaluating the effect on the instances where assigns t. Note that for any function ft, Bt(W, t; ft) corresponds to the discrepancy between the ft(X)-moments of the measure t, (A) = 1 n i=1 t(Xi)I [Xi 2 A] on X and the measure t,W (A) = 1 n i=1 WiδTit I [Xi 2 A]. The sum B(W, ; f) corresponds to the sum of moment discrepancies over the components of f = (f1, . . . , fm) between these measures. The moment discrepancy of interest is that of f = µ or f = µ ˆµ, but neither of these are known. Balanced policy evaluation seeks weights W to minimize a combination of imbalance, given by the worst-case value of B(W, ; f) over functions f, and variance, given by the norm of weights W T W for a specified positive semidefinite (PSD) matrix . This follows a general approach introduced by [22, 24] of finding optimal balancing weights that optimize a given CMSE objective directly rather than via a plug-in approach. Any choice of k k gives rise to a worst-case CMSE objective for policy evaluation: E2(W, ; k k, ) = supkfk 1 B2(W, ; f) + 1 n2 W T W. Here, we focus on k k given by the direct product of reproducing kernel Hilbert spaces (RKHS): kfkp,K1:m,γ1:m = (Pm where k k Kt is the norm of the RKHS given by the PSD kernel Kt( , ) : X 2 ! R, i.e., the unique completion of span(Kt(x, ) : x 2 X) endowed with h Kt(x, ), Kt(x0, )i = Kt(x, x0) [see 39]. We say kfk Kt = 1 if f is not in the RKHS. One example of a kernel is the Mahalanobis RBF kernel: Ks(x, x0) = exp( (x x0)T ˆS 1(x x0)/s2) where ˆS is the sample covariance of X1:n and s is a parameter. For such an RKHS product norm, we can decompose the worst-case objective into the discrepancies in each treatment as well as characterize it as a posterior (rather than worst-case) risk. Lemma 1. Let B2 t(W, t; k k Kt) = Pn i,j=1(WiδTit t(Xi))(WjδTjt t(Xj))Kt(Xi, Xj) and 1/p + 1/q = 1. Then E2(W, ; k kp,K1:m,γ1:m, ) = (Pm t(W, t; k k Kt))2/q + 1 n2 W T W. Moreover, if p = 2 and µt has a Gaussian process prior [44] with mean ft and covariance γt Kt then CMSE(ˆ W,f, ) = E2(W, ; k kp,K1:m,γ1:m, ), where the CMSE marginalizes over µ. This gives the CMSE of ˆ W for f constant or ˆ W,ˆµ for f = ˆµ. The second statement in Lemma 1 suggests that, in practice, model selection of γ1:m, , and kernel hyperparameters such as s or even ˆS, can done by the marginal likelihood method [see 44, Ch. 5]. 2.2 Evaluation Using Optimal Balancing Weights Our policy evaluation estimates are given by either the estimator ˆ W ( ;k k, ) or ˆ W ( ;k k, ),ˆµ where W ( ) = W ( ; k k, ) is the minimizer of E2(W, ; k k, ) over the space of all weights W that sum to n, W = {W 2 Rn i=1 Wi = n} = n n. Specifically, W ( ; k k, ) 2 argmin W 2W E2(W, ; k k, ). When k k = k kp,K1:m,γ1:m, this problem is a quadratic program for p = 2 and a second-order cone program for p = 1, 1. Both are efficiently solvable [9]. In practice, we solve these using Gurobi 7.0. In Lemma 1, Bt(W, t; k k Kt) measures the imbalance between t, and t,W as the worst-case discrepancy in means over functions in the unit ball of an RKHS. In fact, as a distributional distance metric, it is the maximum mean discrepancy (MMD) used, for example, for testing whether two samples come from the same distribution [16]. Thus, minimizing E2(W, ; k kp,K1:m,γ1:m, ) is simply seeking the weights W that balance t, and t,W subject to variance regularization in W. Example 1. We demonstrate balanced evaluation with a mixture of m = 5 Gaussians: X | T N(XT , I2 2), X1 = (0, 0), Xt = (Re, Im)(ei2 (t 2)/(m 1)) for t = 2, . . . , m, and T Multinomial(1/5, . . . , 1/5). Fix a draw of X1:n, T1:n with n = 100 shown in Fig. 1a (numpy seed 0). Color denotes Ti and size denotes 'Ti(Xi). The centers Xt are marked by a colored number. Next, we let µt(x) = exp(1 1/kx χtk2) where χt = (Re, Im)(e i2 t/m/ 2) for t 2 [m], i N(0, σ), and σ = 1. Fig. 1b plots µ1(x). Fig. 1c shows the corresponding optimal policy . Next we consider evaluating . Fixing X1:n as in Fig. 1a, we have SAPE( ) = 0.852. With X1:n fixed, we draw 1000 replications of T1:n, Y1:n from their conditional distribution. For each replication, we fit ˆ' by estimating the (well-specified) Gaussian mixture by maximum likelihood and fit ˆµ using m separate gradient-boosted tree models (sklearn defaults). We consider evaluating either using the vanilla estimator ˆ W or the doubly robust estimator ˆ W,ˆµ for W either chosen in the 4 different standard ways laid out in Sec. 1.2, using either the true ' or the estimated ˆ', or chosen by the balanced evaluation approach using untuned parameters (rather than fit by marginal likelihood) using the standard (s = 1) Mahalanobis RBF kernel for Kt, kfk2 = Pm Kt, and = I. (Note that this misspecifies the outcome model, kµtk Kt = 1.) We tabulate the results in Tab. 1. We note a few observations on the standard approaches: vanilla IPW with true ' has zero bias but large SD (standard deviation) and hence RMSE (root mean square error); a DR approach improves on a vanilla IPW with ˆ' by reducing bias; clipping and normalizing IPW reduces SD. The balanced evaluation approach achieves the best RMSE by a clear margin, with the vanilla estimator beating all standard vanilla and DR estimators and the DR estimator providing a further improvement by nearly eliminating bias (but increasing SD). The marked success of the balanced approach is unsurprising when considering the support k Wk0 = Pn i=1 I [Wi > 0] of the weights. All standard approaches use weights that are multiples of Ti(Xi), limiting support to the overlap between and T1:n, which hovers around 10 16 over replications. The balanced approach uses weights that have significantly wider support, around 88 94. In light of this, the success of the balanced approach is expected. 2.3 Consistent Evaluation Next we consider the question of consistent evaluation: under what conditions can we guarantee that ˆ W ( ) SAPE( ) and ˆ W ( ),ˆµ SAPE( ) converge to zero and at what rates. One key requirement for consistent evaluation is a weak form of overlap between the historical data and the target policy to be evaluated using this data: Assumption 2 (Weak overlap). P('t(X) > 0_ t(X) = 0) = 1 8t 2 [m], E[ 2 T (X)] < 1. Figure 2: Policy learning results in Ex. 2; numbers denote regret Balanced policy learner .06 IPW .50 Gauss Proc 0.29 IPW-SVM 0.34 SNPOEM 0.28 DR .26 Grad Boost 0.20 DR-SVM 0.18 PF 0.23 This ensures that if can assign treatment t to X then the data will have some examples of units with similar covariates being given treatment t; otherwise, we can never say what the outcome might look like. Another key requirement is specification. If the mean-outcome function is well-specified in that it is in the RKHS product used to compute W ( ) then convergence at rate 1/pn is guaranteed. Otherwise, for a doubly robust estimator, if the regression estimate is well-specified then consistency is still guaranteed. In lieu of specification, consistency is also guaranteed if the RKHS product consists of C0-universal kernels, defined below, such as the RBF kernel [40]. Definition 1. A PSD kernel K on a Hausdorff X (e.g., Rd) is C0-universal if, for any continuous function g : X ! R with compact support (i.e., for some C compact, {x : g(x) 6= 0} C) and > 0, there exists m, 1, x1, . . . , m, xm such that supx2X | Pm j=1 i K(xj, x) g(x)| . Theorem 3. Fix and let W n( ; kfkp,K1:m,γn,1:m, n) with 0 I n I, 0 < γ γn,t γ 8t 2 [m] for each n. Suppose Asns. 1 and 2 hold, Var(Y | X) a.s. bounded, E[ Kt(X, X)] < 1, and E[Kt(X, X) 2 T (X)] < 1. Then the following two results hold: (a) If kµtk Kt < 1 for all t 2 [m]: ˆ W n( ) SAPE( ) = Op(1/pn). (b) If Kt is C0-universal for all t 2 [m]: ˆ W n( ) SAPE( ) = op(1). The key assumptions of Thm. 3 are unconfoundedness, overlap, and bounded variance. The other conditions simply guide the choice of method parameters. The two conditions on the kernel are trivial for bounded kernels like the RBF kernel. An analogous result for the DR estimator is a corollary. Corollary 4. Suppose the assumptions of Thm. 3 hold and. Then (a) If kˆµnt µtk Kt = op(1) 8t 2 [m]: ˆ W n( ),ˆµn SAPE( ) = ( 1 2 Var(Yi | Xi))1/2 + op(1/pn). (b) If kˆµn(X) µ(X)k2 = Op(r(n)), r(n) = (1/pn): ˆ W n( ),ˆµn SAPE( ) = Op(r(n)). (c) If kµtk Kt < 1, kˆµntk Kt = Op(1) for all t 2 [m]: ˆ W n( ),ˆµn SAPE( ) = Op(1/pn). (d) If Kt is C0-universal for all t 2 [m]: ˆ W n( ),ˆµn SAPE( ) = op(1). Cor. 4(a) is the case where both the balancing weights and the regression function are well-specified, in which case the multiplicative bias disappears faster than op(1/pn), leaving us only with the irreducible residual variance, leading to an efficient evaluation. The other cases concern the doubly robust nature of the balanced DR estimator: Cor. 4(b) requires only that the regression be consitent and Cor. 4(c)-(d) require only the balancing weights to be consistent. 3 Balanced Learning Next we consider a balanced approach to policy learning. Given a policy class [X ! m], we let the balanced policy learner yield the policy 2 that minimizes the balanced policy evaluation using either a vanilla or DR estimator plus a potential regularization term in the worst-case/posterior CMSE of the evaluation. We formulate this as a bilevel optimization problem: ˆ bal 2 argmin {ˆ W +λE(W, ; k k, ) : 2 , W 2 argmin W 2W E2(W, ; k k, )} (1) ˆ bal-DR 2 argmin {ˆ W,ˆµ+λE(W, ; k k, ) : 2 , W 2 argmin W 2W E2(W, ; k k, )} (2) The regularization term regularizes both the balance (i.e., worst-case/posterior bias) that is achievable for and the variance in evaluating . We include this regularizer for completeness and motivated by the results of [42] (which regularize variance), but find that it not necessary to include it in practice. 3.1 Optimizing the Balanced Policy Learner Unlike [1, 7, 13, 41, 45], our (nonconvex) policy optimization problem does not reduce to weighted classification precisely because our weights are not multiplies of Ti(Xi) (but therefore our weights also lead to better performance). Instead, like [42], we use gradient descent. For that, we need to be able to differentiate our bilevel optimization problem. We focus on p = 2 for brevity. Theorem 5. Let k k = k k2,K1:m,γ1:m. Then 9W ( ) 2 argmin W 2W E2(W, ; k k, ) such that r t(X1),..., t(Xn)ˆ W ( ) = 1 1:n H(I (A + (I A) H) 1(I A) H)Jt r t(X1),..., t(Xn)ˆ W ( ),ˆµ = 1 1:n H(I (A + (I A) H) 1(I A) H)Jt + 1 n ˆµt(X1:n) r t(X1),..., t(Xn)E(W ( ), ; k k, ) = Dt/E(W ( ), ; k k, ) where H = F(F T HF) 1F T , Fij = δij δin for i 2 [n], j 2 [n 1], Aij = δij I [W i ( ) > 0], Dti = γ2 j=1 Kt(Xi, Xj)(WjδTjt t(Xj)), Hij = 2 Pm t δTitδTjt Kt(Xi, Xj) + 2 , and Jtij = 2γ2 t δTit Kt(Xi, Xj). To leverage this result, we use a parameterized policy class such as logit = { t(x; βt) / exp(βt0 + βT t x)} (or kernelized versions thereof), apply chain rule to differentiate objective in the parameters β, and use BFGS [15] with random starts. The logistic parametrization allows us to smooth the problem even while the solution ends up being deterministic (extreme β). This approach requires solving a quadratic program for each objective gradient evaluation. While this can be made faster by using the previous solution as warm start, it is still computationally intensive, especially as the bilevel problem is nonconvex and both it and each quadratic program solved in batch mode. This is a limitation of the current optimization algorithm that we hope to improve on in the future using specialized methods for bilevel optimization [4, 32, 37]. Example 2. We return to Ex. 1 and consider policy learning. We use the fixed draw shown in Fig. 1a and set σ to 0. We consider a variety of policy learners and plot the policies in Fig. 2 along with their population regret PAPE( ) PAPE( ). The policy learners we consider are: minimizing standard IPW and DR evaluations over logit with ˆ', ˆµ as in Ex. 1 (versions with combinations of normalized, clipped, and/or true ', not shown, all have regret 0.26 0.5), the direct method with Gaussian process regression gradient boosted trees (both sklearn defaults), weighted SVM classification using IPW and DR weights (details in supplement), SNPOEM [43], PF [23], and our balanced policy learner (1) with parameters as in Ex. 1, = logit, λ = = 0 (the DR version (2), not shown, has regret .08). Example 3. Next, we consider two UCI multi-class classification datasets [30], Glass (n = 214, d = 9, m = 6) and Ecoli (n = 336, d = 7, m = 8), and use a supervised-to-contextual-bandit transformation [7, 13, 42] to compare different policy learning algorithms. Given a supervised multiclass dataset, we draw T as per a multilogit model with random 1 coefficients in the normalized covariates X. Further, we set Y to 0 if T matches the label and 1 otherwise. And we split the data 75-25 into training and test sample. Using 100 replications of this process, we evaluate the performance of learned linear policies by comparing the linear policy learners as in Ex. 2. For IPW-based approaches, we estimate ˆ' by a multilogit regression (well-specified by construction). For DR approaches, we estimate ˆµ using gradient boosting trees (sklearn defaults). We compare these to our balanced policy learner in both vanilla and DR forms with all parameters fit by marginal likelihood using the RBF kernel with an unspecified length scale after normalizing the data. We tabulate the results in Tab. 2. They first demonstrate that employing the various stopgap fixes to IPW-based policy learning as in SNPOEM indeed provides a critical edge. This is further improved upon by using a balanced approach to policy learning, which gives the best results. In this example, DR approaches do worse than vanilla ones, suggesting both that XGBoost provided a bad outcome model and/or that the additional variance of DR was not compensated for by sufficiently less bias. 3.2 Uniform Consistency and Regret Bounds Next, we establish consistency results uniformly over policy classes. This allows us to bound the regret of the balanced policy learner. We define the sample and population regret, respectively, as R (ˆ ) = PAPE(ˆ ) min 2 PAPE( ), b R (ˆ ) = SAPE(ˆ ) min 2 SAPE( ) Table 2: Policy learning results in Ex. 3 IPW DR IPW-SVM DR-SVM POEM SNPOEM Balanced Balanced-DR Glass 0.726 0.755 0.641 0.731 0.851 0.615 0.584 0.660 Ecoli 0.488 0.501 0.332 0.509 0.431 0.331 0.298 0.371 A key requirement for these to converge is that the best-in-class policy is learnable. We quantify that using Rademacher complexity [3] and later extend our results to VC dimension. Let us define b Rn(F) = 1 2n i2{ 1,+1}n supf2F i=1 if(Xi), Rn(F) = E[b Rn(F)]. E.g., for linear policies b Rn(F) = O(1/pn) [21]. If F [X ! Rm] let Ft = {(f( ))t : f 2 F} and set Rn(F) = Pm t=1 Rn(Ft) and same for b Rn(F). We also strengthen the overlap assumption. Assumption 3 (Strong overlap). 9 1 such that P('t(X) 1/ ) = 1 8t 2 [m]. Theorem 6. Fix [X ! m] and let W n( ; kfkp,K1:m,γn,1:m, n) with 0 I n I, 0 < γ γn,t γ 8t 2 [m] for each n and 2 . Suppose Asns. 1 and 3 hold, | i| B a.s. bounded, and Kt(x, x) Γ 8t 2 [m] for Γ 1. Then the following two results hold: (a) If kµtk Kt < 1, 8t 2 [m] then for n sufficiently large (n 2 log(4m/ )/(1/(2 ) Rn( ))2), we have that, with probability at least 1 , sup 2 |ˆ W ( ) SAPE( )| 8 Γγm(kµk + 2 log(4m/ ) 1B)Rn( ) 2 kµk + 12 Γ2γmkµk + 6 Γγm 1B log + 1 pn(2 1B + 12 Γ2γm 1B + 3 Γγmkµk) (b) If Kt is C0-universal for all t 2 [m] and either Rn( ) = o(1) or b Rn( ) = op(1) then sup 2 |ˆ W ( ) SAPE( )| = op(1). The proof crucially depends on simultaneously handling the functional complexities of both the policy class and the space of functions {f : kfk < 1} being balanced against. Again, the key assumptions of Thm. 6 are unconfoundedness, overlap, and bounded residuals. The other conditions simply guide the choice of method parameters. Regret bounds follow as a corollary. Corollary 7. Suppose the assumptions of Thm. 6 hold. If ˆ bal n is as in (1) then: (a) If kµtk Kt < 1 for all t 2 [m]: R (ˆ bal n ) = Op(Rn( ) + 1/pn). (b) If Kt is C0-universal for all t 2 [m]: R (ˆ bal n ) = op(1). If ˆ bal-DR n is as in (2) then: (c) If kˆµnt µtk Kt = op(1) for all t 2 [m]: R (ˆ bal-DR n ) = Op(Rn( ) + 1/pn). (d) If kˆµn(X) µ(X)k2 = Op(r(n)): R (ˆ bal-DR n ) = Op(r(n) + Rn( ) + 1/pn). (e) If kµtk Kt < 1, kˆµntk Kt = Op(1) for all t 2 [m]: R (ˆ bal-DR n ) = Op(Rn( ) + 1/pn). (f) If Kt is C0-universal for all t 2 [m]: R (ˆ bal n ) = op(1). And, all the same results hold when replacing Rn( ) with b Rn( ) and/or replacing R with b R . 4 Conclusion Considering the policy evaluation and learning problems using observational or logged data, we presented a new method that is based on finding optimal balancing weights that make the data look like the target policy and that is aimed at ameliorating the shortcomings of existing methods, which included having to deal with near-zero propensities, using too few positive weights, and using an awkward two-stage procedure. The new approach showed promising signs of fixing these issues in some numerical examples. However, the new learning method is more computationally intensive than existing approaches, solving a QP at each gradient step. Therefore, in future work, we plan to explore faster algorithms that can implement the balanced policy learner, perhaps using alternating descent, and use these to investigate comparative numerics in much larger datasets. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1656996. [1] S. Athey and S. Wager. Efficient policy learning. ar Xiv preprint ar Xiv:1702.02896, 2017. [2] P. C. Austin and E. A. Stuart. Moving towards best practice when using inverse probability of treatment weighting (iptw) using the propensity score to estimate causal treatment effects in observational studies. Statistics in medicine, 34(28):3661 3679, 2015. [3] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463 482, 2003. [4] K. P. Bennett, G. Kunapuli, J. Hu, and J.-S. Pang. Bilevel optimization and machine learning. In IEEE World Congress on Computational Intelligence, pages 25 47. Springer, 2008. [5] D. Bertsimas and N. Kallus. The power and limits of predictive approaches to observational- data-driven optimization. ar Xiv preprint ar Xiv:1605.02347, 2016. [6] D. Bertsimas, N. Kallus, A. M. Weinstein, and Y. D. Zhuo. Personalized diabetes management using electronic medical records. Diabetes care, 40(2):210 217, 2017. [7] A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129 138. ACM, 2009. [8] L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y. Simard, and E. Snelson. Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research, 14(1):3207 3260, 2013. [9] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, [10] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012. [11] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, and C. Hansen. Double machine learning for treatment and causal parameters. ar Xiv preprint ar Xiv:1608.00060, 2016. [12] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001. [13] M. Dudík, J. Langford, and L. Li. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011. [14] M. R. Elliott. Model averaging methods for weight trimming. Journal of official statistics, 24 (4):517, 2008. [15] R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013. [16] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In Advances in neural information processing systems, pages 513 520, 2006. [17] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663 685, 1952. [18] G. W. Imbens. The role of the propensity score in estimating dose-response functions. Biometrika, 87(3), 2000. [19] G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [20] E. L. Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295 311, 2008. [21] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pages 793 800, 2009. [22] N. Kallus. Generalized optimal matching methods for causal inference. ar Xiv preprint ar Xiv:1612.08321, 2016. [23] N. Kallus. Recursive partitioning for personalization using observational data. In International Conference on Machine Learning (ICML), pages 1789 1798, 2017. [24] N. Kallus. Optimal a priori balance in the design of controlled experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):85 112, 2018. [25] N. Kallus and A. Zhou. Confounding-robust policy improvement. 2018. [26] N. Kallus and A. Zhou. Policy evaluation and optimization with continuous treatments. In International Conference on Artificial Intelligence and Statistics, pages 1243 1251, 2018. [27] J. D. Kang, J. L. Schafer, et al. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4): 523 539, 2007. [28] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 1991. [29] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit- based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297 306. ACM, 2011. [30] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ [31] J. K. Lunceford and M. Davidian. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine, 23(19): 2937 2960, 2004. [32] P. Ochs, R. Ranftl, T. Brox, and T. Pock. Techniques for gradient-based bilevel optimization with non-smooth lower level problems. Journal of Mathematical Imaging and Vision, 56(2): 175 194, 2016. [33] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011. [34] J. M. Robins. Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, pages 6 10, 1999. [35] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427): 846 866, 1994. [36] H. L. Royden. Real Analysis. Prentice Hall, 1988. [37] S. Sabach and S. Shtern. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640 660, 2017. [38] D. O. Scharfstein, A. Rotnitzky, and J. M. Robins. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448): 1096 1120, 1999. [39] B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. [40] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. ar Xiv preprint ar Xiv:1003.0887, 2010. [41] A. Strehl, J. Langford, L. Li, and S. M. Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems, pages 2217 2225, 2010. [42] A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, pages 814 823, 2015. [43] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems, pages 3231 3239, 2015. [44] C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning. MIT Press, Cambridge, MA, 2006. [45] X. Zhou, N. Mayer-Hamblett, U. Khan, and M. R. Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112 (517):169 187, 2017.