# learning_robust_decision_policies_from_observational_data__f721b0e9.pdf Learning Robust Decision Policies from Observational Data Muhammad Osama muhammad.osama@it.uu.se Dave Zachariah dave.zachariah@it.uu.se Peter Stoica peter.stoica@it.uu.se Division of System and Control Department of Information Technology Uppsala University, Sweden. We address the problem of learning a decision policy from observational data of past decisions in contexts with features and associated outcomes. The past policy maybe unknown and in safety-critical applications, such as medical decision support, it is of interest to learn robust policies that reduce the risk of outcomes with high costs. In this paper, we develop a method for learning policies that reduce tails of the cost distribution at a specified level and, moreover, provide a statistically valid bound on the cost of each decision. These properties are valid under finite samples even in scenarios with uneven or no overlap between features for different decisions in the observed data by building on recent results in conformal prediction. The performance and statistical properties of the proposed method are illustrated using both real and synthetic data. 1 Introduction We consider data of discrete decisions x X taken in contexts with features z Z. The outcome of each decision has an associated cost y Y (or equivalently, negative reward). For instance, we may obtain data from a hospital in which patients with features z are given treatment x to lower their blood pressure and y denotes the change of pressure value. The observational data is drawn independently as follows (xi, yi, zi) p(x, y, z) = p(z) p(x|z) | {z } past policy p(y|x, z), i = 1, . . . , n (1) where we have used a causal factorization of the unknown data-generating process. The distribution of contexts is described by p(z) and p(x|z) summarizes a decision policy which is generally unknown. We assume that there are no unobserved confounders. Using the n training data points, our goal is to automatically improve upon the past policy. That is, learn a new policy, which is a mapping from features to decisions π(z) : Z X, such that the outcome cost y will tend to be lower than in the past. This policy partitions the feature space Z into |X| disjoint regions. A sample from the resulting data generating process can then be expressed as (x, y, z) pπ(x, y, z) = p(z)1{x = π(z)}p(y|x, z), (2) 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 20 30 40 50 Age Probability p(x = 1 | z) Male p(x = 1 | z) Female (a) Training data and past policy 30 20 10 0 10 20 30 y x p(x|z) x = πα(z) (b) Cost distribution Figure 1: Example of synthetic patient data with features z {age, gender} and decisions x {0, 1} on whether or not to assign a treatment against high blood pressure. The outcome cost y [ 30, 30] here is the change in blood pressure. (a) Training data (xi, zi) where treatment and no treatment are denoted by and , respectively. The example illustrates a past policy with highly uneven feature overlap, such that p(x = 1|z) approaches 0 for younger males and 1 for older women, respectively. (b) Probability that a change of blood pressure y exceeds a level ey, when x is assigned according to past policy p(x|z) vs. a proposed robust policy πα(z) that targets lowering the tail costs at the α = 20% level (dashed line). where 1{ } is the indicator function. In the treatment regime literature [17, 24, 25, 7, 21], the conventional aim is to minimize the expected cost, viz. min π Π Eπ[y], where Eπ[y] E x X E[y|x, z]1{x = π(z)} x X E 1{x = π(z)} p(x|z) y , (3) where the last identity follows if features overlap across decisions so that p(x|z) > 0 [9]. The optimal policy for this problem is π(z) = arg minx X E[y|x, z] and is determined by the unknown training distribution (1). Thus a policy must be learned from n training samples, where a fundamental source of uncertainty about outcomes is uneven feature overlap across decisions [4, 11] (see Fig. 1a for an illustration). Eq. (3) is equivalent to an off-policy learning problem in contextual bandit settings using logged data [13, 6, 19, 10, 18], but where the past policy is unknown. A common approach is to learn a regression model of E[y|x, z], which in the case of binary decisions X = {0, 1} and linear models restricts the class of policies to the form Πγ = {π(z) = 1{γ z+γ0 > 0}}. To avoid the sensitivity to regression model misspecification, an alternative approach is to learn a model of p(x|z) and then approximately solve (3) by numerical search over a restricted parametric class of policies Πγ. In scenarios with highly uneven feature overlap, however, this approach leads to high-variance estimates of Eπ[y], see the analysis in [6]. Reliably estimating the expected cost of a policy would yield an important performance certificate in safety-critical applications [21]. In such applications, however, reducing the prevalence of high-cost outcomes is a more robust strategy than reducing the expected cost, even when such tail events have low probability, see Figure 1b for an illustration. This is especially relevant when the conditional distribution of outcome costs p(y|x, z) is skewed or has a dispersion that varies with x [23]. In this paper, we develop a method for learning a robust policy that targets the reduction of the tail of the cost distribution pπ(y), rather than Eπ[y], provides a statistically valid limit yα(z) y for each decision, is operational even when there is little feature overlap. Moreover, when the past policy is unknown, the robust policy can be learned using unsupervised techniques, which obviates the need to specify associative models b E[y|x, z] and/or bp(x|z). The method is demonstrated using both real and synthetic data. 35 40 45 50 55 Age x = 0 x = 1 20 25 30 35 40 Age x = 0 x = 1 Figure 2: Synthetic patient data from Figure 1 along with limit yα(x, z) such that Px{y yα(x, z)} is no lower than 1 α = 80%. The functions are learned using the method described in Sections 3.1 and 3.2. The training data provides evidence that treating (a) males in ages 41-53 years and (b) females in ages 22-40 years, yields the lowest tail costs. A robust policy πα(z) in (5) selects decisions in X which yields the minimum yα(x, z) and therefore targets the reduction of the tail of pπ(y) (Fig. 1b). For younger males, however, data on treatment is unavailable and the limit becomes uninformative, yα(x, z) = max(Y). Conversely, data on untreated older female is unavailable. 2 Problem formulation We consider a policy π(z) to be robust if it can reduce the tail costs at a specified level α as compared to the past policy even for finite n and highly uneven feature overlap. We define the α-tail as all yα for which the probability Pπ{y yα} is greater than or equal to 1 α. An optimal robust policy therefore minimizes the (1 α)-quantile of the cost, viz. a solution to min π Π inf n yα Y : Pπ{y yα} 1 α o , (4) Since a learned policy π(z; Dn) is a function of the training data Dn = {(xi, yi, zi)}n i=1, the probability is also defined over all n i.i.d. training points. The problem we consider is to learn a policy in a class Πα that approximately solves (4) and certifies each decision by a limit yα(z) y that holds with a probability of at least (1 α) for finite n and highly uneven feature overlap. 3 Learning Method Since the cumulative distribution function (CDF) in (4) is unknown for a given policy, it is a challenging task to find the minimum yα which satisfies the (1 α) constraint. We propose to restrict the policies to a class Πα, constructed as follows: Suppose there exists a feature-specific limit yα(x, z) for a given decision x X, such that Px{y yα(x, z)} is no less than 1 α. Here Px = Pπ=x is a short-hand notation when enforcing a decision x. Then we define Πα as all policies π(z) that select x with the minimum cost limit at the specified level α. That is, a class of robust policies Πα = π(z) = arg min x X yα(x, z) : Px y yα(x, z) 1 α, x X Learning a policy in Πα therefore amounts to using Dn to learn a set of functions {yα(x, z)}x X that satisfy the constraints. Figure 2 illustrates yα(x, z) constructed using the method described below, for a binary decision variable x. Remark: If there is a tie among {yα(x, z)}x X , the policy can randomly draw x from the minimizers. If the limits are non-informative, yα(x, z) max(Y), the method will indicate that the data is not sufficiently informative for reliable cost-reducing decisions. See Figure 2 for regions in feature space where there is no data about outcomes for treated younger males and untreated older women; consequently yα(x, z) = max(Y) for such pairs of features and decisions. 3.1 Statistically valid limits To construct feature-specific limits yα(x, z) that satisfy the constraint in (5), we leverage recent results developed using the conformal prediction framework [22, 14, 1]. We begin by quantifying the divergence of a sample (x, y, z) in (2) from those in Dn, using the residual s(x, y, z) = |y µ(x, y, z)| 0, (6) where µ(x, y, z) is any predictor of the cost fitted using Dn (x, y, z). Then s(x, y, z) can be viewed as a random non-conformity score with a CDF F(s) and quantile s1 α(F) = inf{s : F(s) 1 α} (7) Result 1 (Finite-sample validity). For a given level α and context z, construct a set of probability weights pk(xi, zi) wk(xi, zi) Pn j=1 wk(xj, zj) + wk(x, z), where wk(x, z) 1{x = k}p(z) p(z|x)p(x) , (8) for k X and define an empirical cdf for the residuals i=1 px(xi, zi)1{s si} + px(x, z)1{s s(x, y, z)}, (9) where si = |yi µ(xi, yi, zi)|. Then yα(x, z) max n y Y : s(x, y, z) s1 α( b Fx) o , (10) satisfies the probabilistic constraint Px{y yα(x, z)} 1 α in (5). Proof. By expressing wk(x, z) qk(x|z)p(z) p(x|z)p(z) , where qk(x|z) = 1{x = k} it follows from [1, corr. 1] that the set in (10) will cover y with a probability of at least 1 α. Computing yα(x, z) requires a search of the maximum value in the set (10), which can be implemented efficiently using interval halving. Each evaluation point in the set, however, requires re-fitting µ(x, y, z) to Dn (x, y, z) in (6). Therefore for an efficient computation of (10), we consider the locally weighted average of costs, i.e., µ(x, y, z) = i=1 px(xi, zi)yi + px(x, z)y (11) which is linear in y. The non-parametric model in (11) also avoids yielding conformal intervals more sensitive to model misspecification in case of a parametric model for µ(x, y, z). This choice then defines a policy in Πα and is illustrated in Figures 3a and 3b. Each decision of the policy can then be certified by a limit yα(z) y obtained by setting yα(z) = yα(π(z), z) in (10) and the probability of exceeding the limit is bounded by α. For the sake of clarity, the computation of yα(z) is summarized in Algorithm 1. The computation of (7) requires sorting and therefore bounds the runtime of the Algorithm by O(n log n). An important property of (10) is that it is statistically valid also for highly uneven feature overlap. As p(z|x) approaches 0 for a given x, the probability weights in (8) concentrate so that px(x, z) 1 in (9). Consequently, yα(x, z) converges to max(Y) so that the proposed robust policy avoids decisions x in contexts z for which there is little or no training data. Remark: The proposed method is readily extendable to other known feature distributions q(z) than the unknown p(z) from which training data was obtained. This will only effect the evaluation of weights with p(z) in the numerator of (8) replaced with q(z). Algorithm 1 Robust policy 1: Input: Training data Dn, level α and feature z 2: for x X do 3: Compute {px(xi, zi)}n i=1 in (8) 4: Set µ0 = Pn i=1 px(xi, zi)yi 5: Set Yα := 6: for y Y do 7: Predictor µ(x, y, z) := µ0 + px(x, z)y 8: Score s(x, y, z) := |y µ(x, y, z)| 9: CDF b Fx(s) in (9) 10: if s(x, y, z) s1 α( b Fx) then 11: Yα := Yα {y} 12: end if 13: end for 14: yα(x, z) = max(Yα) 15: end for 16: Output: π(z) = arg minx yα(x, z) and yα(z) = yα(π(z), z) 3.2 Unsupervised learning of weights In randomized control trials, and other controlled experiments, the weights in (8) are given by a known past policy. In the general case, however, wk(x, z) must be learned from training data. This is effectively an unsupervised learning problem which therefore circuments the need for specifying associative models of E[y|x, z] (regression) or p(x|z) (propensity score). The categorical distribution of past decisions, p(x), is readily modeled as bp(x = k) = n 1 Pn i=1 1{xi = k} using Dn. The conditional feature distribution p(z|x = k) can in turn be modelled by a flexible generative model, e.g. Gaussian mixture models or multinoulli models. The accuracy of the learned generative model bp(z|x = k) can then be assessed using model validation methods, e.g. [15]. If the training data contains high-dimensional covariates, we propose constructing features z using dimension-reduction methods, such as autoencoders [2, 12, 16, 20]. The weights in (8) are learned via bp(x) and bp(z|x), and using bp(z) = P x X bp(z|x)bp(x). Remark: If a validated propensity score model already exists, one can simply use the equivalent form wk(x, z) = 1{x = k}/bp(x|z). 4 Numerical experiments We study the statistical properties of policies in the robust class Πα, which we denote πα(z). To illustrate some key differences between a mean-optimal policy (3) and a robust policy, we first consider a well-specified scenario in which the mean-optimal policy belongs to a given class Πγ. Subsequently, we study a scenario with misspecified models using real training data. The code for the experiments can be found here. 4.1 Synthetic data We consider a scenario in which patients are assigned treatments to reduce their blood pressure. We create a synthetic dataset, drawing n = 200 data points from the training distribution (1) where features z = [z1, z2] represent age z1 R and gender z2 {0, 1} (1 for females and 0 for males). The feature distribution for the population of patients p(z) is specified as p(z1|z2) = z2 N(30, 5) + (1 z2) N(45, 5) and p(z2) 0.5 (12) The treatment decision x {0, 1} is assigned based on a past policy which we specify by the probability p(x = 1|z) = z2 0.92 f z1 20 + (1 z2) 0.20 f z1 45 40 45 50 55 Age p(x = 1|z) πα(z) (a) Policy for males 20 25 30 35 40 Age p(x = 1|z) πα(z) (b) Policy for females 30 20 10 0 10 20 30 y x p(x|z) x = πα(z) (c) Complementary CDF of costs 0.1 0.2 0.3 0.4 0.5 Target probability α P{y > yα(z)} (d) Probability of cost y exceeding yα(z) Figure 3: (a) and (b) show a past policy p(x|z) with uneven feature overlap. Robust policy πα(z), mean-optimal and quantile-optimal linear policies πγ(z) and πq(z) are learned from Dn. The robust policy targets reducing tail costs at the α = 20% level. Note that the mean-optimal policy is to not treat, i.e., x = 0. (c) Costs of past and learned policies along with α-level (dashed). (d) Accuracy of the limit yα(z) vs. different α-levels, using 300 Monte Carlo runs. where f(a) = 1/(1 + exp( a)) is the sigmoid function. See Figures 3a and 3b for all illustration. While the assignment mechanism is not necessarily realistic, we use it to illustrate the relevant case of uneven feature overlap. Finally, the change in blood pressure y is drawn randomly as p(y|x, z) = x N e 1 z 45, σ2 1 + (1 x) N e 1 z 46, σ2 0 , (14) where σ1 = 0.2 and σ0 = 20. While the expected cost for the untreated group is lower than for the treated group, we consider the untreated patients to have more heterogeneous outcomes, so that the dispersion is higher. That is, E[y|0, z] < E[y|1, z] while σ0 > σ1. Since the past policy is unknown, we learn weights (8) for πα(z) in an unsupervised manner, using bp(z|x) = bp(z1|z2, x)bp(z2|x), where bp(z1|z2, x) is a misspecified Gaussian model and bp(z2|x) is a Bernoulli model. We let α = 20%. As a baseline comparison, we consider minimizing the expected cost (3) for a linear policy class Πγ. Since E[y|x, z] is a linear function in z, this is a well-specified scenario in which the mean-optimal policy belongs to Πγ. We fit a correct linear model of the conditional mean and denote the resulting policy by πγ(z). We also compare against the quantile optimal policy πq(z) Πγ [23], which uses a consistent estimate of the α-quantile of the cost. Figures 3a and 3b show the decision x taken by the robust and mean-optimal policy, πα(z) and πγ(z), respectively, as a function of features z. Note that (14) leads to a mean-optimal policy πγ(z) 0, since the expected cost for the untreated group is lower than that of the treated group. By contrast, the robust policy πα(z) takes into account that the dispersion of costs is much higher for untreated patients and therefore assigns x = 1 to male patients in the age span 41-54 years as well as all females in the observable age span. To reduce the risk of increased blood pressure at the specified level, it therefore opts for treatments more often. This is highlighted in Figure 3c which shows the cost distribution, using the complementary CDF Pπ{y > ey}, for the different policies. We see that the robust policy safeguards against large increases in blood pressure, where the (1 α) quantile is smaller than that for the mean-optimal policy. The quantile optimal policy yields a marginally a smaller (1 α) quantile than the proposed robust policy but also notably higher tail costs. An important feature of the proposed methodology is that each decision of the policy πα(z) has an associated limit yα(z), such that the probability of exceeding it, Pπ{y > yα(z)}, is bounded by α. Figure 3d shows the estimated probability under the robust policy versus the target level α. Despite the misspecification of the Gaussian model bp(z1|z2, x), the target α provides an accurate limit for the actual probability. 4.2 Infant Health and Development Program data Next, the properties of the proposed method are studied using real data. We use data from the Infant Health and Development program (IHDP) [3], which investigated the effect of personalized home visits and intensive high-quality child care on the health of low birth-weight and premature infants [8]. The data for each child included a 25-dimensional covariate vector ez, containing information on birth weight, head circumference, gender etc., standardized to zero mean and unit standard deviation, as well as a decision x {0, 1} indicating whether a child received special medical care or not. The outcome cost y is a child s cognitive underdevelopment score (simply a sign change of a development score). The covariate distribution p(ez) is unknown. The past policy, which we also treat as unknown, was in fact a randomized control experiment, so that p(x = 1|ez) was a constant. This policy was found to be successful in improving cognitive scores of the treated children as compared to those in the control group. To obtain outcome costs for either decision in X, we generate y synthetically by the nonlinear associative models following [8, 5]: y|x = 0,ez N( exp[(ez + 0.51) β], σ0) and y|x = 1,ez N( ez β ω, σ1), (15) where we consider different dispersions below. Here ω is selected as described in [5] and [8] so that the effect of treatment on the treated is 1.5. The unknown parameter β is a 25-dimensional vector of coefficients drawn randomly from {0, 0.1, 0.2, 0.3, 0.4} with probabilities {0.6, 0.1, 0.1, 0.1, 0.1}, respectively, as specified in [8]. The IHDP data contains 747 data points and we randomly select a subset of n = 600 training points that form Dn. The remaining 147 points are used to evaluate learned policies. To learn the weights (8) for the robust policy, we first reduce the 25-dimensional covariates ez into 4dimensional features z = enc(ez) using an autoencoder [2, sec.7.1]. Then bp(z|x) is a learned Gaussian mixture model with four mixture components and bp(x) is a learned Bernoulli model. Together the models define (8) and a robust policy πα(z) is learned for the target probability α = 20%. For comparison, we also consider a linear policy πγ(ez) that aims to minimize the expected cost (3) using linear models of the conditional means. Note that such models are well-specified and misspecified for the treated and untreated outcomes in (15), respectively. In addition, we also compare against the quantile-optimal linear policy πq(z) [23]. Figure 4 shows the cost distribution for the past and learned policies when the dispersions in (15) are equal or different. We see that in the cases of equal dispersion in Figure 4a and higher dispersion for untreated in Figure 4c, the robust and optimal linear policies reduce the (1 α)-quantile of the cost yα as compared to that for the past policy, where the robust policy does slightly better. Since the treated group tends to have a lower mean cost than the untreated group in the training data, the linear policy tends to assign x = 1 to most patients in the test data. Moreover, the misspecified linear model leads to biased estimates of the expected cost and the resulting policy πγ(ez) cannot fully capture the non-linear partition of the feature space implied by the mean-optimal policy based on E[y|x,ez]. Figure 4e shows the cost distribution when the treatment outcome costs have higher dispersion. The tendency toward treatment assignment by the linear policies results in higher tail costs. By contrast, the robust policy adapts to a higher cost dispersion in the treated group and assigns fewer treatments which results in resulting in smaller tail costs. In this case, the tail cost is more similar to the past policy since its proportion of (random) treatment assignments is small in the data. The robust methodology also provides a certificate yα(z) for each decision, as illustrated in Figures 4b, 4f and 4d with respect to two standardized covariates for each child in the test set. The probability that the cost y exceeds yα(z) is 18.6%, estimated using 500 Monte Carlo runs, which is close to and no greater than the targeted probability α = 20% despite the model misspecification of bp(z|x). 8 6 4 2 0 y x p(x|z) x = πα(z) (a) Complementary CDF 2 0 2 Neo natal index 8 6 4 2 0 y x p(x|z) x = πα(z) (c) Complementary CDF 2 0 2 Neo natal index 8 6 4 2 0 y x p(x|z) x = πα(z) (e) Complementary CDF 2 0 2 Neo natal index Figure 4: IHDP data and cognitive underdevelopment scores y. First column: Complementary CDFs for learned robust πα and linear policies, respectively, as compared to past policy. We consider three different scenarios in (15): (a) σ1 = σ0 = 1, (b) σ1 = 1, σ0 = 5, and (c) σ1 = 5, σ0 = 1. Second column: limit yα(z) (color bar) provided by the robust methodology, plotted against two standardized covariates: neonatal health index and mother s age of each child in the test data. Each unit corresponds to a standard deviation from the mean. 5 Conclusion We have developed a method for learning decision policies from observational data that lower the tail costs of decisions at a specified level. This is relevant in safely-critical applications. By building on recent results in conformal prediction, the method also provides statistically valid bound on the cost of each decision. These properties are valid under finite samples and even in scenarios with highly uneven overlap between features for different decisions in the observed data. Using both real and synthetic data, we illustrated the statistical properties and performance of the proposed method. Broader Impact We believe the work presented herein can provide a useful tool for decision support, especially in safety-critical applications where it is of interest to reduce the risk of incurring high costs. The methodology can leverage large and heterogeneous data on past decisions, contexts and outcomes, to improve human decision making, while providing an interpretable statistical guarantee for its recommendations. It is important, however, to consider the population from which the training data is obtained and used. If the method is deployed in a setting with a different population it may indeed fail to provide cost-reducing decisions. Moreover, if there are categories of features that are sensitive and subject to unwarranted biases, the population may need to be split into appropriate subpopulations or else the biases can be reproduced in the learned policies. Acknowledgments and Disclosure of Funding This research was partially supported by the Swedish Research Council (contract no.: 2018-05040) and the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by Knut and Alice Wallenberg Foundation. [1] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction under covariate shift. Advances in Neural Information Processing Systems 32 (Neur IPS 2019), 2019. [2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. [3] Jeanne Brooks-Gunn, Fong-ruey Liaw, and Pamela Kato Klebanov. Effects of early intervention on cognitive function of low birth weight preterm infants. The Journal of pediatrics, 120(3):350 359, 1992. [4] Alexander D Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Overlap in observational studies with high-dimensional covariates. ar Xiv preprint ar Xiv:1711.02582, 2017. [5] Vincent Dorie. Non-parmeterics for Causal Inference, 2016. [6] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011. [7] Sheng Fu, Qinying He, Sanguo Zhang, and Yufeng Liu. Robust outcome weighted learning for optimal individualized treatment rules. Journal of biopharmaceutical statistics, 29(4):606 624, 2019. [8] Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217 240, 2011. [9] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [10] Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. Deep learning with logged bandit feedback. 2018. [11] Fredrik D Johansson, Dennis Wei, Michael Oberst, Tian Gao, Gabriel Brat, David Sontag, and Kush R Varshney. Characterization of overlap in observational studies. ar Xiv preprint ar Xiv:1907.04138, 2019. [12] Diederik P. Kingma and Max Welling. An introduction to variational autoencoders. Foundations and Trends R in Machine Learning, 12(4):307 392, 2019. [13] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817 824, 2008. [14] Jing Lei, Max G Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094 1111, 2018. [15] Andreas Lindholm, Dave Zachariah, Petre Stoica, and Thomas B Schön. Data consistency approach to model validation. IEEE Access, 7:59788 59796, 2019. [16] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015. [17] Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011. [18] Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly robust off-policy evaluation with shrinkage. ar Xiv preprint ar Xiv:1907.09623, 2019. [19] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823, 2015. [20] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. ar Xiv preprint ar Xiv:1711.01558, 2017. [21] A.A. Tsiatis, M. Davidian, S.T. Holloway, and E.B. Laber. Dynamic Treatment Regimes: Statistical Methods for Precision Medicine. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press, 2019. [22] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005. [23] Lan Wang, Yu Zhou, Rui Song, and Ben Sherwood. Quantile-optimal treatment regimes. Journal of the American Statistical Association, 113(523):1243 1254, 2018. [24] Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106 1118, 2012. [25] Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169 187, 2017.