# counterfactual_learning_with_multioutput_deep_kernels__14bb248b.pdf Published in Transactions on Machine Learning Research (11/2022) Counterfactual Learning with Multioutput Deep Kernels Alberto Caron alberto.caron.19@ucl.ac.uk Department of Statistical Science, University College London The Alan Turing Institute, London, UK Gianluca Baio g.baio@ucl.ac.uk Department of Statistical Science, University College London Ioanna Manolopoulou i.manolopoulou@ucl.ac.uk Department of Statistical Science, University College London Reviewed on Open Review: https: // openreview. net/ forum? id= i GREAJd ULX In this paper, we address the challenge of performing counterfactual inference with observational data via Bayesian nonparametric regression adjustment, with a focus on highdimensional settings featuring multiple actions and multiple correlated outcomes. We present a general class of counterfactual multi-task deep kernels models that estimate causal effects and learn policies proficiently thanks to their sample efficiency gains, while scaling well with high dimensions. In the first part of the work, we rely on Structural Causal Models (SCM) to formally introduce the setup and the problem of identifying counterfactual quantities under observed confounding. We then discuss the benefits of tackling the task of causal effects estimation via stacked coregionalized Gaussian Processes and Deep Kernels. Finally, we demonstrate the use of the proposed methods on simulated experiments that span individual causal effects estimation, off-policy evaluation and optimization. 1 Introduction In the recent years, there has been a surge of attention towards the use of machine learning methods for causal inference under observational data. The desire for highly personalized decision-making is indeed pervasive in many disciplines such as precision medicine, web advertising and education (Hodson, 2016). However, in these fields, exploration of policies in the real-world is usually very costly and potentially harmful. For example, a randomized clinical trial targeted at assessing the effects of a new surgical technique entails high expenses in terms of patient recruitment and design, in addition to concerns about safety and potential side effects on both treated and non-treated individuals. Thus, in the attempt to answer counterfactual questions about policy interventions, such as what would have happened if individual i undertook treatment A instead of treatment B?", one cannot typically rely on randomized experimental data. On the other hand, observational data are abundant and more readily accessible at relatively lower costs, although they suffer from sample selection bias issues arising because of confounding factors. Counterfactual quantities can still be identified and estimated in settings with observed (and in some cases unobserved) confounders, provided that some assumptions hold. This work is aimed especially at tackling the problem of carrying out counterfactual inference using observational data, with a focus on high-dimensional settings characterized by a large number of covariates, multiple discrete actions (or manipulative variables) and multiple correlated outcomes of interest. To this end, we present a general class of counterfactual multi-task Deep Kernel Learning models (Counter DKL) that efficiently adapt to multiple actions and outcomes settings by exploiting existing correlations, while inheriting the appealing scaling properties of DKL (Wilson et al., 2016) under large samples and large number of predictors, and preserving the Bayesian uncertainty quantification properties of Gaussian Processes (GPs). Published in Transactions on Machine Learning Research (11/2022) Counter DKL essentially consists of two joint, end-to-end, components: i) a deep learning architecture that learns a lower-dimensional representation of high-dimensional input space; ii) a multitask GP (kernel-based) component (Teh et al., 2005; Bonilla et al., 2008; Álvarez et al., 2012; Bohn et al., 2019) placed on the lower-dimensional representations that can learn the posterior joint distribution of the outcomes conditional on inputs and actions, by avoiding parameter proliferation and stability issues that arise from placing a multitask GP kernel directly on the high-dimensional input space. Specifically, our contributions can be listed as follows: We extend the class of causal multitask GPs (Alaa & van der Schaar, 2017; 2018) to multiple actions and multiple outcomes designs, by specifying a stacked coregionalization model, highlighting its advantages in tackling poor overlap" regions and disadvantages in terms of scalability. We introduce the class of Counter DKL for causal inference under multiple actions/outcomes, discussing their benefits over causal multitask GPs in high-dimensional settings (in terms of covariates, actions and outcomes). We demonstrate the use of Counter DKL with simulated experiments on causal effects estimation, offpolicy evaluation (OPE) and learning off-policy (OPL) problems (Dudík et al., 2011; Dudík et al., 2014; Farajtabar et al., 2018; Kallus, 2021), by providing also an Python implementation of the models, based on GPy Torch1. Studies with multiple outcomes are quite common in applied research, as policy decisions are rarely based on a single outcome, but rather on a profile of different outcomes that might exhibit positive or negative correlation. As an example, in the medical domain, prescription of a specific treatment, such as anti-coagulants to lower cholesterol level, depends on the risk of myocardial infarction (primary outcome), as well as the risk of bleeding (unwanted, correlated, side effect). 1.1 Related Work Many significant contributions in the literature on non-linear regression techniques for counterfactual/causal learning have been made in the recent years. These include works on heterogeneous (or individual) treatment effects estimation (Shalit et al., 2017; Künzel et al., 2017; Alaa & van der Schaar, 2017; 2018; Yao et al., 2018; Wager & Athey, 2018; Nie et al., 2020), and on the related field of off-policy evaluation and learning (Dudík et al., 2014; Farajtabar et al., 2018; Kallus, 2018; Athey & Wager, 2021), that focus more on the prescriptive, rather than predictive, goal. Among the several contributions, we particularly draw attention to the early work of Hill (2011) that first proposed Bayesian nonparametric methods (specifically Bayesian Additive Regression Trees) for regression adjustment in estimating causal effects, followed by the extension of Hahn et al. (2020); Caron et al. (2022b), and the seminal work of Alaa & van der Schaar (2017; 2018) who first proposed multitask Gaussian Processes for causal effects estimation tasks, but limited to contexts with binary actions and a single (continuous) outcome. All these works highlight the advantages of Bayesian nonparametric models in terms of flexible non-linear function approximation and uncertainty quantification. More specifically, the advantage of multitask GPs and DKL for causal learning lies in their sample efficiency gains. As observational data often feature imbalanced action arms with relatively few instances, sample splitting can result in under or over-fitting of the surfaces of interest. Multitask GPs and DKL both allow for information sharing when learning actions and outcomes correlated tasks, while only DKL guarantees better scalability and also makes the choice of a GP kernel less challenging, as the deep learning structure can itself learn arbitrarily complex non-linear functions (Wilson et al., 2016). 2 Problem Framework In order to introduce the problems of causal effects identification and estimation, we borrow the notation from do-calculus and Structural Causal Models (SCM) (Pearl, 2009), and start by defining the latter: Definition 2.1 (SCM). A Structural Causal Model is a 4-tuple E, V, F, p(ε) consisting of (subscript j indicates a random element in the set): 1 Full code at: https://github.com/albicaron/Counter DKL Published in Transactions on Machine Learning Research (11/2022) 1. E: denoting a set of exogenous variables, defined as variables determined outside of the model. 2. V: denoting a set of endogenous variables, defined as variables determined inside the model. 3. F: a set of functions fj F mapping each element εj E and every parent variables of Vj V which we denote by pa(Vj) to the endogenous variables Vj V, fj : εj pa(Vj) 7 Vj. 4. p(εj): a probability distribution over εj U. A SCM implies a causal Directed Acyclic Graph (DAG), where nodes in the graph depict variables, i.e. εj and Vj, while edges denote the functional causal relationships fj F. In the following paragraphs, we examine the problem of identifying and estimating these causal relationships fj F, under observational data, relying on SCMs and causal DAGs. Figure 1: Causal DAG depicting the setting of interest, where arrows represent causal relationships. Set of observable covariates X X satisfies the backdoor criterion, in that conditioning on them allows for identification of the causal effects of action A A on the outcomes Y Y. Suppose we have access to an observational dataset Di = {Xi, Ai, Yi} p( ), with i {1, ..., N}, where Xi X is a set of covariates, Ai A a set of discrete actions2 (or manipulative variables), and Yi RM a set of M different outcomes. Here we consider continuous type of outcomes for simplicity, but we extend the methods also to discrete outcomes (Milios et al., 2018), as in the experimental Section 5.2. The overarching goal is to identify and estimate the (average) effects of intervening on the manipulative variable Ai, by setting it equal to some value a, on the outcomes Yi. We denote the main quantity of interest, the joint interventional probability distribution of all the outcomes conditional on Ai = a by using the do-operator as p(Y |do(A = a)). We assume that the SCM is fully described by the following set of equations: Xi = f X(εi,X), Ai = f A pa(Ai), εi,A = f A(Xi, εi,A) (1) Yi = f Y pa(Yi), εi,Y = f Y (Xi, Ai, εi,Y ) , where: f Y ( ) is a vector-valued function if multiple outcomes are considered; pa(Ai) denotes parent variables (or causes) of Ai; εi,j are error terms with a distribution p(εi,j). The equivalent causal DAG is represented in Figure 1, although we note that the methods presented in this work extend to other types of causal DAGs. We make two standard assumption for identification of the causal effect A Y in this scenario. The first is unconfoundedness, stating that there are no unobserved confounders, or equivalently that the set of observed covariates Xi X is causally sufficient, in the sense that conditioning on Xi allows to identify the causal association between Ai and Yi. Using Pearl s terminology, we equivalently say that X satisfies the backdoor criterion (see A.1 in the appendix for a formal definition), in that it blocks all backdoor paths" from A to all the outcomes Y (Pearl, 2009). The second assumption is overlap (also known as positivity or common support assumption). Overlap requires that 0 < p(Ai = a|Xi = x) < 1, a A - i.e. that the observed actions allocation given Xi = x is never deterministic, so we could theoretically observe data points for which Xi = x in each of the discrete arms of A (A.2 in the appendix). This ensures that we have comparable units in terms of Xi in each action arm, so we can approximate f Y (Xi, Ai = a, εi,Y ) well enough. Violation of overlap for portions of X undermines generalization and extrapolation of model s prediction in those regions; thus, one must be careful as to which subpopulation, defined by a common support Xover X, to target to estimate causal effects. Under these two assumptions, the multivariate interventional distribution p(Y | do(A = a), X = x) can be recovered via backdoor adjustment as p(Y | do(A = a), X = x) = p(Y | A = a, X = x) (Pearl, 2009) (see Appendix A). Hence, provided that the covariates Xi, direct common causes of Ai and Yi, satisfy the backdoor criterion, we can estimate unbiased causal effects with the observed quantities in Di = {Xi, Ai, Yi}. 2Extensions to settings with continuous action space A R (entailing a dose-response curve estimation) are non-trivial (Imai & van Dyk, 2004), and the multitask learning paradigm is not specifically suitable for them. Published in Transactions on Machine Learning Research (11/2022) 3 Counterfactual Learning with Multitask GPs For ease of exposition, let us consider the simple case depicted by the causal DAG in Figure 1a, with a single continuous outcome Yi R, with i {1, ..., N}. We tackle the problem of estimating p(Y |do(A = a)) via non-linear regression-adjustment (Johansson et al., 2016; Shalit et al., 2017; Künzel et al., 2017; Nie & Wager, 2020; Caron et al., 2022a). In particular, we assume, in line with most of the previous works, additive noise structure3, such that the outcome functional can be written as: Yi = f Y (Xi, Ai) + εi,Y , E(εi,Y ) = 0 . (2) There are different ways in which one can derive an estimator for p(Y |do(A = a)) and its moments, e.g. E(Y |do(A = a)), from (2) (Künzel et al., 2017; Caron et al., 2022a). Alaa & van der Schaar (2017; 2018) first proposed the use of multitask learning via Gaussian Process regression, in the specific context of conditional average treatment effects estimation, which is defined, assuming binary Ai {0, 1}, as the quantity τ(xi) = E[Yi|do(Ai = 1), xi] E[Yi|do(Ai = 0), xi]. The idea behind causal multitask GPs is to view the D = |A| interventional quantities Ya, where D is the number of discrete action arms, as the output from a vector-valued function f Y ( ) : X 7 RD (plus noise), modelled with a GP prior: f Y ( ) GP m( ), K( , ) , (3) with mean m(xi) RD and covariance/kernel function K(xi, xj) RD RD, given two P-dimensional input points xi, xj X for units i and j. Given the likelihood function as a multivariate Gaussian p(yi|f Y , xi, Σ) N(f Y (xi), Σ), where Σ RD RD is the error covariance diagonal matrix with {σ2 a}D a=1 on the diagonal and yi RD an output point, the posterior predictive distribution for a train set covariate realization xi X, train set outcome realization yi R and a test set covariate realization x j X is obtained as, assuming zero prior mean m( ) = 0 for simplicity: p f Y (x j) | (xi, yi), f Y , ϕ N f Y (x j), K (x j, x j) , f Y (x j) = K(x j, xi)Hy , K (x j, x j) = K(x j, x j) K(x j, xi)HK (x j, xi) , where H = h K(xi, xi) + Σ i 1 , and where ϕ denotes the model parameters and f Y (x j) the function evaluated at a test point x j. Under zero prior mean m( ) = 0, the multitask GP in (3) is fully parametrized by its kernel function K( , ). The structure of the kernel function in a multitask GP is what induces task-relatedness when fitting the multi-valued surface f Y ( ). 3.1 The multitask kernel The simplest specification for the multitask kernel matrix is given by the separable kernels structure, which assumes single entries in K( , ) to be of the form ka,a (xi, xj) = k(xi, xj) k A(a, a ) = k(xi, xj) ba,a , with action a A = {1, ..., D}. Here, k(xi, xj) represents a base kernel (e.g. linear, squared exponential, Matérn, etc.) while ba,a = k A(a, a ) is the generic entry of the D D coregionalization matrix B, which contains the parameters governing task-relatedness over the actions A. In the trivial case where ba,a = 0 we have that tasks a and a are uncorrelated, i.e. actions a and a are unrelated in the way they affect the outcome Y . A slightly more general framework, which we adopt in this work, is given by the sum of separable kernels structure (Álvarez et al., 2012). This assumes that the single entry of K( , ) reads ka,a (xi, xj) = PQ q=1 Bq kq(xi, xj), i.e. the sum of Q different coregionalization matrices Bq with associated base kernel kq( , ). In compact matrix notation this translates into K(X, X ) = PQ q=1 Bq Kq(X, X ) for two different input matrices X, X X, where is the Kronecker product. Imposing a sum of separable kernels structure is equivalent to assuming that the collection of action-specific functions {f Ya( )}D a=1 generates from Q D 3Technically this makes the setup semi-parametric, rather than fully nonparametric. Published in Transactions on Machine Learning Research (11/2022) common underlying independent latent GP functions {uq( )}Q q=1, parametrized by their base kernel kq( , ), that is cov(uq(xi), uq (xj)) = kq(xi, xj) (Álvarez et al., 2012). In terms of the form of the coregionalization matrices Bq, with q {1, ..., Q}, we will follow the linear model of coregionalization (LMC), which assumes that each Bq is equal to Bq = Lq L q, with single entries bq; (a,a ) = PRq r=1 αr a,qαr a ,q. Rq represents the number of GP samples obtained from the same latent GP function q, uq( ). Thus, adopting the LMC for causal learning is equivalent to assuming that correlation in the {f Ya( )}D a=1 action-specific functions, modelled through the multitask kernel K( , ), arises from Q different samples of Rq GP functions with the same kernel kq( , ), drawn from Q D different independent latent GP processes {uq( )}Q q=1. To express this more compactly, a causal multitask GP model under the LMC reads f Y ( ) GP m( ), K( , ) , with single entries of K( , ) being ka,a (xi, xj) = q=1 Bq kq(xi, xj) = q=1 Aq A q kq(xi, xj) , implying cov fa(xi), fa (xj) = r,r αr a,qαr a ,q cov ur q(xi), ur q (xj) . In our specific case, as in Alaa & van der Schaar (2018), we employ a special case of LMC, named intrinsic coregionalization model (ICM) (Bonilla et al., 2008), where the underlying latent GP function is unique (Q = 1), so that ka,a (xi, xj) = B K(xi, x i), with unique base kernel K( , ). The ICM specification attempts to avoid severe parameter proliferation in high-dimensional settings with multiple correlated actions D = |A|, while still being capable of capturing task-relatedness through the relatively simple structure of B. However, beside the issue of parameter proliferation when A features multiple discrete actions, exact GP regression is also known to scale poorly with sample size and cardinality of input space |X|, and direct likelihood maximization methods face issues in over-parametrized models, although some solutions, such as variational methods (Titsias, 2009; Hensman et al., 2013), might be adopted for better scalability. 3.2 Why multitask counterfactual learning? We know that asymptotically, under no sample selection bias, the best approach to estimate the causal quantities f Y (xi) would be a T-Learner" (Künzel et al., 2017; Caron et al., 2022a), which implies splitting the sample and fitting separate models for each arm A = a, ˆf(x)a. However, in finite samples flawed by selection bias, which is the typical case of observational studies, this is often not the best strategy (Hahn et al., 2020; Caron et al., 2022a). By simply extending the result in Alaa & van der Schaar (2018), we hereby show how increasing action and covariates spaces make the problem of estimating causal effects harder in terms of minimax risk, i.e. in terms of optimal convergence rates for any nonparametric estimator for a given space of functions. We particularly consider the problem of estimating Conditional Average Treatment Effects (CATE) between action a and b, defined under the SCM specified by (2) as τa,b(xi) = E[Yi|do(Ai = a), Xi = x] E[Yi|do(Ai = b), Xi = x] = fa(xi) fb(xi). We assume that all the fa, a A belong to the Hölder ball class of αa-smooth functions H(αa), with αa 1 bounded derivatives and bounded in sup-norm by a constant C > 0. Considering an L2-norm loss function on τa,b(xi), namely E ˆτa,b(xi) τa,b(xi) 2 L2 , the difficulty of CATE estimation with a nonparametric model ψ Ψ can be specified by the optimal minimax rate of convergence, that we define as follows (proof in the Appendix A section). Corollary 3.1 (Minimax rate). Assume covariate space is X = [0, 1]P , and fa depends on Pa covariates s.t. Pa P, a A. Define na,b < N as the subsample identified by action a and b. If both fa H(αa) and fb H(αb), then CATE optimal minimax rate for a L2-norm loss function is: inf ˆτψ sup fa,fb E ˆτa,b(xi) τa,b(xi) 2 L2 n a,b log P Pa+Pb P Pa a P Pb b 1 na,b , (6) where x y = max{x, y} and is asymptotical equivalence. The first terms on the RHS of 6 relates to the problem of CATE function approximation, while the second term to the degree of sparsity of the CATE. Thus, Published in Transactions on Machine Learning Research (11/2022) the optimal minimax rate, which minimizes the loss in the worst case scenario permitted, is asymptotically as complex as the hardest of these two tasks. The tightness" of this rate depends on the cardinality of the predictor space P and of the subsample defined by action a and b. This implies that, in the presence of multiple discrete actions (where some arms are likely to be very imbalanced), the rate will inevitably grows larger. For these reasons, a multitask GP prior over the actions is well suited to tackle selection bias and particularly estimation in regions with poor overlap, i.e. regions in X where we mainly observe data points with specific action Ai = a and very few others. In addition to this, as shown by Alaa & van der Schaar (2018), a multitask GP prior can achieve the optimal minimax rate of Corollary 3.1 in its posterior consistency. Hence, splitting the sample into na subgroups and fitting independent models can be very sample inefficient in these settings (6). Multitask GPs can aid extrapolation in such cases of strong sample selection bias, by learning the correlated functions {f Ya( )}D a=1 jointly as f Y ( ). The first row plots in Figure 2 provide a very simple one-covariate example of how multitask learning addresses the issue of extrapolation and prediction in poor overlap regions. Fitting the two surfaces f Y 1(xi) and f Y 0(xi) (dashed lines) through separate GP regressions results in a bad fit out of overlap regions (top-left plot) in this specific case. Multitask coregionalized GP attempts to fix this problem by embedding the assumption that the two surfaces share similar patterns via joint estimation of f Ya(xi) and their task-relatedness parameters, increasing sample efficiency (right panel). When the two surfaces share minor patterns instead, such as in the second row plots of Figure 2, the sample efficiency gains are less significant; and in some more extreme cases where the surfaces do not share any similar pattern at all, assuming a multitask GP prior might also introduce bias. The issue of partial overlap might be less severe in scenarios with larger sample size, but not always; however, in settings with strong sample selection bias, or settings with multiple discrete actions or action spaces that grow with the sample size, the issue remains relevant. This is because the sampling mechanism is inherently flawed, so that even with infinite samples, poor overlap regions do not fade away, independently of the modelling assumptions one makes. Modelling assumptions on how to estimate {f Ya( )}D a=1 hence ultimately depends on domain expert knowledge about the setting Hahn et al. (2020); Caron et al. (2022c). If one possesses prior knowledge that the causal effects might be fairly homogeneous across units with different covariate realizations Xi = xi, so that the CATE function is likely to display simple patterns (first row, Figure 2), then using a multitask approach would actually help incorporate this assumption in the model. Conversely, if one believes instead that CATE is likely to be a rather complex function itself (second row, Figure 2), estimating {f Ya( )}D a=1 independently would be a better choice. 3.3 Multiple Output Designs Reverting back to setups with multiple correlated outcomes, we introduce a simple extension to the class of counterfactual GPs presented above that involves an extra multitask learning layer over the M outcomes Yi, in addition to the one over Ai. The way we formulate this is through an additional coregionalization matrix, in the same fashion as we did for the actions case. This extended version featuring stacked coregionalization, has a GP prior of the following form: Yi = f Ya(xi) + εi , E(εi) = 0 f Ya( ) GP 0, KY ( , ) , KY ( , ) = BY BA K( , ) , (7) where BY is the M M coregionalization matrix over the outcomes, BA the D D coregionalization matrix over the actions and K( , ) is the base kernel. The vector-valued function f Ya( ) in this case includes all the single-valued functionals {fa,m( )}D,M a,m . The extra multitask learning layer over the outcomes Yi is aimed at increasing sample efficiency by borrowing information among correlated outcomes, as opposed to fitting M separate counterfactual GPs with a single coregionalization layer over A, but it is also conceptually sound, as the quantity of interest is indeed the joint interventional distribution p(Y |do(A = a), X = x), which accounts for and explicitly models correlation between the outcomes, rather than the collection of marginal distributions {p(Ym|do(A = a))}M m=1, which leaves correlation unspecified. As we will address in the later Published in Transactions on Machine Learning Research (11/2022) 3 2 1 0 1 2 3 X Fit Treated Control 3 2 1 0 1 2 3 X Multitask GP Fit Treated Control 3 2 1 0 1 2 3 X Fit Treated Control 3 2 1 0 1 2 3 X Fit Treated Control Figure 2: Simple one covariate example, with A = {0, 1}. Overlap is guaranteed to hold over the whole support X in the data generating process, i.e. every unit has non-zero probability of being assigned to either Ai = 1 or Ai = 0, but p(Ai = 1|Xi) is generated as an increasing function of Xi (selection bias). In the top row simulation, the two underlying counterfactual surfaces f Yd(xi) (dashed lines) are generated with very similar patterns, thus GP (left panel) is unable to borrow information from the other arm in poor overlap regions contrary to multitask GP (right panel). In the bottom row simulation instead we generate less similar surfaces, so borrowing of information through multitask GP does not lead to any improvement. section, although the extra layer defined by BY allows for higher sample efficiency, it also poses some issues due to parameter proliferation and stability of the optimization problem in high dimensions. 4 Counterfactual Multitask Deep Kernel Learning Gaussian Processes regressions are known to scale poorly with high dimensions. Their typical computational cost amounts to O(n3) for training points and O(n2) for test points. Similarly, coregionalized GPs suffer from over-parametrization and instability in the optimization procedure as the number of inputs P and the number of discrete actions D increase. Deep Kernel Learning (DKL) was firstly introduced by Wilson et al. (2016) with the aim of combining the scalability of Deep Learning methods and the nonparametric Bayesian uncertainty quantification of GPs in tackling prediction tasks in high-dimensional settings. Given a base kernel k( , ) (e.g. linear, squared exponential, etc.), a DKL structure learns a functional f Ya( ) by passing the P inputs Xi X through a deep architecture (a fully-connected feedforward neural network in our case), which maps them to a lower dimensional representation space via non-linear activation functions. The base kernel k( , ) is then applied in this lower dimensional representation space, k(h(l), h(l) ), where h(l) is a neural network s hidden layer, constituting a final Gaussian Process layer (or an infinite basis functions representation layer). The resulting mathematical object can be described as a kernel being applied to a concatenation of linear and non-linear functions of the inputs, namely K(x, x ) = K(g1 ... gl(x), g1 ... gl(x )) (Bohn et al., 2019). Thus, the DKL architecture is end-to-end, fully-connected and learnt jointly: the P inputs are passed on to ℓhidden neural nets layers where the last hidden layer before the GP layer typically maps them to a lower dimensional representation space (with e.g. two hidden units). This is what generates benefits in terms of scalability compared to a classic GP, as the base kernel k( , ) is applied to a lower dimensional representation space, rather than the higher dimensional inputs space directly. Another intrinsic advantage of DKL is that the deep architecture preceeding the GP layer can itself learn arbitrarily complex function, so the choice of a specific GP kernel becomes less cumbersome. For example, Wilson et al. (2016) show that DKL is more robust in recovering step functions, due to weaker smoothness assumptions compared to standard GP kernels. Published in Transactions on Machine Learning Research (11/2022) ... ... ... Input layer Hidden layers Output layer . . . h(1) = g1(W1X + b1) , . . . h(l) = gl(Wl h(l 1) + bl) , Y = f(h(l)) + Ξ f( ) GP 0, K( , ) , K( , ) = BY BA Kq(h(l), h(l) ) Figure 3: Counterfactual multitask DKL architecture. The P raw inputs are passed through a deep learning structure with ℓhidden layers. Multioutput separable kernels (inducing coregionalization over actions A and outcomes Y ) are then applied to the last Gaussian Process hidden layer, before the M action-specific output layer. Parameters are estimated jointly by minimizing the negative log likelihood. DKL naturally presents some limitations concerning the more burdensome parameter tuning (e.g., hidden layers and units selection) and the fact that they more easily tend to overfit when overly-parametrized (we refer to Ober et al. (2021) for a more detailed discussion of the issue). The kernel k( , ) in the last GP layer of a DKL architecture can easily incorporate the separable kernel structure for multitask learning, in the same fashion as the class of causal GPs presented earlier. Thus, we propose a multitask modelling framework to induce correlation across the action-specific functions {f Ya( )}D a=1, under the name of Counterfactual DKL (Counter DKL), where a similar Intrinsic Coregionalization Model (ICM) in the same spirit of (5) is placed on the last hidden layer of the neural network h(l) = gl(Wl h(l 1) + bl), where (Wl, bl) are the last layer s weights and bias, such that f Ya(h(l)) GP 0, K( , ) , where K(h(l), h(l) ) = KA(a, a ) K(h(l), h(l) ) = B K(h(l), h(l) ) (8) In this case, the Kronecker product of the coregionalization matrix occurs in the last hidden layer, and features lower dimensional representations instead of the potentially large number of raw inputs. Similarly, we can induce coregionalization over the M outcomes by adding another level of coregionalization, with the kernel reading K(h(l), h(l) ) = KY (y, y ) KA(a, a ) Kq(h(l), h(l) ) = BY BA Kq h(l), h(l) . Figure 3 graphically depicts a counterfactual multitask Deep Learning architecture, with fully-connected hidden layers, a final (infinite) GP layer and the M action-specific outcomes. The parameter space in Counter DKL comprises: i) the set of deep neural network s weight matrices {Wi}l i=1 and biases {bi}l i=1; ii) the base kernel K hyperparameters Φ, e.g. in the case of squared exponential kernel Φ includes lengthscales and variances parameters Φ = {ℓ, σ2}; iii) the coregionalization matrix B entries. Hence, the parameter space is the collection Θ = ({Wi}l i=1, {bi}l i=1, Φ, B). These parameters are estimated jointly via maximization of the log-marginal likelihood. More details are provided in the Appendix Section D and in Wilson et al. (2016); Gardner et al. (2018). In the next section, we will investigate properties of counterfactual multitask GPs and DKL on a variety of experiments. 5 Experiments The fundamental problem of causal inference is that the interventional quantity p(Y |do(A = a), Xi = xi) is never observable, so we have to resort to simulation to fully evaluate the methods on individual causal effects (ICE) estimation. We evaluate the performance of counterfactual GPs and counterfactual DKL on a data generating process with three different tasks, and on a real-world example combining experimental and observational data. For the first simulated experiment, we construct the DGP such that the backdoor Published in Transactions on Machine Learning Research (11/2022) 500 1000 1500 2000 2500 N 500 1000 1500 2000 2500 N 500 1000 1500 2000 2500 N GP Counter GP MOGP DKL Counter DKL MODKL 10 15 20 25 P 10 15 20 25 P Varying P, N=1500 10 15 20 25 P Varying N, P=10 Figure 4: Results on performance of the methods compared, in terms of RMSE or Optimal Allocation Rate (OAR), averaged across B = 100 replications for each N {500, 1000, 1500, 2000, 2500} (first row) and each P {10, 15, 20, 25} (second row). First column: RMSE evaluated on the individual causal effect (ICE) estimation task (on the test set). Second column: RMSE evaluated on the OPE task. Third column: OAR on the OPL task, defined as percentage of units correctly allocated to the best action among the D ones. criterion holds for Xi X. The GPs and DKL implementations in the simulated examples all make use of the KISS-GP approximation to compute the base kernel covariance matrix as Kq = MKdeep U,U M in the GP layer for better scalability (Wilson & Nickisch, 2015; Wilson et al., 2016)4. 5.1 Simulated Example We consider a simulated setting with D = 4 possible actions A = {0, 1, 2, 3} and M = 2 correlated outcomes Y = (Y1, Y2) R2. Actions and outcomes are generated according to a policy πb(xi) = p(Ai = a|Xi = xi) and an outcome function f Y (xi), both dependent on the covariates Xi X. The probabilistic DGP is fully described in the Appendix B of supplementary materials. The models we compare are the following: i) separate standard GP regressions, employed to fit f Yd( ) distinctly for each outcome and for each action (GP); ii) counterfactual multitask GP regression (Alaa & van der Schaar, 2017; 2018), with coregionalization over Ai only, meaning that we fit two separate models for each outcome, but a unique multi-valued function model for Ai (Counter GP); iii) counterfactual multioutput GP regression, a unique model with coregionalization both over Ai and Yi (MOGP); iv) separate DKL regressions with 3 hidden layers of [50, 50, 2] units, the equivalent of i) but with deep kernel implementation (DKL); v) counterfactual multitask DKL regression with 3 hidden layers of [50, 50, 2] units, the DKL equivalent of ii) (Counter DKL); vi) counterfactual multioutput DKL, the DKL equivalent of iii) (MODKL). In particular, we consider two slightly different versions of this setup. In the first version we fix the number of covariates to P = 10 (only 7 of them being relevant for the estimation) and study the behaviour of the estimators with increasing sample size N {500, 1000, 1500, 2000, 2500}. In the second version we fix sample size to N = 1500 and study the behaviour of the estimators with increasing number of covariates P {10, 15, 20, 25}. Performance of the models is evaluated on the following three related tasks: ICE: The first is the prediction of Individual Causal Effects (ICE). This tackles the estimation of the average causal effect of playing action Ai = a on outcome Yi, given a certain realization of the covariates 4 All experiments were run on a Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz, 8Gb RAM CPU. Published in Transactions on Machine Learning Research (11/2022) 1 2 3 4 5 : Degree of Confounding GP Counter GP DKL Counter DKL MODKL 1 2 3 4 5 : Degree of Confounding 95% Coverage Figure 5: Models performance in terms of RMSE (left plot) and 95% Coverage (right plot), in estimating Individual Causal Effects (ICE) on a 20% left-out test set, given an increasing level of confounding, represented by the γ parameter: higher values of γ corresponds to higher probability of being assigned to one of the two action Ai = {3, 4}, thus generating more arms imbalance. space, Xi = x, i.e. the estimation of ICE: E(Yi|do(Ai = a), Xi = xi). This is carried out using a 80% training set, and evaluated via RMSE on a 20% left-out test set. OPE: The second is Off-Policy Evaluation, which is concerned with quantifying how good a given alternative policy πe is, compared to the action allocation policy that generated the data (behavior policy, πβ). This is done by estimating the policy value, defined as the cumulative reward V(πe) = EX,A,Y P i πe(ai|xi) Yi|do(Ai = ai) originating from πe. In our case we pick the alternative policy to be a uniformly-at-random action allocation, i.e. πe Multinom(.25, .25, .25, .25). Performance is evaluated through RMSE on the entire sample. OPL: The last is Off-Policy Learning, which is concerned with finding the optimal policy π : X A, that is the policy that generates the highest policy value: π p arg maxπp Π V(πp). This last task is evaluated through an Accuracy metric on the whole sample, which we label Optimal Allocation Rate (OAR), indicating the percentage of units correctly assigned to their specific optimal action π (xi) = a, i.e. the action that generates the best outcome for them. Since we are dealing with M = 2 outcomes, we produce performance measurements on RMSE and optimal allocation rate for both outcomes and then average them, assuming both outcomes are given equal policy importance and live on the same scale. For both versions of the setup, namely increasing N and increasing P, we replicate the experiment B = 100 times to obtain Monte Carlo averages and 95% confidence intervals for the metrics. Results are depicted in Figure 4. RMSE performance in all models for increasing N and P behaves accordingly to Corollary 3.1. Counter DKL and MODKL perform consistently better than the GP models, as they scale better with an increasing sample size N and increasing number of predictors P. Particularly MOGP s performance deteriorates for issues related to stability of the marginal likelihood maximization and over-parametrization, as we had to omit it from the study of increasing predictors due to failed convergence for P > 10. The advantages over standard DKL regression instead are entirely attributable to sample efficiency gains from multitask coregionalization in Counter DKL and MODKL, both in the increasing N and increasing P studies. In the case of increasing P, we emphasize that as predictor space grows larger the causal DGP becomes relatively sparser (only 7 predictors out of P remain relevant for the estimation), especially in the case of P = {20, 25}. So in these two cases the batch of DKL models would perhaps achieve better performance from increasing the number of hidden units or hidden layers and adding regularization (dropout, ℓ1 or ℓ2 regularizers) in the deep architecture part. In addition, we run a slightly different version of the ICE experiment above, to further investigate properties of the models in terms of uncertainty quantification, that we measure through the 95% coverage of each ICE Published in Transactions on Machine Learning Research (11/2022) Model Train MAE Test MAE Train Rpol Test Rpol Runtime (s) GP 0.033 0.006 0.036 0.008 0.22 0.02 0.27 0.02 171.3 16.1 Counter GP 0.033 0.006 0.035 0.007 0.24 0.01 0.27 0.02 248.6 6.4 PCA + GP 0.073 0.002 0.074 0.003 0.22 0.01 0.27 0.02 66.3 2.4 PCA + Counter GP 0.074 0.001 0.074 0.001 0.23 0.01 0.26 0.02 126.1 3.9 Auto Enc + GP 0.075 0.004 0.075 0.003 0.21 0.03 0.27 0.02 76.0 3.0 Auto Enc + Counter GP 0.076 0.003 0.076 0.003 0.24 0.02 0.30 0.03 138.7 9.2 DKL 0.029 0.011 0.042 0.015 0.20 0.01 0.21 0.02 44.8 3.3 Counter DKL 0.011 0.003 0.015 0.005 0.22 0.01 0.25 0.02 122.7 7.4 Table 1: Train and test set performance on the Jobs data experiment in terms of Mean Absolute Error (MAE) in estimating ATT, Policy Risk (Rpol) and overall runtime (s), with 10-fold cross-validated 95% intervals. Bold indicates best performance. estimates (then averaged over actions and outcomes). This is defined as Coverage95% = 1 i=i I ˆfa(xi)low fa(xi) ˆfa(xi)upp , where ˆfa(xi)low and ˆfa(xi)upp are the lower and upper bands of the 95% credible interval output by the model on ˆfa(xi), while fa(xi) is the true individual counterfactual outcome corresponding to action Ai = a. Given fixed N = 2000 and P = 20 and a similar data generating process as before, we introduce the parameter γ, which governs the level of confounding, or the degree of groups imbalance in terms of action allocation. Particularly, for increasing values of γ, we assign higher probability of choosing one of the two action Ai = {3, 4}. This generates action arms imbalance as it leaves gradually less units in arms {1, 2}. Results are gathered in Figure 5, where MODKL display higher performance both in terms of error and uncertainty quantification. 5.2 Real-World Example: Job Training Programs and Unemployment We demonstrate the efficiency of Counter DKL also on a second experiment taken from Shalit et al. (2017), involving a popular real-world study on a job training program, dating back to La Londe (1986). The distinctive feature of this dataset is that it combines a randomized and an observational subgroups, where the aim is to estimate the effects of participation on a job training program on earnings and employment. The randomized experiment features 297 treated and 425 control units; The observational subsample is instead made of 2490 control units only. The binary treatment Ai {0, 1} denotes participation to the job training program. The original outcome Yi is earnings after the program, which is censored continuous (Yi = 0 for unemployed units). However, following Shalit et al. (2017), we construct a binary indicator Yi {0, 1} denoting employment status at the end of the job training program as outcome. This gives us the opportunity to demonstrate the use of the methods presented in this paper also on binary/categorical type of outcomes. To this end we use the classification method for GPs proposed in Milios et al. (2018), where class labels are interpreted as coefficients of a degenerate Dirichlet distribution, which makes the GP classification task efficiently faster and more scalable. The 7 covariates Xi X in the study are the following: age, years of schooling, african american ethnicity, hispanic ethnicity, marital status, high school diploma. Given the presence of a randomized subsample, we can exploit it to compute unbiased estimates of the quantities of interest and treat them as ground truth. The two quantities of interest in this case are: i) the Average Treatment Effect on the Treated group (ATT), defined as ATT = T 1 PTe i=1 yi C 1 PC i=1 yi, where T and C are the number of treated and control units in the experimental data; ii) the Policy Risk (Shalit et al., 2017), defined as the average error in allocating the treatment according to the ICE estimates policy rule namely π(xi) = 1 if ICE = E(Yi|do(Ai = 1), xi) E(Yi|do(Ai = 0), xi) > 0 or Rpol = 1 E Y |do(Ai = 1), π(xi) = 1 p(π(xi) = 1) + E Y |do(Ai = 0), π(xi) = 0 p(π(xi) = 0) . Notice that we cannot measure performance on ICE directly as this is always unobservable in real-world scenarios; Published in Transactions on Machine Learning Research (11/2022) also, we restrict analysis of average causal/treatment effects on the treated group since we are sure that overlap holds there, as all the treated units were part of the randomized experiment subgroup, while the observational subgroup is made only of control units. More details about this experiment can be found in the Appendix C of supplementary materials and in Shalit et al. (2017). We compare the following models: i) GP and Counter GP, as in Alaa & van der Schaar (2017); ii) vanilla PCA plus either GP or Counter GP; iii) vanilla deep Auto Encoder plus either GP or Counter GP; iv) DKL and Counter DKL (ours). Results on performance are gathered in Table 1, in terms of 70%-30% train and test set Mean Absolute Error (MAE) on ATT, Policy Risk Rpol and average runtime, accompanied by 10-fold cross-validated 95% error intervals. In this example multitasking is induced only over the binary treatment, as we deal with just a single outcome of interest. As the results depict, by operating jointly via a unique loss function, Counter DKL is significantly more efficient than naively applying dimensionality reduction and fitting a multitask GP on a lower dimensional space as two separate steps. It also displays gains over Counter GP, thanks to its deep component that guarantees better computational time (in terms of runtime) and scalability, and is able learn arbitrarily complex functions while imposing weaker smoothness assumptions than standard GP kernels, even on a low-dimensional covariate space example such as the one presented here (7 covariates). 5.3 The Infant Health Development Program data Model Train RMSE Test RMSE RF 1.85 0.13 2.39 0.17 X-RF 3.29 0.23 3.37 0.24 CF 3.10 0.21 3.07 0.20 BART 0.97 0.04 1.44 0.09 X-BART 2.07 0.14 2.13 0.14 BCF 0.87 0.05 1.34 0.10 Counter GP 0.61 0.02 0.70 0.04 Counter DKL 0.62 0.02 0.67 0.04 Table 2: RMSE of compared models on the semisimulated IHDP setup, evaluated for CATE estimation on 80%-20% train-test sets. Finally we compare Counter DKL with few other recent methods for causal effects estimation, on a popular simulated experiment utilizing the Infant Health Development Program (IHDP) data, originally found in Hill (2011), and more recently in several contributions on Conditional Average Treatment Effects (CATE) estimation, such as Shalit et al. (2017); Alaa & van der Schaar (2017; 2018); Caron et al. (2022a) among others. The experiment is a semi-simulated setup. It makes use of real-world data from the IHDP study, a randomized clinical trial aimed at improving the health status of premature infants with low weight at birth through pediatric follow-ups and parent support groups, and recreates an observational type of study by removing a non-random portion of treated units, namely those with non-white mothers". This leaves a total of 139 observations in the treated group and 608 in the control. In addition, the semi-simulated setup uses the real-world binary treatment Ai {0, 1} and the 25 available covariates Xi X, but simulates the two continuous potential outcomes (Y1, Y0) R2, as described in the non-linear Response Surface B setting in Hill (2011). As anticipated above, the estimand of interest in this case is the Conditional Average Treatment Effects (CATE), defined as E[Yi|do(Ai = 1), Xi = x] E[Yi|do(Ai = 0), Xi = x]. The models we compare include: i) vanilla Random Forest (RF), as a T-Learner (Künzel et al., 2017; Caron et al., 2022a); ii) X-Learner version of Random Forests (X-RF) as in (Künzel et al., 2017); iii) Causal Forest (CF), or Random Forests as an R-Learner, developed by Wager & Athey (2018); iv) vanilla Bayesian Additive Regression Trees (BART), as a T-Learner; v) X-Learner version of BART (X-BART); vi) Bayesian Causal Forests (BCF) by Hahn et al. (2020); vii) Counterfactual GP (Counter GP) as in Alaa & van der Schaar (2018); viii) our Counterfactual DKL (Counter DKL) with [100, 100, 2] hidden layers. Results reported in Table 2 refers to 1000 replication of the experiment on 80%-20% train-test split as in Alaa & van der Schaar (2017). 6 Conclusions Throughout this work, we considered the problem of counterfactual effects learning using observational data, which is of interest in domains where exploration of policies is costly (healthcare, socio-economic sciences, etc.). We reviewed the class of counterfactual GP regression models, extending it to adjust to multiple actions and outcomes settings, and discussed how multitask learning over multiple actions helps addressing finite sample selection bias. We then introduced a new class of counterfactual models based on Published in Transactions on Machine Learning Research (11/2022) Deep Kernel Learning, whose main advantages lie in their more flexible function approximation capabilities and better scalability. While counterfactual GPs struggle to scale up with sample size, number of predictors and number of actions/outcomes to coregionalize over, DKL capitalizes on these components by learning lower dimensional representations. We stress that the class of DKL methods proposed can be easily expanded to carry out counterfactual learning in other more complex scenarios such as: i) unobserved confounding where identification is still possible (instrumental variables); ii) dynamic settings such as dynamic treatment regimes or RL; iii) non-standard data like images (as DKL can incorporate any type of deep architecture). Ahmed Alaa and Mihaela van der Schaar. Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In Proceedings of the 35th International Conference on Machine Learning, pp. 129 138, 2018. Ahmed M. Alaa and Mihaela van der Schaar. Bayesian inference of individualized treatment effects using multi-task Gaussian Processes. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 3427 3435, 2017. Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444 455, 1996. Susan Athey and Stefan Wager. Policy Learning With Observational Data. Econometrica, 89(1):133 161, January 2021. Heejung Bang and James M. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962 973, 2005. Elias Bareinboim, Andrew Forney, and Judea Pearl. Bandits with unobserved confounders: A causal approach. In Advances in Neural Information Processing Systems 29, volume 28, 2015. Bastian Bohn, Christian Rieger, and Michael Griebel. A representer theorem for deep kernel learning. J. Mach. Learn. Res., 20(1):2302 2333, 2019. Edwin V Bonilla, Kian Chai, and Christopher Williams. Multi-task Gaussian Process prediction. In Advances in Neural Information Processing Systems, volume 20, 2008. Alberto Caron, Gianluca Baio, and Ioanna Manolopoulou. Estimating individual treatment effects using non-parametric regression models: A review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2022a. doi: https://doi.org/10.1111/rssa.12824. Alberto Caron, Gianluca Baio, and Ioanna Manolopoulou. Shrinkage Bayesian Causal Forests for heterogeneous treatment effects estimation. Journal of Computational and Graphical Statistics, pp. 1 13, 2022b. doi: 10.1080/10618600.2022.2067549. Alberto Caron, Gianluca Baio, and Ioanna Manolopoulou. Interpretable deep causal learning for moderation effects. In ICML 2022, 2nd Interpretable Machine Learning for Healthcare Workshop, 2022c. URL https://arxiv.org/abs/2206.10261. Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018. Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 1097 1104, 2011. Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485 511, 2014. Published in Transactions on Machine Learning Research (11/2022) Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1447 1456, 2018. Jacob R. Gardner, Geoff Pleiss, David Bindel, Kilian Q. Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7587 7597, 2018. P. Richard Hahn, Jared S. Murray, and Carlos M. Carvalho. Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects. Bayesian Analysis, 15(3):965 1056, 2020. Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A flexible approach for counterfactual prediction. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1414 1423, 2017. James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI 13, pp. 282 290, 2013. Jennifer L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217 240, 2011. R. Hodson. Precision medicine. Nature, 547(7619), 2016. D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663 685, 1952. Kosuke Imai and David A van Dyk. Causal inference with general treatment regimes. Journal of the American Statistical Association, 99(467):854 866, 2004. Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 652 661, 2016. Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 3020 3029, 2016. Nathan Kallus. Balanced policy evaluation and learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8909 8920, 2018. Nathan Kallus. More efficient policy learning via optimal retargeting. Journal of the American Statistical Association, 116(534):646 658, 2021. Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591 616, 2018. Sören Künzel, Jasjeet Sekhon, Peter Bickel, and Bin Yu. Meta-learners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116, 06 2017. Robert J. La Londe. Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76:604 620, 1986. Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi, Lorenzo Rosasco, and Maurizio Filippone. Dirichletbased gaussian processes for large-scale calibrated classification. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6008 6018, 2018. Whitney K. Newey and James L. Powell. Instrumental variable estimation of nonparametric models. Econometrica, 71(5):1565 1578, 2003. Published in Transactions on Machine Learning Research (11/2022) X Nie and S Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299 319, 09 2020. Xinkun Nie, Emma Brunskill, and Stefan Wager. Learning when-to-treat policies. Journal of the American Statistical Association, 0(ja):1 58, 2020. Sebastian W. Ober, Carl E. Rasmussen, and Mark van der Wilk. The promises and pitfalls of deep kernel learning. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161, pp. 1206 1216, 2021. Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X. Min Qian and Susan A. Murphy. Performance guarantees for individualized treatment rules. Ann. Statist., 39(2):1180 1210, 04 2011. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41 55, 04 1983. Donald B. Rubin. Bayesian inference for causal effects: The role of randomization. Ann. Statist., 6(1):34 58, 01 1978. Phillip J. Schulte, Anastasios A. Tsiatis, Eric B. Laber, and Marie Davidian. Q-and a-learning methods for estimating optimal dynamic treatment regimes. Statistical Science, 29(4):640 661, 2014. Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, volume 70, pp. 3076 3085, 2017. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018. Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, volume R5, pp. 333 340, 2005. Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 2139 2148, 2016. Jin Tian and Judea Pearl. A general identification condition for causal effects. In Eighteenth National Conference on Artificial Intelligence, pp. 567 573, 2002. Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, volume 5, pp. 567 574, 2009. Masatoshi Uehara, Masahiro Kato, and Shota Yasui. Off-policy evaluation and learning for external validity under a covariate shift. In Advances in Neural Information Processing Systems 33, 2020. Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228 1242, 2018. Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (KISS-GP). In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pp. 1775 1784, 2015. Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In Arthur Gretton and Christian C. Robert (eds.), Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, pp. 370 378. PMLR, 2016. Published in Transactions on Machine Learning Research (11/2022) Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. Representation learning for treatment effect estimation from observational data. In Advances in Neural Information Processing Systems 31, pp. 2633 2643, 2018. Baqun Zhang, Anastasios A. Tsiatis, Eric B. Laber, and Marie Davidian. A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010 1018, 2012. Yingqi Zhao, Donglin Zeng, A. John Rush, and Michael R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106 1118, 2012. Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R. Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112(517): 169 187, 2017. Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012. Published in Transactions on Machine Learning Research (11/2022) A Proofs and Discussion In this first section of supplementary material we provide assumptions, proofs and brief discussion of the two theoretical results in the main paper (Section 2 theorem and Section 3.2 corollary). A.1 Backdoor Adjustment The backdoor adjustment theorem (Pearl, 2009) is a well known and understood results in causal inference. Its extension to multi-action and multi-outcome settings is trivial. We briefly reformulate the necessary assumptions as follows. Assumption A.1 (Backdoor Criterion). Using the causal graph terminology, we denote by pa(Vj) the parents of a specific random variable Vj and by de(Vj) its descendants. We assume that pa(A, Y ) X, de(A, Y ) X. Namely X contains (but does not necessarily coincide with) all the common parent variables of A A and Y = {Ym}M m=1 Y, for all m {1, ..., M}, and does not contain any common descendant of them. Assumption A.2 (Overlap). Defining the propensity score as πa(xi) = p(Ai = a|Xi = xi), we require that πa(xi) (α, 1 α) for all i {1, ..., N}, where α (0, 1 2). The scalar α represents how tight we require the overlap between units in each arm in terms of the covariates to be. Proof of Backdoor Adjustment Theorem Under Assumption 1.1 and 1.2, we are able to identify the effect A Y = {Ym}M m=1, in the form of the joint interventional distribution p Y |do(A = a) , as: p Y | do(A = a) = Z p Y | do(A = a), x p x | do(A = a) dx p Y | a, x p x | do(A = a) dx X p(Y | a, x)p(x) dx, The above derivation refers to the marginal interventional distribution, with respect to X. Often we are interest in the conditional interventional distribution p Y |do(A = a), X instead (e.g., conditional on patient s characteristics), such as when estimating CATE: τ(x) = f1(x) f0(x). The only additional requirement compared to the original version of backdoor adjustment (Pearl, 2009) is that Assumption 1.1 holds for all the collection of outcomes Y . A.2 Corollary 3.1 Corollary 3.1 is a trivial extension of the result on CATE optimal minimax rate derived in Theorem 1 by Alaa & van der Schaar (2018) to discrete multi-action set domains, where specifically {0, 1} A. Thus the proof follows straightforwardly from Alaa & van der Schaar (2018). The main difference is the following. Proof of Corollary 3.1 Optimal minimax rate in Alaa & van der Schaar (2018) define the hardness of CATE estimation between binary action a, b A: τa,b(xi) = fa(xi) fb(xi). If we have multi-actions space, it means that, given the whole sample size is N, each pair (strictly more than one pair) of discrete actions a, b defines a subsample (and thus a subpopulation) of na,b < N units. This implies that the hardness of approximating a function in the Hölder ball class and of performing variable selection is proportional to the smaller subsample na,b, not N, which makes the CATE estimation problem harder the smaller na,b. This is likely to happens when multi-action scenarios feature infrequently explored action arms. Thus, technically speaking, result in Theorem 1 in Alaa & van der Schaar (2018) is a special case of Corollary 3.1 where A = {0, 1}. Published in Transactions on Machine Learning Research (11/2022) B Data Generating Processes We hereby describe the causal data generating processes in the simulated examples of the paper (Section 3.2 and Section 5.1). B.1 Section 3.2 one covariate example For the simple one-covariate example in Section 3.2 (Figure 2), where we discuss the benefits of multitask counterfactual learning, we generated N = 300 data points from one, uniformly distributed covariate, Xi Uniform( 3, 3). Then we generated a binary action variable Ai Bernoulli p(Ai = 1|xi) , where p(Ai = 1|xi) = Φ 0.2 + Xi and Φ( ) is the standard normal cdf. Finally, the two counterfactual outcome surfaces were generated as f0(xi) = 2 + 0.3 exp Xi and f1(xi) = 3 + f0(xi), with the final outcome being Y = f0(xi) + τ(xi)Ai + εi where τ(xi) = f1(xi) f0(xi) is the CATE function and εi N(0, 0.752). B.2 Section 5.1 experiment The causal data generating process for the simulated experiment of Section 5.1 is described as follows. The P covariates are generated from a uniform distribution Xi,j Unif( 3, 3) for j {1, ..., P} and i {1, ..., N}. The action allocation policy is simulated according to a multinomial distribution where the probabilities of being assigned to action Ai = a are generated as a softmax function of the covariates p(Ai = a|Xi = xi) = exp{Xiβa} / P a A exp{Xiβa}, where βa is an action-specific P-dimensional sparse vector of action-specific coefficients defined as follow: β1 = 1 0.8 0.1 0.1 0 ... 0 , β2 = 0 0 1 0.8 0.2 0 ... 0 , β3 = 1.5 0.8 0.1 0.1 0 ... 0 , β4 = 1 0.8 0.1 0.1 0 ... 0 . Thus Ai is drawn from a multinomial with vector probabilities parameter p(Ai = a|Xi = xi). The M = 2 action-specific correlated counterfactual outcomes Yi | do(Ai = a) instead are generated as Yi | do(Ai = a) = f Y a(Xi) + εi , εi N(0, Σεi), where: f Y 11 = 3 + 0.4X0X1 0.3X2 2 + 0.2 exp(X3) + 0.6 sin(X4) f Y 12 = 1 + f Y 11 + 0.1X5 f Y 13 = 1 + f Y 11 + 0.3X5 f Y 14 = 0.5 + f Y 11 + 0.5X6 f Y 21 = 1 + 0.2X0X1 0.2X2 2 + 0.1 exp(X3) f Y 22 = 2 + f Y 21 + 0.2X5 f Y 23 = 2 + f Y 21 + 0.4X5 f Y 24 = 1 + f Y 21 + 0.5X6 and where diag(Σε) = [σ1, ..., σ4], with σ1 = ... = σ4 = 0.5, and off-diagonal elements are 0. Finally, we briefly describe the main specifications of the methods compared. The GP models (GP, Counter GP and MOGP) all employed a RBF base kernel, while the DKL models employed a three [50, 50, 2] hidden layers feedforward neural network before the GP -layer, which itself employs a RBF base kernel. The multitask and multioutput models (both GPs and DKLs) all make use of the Intrinsic Coregionalization Model (ICM), such that K(xi, x i) = BY BA Kq(xi, x i). All model were optimized through the Adam solver. More details and fully reproducible code on this experiment can be found in the Github repository: https://github.com/albicaron/Counter DKL. Published in Transactions on Machine Learning Research (11/2022) Data N P # actions indian 573 10 2 heart 270 13 2 yeast 1484 8 10 contracept 1473 9 3 Table 3: UCI datasets characteristics. GP Counter GP DKL Counter DKL indian 0.390 0.392 0.376 0.347 heart 2.553 1.076 0.433 0.410 yeast 0.534 0.657 1.3144 0.081 contracep 0.339 0.003 0.008 0.007 Table 4: OPE absolute regret on UCI datasets. Bold denotes best performance. C The Job Training Data The Job Training data (La Londe, 1986) are a popular case study in the causal inference literature. They comprise a portion of data pertaining to a randomized experiment and a portion of observational data, with the randomized experiment featuring 297 treated and 425 control units, while the observational data being of 2490 control units only. Given the randomized subsample of the data, we can obtain an unbiased estimate (computed on the randomized units only) for the Average Treatment Effect on the Treated group (ATT) as ATT = T 1 PTe i=1 yi C 1 PC i=1 yi, where T and C are the number of treated and control units in the experimental data, and treat this as the ground truth for estimating performance of the methods; and also for the policy risk measure Rpol = 1 E Y |do(Ai = 1), π(xi) = 1 p(π(xi) = 1) + E Y |do(Ai = 0), π(xi) = 0 p(π(xi) = 0) . A brief overview on the specifications of the models employed follows. All GPs employ RBF base kernel (also DKL s specifications in the last hidden layer). DKL and Counter DKL deep NN structure features three [10, 5, 2] hidden layers. The Auto Encoder deep structure employed for the Auto Enc + GP" and Auto Enc + Counter GP" models similarly learns a 2-dimensional encoded lower-dimensional representation, where the encoder has two [10, 5] hidden layers before the 2-dim representation and the decoder has [5, 10] hidden layers before the reconstruction loss. D Marginal Likelihood Maximization in Multioutput Deep Kernels In the multitask deep kernel learning class of models, the parameter space Θ = (W, ϕ, B) is made of the deep neural network s weights W, the base kernel s hyperparameters ϕ (variance, lengthscales, etc.) and the coregionalization matrix B entries. These parameters are learnt jointly by maximizing the log-marginal likelihood L at the end of the GP layer. Using the chain rule, the derivatives are: Kϕ g(x, W) g(x, W) where g(x, W) is the function mapping the inputs to the lower representation space parametrized by W, Kϕ is the base kernel and K( ) = B Kϕ( ) is the coregionalized kernel. E Additional Simulated Experiments Finally, we describe and present results on a few additional simulated examples that we conducted to assess Counter DKL performance compared to some other specifications seen in Section 5.2, on datasets with varying sample size, predictor space and action space dimensions. In particular, following Dudík et al. (2011) and Farajtabar et al. (2018), we make use of some of the popular datasets for classification in the open-source UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php), by transforming the Published in Transactions on Machine Learning Research (11/2022) classification task in a causal Off-Policy Evaluation task in the following way. Each dataset is equipped with a pair of covariates Xi and classification labels Li. We view the classification labels Li as our discrete actions Li = Ai, and consequently generate the action-specific outcome Yai as function of the covariates as follows: Yai = exp{Xiβa} + εi, where N(0, 0.5) and βa is a P-dimensional vector of action-specific coefficients, where entries are {0.4, 0.2, 0.0} sampled from a Multinomial(0.6, 0.25, 0.15), with replacement. The datasets utilized are summarized in Table 3 in terms of sample size N, number of covariates P and number of actions. We compare GP, Counter GP, DKL and Counter DKL models on an Off-Policy Evaluation task, where we evaluate the uniformly at random generated policy, via the absoulte regret or risk measure, defined as E | V(πe) ˆV(πe) | . All models employ a RBF base kernel, either directly on the inputs or on the lower dimensional layer. Results averaged over B = 20 replications of the experiments for each dataset are gathered in Table 4.