# optimal_policy_adaptation_under_covariate_shift__a5f253c7.pdf Optimal Policy Adaptation Under Covariate Shift Xueqing Liu1 , Qinwei Yang1 , Zhaoqing Tian1 , Ruocheng Guo2 , Peng Wu 1Beijing Technology and Business University 2Byte Dance Research pengwu@btbu.edu.cn Transfer learning of prediction models under covariate shift has been extensively studied, while the corresponding policy learning approaches are rarely discussed. In this paper, we propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets: one with full information from the source domain and the other from the target domain with only covariates. First, in the setting of covariate shift, we formulate the problem from a perspective of causality and present the identifiability assumptions for the reward induced by a given policy. Then, we derive the efficient influence function and the semiparametric efficiency bound for the reward. Based on this, we construct a doubly robust and semiparametric efficient estimator for the reward and then learn the optimal policy by optimizing the estimated reward. Moreover, we theoretically analyze the bias and the generalization error bound for the learned policy. Furthermore, in the presence of both covariate and concept shifts, we propose a novel sensitivity analysis method to evaluate the robustness of the proposed policy learning approach. Extensive experiments demonstrate that the approach not only estimates the reward more accurately but also yields a policy that closely approximates the theoretically optimal policy. 1 Introduction In many real-world scenarios, labeled data is often scarce due to budget constraints and time-consuming collection processes [Zhuang et al., 2020; Imbens et al., 2024], significantly limiting the generalizability of the resulting models. For example, in medical research, collecting labeled data involves extensive clinical trials and follow-up periods, making it costly and time-consuming [Dahabreh et al., 2020; Hu et al., 2023]. In autonomous driving, obtaining labeled data requires manual annotation of large amounts of sensor data, which is laborious and expensive [Sun et al., 2020]. To address this problem and enhance a model s performance corresponding author in a target domain without labels, an active area of research is transfer learning. It aims to improve the performance of target learners in the target domain by transferring the knowledge contained in a different but related source domain. While transfer learning has been extensively studied in the context of prediction models [Wang et al., 2018; Wang et al., 2020; Pesciullesi et al., 2020], how to transfer a policy is still underdeveloped. Policy learning refers to identifying individuals who should receive treatment/intervention based on their characteristics by maximizing rewards [Murphy, 2003]. It has broad applications in recommender systems [Chen and Sun, 2021; Wu et al., 2022], precision medicine [Bertsimas et al., 2017] and reinforcement learning [Liu et al., 2021; Kwan et al., 2023]. Unlike transfer learning for prediction models, policy transfer faces identification challenges due to its counterfactual nature [Athey and Wager, 2021; Li et al., 2023b; Wu et al., 2024c; Yang et al., 2024]. Instead of predicting outcomes based on observed data, policy transfer requires considering what would happen under different actions, making the process more complex. We aim to learn the optimal policies in the target and entire domains using a dataset from the source domain (source dataset) and a dataset from the target domain (target dataset). The source dataset includes the covariates, treatment, and outcome for each individual, whereas the target dataset contains only the covariates. We assume that the source dataset satisfies the unconfoundedness and overlap assumptions while imposing fewer restrictions on the target dataset. We allow for substantial differences in the covariate distributions between the source and target datasets (referred to as covariate shift), while assuming that the conditional distributions of potential outcomes given covariates are the same. In this article, we first propose a principled policy learning approach under covariate shift. Specifically, we define the reward and the optimal policy in the target domain using the potential outcome framework in causal inference. Under the widely used assumptions of unconfoundedness and transportability, we establish the identifiability of the reward in the target domain and then derive its efficient influence function and semiparametric efficiency bound. Building on this, we develop a novel estimator for the reward. Theoretical analysis shows that the proposed estimator is doubly robust and achieves the semiparametric efficient bound, that is, it is the optimal regular estimator in terms of asymptotic variance Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) [Newey, 1990]. Then we propose to learn the optimal policy by maximizing the estimated reward. We analyze the bias of the estimated reward and the generalization error bound of the learned policy. In addition, we extend the proposed method to learn the optimal policy in the entire domain consisting of the source and target domains by leveraging data from both domains to address distributional discrepancies and ensure robust generalization across heterogeneous environments. The main contributions are summarized as follows: (1) We propose a principled approach for learning the optimal policy under covariate shift from a perspective of causality, by introducing plausible identifiability assumptions and efficient estimation methods; (2) We provide a comprehensive theoretical analysis of the proposed approach, including the consistency, asymptotic normality, and semiparametric efficiency of the estimator of reward. Additionally, we derive the bias and generalization error bound for the learned policy; (3) We conduct extensive experiments to demonstrate the effectiveness of the proposed policy learning approach. 2 Problem Formulation 2.1 Notation and Setup Let A A = {0, 1} denote the binary indicator for treatment, where A = 1 indicates receiving treatment and A = 0 indicates not receiving treatment. The random vector X X Rp represents the p-dimensional covariates measured before treatment, and Y Y R denotes the outcome of interest. Assume that a larger outcome is preferable. Under the potential outcome framework [Rubin, 1974; Splawa-Neyman, 1990], let Y (a) denote the potential outcome that would be observed if A were set to a for a A. By the consistency [Hern an and Robins, 2020], the observed outcome Y satisfies Y = Y (A) = AY (1) + (1 A)Y (0). Without loss of generality, we consider a typical scenario involving two datasets: a source dataset and a target dataset, which are representative samples of the source domain and target domain, respectively. Let G {0, 1} be the indicator for the data source, where G = 1 denotes the source domain and G = 0 denotes the target domain. The observed data are represented as follows, D1 = {(Xi, Ai, Yi, Gi = 1) : i = 1, ..., n1}, D0 = {(Xi, Gi = 0) : i = n1 + 1, ..., n1 + n0}, where the source dataset D1 consists of n1 individuals, with observed covariates, treatment, and outcome for each individual. The target dataset D0 contains n0 individuals, with only covariates for each individual. This is common in real life due to the scarcity of outcome data. For example, in medical research, patient features are observed, but obtaining outcomes requires long-term follow-up [Hu et al., 2023; Imbens et al., 2024]. Let P( |G = 1) and P( |G = 0) be the distributions of the two datasets respectively. Then n = n0 + n1 and q = n1/n represent the probability of an individual belonging to the source population. 2.2 Formulation We formulate the goal of learning the optimal policy in the target domain. Specifically, let π : X A denote a policy that maps individual covariates X = x to the treatment space A. A policy π(X) is a treatment rule that determines whether an individual receives treatment (A = 1) or not (A = 0). For a given policy π applied to the target domain, the average reward is defined as follows R(π) = E[π(X)Y (1) + (1 π(X))Y (0)|G = 0]. (1) We aim to learn the optimal policy π defined by π = arg maxπ Π R(π), where Π is a pre-specified policy class. For example, π(X) can be modeled with a parameter θ using methods such as logistic regression or multilayer perceptron, with each value of θ corresponding to a different policy. In addition, for a policy π(x) applied across the whole domain, the corresponding average reward is defined as V (π) = E[π(X)Y (1) + (1 π(X))Y (0)]. (2) There is a subtle difference between R(π) and V (π). For R(π), our focus is on transferring the policy from the source domain to the target domain, and for V (π), we aim to generalize the policy from the source domain to the entire domain. In the main text, we focus on learning the policy maximizing R(π) to avoid redundancy. We also develop a similar approach to learn the policy maximizing V (π) and briefly present it in Section 4.3. 3 Oracle Policy and Identifiability 3.1 Oracle Policy The optimal policy that maximizes Eq. (1) has an explicit form. Let τ(X) = E[Y (1) Y (0)|X, G = 0] be the conditional average treatment effect (CATE) in the target domain, R(π) = E[π(X){Y (1) Y (0)} + Y (0)|G = 0] = E[π(X)τ(X)|G = 0] + E[Y (0)|G = 0] where the last equality follows from the law of iterated expectations. Then we have the following conclusion. Lemma 1. The oracle policy π 0(x) = arg max π R(π) = 1, if τ(x) 0 0, if τ(x) < 0, where maxπ is taken over all possible policies without constraints, rather than being restricted to Π. For an individual characterized by X = x in the target domain, Lemma 1 asserts that the decision to accept treatment (A = 1) should be based on the sign of τ(x). The oracle policy π 0 recommends treatment for individuals expected to experience a positive benefit, thereby optimizing the overall reward within the target domain. The target policy π equals the oracle policy π 0 in Lemma 1 if π 0 Π; otherwise, they may not be equal, and their difference is the systematic error induced by limited hypothesis space of Π. 3.2 Identifiability of the Reward To learn the optimal policy π , we first need to address the identifiability problem of R(π), as this forms the foundation for policy evaluation. Since the target dataset only contains covariates X, R(π) cannot be identified from the target data alone due to the absence of treatment and outcome. To identify R(π), it is necessary to borrow information from the source dataset by imposing several assumptions. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25) Assumption 1. For all X in the source domain, (i) Unconfoundedness: (Y (1), Y (0)) A | X, G = 1; (ii) Overlap: 0