# distributionally_robust_policy_learning_under_concept_drifts__8550bb06.pdf

Distributionally Robust Policy Learning under Concept Drifts

Jingyuan Wang 1 Zhimei Ren 2 Ruohan Zhan 3 Zhengyuan Zhou 1 4

Distributionally robust policy learning aims to find a policy that performs well under the worstcase distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-rootn rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class Π, and show that the sub-optimality gap of the proposed algorithm is of the order κ(Π)n 1/2, where κ(Π) is the entropy integral of Π under the Hamming distance and n is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.

1Stern School of Business, New York University 2Department of Statistics and Data Science, University of Pennsylvania 3IEDA, Hong Kong University of Science and Technology 4Arena Technologies. Correspondence to: Zhengyuan Zhou <zhengyuanzhou24@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

In a wide range of fields, the abundance of user-specific historical data provides opportunities for learning efficient individualized policies. Examples include learning the optimal personalized treatment from electronic health record data (Murphy, 2003; Kim et al., 2011; Chan et al., 2012), or obtaining an individualized advertising strategy using past customer behavior data (Bottou et al., 2013; Kallus & Udell, 2016). Driven by such a practical need, a line of works have been devoted to developing efficient policy learning algorithms using historical data a task often known as offline policy learning (Dudík et al., 2011; Zhang et al., 2012; Swaminathan & Joachims, 2015a;b;c; Kitagawa & Tetenov, 2018; Athey & Wager, 2021; Zhou et al., 2023; Zhan et al., 2023; Bibaut et al., 2021; Jin et al., 2021; 2022a).

Most existing methods for offline policy learning deliver performance guarantees under the premise that the target environment remains the same as that from which the historical data is collected. It has been widely observed, however, that such a condition is hardly met in practice (see e.g., Recht et al. (2019); Namkoong et al. (2023); Liu et al. (2023); Jin et al. (2023) and the references therein). Under distribution shift, a policy learned in one environment often shows degraded performance when deployed in another environment. To address this issue, there is an emerging body of research on robust policy learning, which aims at finding a policy that still performs well when the target distribution is perturbed. Pioneering works in this area consider the case where the joint distribution of the covariates and the outcome is shifted from the training distribution, and researchers devise algorithms that output a policy achieving reliable worst-case performance under the aforementioned shifts (Si et al., 2023; Kallus et al., 2022). The joint modeling approach, however, ignores the type of distributional shifts, and the resulting worst-case value can be unnecessarily conservative in practice.

Indeed, distributional shifts can be categorized into two classes by their sources: (1) the shift in the covariate X, and/or (2) the shift in the conditional relationship between the outcome Y and the covariate X. The two types of distributional shifts are different in nature, have different implications on the objectives, and call for distinct treatment (Namkoong et al., 2023; Liu et al., 2023; Jin et al.,

Distributionally Robust Policy Learning under Concept Drifts

Distribution shift Unknown π0 General X Upper bound Lower bound

Athey & Wager (2021) Zhou et al. (2023) Si et al. (2023) Joint Kallus et al. (2022) Joint

Mu et al. (2022) Separate O q

log n log(|X||A|)

This work Separate O κ(Π) n

Table 1. Comparison of results in the offline policy learning literature. Unknown π0 refers to whether an algorithm assumes knowledge of the behavior policy π0. General X refers to whether an algorithm allows for general types of covariates. Athey & Wager (2021); Zhou et al. (2023); Si et al. (2023); Kallus et al. (2022) have the regret upper and lower bounds for the specific problems they consider that are not directly comparable to ours, so we do not include them in the table. |X| refers to the cardinality of the covariate support (if finite) and |A| to that of the action set. κ(Π) and Ndim(Π) are the entropy integral under Hamming distance and the Natarajan dimension of a policy class Π, with the relation κ(Π) = O(log(d)Ndim(Π)), where d is the dimension of the covariate space.

2023; Ai & Ren, 2024). To be concrete, imagine that the distribution of covariates changes while that of Y | X remains invariant in this case, the distribution shift is identifiable/estimable since the covariates are often accessible in the target environment. As a result, it is often unnecessary to account for the worst-case covariate shift rather than directly correcting for it. Alternatively, when the Y | X distribution changes but the X distribution remains invariant, the distribution shift is no longer identifiable, where we can instead apply the worst-case consideration to guarantee performance. This latter setting, known as concept drift, occurs due to sudden external shocks (Widmer & Kubat, 1996; Lu et al., 2018; Gama et al., 2014). For example, in advertising, the customer behavior can evolve over time as the environment changes, while the population remains largely the same. In personalized product recommendation, similar population segments in developed and emerging markets may prefer different product features. In these applications, with the one extra bit of information that the shift is only in the conditional reward distribution, can we obtain a more efficient policy learning algorithm?

Motivated by the above situations, we study robust policy learning under concept drift. Most existing methods for robust policy learning (Si et al., 2023; Kallus et al., 2022) model the distributional shift jointly without distinguishing the sources, and the corresponding algorithms turn out to be suboptimal. The reason behind their suboptimality is that the worst-case distributions under the two models the joint-shift model and the concept-drift model can be substantially different, so it would be a waste of our budget to consider adversarial distributions that are not feasible under concept drift. It is worth mentioning that a recent paper by Mu et al. (2022) accounts for the sources of distributional shifts in policy learning; their approach, however, applies only when the covariates take a finite number of values, and therefore is limited in its applicability. When

the covariate space is infinite, it remains unclear how to efficiently learn a robust policy under concept drift. The current work aims to fill in the gap by answering the question:How can we efficiently learn a policy with optimal worst-case average performance under concept drift with minimal assumptions? We provide a rigorous answer to the above question. Specifically, we assume the covariate distribution remains the same in the training and target environments,1 while the Y | X distribution shift is bounded in KL-divergence by a pre-specified constant δ. Our goal is to find a policy that maximizes the worst-case averaged outcome over all possible target distributions satisfying the previous condition.

1.1. Our Contributions

Towards robust policy learning under concept drift, we make the following contributions.

Policy Evaluation. Given a policy, we present a doublyrobust estimator for the worst-case policy value under concept drift. We prove that the estimator is asymptotic normal under mild conditions on the estimation rate of the nuisance parameter. Our approach involves solving the dual form of a distributionally robust optimization problem and taking a de-biased step to deal with the slow convergence of the optimizer, thereby obtaining an estimator with root-n convergence rate.

Policy Learning. We propose a robust policy learning algorithm that outputs a policy maximizing the estimated policy value over a policy class Π. Compared with the oracle optimal policy, the policy provided by our algorithm with high probability has a regret/suboptimality gap of the order κ(Π)/ n, where κ(Π) is a measure quantifying the

1Otherwise, the covariate shift can be easily adjusted by covariate matching discussed earlier.

Distributionally Robust Policy Learning under Concept Drifts

policy class complexity (to be formalized shortly) and n is the number of samples. Compared with Mu et al. (2022), our algorithm and theory apply to general covariate spaces and potentially infinite policy classes, while their method is restricted to finite covariate space and policy class. We complement the upper bound with a matching lower bound, thus establishing the minimax optimality of our proposed algorithm. We summarize the comparison between our result and prior work in Table 1 for better demonstration.

Implementation and Empirics. We provide efficient implementation of our robust policy learning algorithm, and compare its empirical performance with existing benchmarks in numerical studies. Our proposed method exhibits substantial improvement.

1.2. Related Works

Offline Policy Learning. There is a long list of works devoted to offline policy learning. Most of them assume no distributional shifts (e.g., Dudík et al. (2011); Zhang et al. (2012); Swaminathan & Joachims (2015a;b;c); Kitagawa & Tetenov (2018); Athey & Wager (2021); Zhou et al. (2023)). Zhan et al. (2023); Jin et al. (2021; 2022a) allow the data to be adaptively collected, but the distribution over the covariate and the (potential) outcomes remain invariant in the training and target environment.

As mentioned earlier, Si et al. (2023); Kallus et al. (2022) study robust policy learning when the joint distribution of (X, Y ) ranges in the neighborhood of the training distribution; Mu et al. (2022) consider the case when the covariate shift and Y | X shift are specified separately; their method, however, is restricted to finite covariate space, and their suboptimality gap is logarithmic factors slower than parametric rates. More recently, Guo et al. (2024) considers a pure covariate shift with a focus on policy evaluation, where the setup and the goal are different from ours.

Distributionally Robust Optimization. More broadly, our work is closely related to DRO, where the goal is to learn a model that has good performance under the worst-case distribution (e.g., Bertsimas & Sim (2004); Delage & Ye (2010); Hu & Hong (2013); Duchi et al. (2019); Dudík et al. (2011); Zhang et al. (2023)). The major focus of the aforementioned works involves parameter estimation and prediction in supervised settings; we however take a decision-making perspective and aim at learning an individualized policy with optimal worst-case performance guarantees.

2. Preliminaries

Consider a set of M actions denoted by [M] and let X Rd. Throughout the paper, we follow the potential outcome framework (Imbens & Rubin, 2015), where Y (a) Ya R denotes the potential outcome had action a been taken

for any a [M]. We posit the underlying data-generating distribution P on the joint covariate-outcome random vector (X, Y (1), , Y (M)) X QM a=1 Ya. Consider a data set D = {(Xi, Ai, Yi)}i [n] consisting of n i.i.d. draws of (X, A, Y ), where Xi X is the observed contextual vector, Ai [M] the action, and Yi = Y (Ai) the realized reward. The actions are selected by the behavior policy π0, where π0(a | x) := P(Ai = a | X = x) is the propensity score, for any a [M], x X. We make the following assumptions for π0 and P.

Assumption 2.1. The behavior policy π0 and the joint distribution P satisfy the following. (1) Unconfoundedness: (Y (1), , Y (M))

A | X. (2) Overlap: for some ε > 0, π0(a | x) ε, for all (a, x) [M] X. (3) Bounded reward support: there exists y > 0, such that 0 Y (a) y for all a [M].

The above assumptions are standard in the literature (see e.g., Athey & Wager, 2021; Zhou et al., 2023; Si et al., 2023; Kallus et al., 2022). In particular, the unconfoundedness assumption guarantees identifiability, and the overlap assumption ensures sufficient exploration when collecting the training dataset. The bounded reward support is assumed for the ease of exposition, and can be relaxed to the sub Gaussian reward straightforwardly.

2.1. The KL-distributionally Robust Formulation

Given the training set D = {(Xi, Ai, Yi)}i [n] and a policy class Π, we aim to learn a policy π Π that achieves high expected reward in a target environment that may deviate from the data-collection environment where D is collected. While distribution shift can take place in various forms, we focus primarily on the concept drift, where only the conditional reward distribution Y (a) | X differs in the training and target environments. The distance between distributions is quantified by the KL divergence. Definition 2.2 (KL divergence). The KL divergence between two distributions Q and P is defined as DKL(Q P) = EQ[log d Q

d P ], where d Q d P is the Radon Nikodym derivative of Q with respect to P.

We define an uncertainty set of neighboring distributions around P, whose conditional outcome distribution is bounded in KL divergence from P. Given a radius δ > 0, the uncertainty set of the conditional distribution is defined as P(PY | X, δ) := QY | X : DKL(QY | X PY | X) δ , where PY | X and QY | X refers to the distribution of (Y (1), . . . , Y (M)) | X under P and Q respectively. The distributionally robust policy value for any policy π at level δ is defined as

Vδ(π) := EPX

inf QY | X P(PY | X,δ) EQY | X h Y π(X) X i .

Distributionally Robust Policy Learning under Concept Drifts

The optimal policy in Π is the one that maximizes Vδ(π), i.e. π δ := argmaxπ Π Vδ(π).2

Under this formulation, our goal is to learn a robust policy with a high value of Vδ(π) using a dataset drawn from P. The task here is two-fold: we need to (i) estimate the policy value Vδ(π) for a given policy π, and (ii) find a near-optimal robust policy bπ Π whose policy value is close to the optimal policy π δ. Here, the performance of a learned policy bπ is measured by the sub-optimality gap (regret):

Rδ(bπ) := Vδ(π δ) Vδ(bπ). (2)

In the following sections, we tackle each task sequentially.

2.2. Strong Duality

In order to estimate Vδ(π), we first rewrite the inner optimization problem in Equation (1) in its dual form using standard results in convex optimization (see e.g., Luenberger (1997)). The transformation is formalized in the following lemma, with its proof provided in Appendix D.1.

Lemma 2.3 (Strong Duality). Given any π Π and any x X, the optimal value of inner optimization problem in Equation (1) equals to

min α 0,η R EP

Vδ(α, η; π) X = x . (3)

where Vδ(α, η; π) = α exp( Y (π(X))+η

α 1) + η + αδ.

We note that the optimization problem in (3) depends on x and π to manifest this dependence, we use (α π(x), η π(x)) to denote its optimizer, i.e., α π and η π are functions of x and

α π(x), η π(x) argmin α 0,η R EP

Vδ(α, η; π) X = x .

With this notation and Lemma 2.3, the robust policy value

Vδ(α π(X), η π(X); π) . (4)

The above formulation has thus translated the original distributionally robust optimization problem into an empirical risk minimization (ERM) problem. We note that, unlike the well-studied joint distributional shift formulation, the above representation admits an optimizer pair (α π(x), η π(x)) that is dependent on the context x (i.e. α π, η π are functions of x) and the policy π. As we shall see shortly, our proposed policy value estimation procedure employs ERM tools to

2When the supremum cannot be attained, we can always construct a sequence of policies whose policy values converge to the supremum, and all the arguments go through with a limiting argument.

estimate (α π, η π), and then compute an estimate of Vδ(π) by plugging (α π, η π) into Equation (4).

The remaining challenge in this proposal is the slow estimation rate of the optimizers if we naïvely plug in the optimizers, the resulting policy value estimator typically has a convergence rate slower than root-n. To overcome this, we incorporate a novel adjustment method to debias the estimator, which allows us to obtain a doubly-robust estimator that achieves root-n convergence rate even when then nuisance parameters (e.g., (α π, η π)) are converging slower than the root-n rate.

We end this section by providing a sufficient condition to ensure α π(x) > 0, which we make throughout the paper.

Assumption 2.4. For a [M] and x X, define y(x; a) = sup{t : P(Y (a) < t | X = x, A = a) = 0} and p(x; a) = P(Y (a) = y(x; a) | X = x, A = a). It holds that log(1/ p(x; a)) > δ for PX|A=a-almost all x.

The above assumption is mild and can be satisfied by many commonly used distributions, e.g., all the continuous distributions; it requires that PY | X,A does not posit a large point mass at its essential infimum. The following result from Jin et al. (2022b, Proposition 4), shows that α > 0 when Assumption 2.4 holds, ensuring that the gradient of the risk function in ERM has a zero mean.

Proposition 2.5 (Jin et al. (2022b)). Under Assumption 2.4, the optimizer α of (3) satisfies α > 0.

3. Policy Value Estimation under Concept Drift

3.1. The Estimation Procedure

Fixing a policy π, we aim to estimate the policy value Vδ(π) using the training dataset D. We first split D into K equally sized disjoint folds: D(k) for k [K],3 where we slightly abuse the notation and use D(k) to denote the data points or the corresponding indices interchangeably.

For each k [K], we first use data points in D(k+1) to obtain the propensity score estimator bπ(k) 0 and the optimizers (bα(k) π , bη(k) π ).4 We then define

b G(k) π (x, y) :=bα(k) π (x) e y+ b η(k) π (x)

b α(k) π (x) 1 + bη(k) π (x) + bα(k) π (x) δ,

g(k) π (x) :=EP h b G(k) π X, Y (π(X)) X = x i .

We next use D(k+2) to obtain bg(k) π as an estimator of g(k) π .

3We assume without loss of generality that n is divisible by K. In practice, we only need a minimum of K = 3 folds. 4We use the convention that D(k) = D(k mod K) for any k.

Distributionally Robust Policy Learning under Concept Drifts

Algorithm 1 Policy estimation under concept drift

Input: Dataset D; policy π; uncertainty set parameter δ; propensity score estimation algorithm C; ERM algorithm E for obtaining (α π, η π); regression algorithm R for estimating gπ. Randomly split D into K non-overlapping equally-sized folds D(k), k [K]; for k = 1, , K do

bπ(k) 0 C(D(k+1)), (bα(k) π , bη(k) π ) E(D(k+1)); bg(k) π R {Xi, Ai, b G(k) π (Xi, Yi); i D(k+2)} ; Compute b V(k) δ (π) according to Equation (5); end for Return: b Vδ(π) 1

K PK k=1 b V(k) δ (π).

The policy value estimator b V(k) δ (π) for the k-th fold is

b V(k) δ (π) = 1 |D(k)|

1{π(Xi) = Ai}

bπ(k) 0 (Ai | Xi)

b G(k) π (Xi, Yi) bg(k) π (Xi) + bg(k) π (Xi). (5)

The policy value estimator is b Vδ(π) := 1

K PK k=1 b V(k) δ (π). The complete procedure is summarized in Algorithm 1. Remark 3.1. The estimation procedure involves three modelfitting steps corresponding to π0, (α π, η π), and gπ, respectively. The propensity score function π0 can be estimated with off-the-shelf algorithms (e.g., logistic regression, random forest); the conditional mean g(k) π can be obtained by regressing b G(k) π (Xi, Yi) onto Xi for the points such that Ai = π(Xi) with standard regression algorithms, e.g., kernel regression (Nadaraya, 1964; Watson, 1964), local polynomial regression (Cleveland, 1979; Cleveland & Devlin, 1988), smoothing spline (Green & Silverman, 1993), regression trees (Loh, 2011) and random forests (Ho et al., 1995). The ERM step is more complex, and will be discussed in details shortly. Remark 3.2. The construction of the estimator b Vδ(π) employs two major techniques: cross-fitting and de-biasing. The cross-fitting technique splits the training dataset D into K folds equally and fits the nuisance parameters on the offfold data. This crucially provides the convenient property of independence between the fitted nuisance parameters and policy value estimators. In contrast to the naïve plug-in estimator whose convergence rate is largely affected by the slow estimation rate of the nuisance parameters, the de-biasing technique addresses this limitation, thereby leading to the doubly-robust property of the proposed estimator.

The ERM Step. For notational simplicity, we denote θ = (α, η) and write the loss function from Lemma 2.3 as

ℓ(x, y; θ) = α exp y + η

α 1 + η + αδ. (6)

By the notation, θ π(x) = (α π(x), η π(x)) is the optimizer of EP [ℓ(x, Y (π(x)); θ) | X = x] with respect to θ. Throughout, we make the following assumption on θ π.

Assumption 3.3. For any policy π, α, α, η such that 0 < α α π(x) α, η π(x) η, x X.

The above assumption is mild and can be achieved, for example, when θ π(x) is continuous in x and when X is compact. We refer the readers to (Jin et al., 2022b) for a more detailed discussion.

Under the unconfoundedness assumption, θ π is also a minimizer of EP ℓ(X, Y ; θ(X))1{A = π(X)} . We can thus estimate θ π by minimizing the empirical risk:

bθ(k) π argmin θ Θ

n b ED(k+1)[ℓ(X, Y ; θ(X))1{A = π(X)}] o ,

where Θ {(α, η) | α : X 7 R 0, η : X 7 R} is to be determined. In our implementation, we follow Yadlowsky et al. (2022); Jin et al. (2022b); Sahoo et al. (2022), and adopt the method of sieves (Geman & Hwang, 1982) to solve (7). Specifically, we consider an increasing sequence Θ1 Θ2 of spaces of smooth functions, and let Θ = Θn in Equation (7). For example, Θn can be a class of polynomials, splines, or wavelets. It has been shown in Jin et al. (2022b, Section 3.4) that under mild regularity conditions, bθ(k) π converges to θ π at a reasonably fast rate. For example, if X = Qd j=1 Xj Rd for some compact intervals Xj and that θ π belongs to the Hölder class of p-smooth functions with some other mild regularity conditions then bθ(k) π θ π L2(PX | A=π(X)) = OP (( log n

n ) p/(2p+d))

and bθ(k) π θ π L = OP (( log n

n ) 2p2/(2p+d)2). The solution details are given in Appendix B, and we also refer the readers to Yadlowsky et al. (2018) and Jin et al. (2022b).

3.2. Theoretical Guarantees

We are now ready to present the theoretical guarantees for the policy value estimator b Vδ(π). To start, we assume the following for the convergence rates of the nuisance parameter estimators.

Assumption 3.4 (Asymptotic estimation rate). For any policy π, assume that for each k [K], the estimators bπ(k) 0 , bg(k) π and the empirical risk optimizer bθ(k) π satisfy

bπ(k) 0 π0 L2(PX | A=π(X)) = o P (n γ1),

bg(k) π g(k) π L2(PX | A=π(X)) = o P (n γ2),

bθ(k) π θ π L2(PX | A=π(X)) = o P (n 1

bθ(k) π θ π L = o P (1),

for some γ1, γ2 0 and γ1 + γ2 1

Distributionally Robust Policy Learning under Concept Drifts

Assumption 3.4 (1) requires either the propensity score π0 or the conditional mean of b G(k) π (X, Y ) is well estimated. This is a standard assumption in the double machine learning literature (Chernozhukov et al., 2018; Athey & Wager, 2021; Zhou et al., 2023; Kallus et al., 2019; 2022; Jin et al., 2022b) and can be achieved by various commonly-used machine learning methods discussed in Section 3.1. Assumption 3.4 (2) requires the optimizer bθ(k) π to be estimated at a rate faster than n 1/4, and can be achieved by, for example, the estimators discussed in Section 3.1 under mild conditions. The empirical sensitivity analysis in Jin et al. (2022b) also provides some justification for Assumption 3.4.

The following theorem states that our estimated policy value b Vδ(π) is consistent for estimating Vδ and is asymptotically normal. Its proof is provided in Appendix D.2.

Theorem 3.5 (Asymptotic normality). Suppose Assumptions 2.1, 2.4, 3.3, and 3.4 hold. For any policy π : X 7 [M], we have n b Vδ(π) Vδ(π) d N(0, σ2 π), where

σ2 π = Var 1{A = π(X)}

π0(A | X) G(X, Y ) g(X) + g(X) ;

Gπ(x, y) = ℓ(x, y; θ π);

gπ(x) := E Gπ(X, Y (π(X))) | X = x .

4. Policy Learning under Concept Drift

Building on the results and methodology in Section 3, we turn to the problem of policy learning under concept drift. Given a policy class Π and an estimated policy value b Vδ(π) for each π Π, it is natural to consider optimizing the estimated policy value over Π to find the best policy. The biggest challenge here is that the nuisance parameter bθ(k) π in defining b Vδ(π) is not only a function of x X, but also a function of π Π. The above strategy requires carrying out the ERM step in Section 3.1, for all possible policies π Π, posing major computational difficulties.

Instead of solving bθ(k) π for each π Π, we propose a computational shortcut that solves a similar ERM problem for each action a [M]. To see why this is sufficient, note that for any π Π,

E ℓ(X, Y (π(X)); θ) | X = x

a=1 1{π(X) = a} E[ℓ(x, Y (a); θ) | X = x]. (8)

Letting θ a(x) argmin θ

E[ℓ(x, Y (a); θ) | X = x] , we

can see that θ π(x)(x) is a minimizer of (8). Then, the policy learning problem reduces to finding π Π that maximizes

E ℓ(X, Y (π(X)); θ π(X)(X)) .

Remark 4.1 (Computational efficiency of the proposed shortcut). Constructing bθπ(x) with bθπ(x)(x) substantially reduces the computational complexity of the policy learning task. It is virtually infeasible to estimate bθπ(x) for each π in a policy class Π with infinite number of policies. Alternatively, solving for bθπ(x)(x) transforms the infeasible task of computing a class of infinite nuisance parameters {θπ(x) : π Π} to the feasible task of computing a finite one {θa(x) : a [M]}. It remains an interesting future direction to extend this trick to continuous action spaces.

4.1. The Learning Algorithm

The policy learning algorithm consists of two main steps: (1) solving for θ a for each a [M] and constructing the policy value estimator b Vδ(π); (2) learning the optimal policy π δ by minimizing b Vδ(π).

As before, we randomly split the original data set D into K folds. For each fold k [K], we use samples in the (k+1)-th data fold D(k+1) to obtain the propensity estimator bπ(k) 0 (a | ) (by regression) and the optimizer θ a (by ERM) for each a [M]. Next, for each a [M], define

Ga(x, y) := ℓ(x, y; θ a(x)), b G(k) a (x, y) := ℓ(x, y; bθ(k) a (x)),

g(k) a (x) := E b G(k) a (X, Y (a)) | X = x .

We then obtain an estimator bg(k) a for g(k) a by regressing b G(k) a (Xi, Yi) onto Xi with i D(k+2). Finally, we obtain the learned policy by maximizing the estimated policy value:

bπLN = argmax π Π b VLN δ (π) := 1

k=1 b VLN,(k) δ (π); (9)

b VLN,(k) δ (π) = 1 |D(k)|

1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi)

b G(k) π(Xi)(Xi, Yi) bg(k) π(Xi)(Xi) + bg(k) π(Xi)(Xi).

The above optimization problem can be solved efficiently by first-order optimization methods or policy tree search as in Zhou et al. (2023); we shall elaborate on the implementation in Section 5. The complete policy learning procedure is summarized in Algorithm 2.

4.2. Regret Upper Bound

In this section, we present the regret analysis of bπLN obtained by Algorithm 2 (recall the definition of regret given in Equation (2)). Before we embark on the formal analysis, we introduce the Hamming entropy integral κ(Π), which measures the complexity of Π.

Definition 4.2 (Hamming entropy integral). Given a policy class Π and n data points {x1, . . . , xn} X,

Distributionally Robust Policy Learning under Concept Drifts

Algorithm 2 Policy learning under concept drift

Input: Dataset D; policy class Π; uncertainty set parameter δ; propensity score estimation algorithm C; ERM algorithm E( ) for obtaining θ a; regression algorithm R for estimating ga. Randomly split D into K equal-sized folds; for k = 1, . . . , K do

bπ(k) 0 C(D(k+1)), for a = 1, , M do

bθ(k) a E(D(k+1)); bg(k) a R(Xi, Ai, b G(k) a (Xi, Yi); i D(k+2)); end for end for Return: bπLN that maximizes b VLN δ (π) as in Equation (9).

(1) The Hamming distance between two policies π, π Π is d H(π, π ) := 1

n Pn i=1 1{π(xi) = π (xi)}.

(2) The ε-covering number of {x1, . . . , xn}, denoted as C(ϵ, Π; {x1, . . . , xn}), is the smallest number L of policies {π1, . . . , πL} in Π, such that π Π, π ℓsuch that d H(π, πℓ) ϵ.

(3) The Hamming entropy integral of Π is defined as κ(Π) := R 1 0 p

log NH(ϵ2, Π) dϵ, where NH(ϵ, Π) := supn 1 supx1,...,xn C(ϵ, Π; {x1, . . . , xn}).

The following theorem provides a regret upper bound for the policy learned by Algorithm 2.

Theorem 4.3. Suppose Assumptions 2.1, 2.4, 3.3, 3.4 hold. For any β (0, 1), there exists N N+ such that when n N, we have with probability at least 1 β that

Rδ(bπLN) C0( α, α, η, δ, ε) n

65 + 8κ(Π) + p

where C0( α, α, η, δ, ε) := 6( α exp( η/α 1)+ η + αδ)/ε.

The proof of Theorem 4.3 is deferred to Appendix D. At a high level, we decompose the regret and upper bound it with the supremum of the estimation error of policy values:

Rδ(bπLN) =Vδ(π ) Vδ(bπLN) Vδ(π ) b VLN δ (π )

+ b VLN δ (π ) b VLN δ (bπLN) + b VLN δ (bπLN) Vδ(bπLN)

2 sup π Π |b VLN δ (π) Vδ(π)|,

where the last step uses the choice of bπLN. We bound the above quantity by establishing uniform convergence results for the policy value estimators. Through a careful chaining argument, we show that the dependence of R(bπLN) on n is of the order O(n 1

2 ), which is sharper than the O(n 1

2 log n) dependence for that of Mu et al. (2022) by a logarithmic factor. We also note that both regrets are asymptotic in n and hold for sufficiently large n.

4.3. Regret Lower Bound

In this section, we complement the regret upper bound in Theorem 4.3 with a minimax lower bound that characterizes the fundamental difficulty of policy learning under concept drift. Our lower bound is stated in terms of the Natarajan dimension (Natarajan, 1989), defined as follows.

Definition 4.4 (Natarajan dimension). Given an M-action policy class Π, we say a set of m points {x1, . . . , xm} is shattered by Π if there exist two functions f 1, f1 : {x1, . . . , xm} 7 [M] such that f 1(xj) = f1(xj) for all j [m] and for any σ { 1}m, there exists a policy π Π such that for any j [m], π(xj) = fσj(xj). The Natarajan dimension Ndim(Π) of Π is the size of the largest set shattered by Π.

Remark 4.5 (Connection to other complexity measures). As can be seen from the definition, the Natarajan dimension generalizes the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 2015) to the multi-class classification setting. The Natarajan dimension is also closely related to the Hamming entropy integral κ(Π) in our upper bound, as κ(Π) = O( p

log(d)Ndim(Π)) (Cai et al., 2020).

Theorem 4.6 (Regret lower bound). Let P denote the set of all distributions of (X, A, Y (1), . . . , Y (M)) that satisfy Assumption 2.1, 2.4, 3.3, and 3.4.5 Suppose that δ 0.2, n Ndim(Π)2, and Ndim(Π) 4/(9ε). For any policy leaning algorithm that outputs bπ as a function of {(Xi, Ai, Yi)}n i=1,

there is sup P P EP n[R(bπ)] 1 120

The proof of Theorem 4.6 is provided in Appendix D.4. Theorem 4.6 implies that for any learning algorithm, there exists a problem instance such that the regret scales as Ω( p

Ndim(Π)/n). Remark 4.7 (Optimality of Algorithm 2). Recalling the relationship between the Natarajan dimension and the Hamming entropy integral in the remark above, we see that our proposed algorithm achieves the minimax rate in the sample size and the policy class complexity.

5. Numerical Results

We evaluated our learning algorithm in two settings: a simulated and a real-world dataset, against the benchmark algorithm SNLN in Si et al. (2023, Algorithm 2).

Simulated Dataset. Our data generating process follows that of the linear boundary example in Si et al. (2023). We let the context set X to be the closed unit ball of R5 and let the action set to be {1, 2, 3}; the rewards Y (a) s are mutually independent conditioned on X with Y (a) | X

5When we say a distribution P satisfies Assumption 3.4, we mean that under P there exist bθ, bπ0, and bg that satisfy the convergence rates in Assumption 3.4.

Distributionally Robust Policy Learning under Concept Drifts

METRIC δ POLICY n =7500 n =13500 n =16500 n =19500

0.05 bπLN 0.2272 0.002 0.2299 0.001 0.2303 0.001 0.2310 0.001 bπSNLN 0.0554 0.005 0.0589 0.004 0.0617 0.004 0.0664 0.003

0.1 bπLN 0.1579 0.007 0.1662 0.002 0.1663 0.002 0.1678 0.002 bπSNLN 0.0548 0.004 0.0580 0.004 0.0583 0.003 0.0616 0.004

0.2 bπLN 0.0781 0.003 0.0802 0.002 0.0804 0.002 0.0831 0.002 bπSNLN 0.0182 0.003 0.0183 0.003 0.0200 0.003 0.0219 0.003

Table 2. Empirical robust policy value Vδ of policies bπLN, bπSNLN on the simulated dataset and the corresponding, over 50 seeds.

Figure 1. Empirical robust policy value Vδ of policies bπLN, bπSNLN on the real-world dataset, over 50 seeds. Shading corresponds to 95% confidence intervals.

being Gaussian, for a [3]. The training datasets Dtrain are generated with an unknown given behavior policy π0 over 50 random seeds. We similarly generate 100 testing datasets Dtest of size 10,000. The details are given in Appendix C.

Real-world Dataset. We consider the dataset of a largescale randomized experiment comparing assistance programs offered to French unemployed individuals provided in Behaghel et al. (2014). The decision maker is trying to learn a personalized policy that decides whether to provide: (i) an intensive counseling program run by a public agency (A = 0); or (ii) a similar program run by private agencies (A = 1), to an unemployed individual. The reward Y is binary and indicates reemployment within six months. The processed dataset is provided in Kallus (2023).

Implementation. In our implementation, the number of splits is taken to be K = 3. We use the Random Forest regressor from the scikit-learn Python library to estimate bπ0 and bg. For estimating θ , we adopt the cubic spline method and employ the Nelder-Mead optimization method in Sci Py Python library (Virtanen et al., 2020) to optimize the coefficients in the spline approximation, where the obtained estimator has threshold at 0.001 to guarantee

Proposition 2.5. Finally, we optimize and find bπLN with policytree (Sverdrup et al., 2020).6

The benchmark algorithm SNLN is adapted from Si et al. (2023, Algorithm 2) as in Kallus et al. (2022).7 Since Si et al. (2023, Algorithm 2) is designed for joint distribution shift formulation, we revised the original algorithm to fit our concept drift setting. The well-known KL-divergence chain rule (Cover, 1999) gives

DKL(QX,Y PX,Y ) = DKL(QX PX) + DKL(QY | X PY | X). (10)

Therefore, given any uncertainty set radius δ and known covariate shift (in this experiment, we assume no covariate shift), Si et al. (2023, Algorithm 2) can be used to implement policy learning under concept drift. Note that SNLN admits known propensity scores. As we only consider the case where the propensity scores are unknown, we complement Si et al. (2023, Algorithm 2) with estimated propensity scores from Random Forest Regressor in scikit-learn,

6A working example on the real-world dataset is given in https: //github.com/off-policy-learning/concept-drift-robust-learning. 7The implemented code of SNLN benchmark is provided by Kallus et al. (2022).

Distributionally Robust Policy Learning under Concept Drifts

METRIC POLICY n =7500 n =13500 n =16500 n =19500

Vmin 0.1 bπLN 0.2075 0.015 0.2139 0.005 0.2149 0.007 0.2167 0.003 bπSNLN 0.1884 0.007 0.2009 0.008 0.2017 0.006 0.2020 0.004

Table 3. Empirical worst case policy reward on the KL-sphere Vmin δ of policies bπLN, bπSNLN, over 20 seeds.

the same way as in the implementation of Algorithm 2.

Evaluation. For a learned policy bπ, we evaluate its performance by the following performance metrics. (i) We use the testing dataset Dtest to estimate the robust policy value V δ (bπ) by the empirical robust policy value

Vδ(bπ) = 1 |Dtest|

i Dtest ℓ(Xi, Yi(π(X1)); θbπ(Xi)),

where the nuisance parameters θbπ(X) are found via cubic spline method and employ the Nelder-Mead optimization method using the testing dataset according to policy bπ: Dtest,bπ = {(Xi, Yi(bπ(Xi)))}. (ii) We also perturb the simulated dataset to mimic possible real-world distributional shift. For each testing dataset j containing 10000 data points of the total 100 testing datasets, we generate a new testing dataset, with each reward distribution ( Y (j) i (1), Y (j) i (2), Y (j) i (3)) randomly sampled on the KL-sphere centered at the reward distribution (Y (j) i (1), Y (j) i (2), Y (j) i (3)) of the testing data point with a radius δ. Then we evaluate bπ using

Vmin δ (bπ) := min j [100]

i=1 Y (j) i bπ(X(j) i ) .

This simulates a more realistic scenario where the policy performance is measured by test datasets with concept drifts.

Results. Table 2 and 1 reports the values Vδ of the learned policies bπLN and bπSNLN on the simulated and the real-wrold dataset, respectively. Table 3 provides the result of Vmin δ . All results are reported with 95% confidence intervals. Table 2 shows that bπLN outperforms the benchmark bπSNLN consistently, with higher policy values and similar 95% confidence intervals. In Figure 1, bπLN continues to show this advantage over bπSNLN on the real-world dataset. With a higher δ, the policy values of bπLN, bπSNLN are smaller, due to a bigger uncertainty set. Table 3 shows that bπLN achieves higher worst-case rewards than bπSNLN does, in a more realistic setting with concept drift testing datasets. Together, we see that bπLN succeeds in finding a better policy under concept drift; while the performance of bπSNLN is comprised by its conservative policy learning process, in which it considers joint distributional shifts even though it is given the information that no covariate shifts took place.

The results align with the intuition that Algorithm 2 admits a subset of the uncertainty set that SNLN considers,

as explained in Equation (10). Consequently, Vδ(bπSNLN) is a lower bound of Vδ(bπLN) in theory, and by the results in Table 2, in practice. In real-world applications, knowing the source of the distribution shift effectively shrinks the uncertainty set, thereby yielding less conservative results. Since it is fairly easy to identify covariate shifts (comparing to detecting concept drift), when the decision maker observes none or little covariate shifts and would like to hedge against the risk of concept drift, it is suitable to apply our method which outperforms existing method designed for learning under joint distributional shifts.

One limitation of our methodology (as well as in other DRO works) is the choice of δ. The parameter δ controls the size of the uncertainty set considered and thus controls the degree of robustness in our model the larger δ, the more robust the output. The empirical performance of the algorithm substantially depends on the selection of δ. A small δ leads to negligible robustification effect and the algorithm would learn an over-aggressive policy; a large δ tends to yield more conservative results. A more detailed discussion can be found in Si et al. (2023).

In Appendix C, we also provide simulation results of Algorithm 1 for a fixed target policy, which show that Algorithm 1 can estimate the distributionally robust policy value under concept drift efficiently.

6. Discussion

In this paper, we study the policy learning problem under concept drift, where we propose a doubly-robust policy value estimator that is consistent and asymptotically normal, and then develop a minimax optimal policy learning algorithm, whose regret is Op(κ(Π)n 1/2) with a matching lower bound. Numerical results show that our learning algorithm outperforms the benchmark algorithm under concept drift. We also note that our results on pure concept drift could be extended to a more general setting where the concept drift adopts the same form as ours, but there is in addition an identifiable covariate shift as in Jin et al. (2022b). The details can be found in Appendix E.

Distributionally Robust Policy Learning under Concept Drifts

Acknowledgments

This work is generously supported by the ONR grant 13983263 and the NSF grant CCF-2312205. The authors would like to thank Miao Lu and Wenhao Yang for pointing out an error in an earlier draft of this manuscript and their helpful comments. Z. R. is supported by Wharton Analytics. R. Z. is supported by RGC grant ECS-26210324.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ai, J. and Ren, Z. Not all distributional shifts are equal: Fine-grained robust conformal inference. ar Xiv preprint ar Xiv:2402.13042, 2024.

Athey, S. and Wager, S. Policy learning with observational data. Econometrica, 89(1):133 161, 2021.

Behaghel, L., Crépon, B., and Gurgand, M. Private and public provision of counseling to job seekers: Evidence from a large controlled experiment. American economic journal: applied economics, 6(4):142 174, 2014.

Bertsimas, D. and Sim, M. The price of robustness. Operations research, 52(1):35 53, 2004.

Bibaut, A., Kallus, N., Dimakopoulou, M., Chambaz, A., and van Der Laan, M. Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. Advances in neural information processing systems, 34:19261 19273, 2021.

Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.

Cai, F., Qu, Z., Xia, L., and Zhou, Z. Multi-action offline policy learning with bayesian optimization. ar Xiv preprint ar Xiv:2003.07545v1, 2020.

Cauchois, M., Gupta, S., Ali, A., and Duchi, J. C. Robust validation: Confident predictions even when distributions shift. Journal of the American Statistical Association, pp. 1 66, 2024.

Chan, C. W., Farias, V. F., Bambos, N., and Escobar, G. J. Optimizing intensive care unit discharge decisions with patient readmissions. Operations research, 60(6):1323 1341, 2012.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters, 2018.

Cleveland, W. S. Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association, 74(368):829 836, 1979.

Cleveland, W. S. and Devlin, S. J. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association, 83(403): 596 610, 1988.

Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.

Delage, E. and Ye, Y. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations research, 58(3):595 612, 2010.

Duchi, J. C., Hashimoto, T., and Namkoong, H. Distributionally robust losses against mixture covariate shifts. Under review, 2(1), 2019.

Dudík, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011.

Gama, J., Žliobait e, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1 37, 2014.

Geman, S. and Hwang, C.-R. Nonparametric maximum likelihood estimation by the method of sieves. The annals of Statistics, pp. 401 414, 1982.

Green, P. J. and Silverman, B. W. Nonparametric regression and generalized linear models: a roughness penalty approach. Crc Press, 1993.

Guo, Y., Liu, H., Yue, Y., and Liu, A. Distributionally robust policy evaluation under general covariate shift in contextual bandits. ar Xiv preprint ar Xiv:2401.11353, 2024.

Ho, T. K. et al. Proceedings of 3rd international conference on document analysis and recognition. In Proceedings of 3rd international conference on document analysis and recognition, 1995.

Hu, Z. and Hong, L. J. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 1(2):9, 2013.

Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.

Distributionally Robust Policy Learning under Concept Drifts

Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084 5096. PMLR, 2021.

Jin, Y., Ren, Z., Yang, Z., and Wang, Z. Policy learning" without overlap: Pessimism and generalized empirical bernstein s inequality. ar Xiv preprint ar Xiv:2212.09900, 2022a.

Jin, Y., Ren, Z., and Zhou, Z. Sensitivity analysis under the f-sensitivity models: a distributional robustness perspective. ar Xiv preprint ar Xiv:2203.04373, 2022b.

Jin, Y., Guo, K., and Rothenhäusler, D. Diagnosing the role of observable distribution shift in scientific replications. ar Xiv preprint ar Xiv:2309.01056, 2023.

Kallus, N. Treatment effect risk: Bounds and inference. Management Science, 69(8):4579 4590, 2023.

Kallus, N. and Udell, M. Dynamic assortment personalization in high dimensions. ar Xiv preprint ar Xiv:1610.05604, 2016.

Kallus, N., Mao, X., and Uehara, M. Localized debiased machine learning: Efficient inference on quantile treatment effects and beyond. ar Xiv preprint ar Xiv:1912.12945, 2019.

Kallus, N., Mao, X., Wang, K., and Zhou, Z. Doubly robust distributionally robust off-policy evaluation and learning. In International Conference on Machine Learning, pp. 10598 10632. PMLR, 2022.

Kim, E. S., Herbst, R. S., Wistuba, I. I., Lee, J. J., Blumenschein Jr, G. R., Tsao, A., Stewart, D. J., Hicks, M. E., Erasmus Jr, J., Gupta, S., et al. The battle trial: personalizing therapy for lung cancer. Cancer discovery, 1(1): 44 53, 2011.

Kitagawa, T. and Tetenov, A. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591 616, 2018.

Liu, J., Wang, T., Cui, P., and Namkoong, H. On the need for a language describing distribution shifts: Illustrations on tabular datasets. ar Xiv preprint ar Xiv:2307.05284, 2023.

Loh, W.-Y. Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1):14 23, 2011.

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang, G. Learning under concept drift: A review. IEEE transactions on knowledge and data engineering, 31(12):2346 2363, 2018.

Luenberger, D. G. Optimization by vector space methods. John Wiley & Sons, 1997.

Mu, T., Chandak, Y., Hashimoto, T. B., and Brunskill, E. Factored dro: Factored distributionally robust policies for contextual bandits. Advances in Neural Information Processing Systems, 35:8318 8331, 2022.

Murphy, S. A. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(2):331 355, 2003.

Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964.

Namkoong, H., Yadlowsky, S., et al. Diagnosing model performance under distribution shift. ar Xiv preprint ar Xiv:2303.02011, 2023.

Natarajan, B. K. On learning sets and functions. Machine Learning, 4:67 97, 1989.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389 5400. PMLR, 2019.

Sahoo, R., Lei, L., and Wager, S. Learning from a biased sample. ar Xiv preprint ar Xiv:2209.01754, 2022.

Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributionally robust batch contextual bandits. Management Science, 2023.

Sverdrup, E., Kanodia, A., Zhou, Z., Athey, S., and Wager, S. policytree: Policy learning via doubly robust empirical welfare maximization over trees. Journal of Open Source Software, 5(50):2232, 2020.

Swaminathan, A. and Joachims, T. Batch learning from logged bandit feedback through counterfactual risk minimization. The Journal of Machine Learning Research, 16(1):1731 1755, 2015a.

Swaminathan, A. and Joachims, T. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pp. 814 823. PMLR, 2015b.

Swaminathan, A. and Joachims, T. The self-normalized estimator for counterfactual learning. advances in neural information processing systems, 28, 2015c.

Van der Vaart, A. W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Vapnik, V. N. and Chervonenkis, A. Y. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity: festschrift for alexey chervonenkis, pp. 11 30. Springer, 2015.

Distributionally Robust Policy Learning under Concept Drifts

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261 272, 2020.

Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge university press, 2019.

Watson, G. S. Smooth regression analysis. Sankhy a: The Indian Journal of Statistics, Series A, pp. 359 372, 1964.

Widmer, G. and Kubat, M. Learning in the presence of concept drift and hidden contexts. Machine learning, 23: 69 101, 1996.

Yadlowsky, S., Namkoong, H., Basu, S., Duchi, J., and Tian, L. Bounds on the conditional and average treatment effect with unobserved confounding factors. ar Xiv preprint ar Xiv:1808.09521, 2018.

Yadlowsky, S., Namkoong, H., Basu, S., Duchi, J., and Tian, L. Bounds on the conditional and average treatment effect with unobserved confounding factors. Annals of statistics, 50(5):2587, 2022.

Yang, W., Zhang, L., and Zhang, Z. Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223 3248, 2022.

Zhan, R., Ren, Z., Athey, S., and Zhou, Z. Policy learning with adaptively collected data. Management Science, 2023.

Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., and Laber, E. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103 114, 2012.

Zhang, Z., Zhan, W., Chen, Y., Du, S. S., and Lee, J. D. Optimal multi-distribution learning. ar Xiv preprint ar Xiv:2312.05134, 2023.

Zhou, Z., Athey, S., and Wager, S. Offline multi-action policy learning: Generalization and optimization. Operations Research, 71(1):148 183, 2023.

Distributionally Robust Policy Learning under Concept Drifts

A. Notation

We use [n] to denote the discrete set {1, 2, , n} for any n Z. We use argmin and argmax to denote the minimizers and maximizers; if the minimizer or the maximizer cannot be attained, we project it back to the feasible set. We denote the usual p-norm as p. Denote P to be any probability measure defined on the probability space (Ω, σ(Ω), P) and b P to be the empirical distribution of P. For any function f, we denote the L2(P)-norm of f conventionally as f L2(P ) = ( R |f(x)|2 d P(x))1/2 and f L = supx X |f(x)|. For any random variables X, Y , we use X

Y to denote that X is independent of Y . For a random variable/vector X, we use EX[ ] to indicate the expectation taken over the distribution of X.

B. The ERM Solution

To solve the ERM problem, we follow Geman & Hwang (1982); Yadlowsky et al. (2022); Jin et al. (2022b) and adopt the method of sieves: consider an increasing sequence Θ1 Θ2 of spaces of smooth functions, and let (bα(k) π , bη(k) π ) = argmin θ Θn En[ℓ(X, Y (π(X)); θ)]. In our case, we consider the following classes of sufficiently smooth functions. For

q1 = q 1 and q2 = q q1 (where q is the smoothness parameter), define the following function class for η :

Θq c(X) = h Cq1(X) : sup x X Pp l=1 γl<q1

|Dγh(x)| + sup x =x X Pp l=1 βl=q1

|Dβh(x) Dβh(x )|

where we denote the derivative Dd = Pp l=1 dl D

xl . To ensure the non-negativeness of α π in Proposition 2.5, we define the truncated function class Θq c(X, ϵ) := {x 7 max{h(x), ϵ} : h Θq c(X)} for the search of απ. Consequently, the function class we consider is Θ = Θq c(X) Θq c(X, ϵ).

optimizers well, we need the true optimizer (α π, η π) to be sufficiently smooth in x. Convexity and stability are also desirable property of the loss function for learning the optimizers. In the next assumption, we present the regularity condition on the conditional distribution PY | X to ensure smoothness of optimizers.

Assumption B.1 (Smooth conditional reward distribution). The conditional reward distribution PY (a) | X=x is smooth in x, i.e. for some h X, PY (a) | X=x+th = PY (a) | X=x + t Ph for some measure Ph on Y.

Taking practical terms into consideration, Assumption B.1 is reasonable as we assume the conditional distributions of Y (a) are close for similar covariates, for any action a A. Jin et al. (2022b, Appendix B.2) presents detailed discussion to justify smoothness of the optimizers under Assumption B.1.

Next, we would like to discuss the regularity conditions of the loss function ℓin Equation (6) and its conditional expectation E[ℓ(x, Y ; θ) | X = x, A = π(x)]. In particular, we require stability of the loss function and its conditional expectation so that plugging in estimators of the optimizers will not cause large errors, which is a mild condition that can be satisfied under the first-order Taylor expansion condition. Readers can refer to (Van der Vaart, 2000; Jin et al., 2022b) for a more detailed discussion. Later, Lemma D.1 summarises the regularity conditions in formal terms and indicates that in our case, with KL-divergence and the loss function defined as in Equation (6), all the above regularity conditions are satisfied.

C. Experiment Details

We let the context set X = {x R5 : x 2 1} to be the closed unit ball of R5 and let the action set to be [3]; the rewards Y (a) s are mutually independent conditioned on X with Y (a) | X N(β a X, σ2 a), for a [3]. We choose β s and σ s to be

β1 = (1, 0, 0, 0, 0), β2 = ( 1/2,

3/2, 0, 0, 0), β3 = ( 1/2,

3/2, 0, 0, 0); σ = (0.2, 0.5, 0.8).

Distributionally Robust Policy Learning under Concept Drifts

The training dataset Dtrain = {(Xi, π0(Xi), Yi(π0(Xi)))}n i=1 is generated with a given behavior policy π0 (unknown to policy learning algorithms), which chooses actions conditioned on context x according to the following rules:

(π0(1 | x), π0(2 | x), π0(3 | x)) =

(0.5, 0.25, 0.25), if argmax i=1,2,3 {β i x} = 1,

(0.25, 0.5, 0.25), if argmax i=1,2,3 {β i x} = 2,

(0.25, 0.25, 0.5), if argmax i=1,2,3 {β i x} = 3.

We also generate 100 testing datasets, each with sample size 10,000. Each testing dataset Dtest consists of i.i.d. draws of data tuple {(Xi, Yi(1), Yi(2), Yi(3))}n i=1, and is generated similarly to the procedure described above.

We present the result of the policy estimation experiments in Figure 2, using Algorithm 1 with inputs of the training datasets and the target policy π

1, if x 2 [0, 1/3], 2, if x 2 [1/3, 2/3], 3, if x 2 [2/3, 1].

The underlying true policy value is obtained by the testing dataset Dtest. Similar to the learning experiment, we repeat the estimation experiment over 50 seeds. Figure 2 shows that as the sample size increases, the estimated policy value by Algorithm 1 is more accurate and stable.

Figure 2. The Mean Square Error (MSE) of the estimated policy value by Algorithm 1. The x-axis is the number of samples used by Algorithm 1, and the y-axis is the mean squared error (MSE) of the policy value estimator.

Computation Details. The experiments were run on the following cloud servers: (i) an Intel Xeon Platinum 8160 @ 2.1 GHz with 766GB RAM and 96 CPU x 2.1 GHz; (ii) an Intel Xeon Platinum 8160 @ 2.1 GHz with 1.5TB RAM and 96 CPU x 2.1 GHz; (iii) an Intel Xeon Gold 6132 @ 2.59 GHz with 768GB RAM and 56 CPU x 2.59 GHz and (iv) an Intel Xeon GPU E5-2697A v4 @ 2.59 GHz with 384GB RAM and 64 CPU x 2.59 GHz.

D. Deferred Proofs of the Main Results

D.1. Proof of Lemma 2.3

Fix π Π and x X. Letting L = d QY | X=x

d PY | X=x , we can rewrite the inner minimization in Equation (1) as

inf L measurable EPY | X[Y (π(x))L | X = x]

s.t. EPY | X[L | X = x] = 1, (11)

EPY | X[f KL(L) | X = x] δ,

Distributionally Robust Policy Learning under Concept Drifts

where the function f KL(x) = x log x represents the KL divergence function. In (11), the first constraint reflects that L is an likelihood ratio, and the second constraint corresponds to the KL divergence bound.

For notational simplicity, let Ex be the shorthand of EPY | X[ | X = x]. By Theorem 8.6.1 of (Luenberger, 1997), the Slater s condition is satisfied and strong duality holds:

inf Ex[L]=1, Ex[f KL(L)] δ Ex Y (π(x))L = max α 0,η R φ(α, η, x), (12)

φ(α, η, x) = inf L 0 L(α, η, L, x),

L(α, η, L, x) = Ex[Y (π(x))L] + η Ex[L] 1 + α Ex[f KL(L)] δ

= Ex Y (π(x))L + η(L 1) + α(f KL(L) δ) .

We can explicitly work out the minimum of L(α, η, L, x), and we have

φ(α, η, x) = Ex

Y (π(x)) + η

where f KL(y) = exp(y 1) is the conjugate function of f KL. Using Equation (12), we arrive at

inf Ex[L]=1, Ex[f KL(L)] δ Ex Y (π(x))L = min α 0,η R Ex

α exp Y (π(x)) + η

α 1 + η + αδ .

The proof is thus completed.

D.2. Proof of Theorem 3.5

For notational simplicity, we drop the dependence on P in EP when the context is clear. The proof of Theorem 3.5 makes use of the following lemma, which establishes some useful properties of the optimizer θ π. The proof of Lemma D.1 can be found in Appendix F.1.

Lemma D.1. For any policy π, assume that Assumption 3.3 holds. We have the following properties of the optimizer θ π.

(1) E θ ℓ(x, Y (π(x)); θ) | X = x = 0 at θ = θ π(x) for any x X.

(2) There exists a constant ξ > 0 such that for any x and θ satisfying θ θ π(x) 2 ξ,

ℓ(x, y; θ) ℓ(x, y; θ π(x)) θℓ(x, y; θ π(x)) (θ θ π(x)) ℓ(x, y) θ θ π(x) 2 2,

for some function ℓ(x, y) such that supx X E[ ℓ(x, Y (π(x))) | X = x] < L for some L > 0.

(3) There exists a constant ξ1 > 0 such that for any θ satisfying θ θ π L ξ1.

ℓ(X, Y (π(X)); θ(X)) ℓ(X, Y (π(X)); θ π(X)) L2(PX,Y (π(X)) | A=π(X)) Cℓ θ θ π L2(PX | A=π(X)),

for some constant Cℓ> 0.

We proceed to show the asymptotic normality of b Vδ(π). For each k [K], we first define the following oracle quantity:

V (k) δ (π) = 1 |D(k)|

1{π(Xi) = Ai}

π0(Ai | Xi) Gπ(Xi, Yi) gπ(Xi) + gπ(Xi).

Distributionally Robust Policy Learning under Concept Drifts

In the sequel, we shall show that b V(k) δ (π) = V (k) δ (π) + op(n 1

2 ). We begin by decomposing the difference between b V(k) δ (π) and V (k) δ :

b V(k) δ (π) V (k) δ (π)

" 1{π(Xi) = Ai}

bπ(k) 0 (Ai | Xi) b G(k) π (Xi, Yi) bg(k) π (Xi) 1{π(Xi) = Ai}

π0(Ai | Xi) Gπ(Xi, Yi) gπ(Xi) #

bg(k) π (Xi) gπ(Xi)

1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) π (Xi, Yi) Gπ(Xi, Yi)

1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

bg(k) π (Xi) g(k) π (Xi)

| {z } (II)

1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) π (Xi, Yi) g(k) π (Xi)

| {z } (III)

1{Ai = π(Xi)}

π0(Ai | Xi) bg(k) π (Xi) gπ(Xi) + 1 |D(k)|

i D(k) (bg(k) π (Xi) gπ(Xi))

| {z } (IV)

Bounding Term (I). Recall that θ π(x) is the minimizer of

E h ℓ x, Y (π(x)); θ X = x i .

By the first-order condition established in part (1) of Lemma D.1, we have

E h θℓ x, Y (π(x)); θ(x) X = x i = 0. (13)

For any i D(k), by the unconfoundedness condition in Assumption 2.1, we have

" 1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) π Xi, Yi Gπ Xi, Yi D( k) #

" 1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) π Xi, Yi(π(Xi)) Gπ Xi, Yi(π(Xi)) D( k) #

= E h b G(k) π Xi, Yi(π(Xi)) Gπ Xi, Yi(π(Xi)) D( k)i

= E h ℓ Xi, Yi(π(Xi)); bθ(k) π (Xi) ℓ Xi, Yi(π(Xi)); θ π(Xi) θℓ Xi, Y (π(Xi)); θ π(Xi)) D( k)i ,

where the last step is due to Equation (13). By Assumption 3.4, bθ(k) π θ π L = o P (1). Therefore, for any β (0, 1), there exists N N+ such that for n N, bθ(k) π θ π L min(ξ, ξ1). On the event that bθ(k) π (x) θ π(x) L min(ξ, ξ1)

Distributionally Robust Policy Learning under Concept Drifts

by part (2) of Lemma D.1 and Jensen s inequality, we have E 1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) π Xi, Yi Gπ Xi, Yi D( k)

E ℓ(Xi, Yi(π(Xi)); bθ(k) π (Xi)) ℓ(Xi, Yi(π(Xi)); θ π(Xi)) θℓ Xi, Y (π(Xi)); θ π(Xi) D( k)

E h ℓ(Xi, Yi) bθ(k) π (Xi) θ π(Xi) 2 2

i LE h bθ(k) π (Xi) θ π(Xi) 2 2 D( k)i = L bθ(k) π θ π 2 L2(PX).

By Chebyshev s inequality, we have for any t > 0 that

1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) π (Xi, Yi) Gπ(Xi, Yi)

E 1{A = π(X)}

π0(A | X) b G(k) π (X, Y ) Gπ(X, Y ) D( k) t D( k) !

1 |D(k)|t2 Var 1{A = π(X)}

π0(A | X) h b G(k) π (X, Y ) Gπ(X, Y ) i

b G(k) π Gπ 2 L2(PX,Y | A=π(X)) ε2|D(k)|t2

Cℓ bθ(k) π θ π 2 L2(PX | A=π(X))

ε2|D(k)|t2 ,

where the last step is due to part (3) of Lemma D.1. Combining the above results, we have that

term (I) = OP n 1/2 bθ(k) π θ π L2(PX) + bθ(k) π θ π 2 L2(PX) = o P (n 1/2),

where the last step is due to Assumption 3.4.

Bounding Term (II). Applying the Cauchy-Schwarz inequality to term (II), we have 1 |D(k)|

1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

bg(k) π (Xi) g(k) π (Xi)

v u u t 1 |D(k)|

i D(k) 1{Ai = π(Xi)} 1

bπ(k) 0 (Ai | Xi) 1 π0(Ai | Xi)

i D(k) 1{Ai = π(Xi)} bg(k) π (Xi) g(k) π (Xi) 2

ϵ 2 bπ(k) 0 π0 L2(PX | A=π(X)) bg(k) π g(k) π L2(PX | A=π(X))

= o P (n 1/2),

where the next-to-last inequality is due to the lower bound on π0 and bπ(k); the last equality is due to the given convergence rate of the product estimation error in Assumption 3.4.

Bounding Term (III). By Assumption 3.4, for any β (0, 1), there exists N1 N+ such that for n N1,

P bθ(k) π θ L min(α, η)/2 1 β.

On the event bθ(k) π θ L min(α, η)/2,

b G(k) π (x, y) = ℓ(x, y; bθ(k) π ) α exp y + η

α 1 + η + αδ =: Lg.

Distributionally Robust Policy Learning under Concept Drifts

Next, for any i D(k),

" 1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) π (Xi, Yi) g(k) π (Xi) D( k) #

E 1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

E h b G(k) π (Xi, Y (π(Xi))) g(k) π (Xi) Xi, D( k)i D( k) #

where the first step is by the unconfoundedness assumption and the second step is due to the fact that g(k) π is the conditional expectation of b G(k) π .

On the event { bθ(k) π θ π L min(α, η)}. By Chebyshev s inequality, for any t > 0,

1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) π (Xi, Yi) g(k) π (Xi) t D( k) !

1 |D(k)|t2 Var

1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) π (Xi, Yi) g(k) π (Xi) D( k) !

1 |D(k)|t2 E

" 1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

2 b G(k) π (Xi, Yi) g(k) π (Xi) 2 D( k) #

4L2 g |D(k)|ε4t2 bπ(k) 0 π0 2 L2(PX | T =π(X)).

The above inequality along with a union bound implies that

term (III) = OP bπ(k) 0 π0 L2(PX | A=π(X))/ q

|D(k)| = o P (n 1/2),

where the last step is by the consistency of bπ(k) 0 assumed in Assumption 3.4.

Bounding Term (IV). We first show that term (IV) is of zero-mean:

1{Ai = π(Xi)}

π0(Ai | Xi) bg(k) π (Xi) gπ(Xi) + 1 |D(k)|

i D(k) (bg(k) π (Xi) gπ(Xi)) D( k) #

" 1{Ai = π(Xi)}

π0(Ai | Xi) bg(k) π (Xi) gπ(Xi) D( k) #

+ E h bg(k) π (Xi) gπ(Xi) D( k)i = 0.

By Chebyshev s inequality, for any t > 0,

1{π(Xi) = Ai}

π0(Ai | Xi) (bg(k) π (Xi) gπ(Xi)) 1 |D(k)|

bg(k) π (Xi) gπ(Xi) t D( k) !

1 |D(k)|t2 Var

1{Ai = π(Xi)}

π0(Ai | Xi) bg(k) π (Xi) gπ(Xi) bg(k) π (Xi) gπ(Xi) D( k) !

= 1 |D(k)|t2 E 1 π0(π(Xi) | Xi)

π0(π(Xi) | Xi) bg(k) π (Xi) gπ(Xi) 2 D( k) .

As a result, term (IV) = OP bg(k) π gπ L2(PX)/ n . Note that

bg(k) π gπ L2(PX) = O( bg(k) π gπ L2(PX | A=π(X)))

O bg(k) π gπ L2(PX | A=π(X)) + g(k) π gπ L2(PX | A=π(X)) ,

Distributionally Robust Policy Learning under Concept Drifts

where the first inequality follows from the overlap condition. By Assumption 3.4, bg(k) π gπ L = o P (1). Meanwhile,

g(k) π gπ 2 L2(PX | A=π(X))

= E ( g(X) g(X))2 | A = π(X), D k

= E E ℓ(X, Y (π(X)); bθ(k) π (X)) ℓ(X, Y (π(X)); θ π(X)) | X 2 A = π(X), D( k)

(i) E ℓ(X, Y (π(X)); bθ(k) π ) ℓ(X, Y (π(X)); θ π) 2 A = π(X), D( k)

(ii) = O bθ(k) π θ π 2 L2(PX | A=π(X)) = o P (1).

Above, step (i) follows from Jensen s inequality and step (ii) from part (3) of Lemma D.1. Combining everything, we have that term (IV) is of rate o P (n 1/2).

Putting Everything Together. So far we have shown that for each fold k [K], there is

b V(k) δ (π) V (k) δ (π) = o P (n 1/2).

Averaging over all k folds, we have n b Vδ(π) Vδ(π)

1{Ai = π(Xi)}

π0(Ai | Xi) Gπ(Xi, Yi) gπ(Xi) gπ(Xi) Vδ(π)

By the central limit theorem and Slutsky s theorem. n b Vδ(π) Vδ(π) d. N(0, σ2),

σ2 = Var 1{A = π(X)}

π0(A | X) Gπ(X, Y ) gπ(X) + gπ(X) .

D.3. Proof of Theorem 4.3

By Assumption 3.3, taking π(x) a for any a [M], there exist constants αa, αa, ηa such that

0 < αa α a(x) αa, |ηa(x)| ηa, x X.

Letting α = mina [M] αa, α = maxa [M] αa, η = maxa [M] ηa, it follows that

0 < α α a(x) α, |ηa(x)| η, x X, a [M]. (14)

For any a [M], if we take π(x) a, then by (1) of Lemma D.1,

E θℓ(x, Y (a); θ a(x)) | X = x = 0.

By (2) of Lemma D.1, for any a [M], there exists a constant ξa > 0 such that for any θ θ a(x) 2 ξa

|ℓ(x, y; θ) ℓ(x, y; θ a(x)) θℓ(x, y; θ a(x)) (θ θ a(x))| ℓa(x, y) θ θ a(x) 2 2,

for some function ℓa(x, y) La for some constant La. Similarly, we shall take ξ = mina [M] ξa, ℓ(x, y) = maxa ℓa(x, y), and L = P

By (3) of Lemma D.1, for any a [M], there exists a constant ξ1,a > 0 such that for any θ θ a L ξ1,a, ℓ(X, Y (a); θ(X)) ℓ(X, Y (a); θ a(X)) L2(PX,Y (a) | A=a) Cℓ,a θ θ a L2(PX | A=a).

Taking ξ1 = mina [M] ξ1,a and Cℓ= P

a [M] Cℓ,a, the above inequality holds for any a [M] and any θ θ a L ξ1.

Distributionally Robust Policy Learning under Concept Drifts

D.3.1. REGRET DECOMPOSITION

The regret bound of Algorithm 2 builds on the following regret decomposition:

Rδ(bπLN) = Vδ(π ) Vδ(bπLN)

= Vδ(π ) b VLN δ (π ) + b VLN δ (π ) b VLN δ (bπLN) + b VLN δ (bπLN) Vδ(bπLN)

Vδ(π ) b VLN δ (π ) + b VLN δ (bπLN) Vδ(bπLN)

b VLN δ (π) Vδ(π) , (15)

where the second-to-last step is by the choice of bπLN. For any π Π and any fold k [K], we define an intermediate quantity

V(k) δ := 1 |D(k)|

1{Ai = π(Xi)}

π0(Ai | Xi) Gπ(Xi)(Xi, Yi) gπ(Xi)(Xi) + gπ(Xi)(Xi).

Letting Vδ = 1

K PK k=1 V(k) δ , we have

b VLN δ (π) Vδ(π) = 1

k=1 b VLN,(k) δ (π) Vδ(π)

k=1 b VLN,(k) δ (π) Vδ(π) + Vδ(π) Vδ(π)

b VLN,(k) δ (π) V(k) δ (π) + sup π Π

Vδ(π) Vδ(π) .

Taking the supremum over all π Π, we have that

b VLN δ (π) Vδ(π) sup π Π

Vδ(π) Vδ(π) + sup π Π

b VLN,(k) δ (π) V(k) δ (π) .

We shall show that the first term above is OP (n 1/2) and the second term is o P (n 1/2). In the following, we refer to the two terms as the effective term and the negligible term, respectively. The following lemma is essential for establishing the uniform convergence results.

Lemma D.2. Suppose h is a function of (x, a, y, π(x)). Given a set of data {zi = (xi, ai, yi)}n i=1, suppose that |h(zi, π(xi))| ci(zi). Then the Rademacher complexity

i=1 ϵih xi, ai, yi, π(xi)

p Pn i=1 ci(zi)2

n 32 + 4κ(Π) ,

where ϵi i.i.d. Unif{ 1} are i.i.d. Rademacher random variables and Eϵ means the expectation over ϵ.

D.3.2. THE EFFECTIVE TERM

Denote Zi = (Xi, Ai, Yi) and take

h(Zi, π(Xi)) = 1{Ai = π(Xi)}

π0(Ai | Xi) Gπ(Xi)(Xi, Yi) gπ(Xi)(Xi) gπ(Xi)(Xi) Vδ(π).

Under the unconfoundedness assumption in Assumption 2.1, E[h(Zi, π(Xi))] = 0. By Equation (14), we have

|h(Zi, π(Xi))| 6

α 1 + η + αδ =: C0( α, α, η, δ, ε).

Distributionally Robust Policy Learning under Concept Drifts

Meanwhile, we have write

k=1 V(k) δ (π) Vδ(π) = sup π Π

i D(k) h Zi; π(Xi) = sup π Π

i=1 h Zi; π(Xi) .

Next, we define

f(z1, . . . , zn; π) = 1

i=1 h(zi, π(xi)).

Consider two arbitrary data sets {zi}n i=1 and {z i}n i=1. We can check that for any π Π and any j [n], f(z1, . . . , zj, . . . , zn; π) sup π Π

f(z1, . . . , z j, . . . , zn; π )

f(z1, . . . , zj, . . . , zn; π) f(z1, . . . , z j, . . . , zn; π)

f(z1, . . . , zj, . . . , zn; π) f(z1, . . . , z j, . . . , zn; π)

h(zj; π) h(z j; π) C0( α, α, η, δ, ε)/n. (16)

Above, the first inequality is because of the definition of sup and the second is due to the triangle inequality; the last step is due to the boundedness of h. Taking the supremum over all π Π in (16), we have that

f(z1, . . . , zj, . . . , zn; π) sup π Π

f(z1, . . . , z j, . . . , zn; π) C0( α, α, η, δ, ε)/n.

By the bounded difference inequality (Wainwright, 2019, Corollary 2.21), for any t > 0,

nh Zi, π(Xi) E sup π Π

nh Zi, π(Xi) t

f {Zi}i [n]; π E sup π Π

f {Zi}i [n]; π #

C0( α,α, η,δ,ε)2 .

Take t = C0( α, α, η, δ, ε) q

β . Then with probability at least 1 β,

nh Zi, π(Xi) < E sup π Π

nh Zi, π(Xi) + C0( α, α, η, δ, ε) r 1

It remains to bound the expectation term. Let Z 1, . . . , Z n be an i.i.d. copy of Z1, . . . , Zn, and let ϵi i.i.d. Unif({ 1}). Then

i [n] h(Zi, π(Xi)) E h h(Zi, π(Xi)) i

i [n] h(Zi, π(Xi)) EZ h 1

i [n] h(Z i, π(X i)) i

i [n] h(Zi, π(Xi)) 1

i [n] h(Z i, π(X i))

i [n] ϵi h(Zi, π(Xi)) h(Z i, π(X i))

i [n] ϵih(Zi, π(Xi))

i [n] ϵih(Zi, π(Xi))

Distributionally Robust Policy Learning under Concept Drifts

step (i) is by Jensen s inequality and step (ii) is because of the symmetry of (Zi, Z i).

Applying Lemma D.2,

i [n] ϵih(Zi, π(Xi))

2C0( α, α, η, δ, ε) n (32 + 4κ(Π)).

Combining the above, for any β (0, 1), we have with probability at least 1 β,

Vδ(π) Vδ(π) C0( α, α, η, δ, ε) n

64 + 8κ(Π) + p

log(1/β) . (18)

D.3.3. BOUNDING THE NEGLIGIBLE TERM

We now proceed to the negligible term. For any π Π and any k [K], consider the following decomposition:

b VLN,(k) δ (π) V(k) δ (π)

1{Ai = π(Xi)}

bπ0(Ai | Xi) b G(k) π(Xi)(Xi, Yi) bg(k) π(Xi)(Xi) + bg(k) π(Xi)(Xi)

1{Ai = π(Xi)}

π0(Ai | Xi) Gπ(Xi)(Xi, Yi) gπ(Xi)(Xi) gπ(Xi)(Xi)

1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) π(Xi)(Xi, Yi) g(k) π(Xi)(Xi)

1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

g(k) π(Xi)(Xi) bg(k) π(Xi)(Xi)

1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) π(Xi)(Xi, Yi) Gπ(Xi)(Xi, Yi)

1{Ai = π(Xi)}

π0(Ai | Xi) bg(k) π(Xi)(Xi) gπ(Xi)(Xi) + 1 |D(k)|

bg(k) π(Xi)(Xi) gπ(Xi)(Xi) .

For notational simplicity, we denote

K1(π) := 1 |D(k)|

1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) π(Xi)(Xi, Yi) g(k) π(Xi)(Xi) ,

K2(π) := 1 |D(k)|

1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

g(k) π(Xi)(Xi) bg(k) π(Xi)(Xi) ,

K3(π) := 1 |D(k)|

1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) π(Xi)(Xi, Yi) Gπ(Xi)(Xi, Yi) ,

K4(π) := 1 |D(k)|

1{Ai = π(Xi)}

π0(Ai | Xi) bg(k) π(Xi)(Xi) gπ(Xi)(Xi) + 1 |D(k)|

bg(k) π(Xi)(Xi) gπ(Xi)(Xi) .

We proceed to bound each term separately. To ease the presentation, we shall write Ek and Pk as the expectation and probability conditioned on D( k), respectively.

Bounding K1(π). Here, we take

h1(Zi; π(Xi)) := 1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) π(Xi)(Xi, Yi) g(k) π(Xi)(Xi) .

Distributionally Robust Policy Learning under Concept Drifts

Since g(k) a (X) is the conditional expectation of b G(k) a (X, Y (a)), we have

Ek h1(Zi, π(Xi)) | Xi = Ek

" 1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

b G(k) Ai (Xi, Yi) g(k) Ai (Xi) Xi

= π0(π(Xi)) bπ0(π(Xi) | Xi) 1 Ek h b G(k) π(Xi)(Xi, Yi) g(k) π(Xi)(Xi) Xi i

By Assumption 3.4, there exists N1 N+, such that when n N1, w. p. at least 1 β,

max a [M] bθ(k) a θ a L max( α, α, η)/2.

On the event {maxa [M] bθ(k) a θ a L max( α, α, η)/2}, we have for any a [M]

|ℓ(x, y, ; bθa(x))| 2 α exp 2 y + 4 η

α 1 + 2 η + 2 αδ.

Letting C1( α, α, η, δ, ε) = 4 α exp 2 y+4 η

α 1 + 4 η + 4 αδ)/ε2, We can then check that

|h1(Zi; π(Xi))| 2C1( α, α, η, δ, ε) bπ0(π(Xi) | Xi) π0(π(Xi) | Xi)

2C1( α, α, η, δ, ε) max a [M]

bπ0(a | Xi) π0(a | Xi) =: c1(Xi).

The upper bound is a constant conditional on Xi s and D( k). We now apply the bounded difference inequality conditional on X = {Xi}i [n]:

i D(k) h1(Zi, π(Xi)) Ek

i D(k) h1(Zi, π(Xi)) X t X

2|D(k)|2t2 P

i D(k) c1(Xi)2

Taking t = p P

i D(k) c1(Xi)2 log(1/β)/|D(k)|, we have with probability at least 1 β,

i D(k) h1(Zi, π(Xi)) Ek

i D(k) h1(Zi, π(Xi)) X

i D(k) c1(Xi)2

For each i D(k), we take A i and Y i as i.i.d. copies of Ai and Yi conditional on Xi, respectively. By a similar symmetrization argument as in the proof for the effective term, we have

i D(k) h1 Zi, π(Xi)

i D(k) h1 Xi, Ai, Yi, π(Xi) EA ,Z h 1 |D(k)|

i D(k) h Xi, A i, Y i , π(Xi) i

i D(k) h1(Xi, Ai, Yi, π(Xi)) 1 |D( k)|

i D(k) h(Xi, A i, Y i , π(Xi))

i D(k) ϵi h1(Xi, Ai, Yi, π(Xi)) h(Xi, A i, Y i , π(Xi))

2Ek h sup π Π

i D(k) ϵih1(Xi, Ai, Yi, π(Xi)) X i .

Distributionally Robust Policy Learning under Concept Drifts

Applying Lemma D.2 with ci = c1(Xi), we have that

Eϵ h sup π Π

i D(k) ϵih1(Xi, Ai, Yi, π(Xi)) X i 2 p P

i D(k) c1(Xi)2

|D(k)| (32 + 4κ(Π)).

Combining the above, on the event {maxa [M] bθ(k) a θ a L max( α, α, η)/2},

i D(k) c1(Xi)2

64 + 8κ(Π) + p

log(1/β) X β.

Since bπ0(a | X) π0(a | X) 2 1,

i D(k) max a [M]

bπ0(a | X) π0(a | X) 2 X

a [M] E (bπ0(a | X) π0(a | X))2 t

bπ0(a | X) π0(a | X) 2 X

a [M] E (bπ0(a | X) π0(a | X))2 t

bπ0(a | X) π0(a | X) 2 E (bπ0(a | X) π0(a | X))2 t

M exp 2|D(k)|t2 .

Taking a union bound, with probability at least 1 3β, we have that

K1(π) 2C1( α, α, η, δ, ε) p

20 + 4κ(Π) + p

a [M] bπ0 π0 L2(PX | A=a) + 1

2n log(M/β) 1/4 .

a [M] bπ0 π0 L2(PX | A=a) = o P (1), there exists N 1 N1 such that when n N 1, with probability at least 1 β/(4K),

K1(π) C0( α, α, η, δ, ε)

Bounding K2(π). We first note that by Cauchy-Schwarz inequality,

1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

g(k) Ai (Xi) bg(k) Ai (Xi)

bπ(k) 0 (π(Xi) | Xi) π0(π(Xi) | Xi) 2s X

g(k) π(Xi)(Xi) bg(k) π(Xi)(Xi) 2

bπ(k) 0 (a | Xi) π0(a | Xi) 2 v u u t X

g(k) a (Xi) bg(k) a (Xi) 2.

Then for any t > 0, let

tε2 max a [M]

n bπ(k) a π(k) 0,a L2(PX) o max a [M]

n g(k) a bg(k) a L2(PX) o .

Distributionally Robust Policy Learning under Concept Drifts

1{Ai = π(Xi)}

bπ0(Ai | Xi) 1{Ai = π(Xi)}

π0(Ai | Xi)

bg(k) Ai (Xi) g(k) Ai (Xi) s

bπ(k) 0 (a | Xi) π0(a | Xi) 2 v u u t X

bg(k) a (Xi) g(k) a (Xi) 2 s

v u u t 1 |D(k)|

bπ(k) 0 (a | Xi) π0(a | Xi) 2

tε max a [M]

n bπ(k) a π(k) 0,a L2(PX) o!

v u u t 1 |D(k)|

bg(k) a (Xi) g(k) a (Xi) 2

tε max a [M]

n bg(k) a bg(k) a L2(PX) o!

where the last inequality is due to Chebyshev s inequality. Marginalizing over the randomness of D( k), for any β (0, 1), we have with probability at least 1 β that

max π Π |K2(π)| < 2M

βε2 max a [M]

n bπ(k) a π(k) 0,a L2(PX) o max a [M]

n bg(k) a g(k) a L2(PX) o .

By Assumption 3.4, there exists N 2 N+ such that when n N 2, with probability at least 1 β/(4K),

sup π Π |K2(π)| C0( α, α, η, δ, ε)

Bounding K3(π). We start by taking

h3(Zi, π(Xi)) = 1{Ai = π(Xi)}

π0(Ai | Xi) h b G(k) π(Xi) Xi, Yi(π(Xi)) Gπ(Xi) Xi, Yi(π(Xi)) i .

For any π Π,

Ek h3(Zi, π(Xi)) | Xi

= Ek h1{Ai = π(Xi)}

π0(Ai | Xi) b G(k) Ai (Xi, Yi(π(Xi))) GAi(Xi, Yi(π(Xi))) Xi i

= Ek h b G(k) π(Xi)(Xi, Yi(π(Xi))) Gπ(Xi)(Xi, Yi(π(Xi))) | Xi i

= Ek h ℓ Xi, Yi(π(Xi)); θ(k) π(Xi)(Xi) ℓ(Xi, Yi; θ π(Xi)(Xi))

ℓ(Xi, Yi(π(Xi)); θ π(Xi)(Xi)) bθ(k) π(Xi)(Xi) θ π(Xi)(Xi) Xi i ,

where the last step follows from part (1) of Lemma D.1. By Assumption 3.4, for any β (0, 1), there exists N3 N+ such that when n N3,

P max a [M] bθ(k) a θ a L > min ξ, α, α, η /2 β.

On the event maxa [M] bθ(k) a θ a L min(ξ, α, α, η)/2 , we have ℓ Xi, Yi; θ(k) π(Xi)(Xi) ℓ(Xi, Yi; θ π(Xi)(Xi)) ℓ(Xi, Yi; θ π(Xi)(Xi)) bθ(k) π(Xi) θ π(Xi)

ℓ(Xi, Yi) X

bθa(Xi) θ a(Xi) 2 2,

Distributionally Robust Policy Learning under Concept Drifts

As a result,

Ek[K3(π) | X] sup π Π

i D(k) Ek h3(Zi, π(Xi)) | Xi

bθa(Xi) θ a(Xi) 2 2.

On the same event,

h3(Zi, π(Xi)) = 1{Ai = π(Xi)}

π0(Ai | Xi) n ℓ Xi, Yi(π(Xi)); θ(k) π(Xi)(Xi) ℓ(Xi, Yi; θ π(Xi)(Xi)) o

ℓ Xi, Yi(π(Xi)); θπ(Xi)(Xi) (θ(k) π(Xi)(Xi) θ π(Xi)(Xi))

ℓ Xi, Yi(π(Xi)); θπ(Xi)(Xi) 2 bθ(k) π(Xi)(Xi) θ π(Xi)(Xi) 2

C2( α, α, η, δ, ε) max a [M]

bθ(k) a (Xi) θ a(Xi) 2,

where C2( α, α, η, δ, ε) = (1 + ( y + η)/α)e( y+ η)/α 1 + δ + 1 is a constant. Let h3(Zi, π(Xi)) = h3(Zi, π(Xi)) Ek[h3(Zi, π(Xi)) | Xi], and we have that

| h3(Zi, π(Xi))| 2C2( α, α, η, δ, ε) max a [M]

bθ(k) a (Xi) θ a(Xi) 2.

Next, we apply the bounded difference theorem conditional on Xi s:

i D(k) h3(Zi, π(Xi)) Ek

i D(k) h3(Zi, π(Xi))

2C2( α, α, η, δ, ε)2 P

i D(k) maxa [M] bθ(k) a (Xi) θ a(Xi) 2 2

for any t > 0. Taking t = C2( α, α, η, δ, ε) q

i D(k) maxa [M] bθ(k) a (Xi) θ a(Xi) 2 2/|D(k)|, we have with probability at least 1 β that

i D(k) h3(Zi, π(Xi)) Ek

i D(k) h3(Zi, π(Xi))

+C2( α, α, η, δ, ε)

bθ(k) a (Xi) θ a(Xi) 2 2.

For the expectation term,the same symmetrization argument as in the proof for K1(π) leads to

i D(k) h3(Zi, π(Xi))

X 2E sup π Π | 1 |D(k)|

ϵi h3(Zi, π(Xi)) X .

Then by Lemma D.2, we have

ϵi h3(Zi, π(Xi))

2C2( α, α, η, δ, ε)

|D(k)| (32 + κ(Π)) s X

a [M] bθ(k) a (Xi) θ a(Xi) 2 2.

Distributionally Robust Policy Learning under Concept Drifts

By Hoeffding s inequality, we have that

i D(k) bθ(k) a (Xi) θ a(Xi) 2 2 θ(k) a θa 2 L2(PX) θ(k) a θa 2 L

Taking a union bound, with probability at least 1 3β, we have that

C2( α, α, η, δ, ε)(130 + 4κ(Π)) p

a [M] bθ(k) a θa L2(PX) + p

M( α + η) 1 |D(k)| log M

a [M] bθ(k) a θa 2 L2(PX) + M bθa θa L p

By Assumption 3.4, bθ(k) a θa L2(PX) = o P (n 1/4) and bθ(k) a θa L = o P (1), so there exists N 3 N3 such that when n N 3, with probability at least 1 β/(4K),

K3(π) C0( α, α, η, δ, ε)

Bounding K4(π). For K4(π), we take

h4(Zi, π(Xi)) = 1{Ai = π(Xi)}

π0(Ai | Xi) bg(k) π(Xi)(Xi) gπ(Xi)(Xi) + bg(k) π(Xi)(Xi) gπ(Xi)(Xi) .

and therefore K4(π) = 1 |D| P

i D(k) h4(Zi, π(Xi)). Again by the unconfoundedness assumption,

Ek h4(Zi, π(Xi)) = 0.

Due to the overlap condition, we further have that

|h4(Zi, π(Xi))| 2

bg(k) π(Xi)(Xi) gπ(Xi)(Xi) 2

ε max a [M]

bg(k) a (Xi) ga(Xi) .

As before, we apply the bounded difference theorem conditional on Xi s and the symmetrization argument to obtain

K4(π) 2Ek h sup π Π

i D(k) ϵih4(Zi, π(Xi)) X i t X

K4(π) Ek h sup π Π

K4(π) X i t | X)

ε2|D(k)|2t2

i D(k) maxa [M](bg(k) a (Xi) ga(Xi))2

We now apply Lemma D.2:

i D(k) ϵih4(Zi, π(Xi))

i D(k) maxa [M] bga(Xi) ga(Xi) 2

|D(k)|ε (64 + 8κ(Π)).

By Assumption 3.4, there exists N4 N+, such that when n N4, with probability at least 1 β,

max a [M] bθ(k) a θ a L max(ξ, α, α, η)/2.

Distributionally Robust Policy Learning under Concept Drifts

On the event {maxa [M] bθ(k) a θ a L max(ξ, α, α, η)/2}, we have for any a [M] that

ℓ(x, y; bθa(x)) 2C1( α, α, η, δ, ε).

On the same event, by Hoeffding s inequality, we have that

i D(k) (bga(Xi) ga(Xi))2 bg(k) a ga 2 L2(PX) t exp t2|D(k)| 8C1( α, α, η, δ, ε)2

Taking a union bound, we have with probability at least 1 2β that

max π Π K4(π) 1

128 + 16κ(Π) + p

a [M] bg(k) a g a L2(PX) + 2M p

C1( α, α, η, δ, ε)(log(M/β)/n)1/4 .

By Assumption 3.4, P

a [M] bg(k) a g a L2(PX) = o P (1), so there exists N 4 N4 such that when n N 4, with probability at least 1 β/(4K),

K4(π) C0( α, α, η, δ, ε)

Combining (18)-(22) and taking a union bound over k [K], when n max(N1, N2, N3, N4) we have that with probability at least 1 β,

b VLN δ (π) Vδ(π) C0( α, α, η, δ, ε) n .

We have thus completed the proof of Theorem 4.3.

D.4. Proof of Theorem 4.6

We first state somes results from (Si et al., 2023) that will be used in the proof. For any p, q [0, 1], define

D(p q) = p log p

+ (1 p) log 1 p

, and gδ(q) = inf p:DKL(p q) δ p,

Lemma D.3 (Adapted from Lemma A17 of Si et al. (2023)). For δ 0.2, gδ(q) is differentiable and g δ(q) 1/2 for q [0.4, 0.6].

Note that our definition of gδ(q) is slightly different from that in (Si et al., 2023), so we include the proof of Lemma D.3 in Appendix F.4 for completeness.

For notational simplicity, we use d to denote the Natarajan dimension of the policy class Π. By the definition of Natarajan dimension, there exists a set of d data points {x1, . . . , xd} X shattered by Π: there exist two functions f 1, f1 : {x1, . . . , xd} 7 [M] such that f 1(xj) = f1(xj) for any j [d] and for any σ { 1.1}d, there exists π Π, such that π(xj) = fσj(xj) for all j [d].

Next, we construct a class of distributions indexed by σ { 1}d that are hard instances for the learning problem. Fix any σ { 1}d, we construct distribution Pσ as follows. First, the covariate are drawn uniformly from {x1, . . . , xd}, i.e.,

Xi i.i.d. Unif {x1, . . . , xd} .

Given Xi, the action Ai is chosen according to the behavior policy π0, where for any j [d],

π0(f1(xj) | xj) = π0(f 1(xj) | xj) = ε

2, and π0(a | xj) = 1 ε

K 2 for all a = f1(xj), f 1(xj).

Distributionally Robust Policy Learning under Concept Drifts

The potential outcomes are generated as follows:

Yi(f1(xj)) | Xi = xj y Bern 1 + σj

, Yi(f 1(xj)) | Xi = xj y Bern 1 σj

and Yi(a) = y Bern(1/4) for all a = f1(xj), f 1(xj),

where (0, 0.1) is some constant to be determined later. Note that the distribution of (Xi, Ai) does not depend on σ. By construction, it is clear that the data-generating process satisfies Assumption 2.1. For any p {(1 + )/2, (1 )/2, 1/4}, log(1/(1 p)) > δ. Therefore, the data-generating process also satisfies Assumption 2.4. As for Assumption 3.3, it suffices to check the Bernoulli distributions with parameters (1 + )/2, (1 )/2, 1/4, and can be verified. Since n d2, we can obtain bθ, bg, and bπ0 that converges at rate OP (n 1/4) (by stratifying on X), thereby satisfying Assumption 3.4.

We now proceed to establish the lower bound. For any policy learning algorithm that returns bπ, the worst-case regret is lower bounded by the average regret over the class of hard instances we have constructed above:

sup P P EP n[R(bπ)] 1

σ { 1}d EP n σ [Rδ(bπ)].

We now focus on the right-hand side above. Fix σ { 1}d. Recall that Rδ(bπ) = Vδ(π ) Vδ(bπ). For the optimal policy value, there is

Vδ(π ) = max π Π EPσ,X

inf QY | X P(Pσ,Y | X,δ) EQY | X Y (π(X)) | X

(i)= max π Π max α,η EPσ

α(X) exp Y (π(X)) + η(X)

α(X) 1 η(X) α(X)δ

= max α,η max π Π EPσ

α(X) exp Y (π(X)) + η(X)

α(X) 1 η(X) α(X)δ , (23)

where the step (i) follows from the duality result in Proposition 2.3. We now take a closer look at the expectation above: by the construction of Pσ,

α(X) exp Y (π(X)) + η(X)

α(X) 1 η(X) α(X)δ

α(xj) exp Y (π(xj)) + η(xj)

α(xj) 1 η(xj) α(xj)δ X = xj

j=1 α(xj) exp η(xj)

exp Y (π(xj))

η(xj) α(xj)δ.

Letting pj = P(Y (π(xj)) = 1 | X = xj), we have

exp Y (π(xj))

= pj exp( 1/α(xj)) + 1 pj,

which is decreasing in pj and is minimized when π(xj) = fσj(xj). By construction, such a policy π is in Π. As a result,

(23) = max α,η 1 d

α(xj) exp Y (fσj(xj)) + η(xj)

α(xj) 1 η(xj) α(xj)δ X = xj

j=1 max α,η EPσ

α exp Y (fσj(xj)) + η

α 1 η αδ X = xj

j=1 inf QY | X P(Pσ,Y | X=xj ,δ) EQY | X Y (fσj(xj)) | X = xj = g 1 +

Distributionally Robust Policy Learning under Concept Drifts

The last step is because Y (fσj(xj)) | X = xj Bern((1 + )/2). Similarly, for V(bπ), we have

Vδ(bπ) = EPσ,X

inf QY | X P(Pσ,Y | X,δ) EQY | X Y bπ(X) | X

j=1 inf QY | X P(Pσ,Y | X=xj ,δ) EQY | X Y (bπ(xj)) | X = xj

j=1 1 bπ(xj) = fσj(xj) g 1 +

+ 1 bπ(xj) = f σj(xj) g 1

+ 1 bπ(xj) = fσj(xj), f σj(xj) g(1/4).

Combining the calculation above, we have

j=1 1{bπ(xj) = f σj(xj)} g 1 +

+ 1{bπ(xj) = fσj(xj), f σj(xj)} g 1 +

j=1 +1{bπ(xj) = fσj(xj)} g 1 +

j=1 1{bπ(xj) = fσj(xj)} g (ξ) (iii)

j=1 1{bπ(xj) = fσj(xj)},

where step (i) uses that g is non-decreasing (c.f. Cauchois et al. (2024, Proposition 1)); in step (ii), ξ ((1 )/2, (1+ )/2), and step (iii) follows from Lemma D.3.

Next, we denote σ[j] to be the vector σ with the j-th element flipped. Then, we have

σ { 1}d EP n σ [R(bπ)] 1

σ { 1}d EP n σ

j=1 1{bπ(xj) = fσj(xj)}

PP n σ bπ(xj) = f1(xj) + PP n σ[j] bπ(xj) = f 1(xj)

σ:σj=1 PP n σ bπ(xj) = f1(xj) + PP n σ[j] bπ(xj) = f1(xj)

1 DTV(P n σ , P n σ[j]) , (24)

where the last step follows from the definition of the TV distance. By Pinsker s inequality, there is

D2 TV P n σ , P n σ[j] 1

2DKL P n σ P n σ[j]

d Pσ[j] (Xi, Ai, Yi)

1 Xi = xj, Ai = f 1(xj) log 1 +

Distributionally Robust Policy Learning under Concept Drifts

where the last step follows from x log( 1+x

1 x) 3x2, for x (0, 1/3). Take = 1 15

d nε this is possible since n d2

and d 4/(9ε) and then

15d2d+2 = 1 120

E. Generalization to Identifiable Covariate Distribution Shift

We now extend our methodology to handle situations where shift in X and Y | X distributions are both present. We note that, in most practical cases, the decision maker has access to the covariates in the target environment, making the shift covariate distribution identifiable and estimable it is therefore unnecessary to guard against the worst-case shift.

Method. Formally, suppose we have access to a training dataset D collected in an environment P, and aims at learning a policy that behaves well in the environment Q. Here, we assume that QY | X P(PY | X, δ) for some given radius δ > 0 but do not impose any constraints on X distribution shift except that QX is absolutely continuous with respect to PX and that the likelihood ratio is bounded. Letting r(x) = d QX

d PX (x), we can estimate r with X from P and Q using standard tools (by directly estimating the density ratio or by means of classification algorithms). For any policy π, the distributional robust policy value can be written as

Vδ(π) := EQX h inf QY | X P(PY | X,δ) EQY | X Y (π(X)) | X i . (25)

Note that the inner expectation of (25) is the same as in (1). By Lemma 2.3, there is

(25) = EQX PY | X

α π(X) exp Y (π(X)) + η π(X) α π(X)

+ η π(X) + α π(X)δ

= EQX PY | X

" 1{A = π(X)}

α π(X) exp Y (A) + η π(X) α π(X)

+ η π(X) + α π(X)δ #

r(X)1{A = π(X)}

α π(X) exp Y (A) + η π(X) α π(X)

+ η π(X) + α π(X)δ #

The above expression ensures that the robust policy value is identifiable with the data accessible to the decision maker. With this representation, subsequent policy evaluation and learning are similar to the pure concept drift case, and we provide the adaptation below.

In addition to D, we let D = {Xi}m i=1 denote the covariates from environment Q, i.e., Xi i.i.d. QX. Assume that limm,n m/n = γ. As before, we adopt a K-fold cross-fitting scheme, where we split both D and D into K nonoverlapping equally-sized folds. For k [K], we use D(k+1) and D(k+1) to obtain bπ(k) 0 , br(k), and (bα(k) π , bη(k) π ) as estimates of π0, r, and (α π, η π), respectively; we then use D(k+2) and D(k+2) to obtain bg(k) π as an estimate for g(k) π . The estimator of the k-th fold is

b V(k) δ (π) = 1 D(k) X

br(k)(Xi)1{Ai = π(Xi)}

bπ(k) 0 (Ai | Xi) b G(k) π (Xi, Yi) bg(k) π (Xi) + 1 | D(k)|

i D(k) bg(k) π (Xi),

and the final robust policy value estimator is b Vδ(π) = 1

K PK k=1 b V(k) δ (π). We then obtain the learned policy via

bπLN = argmax π Π b V(π),

where we apply the same computational trick as in the pure concept shift case.

Theoretical Guarantees. We now extend the theoretical guarantees to the general case. Since the proof is similar to the pure concept drift case, the proof sketch is provided. As a prerequisite, we modify Assumption 3.4 to be:

Distributionally Robust Policy Learning under Concept Drifts

Assumption E.1. For any policy π, assume that for each k [K], the estimators bπ(k) 0 , br(k), bg(k) π , and the empirical risk optimizer bθ(k) π satisfy

br(k)/bπ(k) 0 r/π0 L2(PX | A=π(X)) = o P (n γ1), bg(k) π g(k) π L2(PX | A=π(X)) = o P (n γ2),

bθ(k) π θ π L2(PX | A=π(X)) = o P (n 1

4 ), bθ(k) π θ π L = o P (1),

for some γ1, γ2 0 and γ1 + γ2 1

Theorems E.2 and E.3 establish the asymptotic normality of the policy value estimator and the regret upper bound. Theorem E.2. Suppose Assumptions 2.1, 2.4, 3.3, and E.1 hold. Additionally assume that d QX/d PX C, a.s., for some

constants C > 0, and that limm,n m/n = γ. For any policy π : X 7 [M], we have n b Vδ(π) Vδ(π) d N(0, σ2 π), where

σ2 π = Var r(X)1{A = π(X)}

π0(A | X) Gπ(X, Y ) gπ(X) + γ Var(gπ(X)).

Theorem E.3. Suppose Assumptions 2.1, 2.4, 3.3, E.1 hold. Additionally assume that d QX/d PX C, a.s., for some constants C > 0, and that limm,n m/n = γ. For any β (0, 1), there exists N N+ such that when n N, we have with probability at least 1 β that

Rδ(bπLN) C(κ(Π) + p

log(1/β)), n

where C > 0 is a constant independent of n and Π.

Proof Sketch. Recall that

Gπ(x, y) = α π(x) exp y + η π(x) α π(x)

+ η π(x) + α π(x)δ and gπ(x) = E Gπ(X, Y (π(X))) | X = x

It can be checked that

(25) = EP hr(X)1{A = π(X)}

π0(A | X) Gπ(X, Y ) i

= EP hr(X)1{A = π(X)}

π0(A | X) Gπ(X, Y ) gπ(X) i + EQX gπ(X) .

So if we define Vδ(π) = 1

K PK k=1 V(k) δ (π), with

V(k) δ (π) = 1 |D(k)|

r(Xi)1{Ai = π(Xi)}

π0(Ai | Xi) Gπ(Xi, Yi) gπ(Xi) + 1 | D(k)|

i D(k) gπ(Xi),

we have by the central limit theorem that n Vδ(π) Vδ(π) d N(0, σ2 π).

As in the proof of Theorem 3.5, we can decompose the difference between V(k) δ (π) and b V(k) δ (π) as follows:

b V(k) δ (π) V(k) δ (π) = 1 |D(k)|

r(Xi)1{Ai = π(Xi)}

π0(Xi) b G(k) π (Xi, Yi) Gπ(Xi, Yi)

i D(k) 1{Ai = π(Xi)} br(Xi)

bπ0(Xi) r(Xi)

b G(k) π (Xi, Yi) g(k) π (Xi)

i D(k) 1{Ai = π(Xi)} br(Xi)

bπ0(Xi) r(Xi)

g(k) π (Xi) bgπ(Xi)

r(Xi)1{Ai = π(Xi)}

π0(Xi) (bg(k) π (Xi) gπ(Xi)) + 1 | D(k)|

i D(k) (bg(k) π (Xi) gπ(Xi)).

Distributionally Robust Policy Learning under Concept Drifts

Following almost the same steps in the proof of Theorem 3.5, we can show that

(26) =OP bθ(k) π θ π 2 L2(PX | A=π(X)) + OP

L2(PX | A=π(X)) n 1/2

L2(PX | A=π(X)) bg(k) π gπ L2(PX | A=π(X))

+ OP bg(k) π gπ L2(PX | A=π(X)) n 1/2

=o P (n 1/2).

Taking a union bound over k [K], we can conclude b Vδ(π) = Vδ(π) + o P (n 1/2), and therefore b Vδ(π) is asymptotically normal. The proof of Theorem E.3 follows similarly.

F. Proof of technical lemmas

F.1. Proof of Lemma D.1

Proof of (1). Given θ, recall that our loss function is

ℓ(x, y; θ) = α exp y + η

α 1 + η + αδ.

By the strong duality, E[ℓ(X, Y (π(X)); θ) | X] is convex in θ; by Proposition 2.5, the first-order condition of convex optimization problem implies

θE h ℓ x, Y (π(x)); θ π(x) X = x i = 0.

Meanwhile, we can compute the gradient of ℓ(x, y; θ) as

αℓ(x, y; θ) = 1 + y + η

η ℓ(x, y; θ) = 1 exp y + η

For any a such that |a α π(x)| α π(x), we have αℓ(x, y; (a, η π(x))) 1 + 2( y + η)

exp 2( y + η)

α 1 + δ < .

By the mean value theorem and the dominated convergence theorem, we can change the order of expectation and taking limits and therefore

αℓ x, Y (π(x)); θ π(x) X = x =

αE h ℓ x, Y (π(x)); θ π(x) X = x i = 0.

Similarly, since η ℓ(x, y; (α π(x), η)) is non-decreasing in η, for |η η π(x)| 1,

η ℓ(x, y; (α π(x), η)) max

( η ℓ(x, y; (α π(x), η π(x) + 1)) ,

η ℓ(x, y; (α π(x), η π(x) 1))

with the right-hand side being integrable under PY | X. Agian by the mean-value theorem and the dominated convergence theorem,

η ℓ(x, Y (π(x)); θ π(x)) X = x =

η E ℓ(x, Y (π(x)); θ π(x)) X = x = 0.

We have thus completed the proof part (1) of Lemma D.1.

Distributionally Robust Policy Learning under Concept Drifts

Proof of (2). We now compute the Hessian of ℓ(x, y; θ):

α2 ℓ(x, y; θ) = (y + η)2

α3 exp y + η

α η ℓ(x, y; θ) = y + η

α2 exp y + η

η2 ℓ(x, y; θ) = 1

α exp y + η

By the Taylor expansion,

ℓ(x, y; θ) ℓ(x, y; θ π(x)) = ℓ(x, y; θ π(x)) (θ θ π(x)) + 1

2(θ θ π(x)) 2ℓ(x, y; θ)(θ θ π(x)),

ℓ(x, y; θ) ℓ(x, y; θ π(x)) ℓ(x, y; θ π(x)) (θ θ π(x))

α 1 θ θ π(x) 2 2,

where θ = tθ + (1 t)θ π(x) for some t [0, 1] and the last step is because

2ℓ(x, y; θ) op (y + η)2

Let ξ = min(α, η)/2. For any θ such that θ θ π(x) 2 ξ, we also have | α α π(x)| ξ and | η η π(x)| ξ. Then

α 1 8 y2 + 8 η2

exp 2 y + 4 η

Letting the right-hand side be ℓ(x, y), we have thus completed the proof of (2).

Proof of (3). By the Taylor expansion,

ℓ(x, y; θ) ℓ(x, y; θ π(x)) = ℓ(x, y; θ) θ(x) θ π(x) ,

where θ = tθ(x)+(1 t)θ π for some t [0, 1]. Let ξ1 = min(α, η)/2. When θ θ π L ξ1, we have | α α π(x)| ξ1 and | η η π(x)| ξ1. Plugging the expressions of the gradient in Equation (27), we have ℓ(x, y; θ(x)) ℓ(x, y; θ π(x)) 2 = ℓ(x, y; θ(x)) (θ(x) θ π(x)) 2

( 1 + y + η(x)

exp y + η(x)

α(x) 1 + δ 2 + 1 exp y + η(x)

θ(x) θ π(x) 2 2

C( y, α, α, η, δ) θ(x) θ π(x) 2 2,

where C( y, α, α, η, δ) is a function of ( y, α, α, η, δ). Taking the expectation over PX,Y | A=π(X), we have ℓ(X, Y ; θ(X)) ℓ(X, Y ; θ π(X)) L2(PX,Y | A=π(X)) C( y, α, α, η, δ) θ θ L2(PX | A=π(X)),

completing the proof of (3).

F.2. Proof of Lemma D.2

We first introduce the ℓ2 distance on the policy space Π, as well as the corresponding covering number.

Definition F.1. Given a function h and a set of realized data z1, . . . , zn,

(1) the ℓ2 distance between two policies π1, π2 Π with respect to {z1, . . . , zn} is defined as

ℓ2 π1, π2; {z1, . . . , zn} =

s Pn i=1 h(zi, π1(xi) h(zi; π2(xi)) 2

4 Pn i=1 ci(zi)2 .

Distributionally Robust Policy Learning under Concept Drifts

(2) N2(γ, Π; {z1, . . . , zn}) is the minimum number of policies needed to γ-cover Π under ℓ2 with respect {z1, . . . , zn}.

Under the ℓ2 distance, we define a sequence of approximation operators Aj : Π 7 Π for j [J], where J = log2 n . Specifically, for any j = 0, 1, . . . , J, let Sj be the set of policies that 2 j-covers Π and satisfies |Sj| = N2(2 j, Π; {Z1, . . . , Zn}). Specially, S0 = { π}, with π is an arbitrary policy in Π this is a valid choice since for any π Π,

ℓ2(π, π; {z1, . . . , zn}) =

s Pn i=1 h(zi, π(xi)) h(zi, π(xi)) 2

4 Pn i=1 ci(zi)2 1.

We shall let Λ = 2 p Pn i=1 ci(zi)2 to denote the normalization factor. The approximation operators are defined in a backward manner: for any π Π,

(1) define AJ[π] = argmin π SJ ℓ2 π, π ; {z1, . . . , zn} ;

(2) for j = J 1, . . . , 0, define

Aj[π] = argmin π Sj ℓ2 Aj+1[π], π ; {z1, . . . , zn} .

Using the sequential approximation operators, we decompose the inner expectation term in (17) (Rademacher complexity) as

i [n] ϵih(Zi, π(Xi))

i [n] ϵi h(Zi, π(Xi)) h(Zi, AJ[π](Xi))

i [n] ϵi h(Zi, Aj[π](Xi)) h(Zi, Aj 1[π](Xi))

i [n] ϵih(Zi, A0[π](Xi))

=: Ξ1 + Ξ2 + Ξ3.

For any π Π, by the Cauchy-Schwarz inequality,

i [n] ϵi h(zi, π(xi)) h(zi, AJ[π](xi))

h zi, π(xi)) h(zi, AJ[π](xi)) 2

= Λ n ℓ2(π, AJ(π); {z1, . . . , zn})

Λ n2 J Λ n3/2 ,

where the second-to-last step is because AJ(π) is 2 J-close to π and the last step is by the choice of J. As a result the above derivation, Ξ1 Λ/n3/2.

Distributionally Robust Policy Learning under Concept Drifts

Next, for any j = 1, . . . , J we use Pj to denote the projection of projecting a policy to Sj, i.e., Aj 1[π] = Pj 1[Aj[π]]. Once Aj(π) is determined, Aj 1(π) is also determined. For any s > 0,

i [n] ϵi h(zi, Aj[π](xi)) h(zi; Aj 1[π](xi)) s

i [n] ϵi h h(zi, π (xi)) h(zi, Pj 1[π ](xi)) i s

2n2s2 Pn i=1 h(zi, π (xi)) h(zi, Pj 1[π ](xi)) 2

Λ2ℓ2(π , Pj 1(π ); z)2

2N2(2 j, Π; Z) exp n2s2

we z is a shorthand for {z1, . . . , zn}. For any j = 1, . . . , J and m N, take

sj,m = Λ n2j 1/2

log N2(2 j, Π; Z) 2m+1j2).

For a fixed m, with a union bound over j = 1, . . . , J we have that

i [n] ϵi h(zi, Aj[π](xi)) h(zi, Aj 1[π](xi))

i [n] ϵi h(zi, Aj[π](xi)) h(zi, Aj 1[π](xi)) sj,m

1 j22m 1 2m 1 .

To proceed, we shall use the following lemma, whose proof is deferred to Appendix F.3.

Lemma F.2. For any realization z1, . . . , zn and γ > 0, there is N2(γ, Π; z1, . . . , zn) NH(γ2, Π).

By Lemma F.2, for any m N+,

log N2(2 j, Π; Z) 2m+1j2

log(NH(2 2j, Π)) + (m + 1) log 2 + 2 log(j)

log(NH 2 2j, Π) +

m + 1 + 1 =: um,

Distributionally Robust Policy Learning under Concept Drifts

where step (i) uses

a + b + c a +

b + c for a, b, c 0; step (ii) uses the definition of κ(Π). Then

i [n] ϵi h h(zi, Aj[π](xi)) h(zi, Aj 1[π](xi)) i

i [n] ϵi h h(zi, Aj[π](xi)) h(zi, Aj 1[π](xi)) i > s

k=1 (uk+1 uk) 2 k+1

k + 1) 2 k+1 4Λ

n κ(Π) + 7 .

Finally, we consider Ξ3. Recall that S0 = { π}, and therefore

i [n] ϵih(zi, π(xi))

i [n] ϵih(zi, π(xi)) > s

0 2 exp n2s2

Putting everything together,

i=1 ϵih xi, ai, yi, π(xi) Λ

n (4κ(Π) + 32)

= 2 p Pn i=1 ci(zi)2

n (4κ(Π) + 32).

F.3. Proof of Lemma F.2

Fix γ > 0. If NH(γ2, Π) = , the lemma is trivially true. Otherwise, let N0 = NH(γ2; Π). For any realization z1, . . . , zn, define

(π i,1, π i,2) = argmax π1,π2

|h(zi, π1(xi)) h(zi, π2(xi))| .

Implicitly, (π i,1, π i,2) depends on zi. For an arbitrary positive integer m and i [n], we define

Λ2n h(zi, π i,1(xi)) h(zi, π i,2(xi)) 2m ,

where we recall that Λ2 = 4 Pn i=1 ci(zi)2. We then construct a new set of data

{ z1, . . . , z N} = {z1, . . . , z1, z2, . . . , z2, . . . , zn, . . . , zn},

where zi appears ni times and

Λ2 h(zi, π i,1(xi)) h(zi, π i,2(xi)) 2m m + n.

Distributionally Robust Policy Learning under Concept Drifts

By definition, there exists a policy set S0 to be a γ2-cover of Π the Hamming distance with respect to x := { x1, . . . , x N} such that |S0| = N0. As a result, for any π Π, there exists π S0 such that H(π, π ; x) γ2. On the other hand,

H(π, π ; x) = 1

i=1 1{π( xi) = π ( xi)}

i=1 ni1{π(xi) = π (xi)}

m Λ2 h(zi, π i,1(xi)) h(zi, π i,2(xi)) 2 1{π(xi) = π (xi)}

m Λ2 h(zi, π(xi)) h(zi, π (xi)) 2 1{π(xi) = π (xi)}

m Λ2 h(zi, π(xi)) h(zi, π (xi)) 2.

Above, step (i) and (ii) follow from the choice of z and (π i,1, π i,2), respectively; step (iii) is because when π(xi) = π (xi), h(zi, π(xi)) = h(zi, π(x i)). By the definition of the ℓ2 distance and that N m + n, we further have

γ2 H(π, π ; x) m (m + n)ℓ2(π, π ; z).

Since m is arbitrary, we take m to infinity and have ℓ2(π, π ; z) γ. By definition, S0 is a γ-cover of Π under ℓ2 with respect to z1, . . . , zn, and therefore N2(γ, Π; z1, . . . , zn) NH(γ2, Π).

F.4. Proof of Lemma D.3

By Yang et al. (2022, Lemma B12), gδ(q) is differentiable in q, and

g δ(q) = q DKL(g(q) q)

p DKL(g(q) q) = g(q)/q (1 g(q))/(1 q) log g(q)/(1 g(q)) log q/(1 q) .

Also by Yang et al. (2022, Lemma B12), gδ(q) is convex in q, so g δ(q) is increasing in q. Since q [0.4, 0.6], g δ(q) g δ(0.4). From the dual form, we can check that g(0.4) 0.1. Plugging in q = 0.4, we have

0.4 + g(0.4)

0.6 5/3 log g(0.4)/(1 g(0.4)) log(2/3) = g(0.4)/0.24 5/3 log(g(0.4)/(1 g(0.4))) log(2/3).

Since the function f(x) = x/0.24 5/3 log(x/(1 x)) log(2/3) is increasing in x for x (0, 0.4), we conclude that

g δ(0.4) 1/2.4 5/3 log(1/9) log(2/3) 1/2,

completing the proof.