# adversarial_counterfactual_environment_model_learning__8c9e444d.pdf

Adversarial Counterfactual Environment Model Learning

Xiong-Hui Chen1,2,4, , Yang Yu1,2,4, , , Zheng-Mao Zhu1,2, Zhihua Yu1,2,3, Zhenjun Chen1,2,3, Chenghe Wang1,2, Yinan Wu3, Hongqiu Wu1,2, Rong-Jun Qin1,2,4,Ruijin Ding5, Fangsheng Huang3

1 National Key Laboratory for Novel Software Technology, Nanjing University 2 School of Artificial Intelligence, Nanjing University, 3 Meituan, 4 Polixir.ai, 5 Tsinghua University {chenxh, yuy, zhuzm}@lamda.nju.edu.cn, {yuzh,chenzj}@smail.nju.edu.cn wangch@lamda.nju.edu.cn, wuyinan02@meituan.com, {wuhq,qinrj}@lamda.nju.edu.cn drj17@mails.tsinghua.edu.cn, huangfangsheng@meituan.com

An accurate environment dynamics model is crucial for various downstream tasks in sequential decision-making, such as counterfactual prediction, off-policy evaluation, and offline reinforcement learning. Currently, these models were learned through empirical risk minimization (ERM) by step-wise fitting of historical transition data. This way was previously believed unreliable over long-horizon rollouts because of the compounding errors, which can lead to uncontrollable inaccuracies in predictions. In this paper, we find that the challenge extends beyond just longterm prediction errors: we reveal that even when planning with one step, learned dynamics models can also perform poorly due to the selection bias of behavior policies during data collection. This issue will significantly mislead the policy optimization process even in identifying single-step optimal actions, further leading to a greater risk in sequential decision-making scenarios. To tackle this problem, we introduce a novel model-learning objective called adversarial weighted empirical risk minimization (AWRM). AWRM incorporates an adversarial policy that exploits the model to generate a data distribution that weakens the model s prediction accuracy, and subsequently, the model is learned under this adversarial data distribution. We implement a practical algorithm, GALILEO, for AWRM and evaluate it on two synthetic tasks, three continuous-control tasks, and a real-world application. The experiments demonstrate that GALILEO can accurately predict counterfactual actions and improve various downstream tasks, including offline policy evaluation and improvement, as well as online decision-making.

1 Introduction

A good environment dynamics model for action-effect prediction is essential for many downstream tasks. For example, humans or agents can leverage this model to conduct simulations to understand future outcomes, evaluate other policies performance, and discover better policies. With environment models, costly real-world trial-and-error processes can be avoided. These tasks are vital research problems in counterfactual predictions [52, 1], off-policy evaluation (OPE) [32, 35], and offline reinforcement learning (Offline RL) [31, 59, 12, 13, 10]. In these problems, the core role of the models is to answer queries on counterfactual data unbiasedly, that is, given states, correctly answer what might happen if we were to carry out actions unseen in the training data. However, addressing counterfactual queries differentiates environment model learning from standard supervised learning (SL), which directly fits the offline dataset for empirical risk minimization (ERM). In essence,

The authors contribute equally Yang Yu is the corresponding author.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

𝑠$=(4, 61) 𝑠%=(6, 62.5)

𝑎$= 0.13 𝑎%= 0

𝑠#=(3, 57) 𝑠&=(5, 62.3)

selection bias

target line

state action

(a) a selection bias example

0 2 4 action

ball's y-axis location

real AWRM-oracle GALILEO SL

(b) responses to actions in the training data

1.0 0.5 0.0 0.5 1.0 variation of action value

variation of next y prediction

real AWRM-oracle GALILEO SL

(c) responses to actions in the action space Figure 1: An example of selection bias and predictions under counterfactual queries. Suppose a ball locates in a 2D plane whose position is st = (xt, yt) at time t. The ball will move to st+1 = (xt+1, yt+1) according to xt+1 = xt + 1 and yt+1 N(yt + at, 2). Here, at is chosen by a control policy at πϕ(a|st) = N((ϕ yt)/15, 0.05) parameterized by ϕ, which tries to keep the ball near the line y = ϕ. In Fig. 1(a), the behavior policy µ is π62.5. Fig. 1(b) shows the collected training data and the learned models prediction of the next position of y. Besides, the dataset superfacially presents the relation that the corresponding next y will be smaller with a larger action. However, the truth is not because the larger at causes a smaller yt+1, but the policy selects a small at when yt is close to the target line. Mistakenly exploiting the association will lead to local optima with serious factual errors, e.g., believed that yt+1 π 1 ϕ (yt|a) + at ϕ 14at, where π 1 ϕ is the inverse function of πϕ. When we estimate the response curves by fixing yt and reassigning action at with other actions at + a, where a [ 1, 1] is a variation of action value, we found that the model of SL indeed exploit the association and give opposite responses, while in AWRM and its practical implementation GALILEO, the predictions are closer to the ground truths (yt+1 yt + at). The result is in Fig. 1(c), where the darker a region is, the more samples are fallen in. AWRM injects data collected by adversarial policies for model learning to eliminate the unidentifiability between yt+1 π 1 ϕ (yt|a) + at and wyt+1 yt + at in offline data.

the problem involves training the model on one dataset and testing it on another with a shifted distribution, specifically, the dataset generated by counterfactual queries. This challenge surpasses the SL s capability as it violates the independent and identically distributed (i.i.d.) assumption [31].

0.0 12.5 25.0 37.5 50.0 62.5 75.0 87.5 100.0 112.5 125.0 adversarial policy's parameter of the counterfactual dataset

mean square error (MSE) in the dataset

AWRM-oracle GALILEO SL

Figure 2: An illustration of the prediction error in counterfactual datasets. The error of SL is small only in training data (ϕ = 62.5) but becomes much larger in the dataset far away from the training data. AWRMoracle selects the oracle worst counterfactual dataset for training for each iteration (pseudocode is in Alg. 2) which reaches small MSE in all datasets and gives correct response curves (Fig. 1(c)). GALILEO approximates the optimal adversarial counterfactual data distribution based on the training data and model. Although the MSE of GALILEO is a bit larger than SL in the training data, in the counterfactual datasets, the MSE is on the same scale as AWRM-oracle.

In this paper, we concentrate on faithful dynamics model learning in sequential decision-making settings like RL. Specifically, we first highlight a distinct situation of distribution shift that can easily lead to catastrophic failures in the model s predictions: In many real-world applications, offline data is often gathered using a single policy with exhibiting selection bias, meaning that, for each state, actions are chosen unfairly. As illustrated in Fig. 1(a), to maintain the ball s trajectory along a target line, a behavior policy applies a smaller force when the ball s location is nearer to the target line. When a dataset is collected with such selection bias, the association between the states (location) and actions (force) will make SL hard to identify the correct causal relationship of the states and actions to the next states respectively (see Fig. 1(c)). Then when we query the model with counterfactual data, the predictions might be catastrophic failures. Finally, offline policy optimization based on this SL model, even for just seeking one-step optimal actions, will select also a totally opposite direction of policy improvement, making the offline policy learning system fail. The selection bias can be regarded as an instance of the distributional-shift problem in offline model-based RL, which has also received great attention [31, 59, 14, 34]. However, previous methods employing naive supervised learning for environment model learning tend to overlook this issue during the learning process,

addressing it instead by limiting policy exploration and learning in high-risk regions. So far, how to learn a faithful environment model that can alleviate the problem directly has rarely been discussed.

In this work, we focus on faithful environment model learning techniques. The work is first inspired by weighted empirical risk minimization (WERM), which is a typical solution to solve the selection bias problem in causal inference for individual treatment effects (ITEs) estimation in many scenarios like patients treatment selection [26, 1, 41]. ITEs measure the effects of treatments on individuals by administering treatments uniformly and evaluating the differences in outcomes. To estimate ITEs from offline datasets with selection bias, they estimate an inverse propensity score (IPS) to reweight the training data, approximating the data distribution under a uniform policy, and train the model under this reweighted distribution. Compared with ITEs estimation, the extra challenge of faithful model learning in sequential decision-making settings include: (1) the model needs to answer queries on numerous different policies, resulting in various and unknown target data distributions for reweighting, and (2) the IPS should account for the cumulative effects of behavior policies on state distribution rather than solely focusing on bias of actions. To address these issues, we propose an objective called adversarial weighted empirical risk minimization (AWRM). For each iteration, AWRM employs adversarial policies to construct an adversarial counterfactual dataset that maximizes the model s prediction error, and drive the model to reduce the prediction risks under the adversarial counterfactual data distribution. However, obtaining the adversarial counterfactual data distribution is infeasible in the offline setting. Therefore, we derive an approximation of the counterfactual data distribution queried by the optimal adversarial policy and provide a tractable solution to learn a model from the approximated data distribution. As a result, we propose a practical approach named Generative Adversarial off LIne counterfactua L Environment m Odel learning (GALILEO) for AWRM. Fig. 2 illustrates the difference in the prediction errors learned by these algorithms.

Experiments are conducted in two synthetic tasks, three continuous-control tasks, and a real-world application. We first verify that GALILEO can make accurate predictions on counterfactual data queried by other policies compared with baselines. We then demonstrate that the model learned by GALILEO is helpful to several downstream tasks including offline policy evaluation and improvement, and online decision-making in a large-scale production environment.

2 Preliminaries

We first introduce weighted empirical risk minimization (WERM) through inverse propensity scoring (IPS), which is commenly used in individualized treatment effects (ITEs) estimation [43]. It can be regarded as a scenario of single-step model learning . We define M (y|x, a) as the one-step environment, where x denotes the state vector containing pre-treatment covariates (such as age and weight), a denotes the treatment variable which is the action intervening with the state x, and y is the feedback of the environment. When the offline dataset is collected with a behavior policy µ(a|x) which has selection bias, a classical solution to handle the above problem is WERM through IPS ω [48, 3, 29]:

Definition 2.1. The learning objective of WERM through IPS is formulated as

min M M Ex,a,y pµ M [ω(x, a)ℓ(M(y|x, a), y)], (1)

where ω(x, a) := β(a|x)

µ(a|x), β and µ denote the policies in testing and training domains, and the joint probability pµ M (x, a, y) := ρ0(x)µ(a|x)M (y|x, a) in which ρ0(x) is the distribution of state. M is the model space. ℓis a loss function.

The ω is also known as importance sampling (IS) weight, which corrects the sampling bias by aligning the training data distribution with the testing data. By selecting different ˆω to approximate ω to learn the model M, current environment model learning algorithms employing reweighting are fallen into the framework. For standard ITEs estimation, ω = 1

ˆµ (i.e., β is a uniform policy) for balancing treatments, where ˆµ is an approximated behavior policy µ. Note that it is a reasonable weight in ITEs estimation: ITEs are defined to evaluate the differences of effect on each state under a uniform policy.

In sequential decision-making setting, decision-making processes in a sequential environment are often formulated into Markov Decision Process (MDP) [51]. MDP depicts an agent interacting with the environment through actions. In the first step, states are sampled from an initial state distribution x0 ρ0(x). Then at each time-step t {0, 1, 2, ...}, the agent takes an action at A through a policy π(at|xt) Π based on the state xt X, then the agent receives a reward rt from

a reward function r(xt, at) R and transits to the next state xt+1 given by a transition function M (xt+1|xt, at) built in the environment. Π, X, and A denote the policy, state, and action spaces.

3 Related Work We give related adversarial algorithms for model learning in the following and leave other related work in Appx. F. In particular, ITEs in Rubin causal model [48] and causal effect estimation in structural causal model [38] attracted widespread attention in recent years [56, 55, 58, 7]. GANTIE [58] uses a generator to fill counterfactual outcomes for each data pair and a discriminator to judge the source (treatment group or control group) of the filled data pair. The generator is trained to minimize the output of the discriminator. [7] propose SCIGAN to extend GANITE to continuous ITEs estimation via a hierarchical discriminator architecture. In real-world applications, environment model learning based on Generative Adversarial Imitation Learning (GAIL) has also been adopted for sequential decision-making problems [25]. GAIL is first proposed for policy imitation [25], which uses the imitated policy to generate trajectories by interacting with the environment. The policy is learned with the trajectories through RL which maximizes the cumulative rewards given by the discriminator. [47, 11] use GAIL for environment model learning by regarding the environment model as the generator and the behavior policy as the environment in standard GAIL. [16] further inject the technique into a unified objective for model-based RL, which joints model and policy optimization. Our study reveals the connection between adversarial model learning and the WERM through IPS, where previous adversarial model learning methods can be regarded as partial implementations of GALILEO, explaining the effectiveness of the former.

4 Adversarial Counterfactual Environment Model Learning In this section, we first propose a new offline model-learning objective for sequential decision-making setting in Sec. 4.1; In Sec. 4.2, we give a surrogate objective to the proposed objective; Finally, we give a practical solution in Sec. 4.3.

4.1 Problem Formulation

In scenarios like offline policy evaluation and improvement, it is crucial for the environment model to have generalization ability in counterfactual queries, as we need to query accurate feedback from M for numerous different policies. Referring to the formulation of WERM through IPS in Def. 2.1, these requirements necessitate minimizing counterfactual-query risks for M under multiple unknown policies, rather than focusing on a specific target policy β. More specifically, the question is: If β is unknown and can be varied, how can we generally reduce the risks in counterfactual queries? In this article, we call the model learning problem in this setting counterfactual environment model learning and propose a new objective to address the issue. To be compatible with multi-step environment model learning, we first define a generalized WERM through IPS based on Def. 2.1: Definition 4.1. In an MDP, given a transition function M that satisfies M (x |x, a) > 0, x X, a A, x X and µ satisfies µ(a|x) > 0, a A, x X, the learning objective of generalized WERM through IPS is:

min M MEx,a,x ρµ M [ω(x, a, x )ℓM(x, a, x )], (2)

where ω(x, a, x ) = ρβ M (x,a,x ) ρµ M (x,a,x ), ρµ M and ρβ M the training and testing data distributions collected by policy µ and β respectively. We define ℓM(x, a, x ) := ℓ(M(x |x, a), x ) for brevity.

In an MDP, for any given policy π, we have ρπ M (x, a, x ) = ρπ M (x)π(a|x)M (x |x, a) where ρπ M (x) denotes the occupancy measure of x for policy π. This measure can be defined as (1 γ)Ex0 ρ0 [P t=0 γt Pr(xt = x|x0, M )] [51, 25] where Prπ [xt = x|x0, M ] is the discounted state visitation probability that the policy π visits x at time-step t by executing in the environment M and starting at the state x0. Here γ [0, 1] is the discount factor. We also define ρπ M (x, a) := ρπ M (x)π(a|x) for simplicity.

With this definition, ω can be rewritten: ω = ρβ M (x)β(a|x)M (x |x,a) ρµ M (x)µ(a|x)M (x |x,a) = ρβ M (x,a) ρµ M (x,a). In single-step

environments, for any policy π, ρπ M (x) = ρ0(x). Consequently, we obtain ω = ρ0(x)β(a|x)

ρ0(x)µ(a|x) = β(a|x)

µ(a|x), and the objective degrade to Eq. (1). Therefore, Def. 2.1 is a special case of this generalized form. Remark 4.2. ω is referred to as density ratio and is commonly used to correct the weighting of rewards in off-policy datasets to estimate the value of a specific target policy in off-policy evaluation

[35, 60]. Recent studies in offline RL also provide similar evidence through upper bound analysis, suggesting that offline model learning should be corrected to specific target policies distribution using ω [57]. We derive the objective from the perspective of selection bias correlation, further demonstrate the necessity and effects of this term.

In contrast to previous studies, in this article, we would like to propose an objective for faithful model learning which can generally reduce the risks in counterfactual queries in scenarios where β is unknown and can be varied. To address the problem, we introduce adversarial policies that can iteratively induce the worst prediction performance of the current model and propose to optimize WERM under the adversarial policies. In particular, we propose Adversarial Weighted empirical Risk Minimization (AWRM) based on Def. 4.1 to handle this problem. Definition 4.3. Given the MDP transition function M , the learning objective of adversarial weighted empirical risk minimization through IPS is formulated as

min M M max β Π L(ρβ M , M) = min M M max β Π Ex,a,x ρµ M [ω(x, a|ρβ M )ℓM(x, a, x )], (3)

where ω(x, a|ρβ M ) = ρβ M (x,a) ρµ M (x,a), and the re-weighting term ω(x, a|ρβ M ) is conditioned on the

distribution ρβ M of the adversarial policy β. In the following, we will ignore ρβ M and use ω(x, a) for brevity. In a nutshell, Eq. (3) minimizes the maximum model loss under all counterfactual data distributions ρβ M , β Π to guarantee the generalization ability for counterfactual queried by policies in Π.

4.2 Surrogate AWRM through Optimal Adversarial Data Distribution Approximation

The main challenge of solving AWRM is constructing the data distribution ρβ

M of the best-response policy β in M since obtaining additional data from M can be expensive in real-world applications. In this paper, instead of deriving the optimal β , our solution is to offline estimate the optimal adversarial distribution ρβ

M (x, a, x ) with respect to M, enabling the construction of a surrogate objective to optimize M without directly querying the real environment M .

In the following, we select ℓM as the negative log-likelihood loss for our full derivation, instantiating L(ρβ M , M) in Eq. (3) as: Ex,a ρµ M [ω(x, a|ρβ M )EM ( log M(x |x, a))], where EM [ ] denotes Ex M (x |x,a) [ ]. Ideally, for any given M, it is obvious that the optimal β is the one that makes ρβ M (x, a) assign all densities to the point with the largest negative log-likelihood. However, this maximization process is impractical, particularly in continuous spaces. To provide a tractable yet relaxed solution, we introduce an L2 regularizer to the original objective in Eq. (3).

min M M max β Π L(ρβ M , M) = min M M max β Π L(ρβ M , M) α

2 ρβ M ( , ) 2 2, (4)

where α denotes the regularization coefficient of ρβ M and ρβ M ( , ) 2 2 = R

X,A(ρβ M (x, a))2dadx.

Now we present the final results and the intuitions behind them, while providing a full derivation in Appx.A. Suppossing we have ρ β

M representing the approximated data distribution of the approximated best-response policy β under model Mθ parameterized by θ, we can find the optimal θ of minθ maxβ Π L(ρβ M , Mθ) (Eq. (4)) through iterative optimization of the objective θt+1 = minθ L( ρ β

M , Mθ). To this end, we approximate ρ β

M via the last-iteration model Mθt and derive an upper bound objective for minθ L( ρ β

θt+1 = min θ Eρµ M

" 1 α0(x, a) log Mθ(x |x, a)

ρµ Mθt (x, a, x )

ρµ M (x, a, x )

| {z } discrepancy

ρµ Mθt (x, a)

ρµ M (x, a)

| {z } density ratio baseline

+ HM (x, a) | {z } stochasticity

| {z } W (x,a,x )

where Eρµ M [ ] denotes Ex,a,x ρµ M [ ], f is a convex and lower semi-continuous (l.s.c.) function satisfying f (x) 0, x X, which is also called f function in f-divergence [2], α0(x, a) = αMθtρµ M (x, a), and HM (x, a) denotes the entropy of M ( |x, a).

Remark 4.4 (Intuition of W). After derivation, we found that the optimal adversarial data distribution can be approximated by ρ β

M (x, a) = R

X ρµ M (x, a, x )W(x, a, x )dx (see Appx. A), leading to the upper-bound objective Eq. (5), which is a WERM dynamically weighting by W. Intuitively, W assigns more learning propensities for data points with (1) larger discrepancy between ρµ Mθt (generated by model) and ρµ M (real-data distribution), or (2) larger stochasticity of the real model M . The latter is contributed by the entropy HM , while the former is contributed by the first two terms combined. In particular, through the definition of f-divergence, we known that the discrepancy of two distribution P and Q can be measured by R

X Q(x)f(P(x)/Q(x))dx, thus the terms f(ρµ Mθt(x, a, x )/ρµ M (x, a, x )) can be interpreted as the discrepancy measure unit between ρµ Mθt(x, a, x ) and ρµ M (x, a, x ), while f(ρµ Mθt(x, a)/ρµ M (x, a)) serves as a baseline on x and a measured by f to balance the discrepancy contributed by x and a, making M focus on errors on x .

In summary, by adjusting the weights W, the learning process will iteratively exploit subtle errors of the current model in any data point, regardless of how many proportions it contributes in the original data distribution, to eliminate potential unidentifiability on counterfactual data caused by selection bias.

4.3 Tractable Solution

In Eq. (5), the terms f(ρµ Mθt(x, a, x )/ρµ M (x, a, x )) f(ρµ Mθt(x, a)/ρµ M (x, a)) are still intractable. Thanks to previous successful practices in GAN [19] and GAIL [25], we achieve the objective via a generator-discriminator-paradigm objective through similar derivation. We show the results as follows and leave the complete derivation in Appx. A.4. In particular, by introducing two discriminators Dφ 0(x, a, x ) and Dφ 0(x, a), we can optimize the surrogate objective Eq. (5) via:

θt+1 = max θ

Aφ 0,φ 1(x, a, x ) log Mθ(x |x, a) + Eρµ M (HM (x, a) Aφ 0,φ 1(x, a, x )) log Mθ(x |x, a)

s.t. φ 0 = arg max φ0

Eρµ M log Dφ0(x, a, x ) + Eρˆ µ Mθt

log(1 Dφ0(x, a, x ))

φ 1 = arg max φ1

Eρµ M [log Dφ1(x, a)] + Eρˆ µ Mθt [log(1 Dφ1(x, a))] , (6)

where Eρµ M [ ] is a simplification of Ex,a,x ρµ M [ ], Aφ 0,φ 1(x, a, x ) = log Dφ 0(x, a, x ) log Dφ 1(x, a), and φ0 and φ1 are the parameters of Dφ0 and Dφ1 respectively. We learn a policy ˆµ µ via imitation learning based on the offline dataset Dreal [40, 25]. Note that in the process, we ignore the term α0(x, a) for simplifying the objective. The discussion on the impacts of removing α0(x, a) is left in App. B.

Offline data

(2) update discriminators

Discriminator

Dynamics model

(1) data generation

rollout with the behavior policy

adversarial reweighting

(3) update the dynamics model

Figure 3: Illustration of the GALILEO workflow.

The overall optimization pipeline is illustration in Fig. 3. In Eq. (6), the reweighting term W from Eq. (5) is split into two terms in the RHS of the equation: the first term is a GAIL-style objective [25], treating Mθ as the policy generator, ˆµ as the environment, and A as the advantage function, while the second term is WERM through HM Aφ 0,φ 1. The first term resembles the previous adversarial model learning objectives [46, 47, 11]. These two terms have intuitive explanations: the first term assigns learning weights on data generated by the model Mθt. If the predictions of the model appear realistic, mainly assessed by Dφ 0, the propensity weights would be increased, encouraging the model to generate more such kind of data; Conversely, the second term assigns weights on real data generated by M . If the model s predictions seem unrealistic (mainly assessed by Dφ 0) or stochastic (evaluated by HM ), the propensity weights will be increased, encouraging the model to pay more attention to these real data points when improving the likelihoods.

e0.05_p0.05

GALILEO SL IPW SCIGAN

GALILEO SL IPW SCIGAN

(b) TCGA Figure 4: Illustration of the performance in GNFC and TCGA. The grey bar denotes the standard error ( 0.3 for brevity) of 3 random seeds.

Based on the above implementations, we propose Generative Adversarial off LIne counterfactua L Environment m Odel learning (GALILEO) for environment model learning. We list a brief of GALILEO in Alg. 1 and the details in Appx. E.

Algorithm 1 Pseudocode for GALILEO

Input: Dreal: offline dataset sampled from ρµ M where µ is the behavior policy; N: total iterations; Process:

1: Approximate a behavior policy ˆµ via behavior cloning through offline dataset Dreal 2: Initialize an environment model Mθ1 3: for t = 1 : N do 4: Use ˆµ to generate a dataset Dgen with the model Mθt 5: Update the discriminators Dφ0 and Dφ1 through the second and third equations in Eq. (6) where ρˆµ Mθt is estimated by Dgen and ρµ M is estimated by Dreal 6: Generative adversarial training for Mθt by regarding Aφ 0,φ 1 as the advantage function and computing the gradient to Mθt, named gpg, with a standard policy gradient method like TRPO [44] or PPO [45] based on Dgen. 7: Regard HM Aφ 0,φ 1 as the reweighting term for WERM and compute the gradient to Mθt based on Dreal. Record it as gsl. 8: Update the model θt+1 θt + gpg + gsl. 9: end for

5 Experiments

In this section, we first conduct experiments in two synthetic environments to quantify the performance of GALILEO on counterfactual queries 3. Then we deploy GALILEO in two complex environments: Mu Jo Co in Gym [53] and a real-world food-delivery platform to test the performance of GALILEO in difficult tasks. The results are in Sec. 5.2. Finally, to further verify the abiliy GALILEO, in Sec. 5.3, we apply models learned by GALILEO to several downstream tasks including off-policy evaluation, offline policy improvement, and online decision-making in production environment. The algorithms compared are: (1) SL: using standard empirical risk minimization for model learning; (2) IPW [50]: a standard implementation of WERM based IPS; (3) SCIGAN [7]: an adversarial algorithms for model learning used for causal effect estimation, which can be roughly regarded as a partial implementation of GALILEO (Refer to Appx. E.2). We give a detailed description in Appx. G.2.

5.1 Environment Settings Synthetic Environments Previous experiments on counterfactual environment model learning are based on single-step semi-synthetic data simulation [7]. As GALILEO is compatible with single-step environment model learning, we first benchmark GALILEO in the same task named TCGA as previous studies do [7]. Based on the three synthetic response functions, we construct 9 tasks by choosing different parameters of selection bias on µ which is constructed with beta distribution, and

3code https://github.com/xionghuichen/galileo.

design a coefficient c to control the selection bias. We name the tasks with the format of t?_bias? . For example, t1_bias2 is the task with the first response functions and c = 2. The details of TCGA is in Appx. G.1.2. Besides, for sequential environment model learning under selection bias, we construct a new synthetic environment, general negative feedback control (GNFC), which can represent a classic type of task with policies having selection bias, where Fig. 1(a) is also an instance of GNFC. We construct 9 tasks on GNFC by adding behavior policies µ with different scales of uniform noise U( e, e) with probabilities p. Similarly, we name them with the format e?_p? . Continuous-control Environments We select 3 Mu Jo Co environments from D4RL [17] to construct our model learning tasks. We compare it with a standard transition model learning algorithm used in the previous offline model-based RL algorithms [59, 30], which is a standard supervised learning. We name the method OFF-SL. Besides, we also implement IPW and SCIGAN as the baselines. In D4RL benchmark, only the medium tasks is collected with a fixed policy, i.e., the behavior policy is with 1/3 performance to the expert policy), which is most matching to our proposed problem. So we train models in datasets Half Cheetah-medium, Walker2d-medium, and Hopper-medium. A Real-world Large-scale Food-delivery Platform We finally deploy GALILEO in a real-world large-scale food-delivery platform. We focus on a Budget Allocation task to the Time period (BAT) in the platform (see Appx. G.1.3 for details). The goal of the BAT task is to handle the imbalance problem between the demanded orders from customers and the supply of delivery clerks in different time periods by allocating reasonable allowances to those time periods. The challenge of the environment model learning in BAT tasks is similar to the challenge in Fig. 1: the behavior policy is a human-expert policy, which tends to increase the budget of allowance in the time periods with a lower supply of delivery clerks, otherwise tends to decrease the budget (We gives a real-data instance in Appx. G.1.3).

5.2 Prediction Accuracy on Shifted Data Distributions Test in Synthetic Environments For all of the tasks, we select mean-integrated-square error MISE = E h R

A (M (x |x, a) M(x |x, a))2 da i as the metric, which is a metric to measure the accuracy in counterfactual queries by considering the prediction errors in the whole action space. The results are summarized in Fig. 4 and the detailed results can be found in Appx. H. The results show that the property of the behavior policy (i.e., e and p) dominates the generalization ability of the baseline algorithms. When e = 0.05, almost all of the baselines fail and give a completely opposite response curve, while GALILEO gives the correct response. (see Fig. 5). IPW still performs well when 0.2 e 1.0 but fails when e = 0.05, p <= 0.2. We also found that SCIGAN can reach a better performance than other baselines when e = 0.05, p <= 0.2, but the results in other tasks are unstable. GALILEO is the only algorithm that is robust to the selection bias and outputs correct response curves in all of the tasks. Based on the experiment, we also indicate that the commonly used overlap assumption is unreasonable to a certain extent especially in real-world applications since it is impractical to inject noises into the whole action space.

0.0 0.2 0.4 0.6 0.8 1.0 normalized action

averaged response

real GALILEO SL IPW SCIGAN

Figure 5: Illustration of the averaged response curves in task e0.05_p0.2.

The problem of overlap assumption being violated, e.g., e < 1 in our setting, should be taken into consideration otherwise the algorithm will be hard to use in practice if it is sensitive to the noise range. On the other hand, we found the phenomenon in TCGA experiment is similar to the one in GNFC, which demonstrates the compatibility of GALILEO to single-step environments.

We also found that the results of IPW are unstable in TCGA experiment. It might be because the behavior policy is modeled with beta distribution while the propensity score ˆµ is modeled with Gaussian distribution. Since IPW directly reweight loss with 1

ˆµ, the results are sensitive to the error of ˆµ.

Finally, we plot the averaged response curves which are constructed by equidistantly sampling action from the action space and averaging the feedback of the states in the dataset as the averaged response. One result is in Fig. 5 (all curves can be seen in Appx. H). For those tasks where baselines fail in reconstructing response curves, GALILEO not only reaches a better MISE score but reconstructs almost exact responses, while the baselines might give completely opposite responses. Test in Mu Jo Co Benchmarks We test the prediction error of the learned model in corresponding unseen expert and medium-replay datasets. Fig. 6 illustrates the results in halfcheetah. We can see that all algorithms perform well in the training datasets. OFF-SL can even reach a bit lower error. However, when we verify the models through expert and medium-replay datasets, which

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(a) medium (train)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(b) medium-replay (test)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(c) expert (test) Figure 6: Illustration of learning curves of the halfcheetah tasks (full results are in Appx. H.5). The figures with titles ending in (train) means the dataset is used for training while the titles ending in (test) means the dataset is just used for testing. The X-axis records the steps of the environment model update, and the Y-axis is the prediction errors in the corresponding steps evaluated by the datasets. The solid curves are the mean reward and the shadow is the standard error of three seeds.

are collected by other policies, the performance of GALILEO is more stable and better than all other algorithms. As the training continues, the baseline algorithms even gets worse and worse. The phenomenon are similar among three datasets, and we leave the full results in Appx. H.5.

0.0 0.2 0.4 0.6 0.8 1.0 normalized allocated allowances

rescaled averaged response

(a) response curves in City A

0.0 0.2 0.4 0.6 0.8 1.0 cumulative samples (rescaled to 1)

rescaled cumulative treatment effects

random GALILEO SL

(b) AUUC curves

0 5 10 15 20 day number

rescaled five-minute order-taken rate

treatment group (GALILEO) control group the day starting A/B test

(c) A/B test in the City A Figure 7: Parts of the results in BAT tasks. Fig. 7(a) demonstrate the averaged response curves of the SL and GALILEO model in City A. It is expected to be monotonically increasing through our prior knowledge. In Fig. 7(b) show the AUCC curves, where the model with larger areas above the random line makes better predictions in randomized-controlled-trials data [61].

Test in a Real-world Dataset We first learn a model to predict the supply of delivery clerks (measured by fulfilled order amount) on given allowances. Although the SL model can efficiently fit the offline data, the tendency of the response curve is easily to be incorrect. As can be seen in Fig. 7(a), with a larger budget of allowance, the prediction of the supply is decreased in SL, which obviously goes against our prior knowledge. This is because, in the offline dataset, the corresponding supply will be smaller when the allowance is larger. It is conceivable that if we learn a policy through the model of SL, the optimal solution is canceling all of the allowances, which is obviously incorrect in practice. On the other hand, the tendency of GALILEO s response is correct. In Appx. H.7, we plot all the results in 6 cities. We further collect some randomized controlled trials data, and the Area Under the Uplift Curve (AUUC) [6] curve in Fig. 7(b) verify that GALILEO gives a reasonable sort order on the supply prediction while the standard SL technique fails to achieve this task.

5.3 Apply GALILEO to Downstream Tasks

Table 1: Results of OPE on DOPE benchmark. We list the averaged performances on three tasks. The detailed results are in Appx. H.6. is the standard deviation among the tasks. We bold the best scores for each metric.

Algorithm Norm. value gap Rank corr. Regret@1

GALILEO 0.37 0.24 0.44 0.10 0.09 0.02 Best DICE 0.48 0.19 0.15 0.04 0.42 0.28 VPM 0.71 0.04 0.29 0.15 0.17 0.11 FQE (L2) 0.54 0.09 -0.19 0.10 0.34 0.03 IS 0.67 0.01 -0.40 0.15 0.36 0.27 Doubly Rubost 0.57 0.07 -0.14 0.17 0.33 0.06

Off-policy Evaluation (OPE) We first verify the ability of the models in Mu Jo Co environments by adopting them into off-policy evaluation tasks. We use 10 unseen policies constructed by DOPE benchmark [18] to conduct our experiments. We select three common-used metrics: value gap, regret@1, and rank correlation and averaged the results among three tasks in Tab. 1. The baselines and the corre-

Table 2: Results of policy performance directly optimized through SAC [20] using the learned dynamics models and deployed in Mu Jo Co environments. MAX-RETURN is the policy performance of SAC in the Mu Jo Co environments, and avg. norm. is the averaged normalized return of the policies in the 9 tasks, where the returns are normalized to lie between 0 and 100, where a score of 0 corresponds to the worst policy, and 100 corresponds to MAX-RETURN.

Task Hopper Walker2d Half Cheetah avg. norm.

Horizon H=10 H=20 H=40 H=10 H=20 H=40 H=10 H=20 H=40 /

GALILEO 13.0 0.1 33.2 0.1 53.5 1.2 11.7 0.2 29.9 0.3 61.2 3.4 0.7 0.2 -1.1 0.2 -14.2 1.4 51.1 OFF-SL 4.8 0.5 3.0 0.2 4.6 0.2 10.7 0.2 20.1 0.8 37.5 6.7 0.4 0.5 -1.1 0.6 -13.2 0.3 21.1 IPW 5.9 0.7 4.1 0.6 5.9 0.2 4.7 1.1 2.8 3.9 14.5 1.4 1.6 0.2 0.5 0.8 -11.3 0.9 19.7 SCIGAN 12.7 0.1 29.2 0.6 46.2 5.2 8.4 0.5 9.1 1.7 1.0 5.8 1.2 0.3 -0.3 1.0 -11.4 0.3 41.8

MAX-RETURN 13.2 0.0 33.3 0.2 71.0 0.5 14.9 1.3 60.7 11.1 221.1 8.9 2.6 0.1 13.3 1.1 49.1 2.3 100.0

sponding results we used are the same as the one proposed by [18]. As seen in Tab. 1, compared with all the baselines, OPE by GALILEO always reach the better performance with a large margin (at least 23%, 193% and 47% respectively), which verifies that GALILEO can eliminate the effect of selection bias and give correct evaluations on unseen policies.

Offline Policy Improvement We then verify the generalization ability of the models in Mu Jo Co environments by adopting them into offline model-based RL. To strictly verify the ability of the models, we abandon all tricks to suppress policy exploration and learning in risky regions as current offline model-based RL algorithms [59] do, and we just use the standard SAC algorithm [20] to fully exploit the models to search an optimal policy. Unfortunately, we found that the compounding error will still be inevitably large in the 1,000-step rollout, which is the standard horizon in Mu Jo Co tasks, leading all models to fail to derive a reasonable policy. To better verify the effects of models on policy improvement, we learn and evaluate the policies with three smaller horizons: H {10, 20, 40}. The results are listed in Tab. 2. We first averaged the normalized return (refer to avg. norm. ) under each task, and we can see that the policy obtained by GALILEO is significantly higher than other models (the improvements are 24% to 161%). But in Half Cheetah, IPW works slightly better. However, compared with MAX-RETURN, it can be found that all methods fail to derive reasonable policies because their policies performances are far away from the optimal policy. By further checking the trajectories, we found that all the learned policies just keep the cheetah standing in the same place or even going backward. This phenomenon is also similar to the results in MOPO [59]. In MOPO s experiment in the medium datasets, the truncated-rollout horizon used in Walker and Hopper for policy training is set to 5, while Half Cheetah has to be set to the minimal value: 1. These phenomena indicate that Half Cheetah may still have unknown problems, resulting in the generalization bottleneck of the models.

Online Decision-making in a Production Environment Finally, we search for the optimal policy via model-predict control (MPC) using cross-entropy planner [21] based on the learned model and deploy the policy in a real-world platform. The results of A/B test in City A is shown in Fig. 7(c). It can be seen that after the day of the A/B test, the treatment group (deploying our policy) significant improve the five-minute order-taken rate than the baseline policy (the same as the behavior policy). In summary, the policy improves the supply from 0.14 to 1.63 percentage points to the behavior policies in the 6 cities. The details of these results are in Appx. H.7.

6 Discussion and Future Work

In this work, we propose AWRM which handles the generalization challenges of the counterfactual environment model learning. By theoretical modeling, we give a tractable solution to handle AWRM and propose GALILEO. GALILEO is verified in synthetic environments, complex robot control tasks, and a real-world platform, and shows great generalization ability on counterfactual queries.

Giving correct answers to counterfactual queries is important for policy learning. We hope the work can inspire researchers to develop more powerful tools for counterfactual environment model learning. The current limitation lies in: There are several simplifications in the theoretical modeling process (further discussion is in Appx. B), which can be modeled more elaborately. Besides, experiments on Mu Jo Co indicate that these tasks are still challenging to give correct predictions on counterfactual data. These should also be further investigated in future work.

Acknowledgements and Disclosure of Funding

This work is supported by the National Key Research and Development Program of China (2020AAA0107200) and the National Science Foundation of China (61921006).

[1] A. M. Alaa and M. van der Schaar. Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 129 138, Stockholm, Sweden, 2018.

[2] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131 142, 1966.

[3] S. Assaad, S. Zeng, C. Tao, S. Datta, N. Mehta, R. Henao, F. Li, and L. Carin. Counterfactual representation learning with balancing weights. In The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1972 1980, Virtual Event, 2021.

[4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Mach. Learn., 79(1-2):151 175, 2010.

[5] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 19, pages 137 144, British Columbia, Canada, 2006. MIT Press.

[6] A. Betlei, E. Diemert, and M. Amini. Treatment targeting by AUUC maximization with generalization guarantees. Co RR, abs/2012.09897, 2020.

[7] I. Bica, J. Jordon, and M. van der Schaar. Estimating the effects of continuous-valued interventions using generative adversarial networks. In Advances in Neural Information Processing Systems 33, virtual event, 2020.

[8] L. Buesing, T. Weber, Y. Zwols, N. Heess, S. Racanière, A. Guez, and J. Lespiau. Woulda, coulda, shoulda: Counterfactually-guided policy search. In 7th International Conference on Learning Representations, New Orleans, LA, 2019.

[9] J. Byrd and Z. C. Lipton. What is the effect of importance weighting in deep learning? In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 872 881, Long Beach, CA, 2019.

[10] X. Chen, B. He, Y. Yu, Q. Li, Z. T. Qin, W. Shang, J. Ye, and C. Ma. Sim2rec: A simulator-based decision-making approach to optimize real-world long-term user engagement in sequential recommender systems. In 39th IEEE International Conference on Data Engineering, pages 3389 3402, Anaheim, CA, 2023. IEEE.

[11] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song. Generative adversarial user model for reinforcement learning based recommendation system. In Proceedings of the 36th International Conference on Machine Learning, pages 1052 1061, Long Beach, CA, 2019.

[12] X. Chen, Y. Yu, Q. Li, F. Luo, Z. T. Qin, W. Shang, and J. Ye. Offline model-based adaptable policy learning. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34, pages 8432 8443, virtual, 2021.

[13] X.-H. Chen, F.-M. Luo, Y. Yu, Q. Li, Z. Qin, W. Shang, and J. Ye. Offline model-based adaptable policy learning for decision-making in out-of-support regions. IEEE Transactions on Pattern Analysis Machine Intelligence, (01):1 16, 5555.

[14] X.-H. Chen, Y. Yu, Q. Li, F.-M. Luo, Z. T. Qin, S. Wenjie, and J. Ye. Offline model-based adaptable policy learning. In Advances in Neural Information Processing Systems 34, Virtual Conference, 2021.

[15] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances in Neural Information Processing Systems 23, pages 442 450, British Columbia, Canada, 2010. Curran Associates, Inc.

[16] B. Eysenbach, A. Khazatsky, S. Levine, and R. Salakhutdinov. Mismatched no more: Joint model-policy optimization for model-based RL. Co RR, abs/2110.02758, 2021.

[17] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: datasets for deep data-driven reinforcement learning. Co RR, abs/2004.07219, 2020.

[18] J. Fu, M. Norouzi, O. Nachum, G. Tucker, Z. Wang, A. Novikov, M. Yang, M. R. Zhang, Y. Chen, A. Kumar, C. Paduraru, S. Levine, and T. Paine. Benchmarks for deep off-policy evaluation. In 9th International Conference on Learning Representations, Virtual Event, Austria, 2021.

[19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672 2680, Montréal, Canada, 2014.

[20] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pages 1856 1865, Stockholmsmässan, Sweden, 2018.

[21] D. Hafner, T. P. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2555 2565, Long Beach, CA, 2019.

[22] N. Hassanpour and R. Greiner. Counterfactual regression with importance sampling weights. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 5880 5887, Macao, China, 2019.

[23] J. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. 2001.

[24] T. Hishinuma and K. Senda. Weighted model estimation for offline model-based reinforcement learning. In Advances in Neural Information Processing Systems 34, pages 17789 17800, virtual, 2021.

[25] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pages 4565 4573, Barcelona, Spain, 2016.

[26] G. Imbens. The role of the propensity score in estimating dose-response functions. Econometrics e Journal, 1999.

[27] E. L. Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295 311, 2008.

[28] F. D. Johansson, N. Kallus, U. Shalit, and D. Sontag. Learning weighted representations for generalization across designs. ar Xiv preprint ar Xiv:1802.08598, 2018.

[29] Y. Jung, J. Tian, and E. Bareinboim. Learning causal effects via weighted empirical risk minimization. In Advances in Neural Information Processing Systems 33, Virtual Event, 2020.

[30] R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. MORe L : Model-based offline reinforcement learning. Co RR, abs/2005.05951, 2020.

[31] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Co RR, abs/2005.01643, 2020.

[32] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31, pages 5361 5371, Montréal, Canada, 2018.

[33] Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill. Provably good batch off-policy reinforcement learning without great exploration. In Advances in Neural Information Processing Systems 33, virtual event, 2020.

[34] C. Lu, P. J. Ball, J. Parker-Holder, M. A. Osborne, and S. J. Roberts. Revisiting design choices in offline model based reinforcement learning. In The 10th International Conference on Learning Representations, Virtual Event, 2022.

[35] O. Nachum, Y. Chow, B. Dai, and L. Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32, pages 2315 2325, BC, Canada, 2019.

[36] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory, 56(11):5847 5861, 2010.

[37] S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems 29, pages 271 279, Barcelona, Spain, 2016.

[38] J. Pearl. Causality. Cambridge University Press, 2009.

[39] S. Pitis, E. Creager, and A. Garg. Counterfactual data augmentation using locally factored dynamics. In Advances in Neural Information Processing Systems 33, virtual, 2020.

[40] D. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88 97, 1991.

[41] T. Qin, T. Wang, and Z. Zhou. Budgeted heterogeneous treatment effect estimation. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8693 8702, 2021.

[42] J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. Mit Press, 2008.

[43] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41 55, 1983.

[44] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1889 1897, Lille, France, 2015.

[45] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017.

[46] W. Shang, Y. Yu, Q. Li, Z. T. Qin, Y. Meng, and J. Ye. Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 566 576, Anchorage, AK, 2019.

[47] J. Shi, Y. Yu, Q. Da, S. Chen, and A. Zeng. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In The 33th AAAI Conference on Artificial Intelligence, pages 4902 4909, Honolulu, Hawaii, 2019.

[48] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000.

[49] S. A. Sontakke, A. Mehrjou, L. Itti, and B. Schölkopf. Causal curiosity: RL agents discovering self-supervised experiments for causal representation learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9848 9858, Virtual Event, 2021.

[50] P. Spirtes. Introduction to causal inference. J. Mach. Learn. Res., 11:1643 1662, aug 2010.

[51] R. S. Sutton and A. G. Barto. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press, 1998.

[52] A. Swaminathan and T. Joachims. Batch learning from logged bandit feedback through counterfactual risk minimization. J. Mach. Learn. Res., 16:1731 1755, 2015.

[53] E. Todorov, T. Erez, and Y. Tassa. Mu Jo Co: A physics engine for model-based control. In Proceedings of the 24th IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033, Vilamoura, Portugal, 2012.

[54] C. Voloshin, N. Jiang, and Y. Yue. Minimax model learning. In A. Banerjee and K. Fukumizu, editors, The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1612 1620, Virtual Event, 2021.

[55] T. Wang, T. Qin, and Z. Zhou. Estimating possible causal effects with latent variables via adjustment. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202, pages 36308 36335, 2023.

[56] T.-Z. Wang, X.-Z. Wu, S.-J. Huang, and Z.-H. Zhou. Cost-effectively identifying causal effects when only response variable is observable. In Proceedings of the 37th International Conference on Machine Learning, pages 10060 10069, 2020.

[57] S. Yang, S. Zhang, Y. Feng, and M. Zhou. A unified framework for alternating offline model training and policy learning. Co RR, abs/2210.05922, 2022.

[58] J. Yoon, J. Jordon, and M. van der Schaar. GANITE: estimation of individualized treatment effects using generative adversarial nets. In 6th International Conference on Learning Representations, 2018.

[59] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma. MOPO: modelbased offline policy optimization. In Advances in Neural Information Processing Systems 33, virtual event, 2020.

[60] R. Zhang, B. Dai, L. Li, and D. Schuurmans. Gendice: Generalized offline estimation of stationary values. In 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.

[61] W. Y. Zou, S. Du, J. Lee, and J. O. Pedersen. Heterogeneous causal learning for effectiveness optimization in user marketing. Co RR, abs/2004.09702, 2020.

Table of Contents

A Proof of Theoretical Results 16 A.1 Proof of Lemma A.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.2 Proof of Eq. (9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.3 Proof of Thm. A.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.4 Proof of the Tractable Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B Discussion and Limitations of the Theoretical Results 25

C Societal Impact 25

D AWRM-oracle Pseudocode 25

E Implementation 25 E.1 Details of the GALILEO Implementation . . . . . . . . . . . . . . . . . . . . . 25 E.2 Connection with Previous Adversarial Algorithms . . . . . . . . . . . . . . . . 28

F Additional Related Work 28

G Experiment Details 30 G.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 G.2 Baseline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 G.3 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 G.4 Computation Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

H Additional Results 34 H.1 Test in Single-step Environments . . . . . . . . . . . . . . . . . . . . . . . . . 34 H.2 All of the Result Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 H.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 H.4 Worst-Case Prediction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 H.5 Detailed Results in the Mu Jo Co Tasks . . . . . . . . . . . . . . . . . . . . . . . 36 H.6 Off-policy Evaluation (OPE) in the Mu Jo Co Tasks . . . . . . . . . . . . . . . . 37 H.7 Detailed Results in the BAT Task . . . . . . . . . . . . . . . . . . . . . . . . . 42

A Proof of Theoretical Results

In the proof section, we replace the notation of E with an integral for brevity. Now we rewrite the original objective L(ρβ M , M) as:

min M M max β Π

X,A ρµ M (x, a)ω(x, a) Z

X M (x |x, a) ( log M(x |x, a)) dx dadx α

2 ρβ M ( , ) 2 2,

where ω(x, a) = ρβ M (x,a) ρµ M (x,a) and ρβ M ( , ) 2 2 = R

X,A ρβ M (x, a)2dadx, which is the squared l2-norm. In an MDP, given any policy π, ρπ M (x, a, x ) = ρπ M (x)π(a|x)M (x |x, a) where ρπ M (x) denotes the occupancy measure of x for policy π, which can be defined [51, 25] as ρπ M (x) := (1 γ)Ex0 ρ0 [P t=0 γt Pr(xt = x|x0, M )] where Prπ [xt = x|x0, M ] is the state visitation probability that π starts at state x0 in model M and receive x at timestep t and γ [0, 1] is the discount factor.

min M M maxβ Π L(ρβ M , M)

estimate the optimal adversarial distribution ρ β

M given M (Sec. A.1)

M as a generalized representation of ρ β

M (Sec. A.2)

M as the distribution of the best-response policy arg maxβ Π L(ρβ M , M), we have the surrogate objective min M M L( ρ β

M , M) (Sec. A.3)

tractable solution (Sec. A.4)

intermediary policy κ generator function f

(model an easy-toestimate distribution of the best-response policy β )

approximate ρ β

M with variational representation

Figure 8: The overall pipeline to model the tractable solution to AWRM. f is a generator function defined by f-divergence [37]. κ is an intermediary policy introduced in the estimation.

The overall pipeline to model the tractable solution to AWRM is given in Fig. 8. In the following, we will summarize the modelling process based on Fig. 8. We first approximate the optimal distribution ρ β

M via Lemma. A.1.

Lemma A.1. Given any M in L(ρβ M , M), the distribution ρ β

M (x, a) of the ideal best-response policy β satisfies: 1 αM (DKL(M ( |x, a), M( |x, a)) + HM (x, a)), (8)

where DKL(M ( |x, a), M( |x, a)) is the Kullback-Leibler (KL) divergence between M ( |x, a) and M( |x, a), HM (x, a) denotes the entropy of M ( |x, a), and αM is the regularization coefficient α in Eq. (4) and also as a normalizer of Eq. (8). Note that the ideal best-response policy β is not the real best-response policy β . The distribution ρ β

M is an approximation of the real optimal adversarial distribution. We give a discussion of the rationality of the ideal best-response policy β as a replacement of the real best-response policy β in Remark A.4. Intuitively, ρ β

M has larger densities on the data where the divergence between the approximation model and the real model (i.e., DKL(M ( |x, a), M( |x, a))) is larger or the stochasticity of the real model (i.e., HM ) is larger.

However, the integral process of DKL in Eq. (8) is intractable in the offline setting as it explicitly requires the conditional probability function of M . Our solution to solve the problem is utilizing the offline dataset Dreal as the empirical joint distribution ρµ M (x, a, x ) and adopting practical

techniques for distance estimation on two joint distributions, like GAN [19, 37], to approximate Eq. (8). To adopt that solution, we should first transform Eq. (8) into a form under joint distributions. Without loss of generality, we introduce an intermediary policy κ, of which µ can be regarded as a specific instance. Then we have M(x |x, a) = ρκ M(x, a, x )/ρκ M(x, a) for any M if ρκ M(x, a) > 0. Assuming x X, a A, ρκ M (x, a) > 0 if ρ β

M (x, a) > 0, which will hold when κ overlaps with µ, then Eq. (8) can transform to:

X ρκ M (x, a, x ) log ρκ M (x, a, x ) ρκ M(x, a, x ) dx ρκ M (x, a) log ρκ M (x, a) ρκ M(x, a) HM (x, a) !

where α0(x, a) = αMρκ M (x, a). We notice that the form ρκ M log ρκ M ρκ M is the integrated function in reverse KL divergence, which is an instance of f function in f-divergence [2]. Replacing that form with f function, we obtain a generalized representation of ρ β

M := 1 α0(x, a)

X ρκ M (x, a, x )f ρκ M(x, a, x ) ρκ M (x, a, x )

dx ρκ M (x, a) f ρκ M(x, a) ρκ M (x, a)

HM (x, a) !

where f : R+ R is a convex and lower semi-continuous (l.s.c.) function. ρ β

M gives a generalized representation of the optimal adversarial distribution to maximize the error of the model. Based on Eq. (9), we have a surrogate objective of AWRM which can avoid querying M to construct ρβ

Theorem A.2. Let ρ β

M as the data distribution of the best-response policy β in Eq. (4) under model Mθ parameterized by θ, then we can find the optimal θ of minθ maxβ Π L(ρβ M , Mθ) (Eq. (4)) via iteratively optimizing the objective θt+1 = minθ L( ρ β

M , Mθ), where ρ β

M is approximated via the last-iteration model Mθt. Based on Corollary A.9, we derive an upper bound objective for minθ L( ρ β

θt+1 = min θ Eρκ M

" 1 α0(x, a) log Mθ(x |x, a)

ρκ Mθt (x, a, x )

ρκ M (x, a, x )

ρκ Mθt (x, a)

ρκ M (x, a)

+ HM (x, a)

| {z } W (x,a,x )

where Eρκ M [ ] denotes Ex,a,x ρκ M [ ], f is a l.s.c function satisfying f (x) 0, x X, and α0(x, a) = αMθtρκ M (x, a).

Thm. A.2 approximately achieve AWRM by using κ and a pseudo-reweighting module W. W assigns learning propensities for data points with larger differences between distributions ρκ Mθt and ρκ M . By adjusting the weights, the learning process will exploit subtle errors in any data point, whatever how many proportions it contributes, to correct potential generalization errors on counterfactual data. Remark A.3. In practice, we need to use real-world data to construct the distribution ρκ M . In the offline model-learning setting, we only have a real-world dataset Dreal collected by the behavior policy µ, which is the empirical distribution of ρµ M . Let κ = µ, we have

θt+1 = min θ Eρµ M

" 1 α0(x, a) log Mθ(x |x, a)

ρµ Mθt (x, a, x )

ρµ M (x, a, x )

ρµ Mθt (x, a)

ρµ M (x, a)

+ HM (x, a)

| {z } W (x,a,x )

which is Eq. (5) in the main body.

In Thm. A.2, the terms f(ρκ Mθt(x, a, x )/ρκ M (x, a, x )) f(ρκ Mθt(x, a)/ρκ M (x, a)) are still intractable. Thanks to previous successful practices in GAN [19] and GAIL [25], we achieve the objective via a generator-discriminator-paradigm objective through similar derivation. We show the results as follows and leave the complete derivation in Appx. A.4. In particular, by introducing two discriminators Dφ 0(x, a, x ) and Dφ 0(x, a), letting κ = µ, we can optimize the surrogate objective Eq. (5) via:

θt+1 = max θ

Aφ 0,φ 1(x, a, x ) log Mθ(x |x, a) + Eρµ M (HM (x, a) Aφ 0,φ 1(x, a, x )) log Mθ(x |x, a)

s.t. φ 0 = arg max φ0

Eρµ M log Dφ0(x, a, x ) + Eρˆ µ Mθt

log(1 Dφ0(x, a, x ))

φ 1 = arg max φ1

Eρµ M [log Dφ1(x, a)] + Eρˆ µ Mθt [log(1 Dφ1(x, a))] ,

where Eρµ M [ ] is a simplification of Ex,a,x ρµ M [ ], Aφ 0,φ 1(x, a, x ) = log Dφ 0(x, a, x ) log Dφ 1(x, a), and φ0 and φ1 are the parameters of Dφ0 and Dφ1 respectively. We learn a policy ˆµ µ via imitation learning based on the offline dataset Dreal [40, 25]. Note that in the process, we ignore the term α0(x, a) for simplifying the objective. The discussion on the impacts of removing α0(x, a) is left in App. B.

A.1 Proof of Lemma A.1

Proof. Given a transition function M of an MDP, the distribution of the best-response policy β satisfies:

M = arg max ρβ M

X,A ρµ M (x, a)ω(x, a) Z

X M (x |x, a) ( log M(x |x, a)) dx dadx α

2 ρβ M ( , ) 2 2

= arg max ρβ M

X,A ρβ M (x, a) Z

X M (x |x, a) ( log M(x |x, a)) dx

| {z } g(x,a)

2 ρβ M ( , ) 2 2

= arg max ρβ M

X,A ρβ M (x, a)g(x, a)dadx ρβ M ( , ) 2 2

= arg max ρβ M

X,A ρβ M (x, a)g(x, a)dadx ρβ M ( , ) 2 2 g( , ) 2 2 α2

= arg max ρβ M 2 Z

X,A ρβ M (x, a)g(x, a)

α dadx + ρβ M ( , ) 2 2 + g( , ) 2 2 α2

= arg max ρβ M ρβ M ( , ) g( , )

We know that the occupancy measure ρβ M is a density function with a constraint R

A ρβ M (x, a)dadx = 1. Assuming the occupancy measure ρβ M has an upper bound c, that is 0 ρβ M (x, a) c, a A, x X, constructing a regularization coefficient αM = R

A(DKL(M ( |x, a), M( |x, a)) + HM (x, a))dxda as a constant value given any M, then we have

M (x, a) = g(x, a)

X M (x |x, a) log M (x |x,a)

M(x |x,a) dx R

X M (x |x, a) log M (x |x, a)dx

= DKL(M ( |x, a), M( |x, a)) + HM (x, a)

DKL(M ( |x, a), M( |x, a)) + HM (x, a) ,

which is the optimal density function of Eq. (7) with α = αM.

Note that in some particular M , we still cannot construct a β that can generate an occupancy specified by g(x, a)/αM for any M. We can only claim the distribution of the ideal best-response policy β satisfies:

M (x, a) = 1 αM (DKL(M ( |x, a), M( |x, a)) + HM (x, a)), (10)

where αM is a normalizer that αM = R

A(DKL(M ( |x, a), M( |x, a)) + HM (x, a))dxda. We give a discussion of the rationality of the ideal best-response policy β as a replacement of the real best-response policy β in Remark A.4.

Remark A.4. The optimal solution Eq. (10) relies on g(x, a). In some particular M , it is intractable to derive a β that can generate an occupancy specified by g(x, a)/αM. Consider the following case: a state x1 in M might be harder to reach than another state x2, e.g., M (x1|x, a) < M (x2|x, a), x X, a A, then it is impossible to find a β that the occupancy satisfies ρβ M (x1, a) > ρβ M (x2, a). In this case, Eq. (10) can be a sub-optimal solution. Since this work focuses on task-agnostic solution derivation while the solution to the above problem should rely on the specific description of M , we leave it as future work. However, we point out that Eq. (10) is a reasonable re-weighting term even as a sub-optimum: ρ β

M gives larger densities on the data where the distribution distance between the approximation model and the real model (i.e., DKL(M , M)) is larger or the stochasticity of the real model (i.e., HM ) is larger.

A.2 Proof of Eq. (9)

The integral process of DKL in Eq. (8) is intractable in the offline setting as it explicitly requires the conditional probability function of M . Our motivation for the tractable solution is utilizing the offline dataset Dreal as the empirical joint distribution ρµ M (x, a, x ) and adopting practical techniques for distance estimation on two joint distributions, like GAN [19, 37], to approximate Eq. (8). To adopt that solution, we should first transform Eq. (8) into a form under joint distributions. Without loss of generality, we introduce an intermediary policy κ, of which µ can be regarded as a specific instance. Then we have M(x |x, a) = ρκ M(x, a, x )/ρκ M(x, a) for any M if ρκ M(x, a) > 0. Assuming x X, a A, ρκ M (x, a) > 0 if ρ β

M (x, a) > 0, which will hold when κ overlaps with µ, then Eq. (8) can transform to:

M (x, a) =DKL(M ( |x, a), M( |x, a)) + HM (x, a)

X M (x |x, a) log M (x |x, a)

M(x |x, a) log M (x |x, a) dx

= 1 αMρκ M (x, a)

X ρκ M (x, a)M (x |x, a) log M (x |x, a)

M(x |x, a) log M (x |x, a) dx

= 1 αMρκ M (x, a)

X ρκ M (x, a, x ) log ρκ M (x, a, x ) ρκ M(x, a, x ) + log ρκ M(x, a) ρκ M (x, a) log M (x |x, a) dx

= 1 αMρκ M (x, a)

X ρκ M (x, a, x ) log ρκ M (x, a, x ) ρκ M(x, a, x ) dx

ρκ M (x, a) log ρκ M (x, a) ρκ M(x, a)

X M (x |x, a)dx

ρκ M (x, a) Z

X M (x |x, a) log M (x |x, a)dx !

= 1 α0(x, a)

X ρκ M (x, a, x ) log ρκ M (x, a, x ) ρκ M(x, a, x ) dx ρκ M (x, a) log ρκ M (x, a) ρκ M(x, a) + ρκ M (x, a)HM (x, a)

where α0(x, a) = αMρκ M (x, a).

Definition A.5 (f-divergence). Given two distributions P and Q, two absolutely continuous density functions p and q with respect to a base measure dx defined on the domain X, we define the f-divergence [37],

Df(P Q) = Z

X q(x)f p(x)

where the generator function f : R+ R is a convex, lower-semicontinuous function.

We notice that the terms ρκ M (x, a, x ) log ρκ M (x,a,x ) ρκ M(x,a,x ) and ρκ M (x, a) log ρκ M (x,a) ρκ M(x,a) are the integrated functions in reverse KL divergence, which is an instance of f function in f-divergence (See Reverse-

KL divergence of Tab.1 in [37] for more details). Replacing that form q log q

p with qf( p

q ), we obtain

a generalized representation of ρ β

M := 1 α0(x, a)

X ρκ M (x, a, x )f ρκ M(x, a, x ) ρκ M (x, a, x )

dx ρκ M (x, a) f ρκ M(x, a) ρκ M (x, a)

HM (x, a) !

A.3 Proof of Thm. A.2

We first introduce several useful lemmas for the proof.

Lemma A.6. Rearrangement inequality The rearrangement inequality states that, for two sequences a1 a2 . . . an and b1 b2 . . . bn, the inequalities

a1b1 + a2b2 + + anbn a1bπ(1) + a2bπ(2) + + anbπ(n) a1bn + a2bn 1 + + anb1

hold, where π(1), π(2), . . . , π(n) is any permutation of 1, 2, . . . , n.

Lemma A.7. For two sequences a1 a2 . . . an and b1 b2 . . . bn, the inequalities

Proof. By rearrangement inequality, we have

i=1 aibi a1b1 + a2b2 + + anbn

i=1 aibi a1b2 + a2b3 + + anb1

i=1 aibi a1b3 + a2b4 + + anb2

i=1 aibi a1bn + a2b1 + + anbn 1

Then we have

Now we extend Lemma A.7 into the continuous integral scenario:

Lemma A.8. Given X R, for two functions f : X R and g : X R that f(x) f(y) if and only if g(x) g(y), x, y X, the inequality Z

X p(x)f(x)g(x)dx Z

X p(x)f(x)dx Z

X p(x)g(x)dx

holds, where p : X R and p(x) > 0, x X and R

X p(x)dx = 1.

Proof. Since (f(x) f(y))(g(x) g(y)) 0, x, y X, we have

y X p(x)p(y)(f(x) f(y))(g(x) g(y))dydx 0 Z

y X p(x)p(y)f(x)g(x) + p(x)p(y)f(y)g(y) p(x)p(y)f(x)g(y) p(x)p(y)f(y)g(x)dydx 0 Z

y X p(x)p(y)f(x)g(x) + p(x)p(y)f(y)g(y)dydx Z

y X p(x)p(y)f(x)g(y) + p(x)p(y)f(y)g(x)dydx

y X p(x)p(y)f(x)g(x)dy + Z

y X p(x)p(y)f(y)g(y)dy dx Z

y X p(x)p(y)f(x)g(y) + p(x)p(y)f(y)g(x)dydx

p(x)f(x)g(x) + Z

y X p(x)p(y)f(y)g(y)dy dx Z

y X p(x)p(y)f(x)g(y) + p(x)p(y)f(y)g(x)dydx Z

x X p(x)f(x)g(x)dx + Z

y X p(x)p(y)f(y)g(y)dydx Z

y X p(x)p(y)f(x)g(y) + p(x)p(y)f(y)g(x)dydx Z

x X p(x)f(x)g(x)dx + Z

y X p(y)f(y)g(y)dy Z

y X p(x)p(y)f(x)g(y) + p(x)p(y)f(y)g(x)dydx

x X p(x)f(x)g(x)dx 2 Z

x X p(x)p(y)f(x)g(y)dydx

x X p(x)f(x)g(x)dx 2 Z

x X p(x)f(x)dx Z

x X p(x)g(x)dx Z

x X p(x)f(x)g(x)dx Z

x X p(x)f(x)dx Z

x X p(x)g(x)dx

Corollary A.9. Let g( p(x)

q(x)) = log p(x)

q(x) where p(x) > 0, x X and q(x) > 0, x X, for υ > 0, the inequality Z

X q(x)f(υ p(x)

q(x))g(p(x)

X q(x)f(υ p(x)

X q(x)g(p(x)

holds if f (x) 0, x X. It is not always satisfied for f functions of f-divergence. We list a comparison of f on that condition in Tab. 3.

Proof. g (x) = log x = 1

x < 0, x X. Suppose f (x) 0, x X, we have f(x) f(y) if and only if g(x) g(y), x, y X holds. Thus f(υ p(x)

q(x)) f(υ p(y)

q(y)) if and only if g( p(x)

q(y)), x, y X holds for all υ > 0. By defining F(x) = f(υ p(x)

q(x))) and G(x) = g( p(x)

q(x)) and using Lemma A.8, we have: Z

X q(x)F(x)G(x)dx Z

X q(x)F(x)dx Z

X q(x)G(x)dx.

Then we know Z

X q(x)f(υ p(x)

q(x))g(p(x)

X q(x)f(υ p(x)

X q(x)g(p(x)

Now, we prove Thm. A.2. For better readability, we first rewrite Thm. A.2 as follows:

Theorem A.10. Let ρ β

M as the data distribution of the best-response policy β in Eq. (4) under model Mθ parameterized by θ, then we can find the optimal θ of minθ maxβ Π L(ρβ M , Mθ) (Eq. (4)) via iteratively optimizing the objective θt+1 = minθ L( ρ β

M , Mθ), where ρ β

M is approximated via the last-iteration model Mθt. Based on Corollary A.9, we have an upper bound objective for

Table 3: Properties of f (x) 0, x X for f-divergences.

Name Generator function f(x) If f (x) 0, x X

Kullback-Leibler x log x False Reverse KL log x True Pearson χ2 (x 1)2 False Squared Hellinger ( x 1)2 False Jensen-Shannon (x + 1) log 1+x

2 + x log x False GAN x log x (x + 1) log(x + 1) True

minθ L( ρ β

M , Mθ) and derive the following objective

θt+1 = arg max θ Eρκ M

" 1 α0(x, a) log Mθ(x |x, a)

ρκ Mθt(x, a, x )

ρκ M (x, a, x )

ρκ Mθt(x, a)

ρκ M (x, a)

+ HM (x, a)

| {z } W (x,a,x )

where α0(x, a) = αMθtρκ M (x, a), Eρκ M [ ] denotes Ex,a,x ρκ M [ ], f is the generator function in f-divergence which satisfies f (x) 0, x X, and θ is the parameters of M. Mθt denotes a probability function with the same parameters as the learned model (i.e., θ = θ) but the parameter is fixed and only used for sampling.

Proof. Let ρ β

M as the data distribution of the best-response policy β in Eq. (4) under model Mθ parameterized by θ, then we can find the optimal θt+1 of minθ maxβ Π L(ρβ M , Mθ) (Eq. (4)) via iteratively optimizing the objective θt+1 = minθ L( ρ β

M , Mθ), where ρ β

M is approximated via the last-iteration model Mθt:

θt+1 = min θ

X M (x |x, a) log Mθ(x |x, a) dx dadx (15)

X ρκ M (x, a, x )f

ρκ Mθt (x, a, x )

ρκ M (x, a, x )

X M (x |x, a)( log Mθ(x |x, a))dx

ρκ M (x, a)

ρκ Mθt (x, a)

ρκ M (x, a)

X M (x |x, a)( log Mθ(x |x, a))dx !

X ρκ M (x, a, x )f

ρκ Mθt (x, a, x )

ρκ M (x, a, x )

X M (x |x, a)( log Mθ(x |x, a)

M (x |x, a))dx + HM (x, a)

ρκ M (x, a)

ρκ Mθt (x, a)

ρκ M (x, a)

X M (x |x, a)( log Mθ(x |x, a))dx !

ρκ M (x, a) Z

X M (x |x, a)f

ρκ Mθt (x, a, x )

ρκ M (x, a, x )

( log Mθ(x |x, a)

M (x |x, a))dx

| {z } based on Corollary A.9

ρκ M (x, a)

ρκ Mθt (x, a)

ρκ M (x, a)

X M (x |x, a)( log Mθ(x |x, a))dx !

ρκ M (x, a) Z

M (x |x, a)f

ρκ Mθt (x, a, x )

ρκ M (x, a, x )

( log Mθ(x |x, a))

ρκ M (x, a)

ρκ Mθt (x, a)

ρκ M (x, a)

X M (x |x, a)( log Mθ(x |x, a))dx !

1 α0(x, a)ρκ M (x, a, x ) log Mθ(x |x, a)

ρκ Mθt (x, a, x )

ρκ M (x, a, x )

ρκ Mθt (x, a)

ρκ M (x, a)

+ HM (x, a)

where Mθt is introduced to approximate the term ρ β

M and fixed when optimizing θ. In Eq. (15), ρβ M ( , ) 2 2 for Eq. (7) is eliminated as it does not contribute to the gradient of θ. Assume f (x)

0, x X, let υ(x, a) := ρκ Mθt (x,a)

ρκ M (x,a) > 0, p(x |x, a) = Mθ(x |x, a), and q(x |x, a) = M (x |x, a), the first inequality can be derived by adopting Corollary A.9 and eliminating the first HM since it does not contribute to the gradient of θ.

A.4 Proof of the Tractable Solution

Now we are ready to prove the tractable solution:

Proof. The core challenge is that the term f( ρκ Mθt (x,a,x )

ρκ M (x,a,x ) ) f( ρκ Mθt (x,a)

ρκ M (x,a) ) is still intractable. In the following, we give a tractable solution to Thm. A.2. First, we resort to the first-order approximation. Given some u (1 ξ, 1 + ξ), ξ > 0, we have

f(u) f(1) + f (u)(u 1), (17)

where f is the first-order derivative of f. By Taylor s formula and the fact that f (u) of the generator function f is bounded in (1 ξ, 1 + ξ), the approximation error is no more than O(ξ2). Substituting u with p(x)

q(x) in Eq. (17), the pattern f( p(x)

q(x)) in Eq. (16) can be converted to p(x)

q(x)f ( p(x)

q(x)) f ( p(x)

q(x)) + f(1), then we have:

θt+1 = arg max θ 1 α0(x, a)

ρκ M (x, a) Z

X M (x |x, a)f

ρκ Mθt (x, a, x )

ρκ M (x, a, x )

log Mθ(x |x, a)dx

ρκ M (x, a)f

ρκ Mθt (x, a)

ρκ M (x, a)

X M (x |x, a) log Mθ(x |x, a)dx +

ρκ M (x, a)HM (x, a) Z

X M (x |x, a) log Mθ(x |x, a)dx !

ρκ Mθt (x, a) Z

X Mθt(x |x, a)f ρκ Mθt (x, a, x )

ρκ M (x, a, x )

log Mθ(x |x, a)dx

ρκ M (x, a) Z

X M (x |x, a)

f ρκ Mθt (x, a, x )

ρκ M (x, a, x )

log Mθ(x |x, a)dx

ρκ Mθt (x, a)f ρκ Mθt (x, a)

ρκ M (x, a)

X M (x |x, a) log Mθ(x |x, a)dx +

ρκ M (x, a)

f ρκ Mθt (x, a)

ρκ M (x, a)

X M (x |x, a) log Mθ(x |x, a)dx +

ρκ M (x, a)HM (x, a) Z

X M (x |x, a) log Mθ(x |x, a)dx !

= arg max θ

1 α0(x, a)ρκ Mθt (x, a, x)

f ρκ Mθt (x, a, x )

ρκ M (x, a, x )

f ρκ Mθt (x, a)

ρκ M (x, a)

log Mθ(x |x, a)dx dadx+

1 α0(x, a)ρκ M (x, a, x )

f ρκ Mθt (x, a)

ρκ M (x, a)

f ρκ Mθt (x, a, x )

ρκ M (x, a, x )

+ HM (x, a)

log Mθ(x |x, a)dx dadx.

Note that the part ρκ M (x, a) in ρκ M (x, a, x ) can be canceled because of α0(x, a) = αMθtρκ M (x, a), but we choose to keep it and ignore α0(x, a). The benefit is that we can estimate ρκ M (x, a, x ) from an empirical data distribution through data collected by κ in M directly, rather than from a uniform distribution which is harder to be generated. Although keeping ρκ M (x, a) incurs extra bias in theory, the results in our experiments show that it has not made significant negative effects in practice. We

leave this part of modeling in future work. In particular, by ignoring α0(x, a), we have:

θt+1 = arg max θ

X,A,X ρκ Mθt (x, a, x)

f ρκ Mθt (x, a, x )

ρκ M (x, a, x )

f ρκ Mθt (x, a)

ρκ M (x, a)

log Mθ(x |x, a)dx dadx+

X,A,X ρκ M (x, a, x )

f ρκ Mθt (x, a)

ρκ M (x, a)

f ρκ Mθt (x, a, x )

ρκ M (x, a, x )

+ HM (x, a)

log Mθ(x |x, a)dx dadx.

We can estimate f ρκ Mθt (x,a)

and f ρκ Mθt (x,a,x )

ρκ M (x,a,x )

through Lemma A.11.

Lemma A.11 (f ( p

q ) estimation [36]). Given a function Tφ : X R parameterized by φ Φ, if f is convex and lower semi-continuous, by finding the maximum point of φ in the following objective:

φ = arg max φ Ex p(x) [Tφ(x)] Ex q(x) [f (Tφ(x))] ,

we have f ( p(x)

q(x)) = Tφ (x). f is Fenchel conjugate of f [23].

In particular,

φ 0 = arg max φ0 Ex,a,x ρκ M [Tφ0(x, a, x )] Ex,a,x ρκ Mθt [f (Tφ0(x, a, x ))]

φ 1 = arg max φ1 Ex,a ρκ M [Tφ1(x, a)] Ex,a ρκ Mθt [f (Tφ1(x, a))] ,

then we have f ρκ Mθt (x,a,x )

ρκ M (x,a,x )

Tφ 0(x, a, x ) and f ρκ Mθt (x,a)

Tφ 1(x, a). Given φ 0 and

φ 1, let Aφ 0,φ 1(x, a, x ) = Tφ 0(x, a, x ) Tφ 1(x, a), then we can optimize θ via:

θt+1 = arg max θ

X,A,X ρκ Mθt(x, a, x) Tφ 0(x, a, x ) Tφ 1(x, a) log Mθ(x |x, a)dx dadx+ Z

X,A,X ρκ M (x, a, x ) Tφ 1(x, a) Tφ 0(x, a, x ) + HM (x, a) log Mθ(x |x, a)dx dadx

= arg max θ

X,A,X ρκ Mθt(x, a, x)Aφ 0,φ 1(x, a, x ) log Mθ(x |x, a)dx dadx+ Z

X,A,X ρκ M (x, a, x )( Aφ 0,φ 1(x, a, x ) + HM (x, a)) log Mθ(x |x, a)dx dadx.

Based on the specific f-divergence, we can represent T and f (T) with a discriminator Dφ. It can be verified that f(u) = u log u (u + 1) log(u + 1), Tφ(u) = log Dφ(u), and f (Tφ(u)) = log(1 Dφ(u)) proposed in [37] satisfies the condition f (x) 0, x X (see Tab. 3). We select the former in the implementation and convert the tractable solution to:

θt+1 = arg max θ Eρκ Mθt

Aφ 0,φ 1(x, a, x ) log Mθ(x |x, a) + Eρκ M (HM (x, a) Aφ 0,φ 1(x, a, x )) log Mθ(x |x, a)

s.t. φ 0 = arg max φ0 Eρκ M log Dφ0(x, a, x ) + Eρκ Mθt

log(1 Dφ0(x, a, x ))

φ 1 = arg max φ1 Eρκ M [log Dφ1(x, a)] + Eρκ Mθt [log(1 Dφ1(x, a))] ,

where Aφ 0,φ 1(x, a, x ) = log Dφ 0(x, a, x ) log Dφ 1(x, a), Eρκ Mθt [ ] is a simplification of

Ex,a,x ρκ Mθt [ ].

Remark A.12. In practice, we need to use the real-world data to construct the distribution ρκ M and the generative data to construct ρκ Mθt. In the offline model-learning setting, we only have a real-world dataset Dreal collected by the behavior policy µ. We can learn a policy ˆµ µ via imitation learning based on Dreal [40, 25] and let ˆµ be the policy κ. Then we can regard Dreal as the empirical

data distribution of ρκ M and the trajectories collected by ˆµ in the model Mθt as the empirical data distribution of ρκ Mθt. Based on the above specializations, we have:

θt+1 = max θ

Aφ 0,φ 1(x, a, x ) log Mθ(x |x, a) + Eρµ M (HM (x, a) Aφ 0,φ 1(x, a, x )) log Mθ(x |x, a)

s.t. φ 0 = arg max φ0

Eρµ M log Dφ0(x, a, x ) + Eρˆ µ Mθt

log(1 Dφ0(x, a, x ))

φ 1 = arg max φ1

Eρµ M [log Dφ1(x, a)] + Eρˆ µ Mθt [log(1 Dφ1(x, a))] ,

which is Eq. (6) in the main body.

B Discussion and Limitations of the Theoretical Results

We summarize the limitations of current theoretical results and future work as follows:

1. As discussed in Remark A.4, the solution Eq. (10) relies on ρβ M (x, a) [0, c], a A, x X. In some particular M , it is intractable to derive a β that can generate an occupancy specified by g(x, a)/αM. If more knowledge of M or β is provided or some mild assumptions can be made on the properties of M or β , we may model ρ in a more sophisticated approach to alleviating the above problem. 2. In the tractable solution derivation, we ignore the term α0(x, a) = αMθtρκ M (x, a) (See Eq. (19)). The benefit is that ρκ M (x, a, x ) in the tractable solution can be estimated through offline datasets directly. Although the results in our experiments show that it does not produce significant negative effects in these tasks, ignoring ρκ M (x, a) indeed incurs extra bias in theory. In future work, techniques for estimating ρκ M (x, a) [33] can be incorporated to correct the bias. On the other hand, αMθt is also ignored in the process. αMθt can be regarded as a global rescaling term of the final objective Eq. (19). Intuitively, it constructs an adaptive learning rate for Eq. (19), which increases the step size when the model is better fitted and decreases the step size otherwise. It can be considered to further improve the learning process in future work, e.g., cooperating with empirical risk minimization by balancing the weights of the two objectives through αMθt.

C Societal Impact

This work studies a method toward counterfactual environment model learning. Reconstructing an accurate environment of the real world will promote the wide adoption of decision-making policy optimization methods in real life, enhancing our daily experience. We are aware that decision-making policy in some domains like recommendation systems that interact with customers may have risks of causing price discrimination and misleading customers if inappropriately used. A promising way to reduce the risk is to introduce fairness into policy optimization and rules to constrain the actions (Also see our policy design in Sec. G.1.3). We are involved in and advocating research in such directions. We believe that business organizations would like to embrace fair systems that can ultimately bring long-term financial benefits by providing a better user experience.

D AWRM-oracle Pseudocode

We list the pseudocode of AWRM-oracle in Alg. 2.

E Implementation

E.1 Details of the GALILEO Implementation

The approximation of Eq. (17) holds only when p(x)/q(x) is close to 1, which might not be satisfied. To handle the problem, we inject a standard supervised learning loss

arg max θ Eρκ M [log Mθ(x |x, a)] (21)

Algorithm 2 AWRM with Oracle Counterfactual Datasets

Input: Φ: policy space; N: total iterations Process:

1: Generate counterfactual datasets {Dπϕ} for all adversarial policies πϕ, ϕ Φ 2: Initialize an environment model Mθ 3: for i = 1:N do 4: Select Dπϕ with worst prediction errors through Mθ from {Dπϕ} 5: Optimize Mθ with standard supervised learning based on Dπϕ 6: end for

to replace the second term of the above objective when the output probability of D is far away from 0.5 (f (1) = log 0.5).

In the offline model-learning setting, we only have a real-world dataset D collected by the behavior policy µ. We learn a policy ˆµ µ via behavior cloning with D [40, 25] and let ˆµ be the policy κ. We regard D as the empirical data distribution of ρκ M and the trajectories collected by ˆµ in the model Mθt as the empirical data distribution of ρκ Mθt. But the assumption x X, a A, µ(a|x) > 0 might not be satisfied. In behavior cloning, we model ˆµ with a Gaussian distribution and constrain the lower bound of the variance with a small value ϵµ > 0 to keep the assumption holding. Besides, we add small Gaussian noises u N(0, ϵD) to the inputs of Dφ to handle the mismatch between ρµ M and ρˆµ M due to ϵµ. In particular, for φ0 and φ1 learning, we have:

φ 0 = arg max φ0 Eρκ M ,u log Dφ0(x + ux, a + ua, x + ux ) + Eρκ Mθt ,u log(1 Dφ0(x + ux, a + ua, x + ux ))

φ 1 = arg max φ1 Eρκ M ,u [log Dφ1(x + ux, a + ua)] + Eρκ Mθt ,u [log(1 Dφ1(x + ux, a + ua))] ,

where Eρκ Mθt ,u[ ] is a simplification of Ex,a,x ρκ Mθt ,u N(0,ϵD)[ ] and u = [ux, ua, ux ].

On the other hand, we notice that the first term in Eq. (20) is similar to the objective of GAIL [25] by regarding Mθ as the policy to learn and κ as the environment to generate data. For better capability in sequential environment model learning, here we introduce some practical tricks inspired by GAIL for model learning [47, 46]: we introduce an MDP for κ and Mθ, where the reward is defined by the discriminator D, i.e., r(x, a, x ) = log D(x, a, x ). Mθ is learned to maximize the cumulative rewards. With advanced policy gradient methods [44, 45], the objective is converted to maxθ Aφ 0,φ 1(x, a, x ) log Mθ(x, a, x ) , where A = Qκ Mθt V κ Mθt, Qκ M θ(x, a, x ) = E [P t=0 γtr(xt, at, xt+1) | (xt, at, xt+1) = (x, a, x ), κ, Mθt], and V κ M θ(x, a) =

EM θ h Qκ M θ(x, a, x ) i . A in Eq. (20) can also be constructed similarly. Although it looks unnecessary in theory since the one-step optimal model Mθ is the global optimal model in this setting, the technique is helpful in practice as it makes A more sensitive to the compounding effect of one-step prediction errors: we would consider the cumulative effects of prediction errors induced by multi-step transitions in environments. In particular, to consider the cumulative effects of prediction errors induced by multi-step of transitions in environments, we overwrite function Aφ 0,φ 1 as Aφ 0,φ 1 = Qκ Mθt V κ Mθt , where Qκ Mθt(x, a, x ) = E P t γt log Dφ 0(xt, at, xt+1)|(xt, at, xt+1) = (x, a, x ), κ, Mθt and V κ Mθt(x, a) = E P t γt log Dφ 1(xt, at)|(xt, at) = (x, a), κ, Mθt . To give an algorithm for singlestep environment model learning, we can just set γ in Q and V to 0.

Algorithm 3 GALILEO pseudocode

Input: Dreal: offline dataset sampled from ρµ M where µ is the behavior policy; N: total iterations; Process:

1: Approximate a behavior policy ˆµ via behavior cloning 2: Initialize an environment model Mθ1 3: for t = 1 : N do 4: Use ˆµ to generate a dataset Dgen with the model Mθt 5: Update the discriminators Dφ0 and Dφ1 via Eq. (25) and Eq. (26) respectively, where ρˆµ Mθt is estimated by Dgen and ρµ M is estimated by Dreal 6: Update Q and V via Eq. (23) and Eq. (24) through Dgen, Dφ0, and Dφ1 7: Update the model Mθt via the first term of Eq. (22), which is implemented with a standard policy gradient method like TRPO [44] or PPO [45]. Record the policy gradient gpg 8: if p0 < EDgen [Dφ0(xt, at, xt+1)] < p1 then 9: Compute the gradient of Mθt via the second term of Eq. (22) and record it as gsl 10: else 11: Compute the gradient of Mθt via Eq. (21) and record it as gsl 12: end if 13: Rescale gsl via Eq. (27) 14: Update the model Mθt via the gradient gsl and obtain Mθt+1 15: end for

offline dataset

behavior policy

dynamics model

pretrain: approximated the behavior

policy (line 1)

(1) generate a dataset (line 4)

fake dataset

discriminators and value functions

𝐷("(𝑠, 𝑎, 𝑠 ), 𝐷(# 𝑠, 𝑎,

(2) update the discriminators

and the value functions

(line 5 to 6)

(3) approximate the AWRM through Eq. 23

to update the dynamics model

(line 7 to 14)

neural networks fake dataset offline dataset

Figure 9: Illustration of the workflow of the GALILEO algorithm.

By adopting the above implementation techniques, we convert the objective into the following formulation

θt+1 = arg max θ Eρκ Mθt

Aφ 0,φ 1(x, a, x ) log Mθ(x |x, a) + Eρκ M (HM (x, a) Aφ 0,φ 1(x, a, x )) log Mθ(x |x, a)

s.t. Qκ Mθt (x, a, x ) = E

t γt log Dφ 0(xt, at, xt+1)|(xt, at, xt+1) = (x, a, x ), κ, Mθt

V κ Mθt (x, a) = E

t γt log Dφ 1(xt, at)|(xt, at) = (x, a), κ, Mθt

φ 0 = arg max φ0 Eρκ M ,u log Dφ0(x + ux, a + ua, x + ux ) + Eρκ Mθt ,u log(1 Dφ0(x + ux, a + ua, x + ux ))

φ 1 = arg max φ1 Eρκ M ,u [log Dφ1(x + ux, a + ua)] + Eρκ Mθt ,u [log(1 Dφ1(x + ux, a + ua))] ,

where Aφ 0,φ 1(x, a, x ) = Qκ Mθ(x, a, x ) V κ Mθ(x, a). In practice, GALILEO optimizes the first term of Eq. (22) with conservative policy gradient algorithms (e.g., PPO [45] or TRPO [44]) to avoid unreliable gradients for model improvements. Eq. (25) and Eq. (26) are optimized with supervised learning. The second term of Eq. (22) is optimized with supervised learning with a re-weighting term Aφ 0,φ 1 + HM . Since HM is unknown, we use HMθ to estimate it. When the mean output probability of a batch of data is larger than 0.6 or small than 0.4, we replace the second term of Eq. (22) with a standard supervised learning in Eq. (21). Besides, unreliable gradients also exist in the process of optimizing the second term of Eq. (22). In our implementation, we use the scale of policy gradients to constrain the gradients of the second term of Eq. (22). In particular, we first compute the l2-norm of the gradient of the first term of Eq. (22) via conservative policy gradient algorithms, named ||gpg||2. Then we compute the l2-norm of the gradient of the second term of Eq. (22), name ||gsl||2. Finally, we rescale the gradients of the second term gsl by

gsl gsl ||gpg||2 max{||gpg||2, ||gsl||2}. (27)

For each iteration, Eq. (22), Eq. (25), and Eq. (26) are trained with certain steps (See Tab. 6) following the same framework as GAIL. Based on the above techniques, we summarize the pseudocode of GALILEO in Alg. 3, where p0 and p1 are set to 0.4 and 0.6 in all of our experiments. The overall architecture is shown in Fig. 9.

E.2 Connection with Previous Adversarial Algorithms

Standard GAN [19] can be regarded as a partial implementation including the first term of Eq. (22) and Eq. (25) by degrading them into the single-step scenario. In the context of GALILEO, the objective of GAN is

θt+1 = arg max θ Eρκ Mθt [Aφ (x, a, x ) log Mθ(x |x, a)]

s.t. φ = arg max φ Eρκ M [log Dφ(x, a, x )] + Eρκ Mθt [log(1 Dφ(x, a, x ))] ,

where Aφ (x, a, x ) = log Dφ (x, a, x ). In the single-step scenario, ρκ Mθt(x, a, x ) = ρ0(x)κ(a|x)Mθt(x |a, x). The term Eρκ Mθt [Aφ (x, a, x ) log Mθ(x |x, a)] can convert to

Eρκ Mθ [log Dφ (x, a, x )] by replacing the gradient of Mθt(x |x, a) θ log Mθ(x |x, a) with θMθ(x |x, a) [51]. Previous algorithms like GANITE [58] and SCIGAN [7] can be regarded as variants of the above training framework.

The first term of Eq. (22) and Eq. (25) are similar to the objective of GAIL by regarding Mθ as the policy to imitate and ˆµ as the environment to collect data. In the context of GALILEO, the objective of GAIL is:

θt+1 = arg max θ Eρκ Mθt

Aφ (x, a, x ) log Mθ(x |x, a)

s.t. Qκ Mθt (x, a, x ) = E

t γt log Dφ (xt, at, xt+1)|(xt, at, xt+1) = (x, a, x ), κ, Mθt

φ = arg max φ Eρκ M log Dφ(x, a, x ) + Eρκ Mθt

log(1 Dφ(x, a, x )) ,

where Aφ (x, a, x ) = Qκ Mθ(x, a, x ) V κ Mθ(x, a) and V κ Mθt(x, a) = EMθt(x,a) [Qκ(x, a, x )].

F Additional Related Work

Our primitive objective is inspired by weighted empirical risk minimization (WERM) based on inverse propensity score (IPS). WERM is originally proposed to solve the generalization problem of domain adaptation in machine learning literature. For instance, we would like to train a predictor M(y|x) in a domain with distribution Ptrain(x) to minimize the prediction risks in the domain with distribution Ptest(x), where Ptest = Ptest. To solve the problem, we can train a weighted objective with max M Ex Ptrain[ Ptest(x)

Ptrain(x) log M(y|x)], which is called weighted empirical risk minimization methods [5, 4, 15, 9, 42]. These results have been extended and applied to causal inference, where the predictor is required to be generalized from the data distribution in observational studies (source domain) to the data distribution in randomized controlled trials (target domain) [48, 3, 22, 29, 28].

In this case, the input features include a state x (a.k.a. covariates) and an action a (a.k.a. treatment variable) which is sampled from a policy. We often assume the distribution of x, P(x) is consistent between the source domain and the test domain, then we have Ptest(x) Ptrain(x) = P (x)β(a|x)

P (x)µ(a|x) = β(a|x)

µ(a|x), where µ and β are the policies in source and target domains respectively. In [48, 3, 22], the policy in randomized controlled trials is modeled as a uniform policy, then Ptest(x)

Ptrain(x) = P (x)β(a|x)

P (x)µ(a|x) = β(a|x)

1 µ(a|x). 1 µ(a|x) is also known as inverse propensity score (IPS). In [28], it assumes that the policy in

the target domain is predefined as β(a|x) before environment model learning, then it uses β

µ as the IPS. The differences between AWRM and previous works are fallen in two aspects: (1) We consider the distribution-shift problem in the sequential decision-making scenario. In this scenario, we not only consider the action distribution mismatching between the behavior policy µ and the policy to evaluation β, but also the follow-up effects of policies to the state distribution; (2) For faithful offline policy optimization, we require the environment model to have generalization ability in numerous different policies. The objective of AWRM is proposed to guarantee the generalization ability of M in numerous different policies instead of a specific policy.

On a different thread, there are also studies that bring counterfactual inference techniques of causal inference into model-based RL [8, 39, 49]. These works consider that the transition function is relevant to some hidden noise variables and use Pearl-style structural causal models (SCMs), which is a directed acyclic graphs to define the causality of nodes in an environment, to handle the problem. SCMs can help RL in different ways: [8] approximate the posterior of the noise variables based on the observation of data, and environment models are learned based on the inferred noises. The generalization ability is improved if we can infer the correct value of the noise variables. [39] discover several local causal structural models of a global environment model, then data augmentation strategies by leveraging these local structures to generate counterfactual experiences. [49] proposes a representation learning technique for causal factors, which is an instance of the hidden noise variables, in partially observable Markov decision processes (POMDPs). With the learned representation of causal factors, the performance of policy learning and transfer in downstream tasks will be improved.

Instead of considering the hidden noise variables in the environments, our study considers the environment model learning problem in the fully observed setting and focuses on unbiased causal effect estimation in the offline dataset under behavior policies collected with selection bias.

In offline model-based RL, the problem is called distribution shift [59, 31, 14] which has received great attentions. However, previous algorithms do not handle the model learning challenge directly but propose techniques to suppress policy sampling and learning in risky regions [59, 30]. Although these algorithms have made great progress in offline policy optimization in many tasks, so far, how to learn a better environment model in this scenario has rarely been discussed.

We are not the first article to use the concept of density ratio for weighting. In off-policy estimation, [32, 35, 60] use density ratio to evaluate the value of a given target policy β. These methods attempt to solve an accurate approximation of ω(s, a|ρβ). The objective of our work, AWRM, is for the model to provide faithful feedback for different policies, formalized as minimizing the model error of the density function for any β in the policy space Π. The core of this problem is how to obtain an approximation of the density function of the best-response β corresponding to the current model, and then approximate the corresponding ω(s, a|ρβ )). It should be emphasized that since β is unknown in advance and will change as the induction bias of M is changed, the solutions proposed in [32, 35, 60] cannot be applied to AWRM; Recently, [24, 57] use density ratio weighting to learn a model. The purpose of weighting is to make the model adapt to the current policy learning and adjust the model learning according to the current policy, which is the same as the WERM objective proposed in Def. 4.1. [54] also utilizes density ratio weighting to learn a model. Instead of estimating them based on the offline dataset and target policy as previous works do, they propose to design an adversarial model learning objective by constructing two function classes V and W, satisfying the target policy s value Vβ and density ratio ωβ are covered, i.e., Vβ V and ωβ W. Different from previous articles, our approach uses adversarial weighting to learn a universal model that provides good feedback for any target policy β Π, i.e., AWRM, instead of learning a model suitable to a specific target policy.

G Experiment Details

G.1 Settings

G.1.1 General Negative Feedback Control (GNFC)

The design of GNFC is inspired by a classic type of scenario that behavior policies µ have selection bias and easily lead to counterfactual risks: For some internet platforms, we would like to allocate budgets to a set of targets (e.g., customers or cities) to increase the engagement of the targets in the platforms. Our task is to train a model to predict targets feedback on engagement given targets features and allocated budgets.

In these tasks, for better benefits, the online working policy (i.e., the behavior policy) will tend to cut down the budgets if targets have better engagement, otherwise, the budgets might be increased. The risk of counterfactual environment model learning in the task is that: the object with better historical engagement will be sent to smaller budgets because of the selection bias of the behavior policies, then the model might exploit this correlation for learning and get a conclusion that: increasing budgets will reduce the targets engagement, which violates the real causality. We construct an environment and a behavior policy to mimic the above process. In particular, the behavior policy µGNF C is

µGNF C(x) = (62.5 mean(x))

where ϵ is a sample noise, which will be discussed later. The environment includes two parts:

(1) response function M1(y|x, a):

M1(y|x, a) = N(mean(x) + a, 2)

(2) mapping function M2(x |x, y):

M2(x |x, a, y) = y mean(x) + x

The transition function M is a composite of M (x |x, a) = M2(x |x, a, M1(y|x, a)). The behavior policies have selection bias: the actions taken are negatively correlated with the states, as illustrated in Fig. 10(a) and Fig. 10(b). We control the difficulty of distinguishing the correct causality of x, a, and y by designing different strategies of noise sampling on ϵ. In principle, with a larger number or more pronounced disturbances, there are more samples violating the correlation between x and a, then more samples can be used to find the correct causality. Therefore, we can control the difficulty of counterfactual environment model learning by controlling the strength of disturbance. In particular, we sample ϵ from a uniform distribution U( e, e) with probability p. That is, ϵ = 0 with probability 1 p and ϵ U( e, e) with probability p. Then with larger p, there are more samples in the dataset violating the negative correlation (i.e., µGNF C), and with larger e, the difference of the feedback will be more obvious. By selecting different e and p, we can construct different tasks to verify the effectiveness and ability of the counterfactual environment model learning algorithm.

0 10 20 30 40 50 timestep

(a) y in collected trajectories.

0 10 20 30 40 50 timestep

(b) a in collected trajectories.

Figure 10: Illustration of information about the collected dataset in GNFC. Each color of the line denotes one of the collected trajectories. The X-axis denotes the timestep of a trajectory.

G.1.2 The Cancer Genomic Atlas (TCGA)

The Cancer Genomic Atlas (TCGA) is a project that has profiled and analyzed large numbers of human tumors to discover molecular aberrations at the DNA, RNA, protein, and epigenetic levels. The resulting rich data provide a significant opportunity to accelerate our understanding of the molecular basis of cancer. We obtain features, x, from the TCGA dataset and consider three continuous treatments as done in SCIGAN [7]. Each treatment, a, is associated with a set of parameters, v1, v2, v3, that are sampled randomly by sampling a vector from a standard normal distribution and scaling it with its norm. We assign interventions by sampling a treatment, a, from a beta distribution, a | x Beta (α, β). α 1 controls the sampling bias and β = α 1

a + 2 α, where a is the optimal treatment. This setting of β ensures that the mode of Beta (α, β) is a .

The calculation of treatment response and optimal treatment are shown in Table 4.

Table 4: Treatment response used to generate semi-synthetic outcomes for patient features x. In the experiments, we set C = 10.

Treatment Treatment Response Optimal treatment

1 f1(x, a1) = C v1 1 T x + 12 v1 2 T xa1 12 v1 3 T xa2 1 a 1 = (v1 2) T x

2(v1 3) T x

2 f2(x, a2) = C v2 1 T x + sin π v2T 2 x v2T 3 x

a2 a 2 = (v2 3) T x

2(v2 2) T x

3 f3(x, a3) = C v3 1 T x + 12a3(a3 b)2 , where b = 0.75(v3 2) T x (v3 3) T x

3 b if b 0.75, 1 if b < 0.75

We conduct experiments on three different treatments separately and change the value of bias α to assess the robustness of different methods to treatment bias. When the bias of treatment is large, which means α is large, the training set contains data with a strong bias on treatment so it would be difficult for models to appropriately predict the treatment responses out of the distribution of training data.

G.1.3 Budget Allocation task to the Time period (BAT)

1. customers send take-out food orders remotely

based on the platform

3. delivery clerks

take food from the stores

2. stores make food

4. delivery clerks

send the food to the customers

5. The platform will pay

the delivery clerks for

fulfilling the order

Figure 11: Illustration of the workflow of the food-delivery platform.

We deploy GALILEO in a real-world large-scale food-delivery platform. The platform contains various food stores, and food delivery clerks. The overall workflow is as follows: the platform presents the nearby food stores to the customers and the customers make orders, i.e., purchase take-out foods from some stores on the platform. The food delivery clerks can select orders from the platform to fulfill. After an order is selected to fulfill, the delivery clerks will take the ordered take-out foods from the stores and then send the food to the customers. The platform will pay the delivery clerks (mainly in proportion to the distance between the store and the customers location) once the orders are fulfilled. An illustration of the workflow can be found in Fig. 11.

0.0 0.2 0.4 0.6 0.8 1.0 normalized actions

normalized fulfilled order amount

Figure 12: Illustration of relationship between user feedback and the actions of the offline dataset in the real-world food-delivery platform.

However, there is an imbalance problem between the demanded orders from customers and the supply of delivery clerks to fulfill these orders. For example, at peak times like lunchtime, there will be many more demanded orders than at other times, and the existed delivery clerks might not be able to fulfill all of these orders timely. The goal of the Budget Allocation task to the Time period (BAT) is to handle the imbalance problem in time periods by sending reasonable allowances to different time periods. More precisely, the goal of BAT is to make all orders (i.e., the demand) sent in different time periods can be fulfilled (i.e., the supply) timely.

The core challenge of the environment model learning in BAT tasks is similar to the challenge in Fig. 1. Specifically, the behavior policy in BAT tasks is a human-expert policy, which will tend to increase the budget of allowance in the time periods with a lower supply of delivery clerks, otherwise will decrease the budget (Fig. 12 gives an instance of this phenomenon in the real data).

To handle the imbalance problem in different time periods, in the platform, the orders in different time periods t [0, 1, 2..., 23] will be allocated with different allowances c N +. For example, at 10 A.M. (i.e., t = 10), we add 0.5$ (i.e., c = 0.5) allowances to all of the demanded orders. From 10 A.M. to 11 A.M., the delivery clerks who take orders and send food to customers will receive extra allowances. Specifically, if the platform pays the delivery clerks 2$ for fulfilling the order, now he/she will receive 2.5$. For each day, the budget of allowance C is fixed. We should find the best budget allocation policy π (c|t) of the limited budget C to make as many orders as possible can be taken timely.

To find the policy, we first learn a model to reconstruct the response of allowance for each delivery clerk ˆ M(yt+1|st, pt, ct), where yt+1 is the taken orders of the delivery clerks in state st, ct is the allowances, pt denotes static features of the time period t. In particular, the state st includes historical order-taken information of the delivery clerks, current orders information, the feature of weather, city information, and so on. Then we use a rule-based mapping function f to fill the complete next time-period states, i.e., st+1 = f(st, pt, ct, yt+1). Here we define the composition of the above functions ˆ M and f as ˆ Mf. Finally, we learn a budget allocation policy based on the learned model. For each day, the policy we would like to find is:

max π Es0 S

t=0 yt| ˆ Mf, π

t,s S ctyt C

In our experiment, we evaluate the degree of balancing between demand and supply by computing the averaged five-minute order-taken rate, that is the percentage of orders picked up within five minutes. Note that the behavior policy is fixed for the long term in this application. So we directly use the data replay with a small scale of noise (See Tab. 6) to reconstruct the behavior policy for model learning in GALILEO.

Also note that although we model the response for each delivery clerk, for fairness, the budget allocation policy is just determining the allowance of each time period t and keeps the allowance to each delivery clerk s the same.

G.2 Baseline Algorithms

The algorithm we compared are: (1) Supervised Learning (SL): training a environment model to minimize the expectation of prediction error, without considering the counterfactual risks; (2) inverse propensity weighting (IPW) [50]: a practical way to balance the selection bias by re-weighting. It can be regarded as ω = 1

ˆµ, where ˆµ is another model learned to approximate the behavior policy; (3) SCIGAN: a recent proposed adversarial algorithm for model learning for continuous-valued interventions [7]. All of the baselines algorithms are implemented with the same capacity of neural networks (See Tab. 6).

G.2.1 Supervised Learning (SL)

As a baseline, we train a multilayer perceptron model to directly predict the response of different treatments, without considering the counterfactual risks. We use mean square error to estimate the performance of our model so that the loss function can be expressed as MSE = 1

n Pn i=1 (yi ˆyi)2, where n is the number of samples, y is the true value of response and ˆy is the predicted response. In practice, we train our SL models using Adam optimizer and the initial learning rate 3e 4 on both datasets TCGA and GNFC. The architecture of the neural networks is listed in Tab. 6.

G.2.2 Inverse Propensity Weighting (IPW)

Inverse propensity weighting [50] is an approach where the treatment outcome model uses sample weights to balance the selection bias by re-weighting. The weights are defined as the inverse propensity of actually getting the treatment, which can be expressed as 1 ˆµ(a|x), where x stands for the feature vectors in a dataset, a is the corresponding action and ˆµ(a|x) indicates the action taken probability of a given the features x within the dataset. ˆµ is learned with standard supervised learning. Standard IPW leads to large weights for the points with small sampling probabilities and finally makes the learning process unstable. We solve the problem by clipping the propensity score: ˆµ min(ˆµ, 0.05), which is common used in existing studies [27]. The loss function can thus be expressed as 1

n Pn i=1 1 ˆµ(ai|xi) (yi ˆyi)2. The architecture of the neural networks is listed in Tab. 6.

G.2.3 SCIGAN

SCIGAN [7] is a model that uses generative adversarial networks to learn the data distribution of the counterfactual outcomes and thus generate individualized response curves. SCIGAN does not place any restrictions on the form of the treatment-does response functions and is capable of estimating patient outcomes for multiple treatments, each with an associated parameter. SCIGAN first trains a generator to generate response curves for each sample within the training dataset. The learned generator can then be used to train an inference network using standard supervised methods. For fair comparison, we increase the number of parameters for the open-source version of SCIGAN so that the SCIGAN model can have same order of magnitude of network parameters as GALILEO. In addition, we also finetune the hyperparameters (Tab. 5) of the enlarged SCIGAN to realize its full strength. We set num_dosage_samples 9 and λ = 10.

Table 5: Table of hyper-parameters for SCIGAN.

Parameter Values

Number of samples 3, 5, 7, 9, 11 λ 0.1, 1, 10, 20

G.3 Hyper-parameters

We list the hyper-parameter of GALILEO in Tab. 6.

G.4 Computation Resources

We use one Tesla V100 PCIe 32GB GPU and a 32-core Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz to train all of our models.

Table 6: Table of hyper-parameters for all of the tasks.

Parameter GNFC TAGC Mu Jo Co BAT

hidden layers of all neural networks 4 4 5 5 hidden units of all neural networks 256 256 512 512 collect samples for each time of model update 5000 5000 40000 96000 batch size of discriminators 5000 5000 40000 80000 horizon 50 1 100 48 (half-hours) ϵµ (also ϵD) 0.005 0.01 0.05 (0.1 for walker2d) 0.05 times for discriminator update 2 2 1 5 times for model update 1 1 2 20 times for supervised learning update 1 1 4 20 learning rate for supervised learning 1e-5 1e-5 3e-4 1e-5 γ 0.99 0.0 0.99 0.99 clip-ratio NAN NAN NAN 0.1 max DKL 0.001 0.001 0.001 NAN optimization algorithm (the first term of Eq. (22)) TRPO TRPO TRPO PPO

e0.05_p0.05

GALILEO SL IPW SCIGAN

GALILEO SL IPW SCIGAN

(b) TCGA Figure 13: Illustration of the performance in GNFC and TCGA. The grey bar denotes the standard error ( 0.3 for brevity) of 3 random seeds. H Additional Results

H.1 Test in Single-step Environments

The results of GNFC tasks are summarized in Fig. 13(a) and the detailed results can be found in Tab. 11. The results show that the property of the behavior policy (i.e., e and p) dominates the generalization ability of the baseline algorithms. When e = 0.05, almost all of the baselines fail and give a completely opposite response curve. IPW still perform well when 0.2 e 1.0 but fails when e = 0.05, p <= 0.2. We also found that SCIGAN can reach a better performance than other baselines when e = 0.05, p <= 0.2, but the results in other tasks are unstable. GALILEO is the only algorithm that is robust to the selection bias and outputs correct response curves in all of the tasks. Based on the experiment, we also indicate that the commonly used overlap assumption is unreasonable to a certain extent especially in real-world applications since it is impractical to inject noises into the whole action space. The problem of overlap assumption being violated should be taken into consideration otherwise the algorithm will be hard to use in practice if it is sensitive to the noise range.

The results of TCGA tasks are summarized in Fig. 13(b) and the detailed results can be found in Tab. 12. We found the phenomenon in this experiment is similar to the one in GNFC, which demonstrates the compatibility of GALILEO to single-step environments. We also found that the results of IPW are unstable in this experiment. It might be because the behavior policy is modeled with beta distribution while the propensity score ˆµ is modeled with Gaussian distribution. Since IPW directly reweight loss function with 1

ˆµ, the results are sensitive to the error on ˆµ. GALILEO also models ˆµ with Gaussian distribution but the results are more stable since GALILEO does not re-weight through ˆµ explicitly.

We give the averaged responses for all of the tasks and the algorithms in Fig. 21 to Fig. 28. We randomly select 20% of the states in the dataset and equidistantly sample actions from the action space for each sampled state, and plot the averaged predicted feedback of each action. The real response is slightly different among different figure as the randomly-selected states for testing is different. We sample 9 points in GNFC tasks and 33 points in TAGC tasks for plotting.

H.2 All of the Result Table

We give the result of CNFC in Tab. 11, TCGA in Tab. 12, BAT in Tab. 9, and Mu Jo Co in Tab. 10.

H.3 Ablation Studies

NO_INJECT_NOISE

Figure 14: Illustration of the ablation studies. The error bars are the standard error.

In Appx. E.1, we introduce several techniques to develop a practical GALILEO algorithm. Based on task e0.2_p0.05 in GNFC, we give the ablation studies to investigate the effects of these techniques. We first compare two variants that do not handle the assumptions violation problems: (1) NO_INJECT_NOISE: set ϵµ and ϵD to zero, which makes the overlap assumption not satisfied;; (2) SINGLE_SL: without replacing the second term in Eq. (6) with standard supervised learning even when the output probability of D is far away from 0.5. Besides, we introduced several tricks inspired by GAIL and give a comparison of these tricks and GAIL: (3) ONE_STEP: use one-step reward instead of cumulative rewards (i.e., Q and V; see Eq. (23) and Eq. (24)) for re-weighting, which is implemented by set γ to 0; (4) SINGE_DIS: remove Tφ 1(x, a) and replace it with EMθ Tφ 0(x, a, x ) , which is inspired by GAIL that uses a value function as a baseline instead of using another discriminator; (5) PURE_GAIL: remove the second term in Eq. (6). It can be regarded as a naive adoption of GAIL and a partial implementation of GALILEO.

We summarize the results in Fig. 14. Based on the results of NO_INJECT_NOISE and SINGLE_SL, we can see that handling the assumption violation problems is important and will increase the ability on counterfactual queries. The results of PURE_GAIL tell us that the partial implementation of GALILEO is not enough to give stable predictions on counterfactual data; On the other hand, the result of ONE_STEP also demonstrates that embedding the cumulative error of one-step prediction is helpful for GALILEO training; Finally, we also found that SINGLE_DIS nearly has almost no effect on the results. It suggests that, empirically, we can use EMθ Tφ 0(x, a, x ) as a replacement for Tφ 1(x, a), which can reduce the computation costs of the extra discriminator training.

H.4 Worst-Case Prediction Error

In theory, GALILEO increases the generalization ability by focusing on the worst-case samples training to achieve AWRM. To demonstrate the property, we propose a new metric named Mean-Max Square Error (MMSE): E h maxa A (M (x |x, a) M(x |x, a))2i and give the results of MMSE for GNFC in Tab. 13 and for TCGA in Tab. 14.

H.5 Detailed Results in the Mu Jo Co Tasks

We select 3 environments from D4RL [17] to construct our model learning tasks. We compare it with a typical transition model learning algorithm used in the previous offline model-based RL algorithms [59, 30], which is a variant of standard supervised learning. We name the method OFFSL. Besides, we also implement IPW and SCIGAN as the baselines. We train models in datasets Half Cheetah-medium, Walker2d-medium, and Hopper-medium, which are collected by a behavior policy with 1/3 performance to the expert policy, then we test them in the corresponding expert dataset. For better training efficiency, the trajectories in the training and testing datasets are truncated, remaining the first 100 steps. We plot the converged results and learning curves of the three Mu Jo Co tasks in Tab. 10 and Fig. 15 respectively.

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(a) halfcheetah-medium (train)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(b) hopper-medium (train)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(c) walker2d-medium (train)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(d) halfcheetah-medium-replay (test)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(e) hopper-medium-replay (test)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(f) walker2d-medium-replay (test)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(g) halfcheetah-expert (test)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(h) hopper-expert (test)

0.0 0.2 0.4 0.6 0.8 1.0 update steps 105

GALILEO OFF-SL SCIGAN IPW

(i) walker2d-expert (test)

Figure 15: Illustration of learning curves of the Mu Jo Co Tasks. The X-axis record the steps of the environment model update, and the Y-axis is the corresponding prediction error. The figures with titles ending in (train) means the dataset is used for training while the titles ending in (test) means the dataset is just used for testing. The solid curves are the mean reward and the shadow is the standard error of three seeds.

In Fig. 15, we can see that all algorithms perform well in the training datasets. OFF-SL and SCIGAN can even reach a bit lower error in halfcheetah and walker2d. However, when we verify the models through expert and medium-replay datasets, which are collected by other policies, the performance of GALILEO is significantly more stable and better than all other algorithms. As the training continues, the baseline algorithms even gets worse and worse. However, whether in GALILEO or other baselines, the performance for testing is at least 2x worse than in the training

dataset, and the error is large especially in halfcheetah. The phenomenon indicates that although GALILEO can make better performances for counterfactual queries, the risks of using the models are still large and still challenging to be further solved.

Table 7: Results of policy performance directly optimized through SAC [20] using the learned dynamics models and deployed in Mu Jo Co environments. MAX-RETURN is the policy performance of SAC in the Mu Jo Co environments, and avg. norm. is the averaged normalized return of the policies in the 9 tasks, where the returns are normalized to lie between 0 and 100, where a score of 0 corresponds to the worst policy, and 100 corresponds to MAX-RETURN.

Task Hopper Walker2d Half Cheetah avg. norm.

Horizon H=10 H=20 H=40 H=10 H=20 H=40 H=10 H=20 H=40 /

GALILEO 13.0 0.1 33.2 0.1 53.5 1.2 11.7 0.2 29.9 0.3 61.2 3.4 0.7 0.2 -1.1 0.2 -14.2 1.4 51.1 OFF-SL 4.8 0.5 3.0 0.2 4.6 0.2 10.7 0.2 20.1 0.8 37.5 6.7 0.4 0.5 -1.1 0.6 -13.2 0.3 21.1 IPW 5.9 0.7 4.1 0.6 5.9 0.2 4.7 1.1 2.8 3.9 14.5 1.4 1.6 0.2 0.5 0.8 -11.3 0.9 19.7 SCIGAN 12.7 0.1 29.2 0.6 46.2 5.2 8.4 0.5 9.1 1.7 1.0 5.8 1.2 0.3 -0.3 1.0 -11.4 0.3 41.8

MAX-RETURN 13.2 0.0 33.3 0.2 71.0 0.5 14.9 1.3 60.7 11.1 221.1 8.9 2.6 0.1 13.3 1.1 49.1 2.3 100.0

We then verify the generalization ability of the learned models above by adopting them into offline model-based RL. Instead of designing sophisticated tricks to suppress policy exploration and learning in risky regions as current offline model-based RL algorithms [59, 30] do, we just use the standard SAC algorithm [20] to exploit the models for policy learning to strictly verify the ability of the models. Unfortunately, we found that the compounding error will still be inevitably large in the 1,000-step rollout, which is the standard horizon in Mu Jo Co tasks, leading all models to fail to derive a reasonable policy. To better verify the effects of models on policy optimization, we learn and evaluate the policies with three smaller horizons: H {10, 20, 40}.

The results have been listed in Tab. 7, where the learning curve in the dynamics models and the real environments is shown in Fig. 16 and Fig. 17. We first averaged the normalized return (refer to avg. norm. ) under each task, and we can see that the policy obtained by GALILEO is significantly higher than other models (the improvements are 24% to 161%). At the same time, we found that SCIGAN performed better in policy learning, while IPW performed similarly to SL. This is in line with our expectations, since IPW only considers the uniform policy as the target policy for debiasing, while policy optimization requires querying a wide variety of policies. Minimizing the prediction risks only under a uniform policy cannot yield a good environment model for policy optimization. On the other hand, SCIGAN, as a partial implementation of GALILEO (refer to Appx. E.2), also roughly achieves AWRM and considers the cumulative effects of policy on the state distribution, so its overall performance is better; In addition, we find that GALILEO achieves significant improvement in 6 of the 9 tasks. But in Half Cheetah, IPW works slightly better. However, compared with MAX-RETURN, it can be found that all methods fail to derive reasonable policies because their policies performances are far away from the optimal policy. By further checking the trajectories, we found that all the learned policies just keep the cheetah standing in the same place or even going backward and fall down 4.

H.6 Off-policy Evaluation (OPE) in the Mu Jo Co Tasks

H.6.1 Training and Evaluation Settings

We select 3 environments from D4RL [17] to construct our model learning tasks as Appx. H.5. To match the experiment setting in DOPE [18], here we use the whole datasets to the train GALILEO model, instead of truncated dataset in Appx. H.5, for GALILEO model training.

OPE via a learned dynamics model is straightforward, which only needs to compute the return using simulated trajectories generated by the evaluated policy under the learned dynamics model. Due to the stochasticity in the model and the policy, we estimate the return for a policy with Monte-Carlo sampling. See Alg. 4 for pseudocode, where we use γ = 0.995, N = 10, H = 1000 for all of the tasks.

H.6.2 Metrics

The metrics we use in our paper are defined as follows:

4the videos can be found in https://github.com/xionghuichen/GALILEO

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(a) hopper H = 10

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(b) hopper H = 20

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(c) hopper H = 40

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(d) halfcheetah H = 10

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(e) halfcheetah H = 20

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(f) halfcheetah H = 40

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(g) halfcheetah H = 10

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(h) halfcheetah H = 20

0 1 2 3 4 time-step 105

returns in models

GALILEO GANITE IPW OFF-SL ground-truth env

(i) halfcheetah H = 40

Figure 16: Illustration of offline policy learning curves of the Mu Jo Co Tasks. The X-axis record the steps of the environment model update, and the Y-axis is the corresponding returns in the dynamics models. The solid curves are the mean reward and the shadow is the standard error of three seeds.

Absolute Error The absolute error is defined as the difference between the value and estimated value of a policy:

Abs Err = |V π ˆV π|, (28)

where V π is the true value of the policy and ˆV π is the estimated value of the policy.

Rank correlation Rank correlation measures the correlation between the ordinal rankings of the value estimates and the true values, which can be written as:

Rank Corr = Cov(V π 1:N, ˆV π 1:N)

σ(V π 1:N)σ( ˆV π 1:N) , (29)

where 1 : N denotes the indices of the evaluated policies.

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(a) hopper H = 10

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(b) hopper H = 20

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(c) hopper H = 40

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(d) halfcheetah H = 10

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(e) halfcheetah H = 20

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(f) halfcheetah H = 40

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(g) halfcheetah H = 10

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(h) halfcheetah H = 20

0 1 2 3 4 time-step 105

eval returns

GALILEO GANITE IPW OFF-SL ground-truth env

(i) halfcheetah H = 40

Figure 17: Illustration of offline policy learning curves of the Mu Jo Co Tasks. The X-axis record the steps of the environment model update, and the Y-axis is the corresponding returns in the groundtruth environments. The solid curves are the mean reward and the shadow is the standard error of three seeds.

Regret@k Regret@k is the difference between the value of the best policy in the entire set, and the value of the best policy in the top-k set (where the top-k set is chosen by estimated values). It can be defined as: Regret @k = max i 1:N V π i max j topk(1:N) V π j , (30)

where topk(1 : N) denotes the indices of the top K policies as measured by estimated values ˆV π.

Algorithm 4 Off-policy Evaluation with GALILEO model

Require: GALILEO model (Mθ), evaluated policy π, number of rollouts N. set of initial states S0, discount factor γ, horizon length H. for i = 1 to N do

Ri = 0 Sample initial state s0 S0 Initialize τ 1 = 0 for t = 0 to H 1 do

at π( |st) st+1, rt Mθ( |st, at) Ri = Ri + γtrt end for end for return 1

N PN i=1 Ri

H.6.3 Detailed Results

The results of OPE on three tasks are in Tab. 8. Firstly, we can see that for all recorded rankcorrelation scores, GALILEO is significantly better than all of the baseline methods, with at least 25 % improvements. As for regret@1, although GALILEO cannot reach the best performance among all of the tasks, it is always one of the top-2 methods among the three tasks, which demonstrates the stability of GALILEO. Finally, GALILEO has the smallest value gaps except for Halfcheetah. However, for all of the top-4 methods, the value gaps in halfcheetah are around 1,200. Thus we believe that their performances on value gaps are roughly at the same level.

Table 8: Results of OPE on DOPE benchmark. is the standard deviation. We bold the top-2 scores for each metric. The results are tested on 3 random seeds. The results of the baselines are from [18]. Note that the rank correlation is NAN for Half Cheetah because the scores are not given in [18].

TASK Half Cheetah

METRIC Absolute value gap Rank correlation Regret@1

GALILEO 1280 83 0.313 0.09 0.12 0.02 Best DICE 1382 130 NAN 0.82 0.29 VPM 1374 153 NAN 0.33 0.19 FQE (L2) 1211 130 NAN 0.38 0.13 IS 1217 123 NAN 0.05 0.05 Doubly Rubost 1222 134 NAN 0.37 0.13

TASK Walker2d

METRIC Absolute value gap Rank correlation Regret@1

GALILEO 176 52 0.57 0.08 0.08 0.06 Best DICE 273 31 0.12 0.38 0.27 0.43 VPM 426 60 0.44 0.21 0.08 0.06 FQE (L2) 350 79 -0.09 0.36 0.31 0.10 IS 428 60 -0.25 0.35 0.70 0.39 Doubly Rubost 368 74 0.02 0.37 0.25 0.09

TASK Hopper

METRIC Absolute value gap Rank correlation Regret@1

GALILEO 156 23 0.45 0.1 0.08 0.08 Best DICE 215 41 0.19 0.33 0.18 0.19 VPM 433 44 0.13 0.37 0.10 0.14 FQE (L2) 283 73 -0.29 0.33 0.32 0.32 IS 405 48 -0.55 0.26 0.32 0.32 Doubly Rubost 307 73 -0.31 0.34 0.38 0.28

H.7 Detailed Results in the BAT Task

The core challenge of the environment model learning in BAT tasks is similar to the challenge in Fig. 1. Specifically, the behavior policy in BAT tasks is a human-expert policy, which will tend to increase the budget of allowance in the time periods with a lower supply of delivery clerks, otherwise will decrease the budget (Fig. 12 gives an instance of this phenomenon in the real data).

Since there is no oracle environment model for querying, we have to describe the results with other metrics.

0.0 0.2 0.4 0.6 0.8 1.0 normalized allocated allowances

rescaled averaged response

0.0 0.2 0.4 0.6 0.8 1.0 normalized allocated allowances

rescaled averaged response

0.0 0.2 0.4 0.6 0.8 1.0 normalized allocated allowances

rescaled averaged response

0.0 0.2 0.4 0.6 0.8 1.0 normalized allocated allowances

rescaled averaged response

0.0 0.2 0.4 0.6 0.8 1.0 normalized allocated allowances

rescaled averaged response

0.0 0.2 0.4 0.6 0.8 1.0 normalized allocated allowances

rescaled averaged response

Figure 18: Illustration of the response curves in the 6 cities. Although the ground-truth curves are unknown, through human expert knowledge, we know that it is expected to be monotonically increasing.

0.0 0.2 0.4 0.6 0.8 1.0 cumulative samples (rescaled to 1)

rescaled cumulative treatment effects

random GALILEO SL

Figure 19: Illustration of the AUCC result for BAT. The model with larger areas above the random line makes better predictions in randomized-controlledtrials data [61].

First, we review whether the tendency of the response curve is consistent. In this application, with a larger budget of allowance, the supply will not be decreased. As can be seen in Fig. 18, the tendency of GALILEO s response is valid in 6 cities but almost all of the models of SL give opposite directions to the response. If we learn a policy through the model of SL, the optimal solution is canceling all of the allowances, which is obviously incorrect in practice.

Second, we conduct randomized controlled trials (RCT) in one of the testing cities. Using the RCT samples, we can evaluate the correctness of the sort order of the model predictions via Area Under the Uplift Curve (AUUC) [6]. To plot AUUC, we first sort the RCT samples based on the predicted treatment effects. Then the cumulative treatment effects are computed by scanning the sorted sample list. If the sort order of the model predictions is better, the sample with larger treatment effects will be computed early. Then the area of AUUC will be larger than the one via a random sorting strategy. The result of AUUC show GALILEO gives a reasonable sorting to the RCT samples (see Fig. 19).

Finally, we search for the optimal policy via the cross-entropy method planner [21] based on the learned model. We test the online supply improvement in 6 cities. The algorithm compared is a human-expert policy, which is also the behavior policy of the offline datasets. We conduct online

A/B tests for each of the cities. For each test, we randomly split a city into two partitions, one is for deploying the optimal policy learned from the GALILEO model, and the other is as a control group, which keeps the human-expert policy as before. Before the intervention, we collect 10 days observation data and compute the averaged five-minute order-taken rates as the baselines of the treatment and control group, named bt and bc respectively. Then we start intervention and observe the five-minute order-taken rate in the following 14 days for the two groups. The results of the treatment and control groups are yt i and yc i respectively, where i denotes the i-th day of the deployment. The percentage points of the supply improvement are computed via difference-in-difference (DID): PT i (yt i bt) (yc i bc) T 100,

where T is the total days of the intervention and T = 14 in our experiments.

Table 9: Results on BAT. We use City-X to denote the experiments on different cities. pp is an abbreviation of percentage points on the supply improvement.

target city City-A City-B City-C

supply improvement +1.63pp +0.79pp +0.27pp

target city City-D City-E City-F

supply improvement +0.2pp +0.14pp +0.41pp

The results are summarized in Tab. 9. The online experiment is conducted in 14 days and the results show that the policy learned with GALILEO can make better (the supply improvements are from 0.14 to 1.63 percentage points) budget allocation than the behavior policies in all the testing cities. We give detailed results which record the supply difference between the treatment group and the control group in Fig. 20.

0 5 10 15 20 day number

rescaled five-minute order-taken rate

treatment group (GALILEO) control group the day starting A/B test

0 5 10 15 20 day number

rescaled five-minute order-taken rate

treatment group (GALILEO) control group the day starting A/B test

0 5 10 15 20 day number

rescaled five-minute order-taken rate

treatment group (GALILEO) control group the day starting A/B test

0 5 10 15 20 day number

rescaled five-minute order-taken rate

treatment group (GALILEO) control group the day starting A/B test

0 5 10 15 20 day number

rescaled five-minute order-taken rate

treatment group (GALILEO) control group the day starting A/B test

0 5 10 15 20 day number

rescaled five-minute order-taken rate

treatment group (GALILEO) control group the day starting A/B test

Figure 20: Illustration of the daily responses in the A/B test in the 6 cities.

Table 10: The root mean square errors on Mu Jo Co tasks. We bold the lowest error for each task. medium dataset is used for training, while expert and medium-replay datasets are just used for testing. follows the standard deviation of three seeds.

TASK Half Cheetah

DATASET medium (train) expert (test) medium-replay (test)

GALILEO 0.378 0.003 2.287 0.005 1.411 0.037 OFF-SL 0.404 0.001 3.311 0.055 2.246 0.016 IPW 0.513 0.033 2.892 0.050 2.058 0.021 SCIGAN 0.309 0.002 3.813 0.133 2.484 0.040

TASK Walker2d

DATASET medium (train) expert (test) medium-replay (test)

GALILEO 0.49 0.001 1.514 0.002 0.968 0.004 OFF-SL 0.467 0.004 1.825 0.061 1.239 0.004 IPW 0.564 0.001 1.826 0.025 1.282 0.007 SCIGAN 0.438 0.001 1.825 0.031 1.196 0.005

TASK Hopper

DATASET medium (train) expert (test) medium-replay (test)

GALILEO 0.037 0.002 0.322 0.036 0.408 0.003 OFF-SL 0.034 0.001 0.464 0.021 0.574 0.008 IPW 0.039 0.001 0.533 0.00 0.671 0.001 SCIGAN 0.039 0.002 0.628 0.050 0.742 0.019

MISE results on GNFC. We bold the lowest error for each task. is the standard deviation of three random seeds.

e1_p1 e0.2_p1 e0.05_p1

GALILEO 5.17 0.06 4.73 0.13 4.70 0.02 SL 5.15 0.23 4.73 0.31 23.64 4.86 IPW 5.22 0.09 5.50 0.01 5.02 0.07 SCIGAN 7.05 0.52 6.58 0.58 18.55 3.50

e1_p0.2 e0.2_p0.2 e0.05_p0.2

GALILEO 5.03 0.09 4.72 0.05 4.87 0.15 SL 5.21 0.63 6.74 0.15 33.52 1.32 IPW 5.27 0.05 5.69 0.00 20.23 0.45 SCIGAN 16.07 0.27 12.07 1.93 19.27 10.72

e1_p0.05 e0.2_p0.05 e0.05_p0.05

GALILEO 5.23 0.41 5.01 0.08 6.17 0.33 SL 5.89 0.88 14.25 3.48 37.50 2.29 IPW 5.21 0.01 5.52 0.44 31.95 0.05 SCIGAN 11.50 7.76 13.05 4.19 25.74 8.30

MISE results on TCGA. We bold the lowest error for each task. is the standard deviation of three random seeds.

t0_bias_2.0 t0_bias_20.0 t0_bias_50.0

GALILEO 0.34 0.05 0.67 0.13 2.04 0.12 SL 0.38 0.13 1.50 0.31 3.06 0.65 IPW 6.57 1.16 6.88 0.30 5.84 0.71 SCIGAN 0.74 0.05 2.74 0.35 3.19 0.09

t1_bias_2.0 t1_bias_6.0 t1_bias_8.0

GALILEO 0.43 0.05 0.25 0.02 0.21 0.04 SL 0.47 0.05 1.33 0.97 1.18 0.73 IPW 3.67 2.37 0.54 0.13 2.69 1.17 SCIGAN 0.45 0.25 1.08 1.04 1.01 0.77

t2_bias_2.0 t2_bias_6.0 t2_bias_8.0

GALILEO 1.46 0.09 0.85 0.04 0.46 0.01 SL 0.81 0.14 3.74 2.04 3.59 0.14 IPW 2.94 1.59 1.24 0.01 0.99 0.06 SCIGAN 0.73 0.15 1.20 0.53 2.13 1.75

MMSE results on GNFC. We bold the lowest error for each task. is the standard deviation of three random seeds.

e1_p1 e0.2_p1 e0.05_p1

GALILEO 3.86 0.03 3.99 0.01 4.07 0.03 SL 5.73 0.33 5.80 0.28 18.78 3.13 IPW 4.02 0.05 4.15 0.12 22.66 0.33 SCIGAN 8.84 0.54 12.62 2.17 24.21 5.20

e1_p0.2 e0.2_p0.2 e0.05_p0.2

GALILEO 4.13 0.10 4.11 0.15 4.21 0.15 SL 5.87 0.43 7.44 1.13 29.13 3.44 IPW 4.12 0.02 6.12 0.48 30.96 0.17 SCIGAN 12.87 3.02 14.59 2.13 24.57 3.00

e1_p0.05 e0.2_p0.05 e0.05_p0.05

GALILEO 4.39 0.20 4.34 0.20 5.26 0.29 SL 6.12 0.43 14.88 4.41 30.81 1.69 IPW 13.60 7.83 26.27 2.67 32.55 0.12 SCIGAN 9.19 1.04 15.08 1.26 17.52 0.02

MMSE results on TCGA. We bold the lowest error for each task. is the standard deviation of three random seeds.

t0_bias_2.0 t0_bias_20.0 t0_bias_50.0

GALILEO 1.56 0.04 1.96 0.53 3.16 0.13 SL 1.92 0.67 2.31 0.19 5.11 0.66 IPW 7.42 0.46 5.36 0.96 5.38 1.24 SCIGAN 2.11 0.47 5.23 0.27 5.59 1.02

t1_bias_2.0 t1_bias_6.0 t1_bias_8.0

GALILEO 1.43 0.06 1.09 0.05 1.36 0.36 SL 1.12 0.15 3.65 1.91 3.96 1.81 IPW 1.14 0.11 0.90 0.09 2.04 0.99 SCIGAN 3.32 0.88 4.74 2.12 5.17 2.42

t2_bias_2.0 t2_bias_6.0 t2_bias_8.0

GALILEO 3.77 0.35 3.99 0.40 2.08 0.60 SL 2.70 0.67 8.33 5.05 9.70 3.12 IPW 2.92 0.15 3.90 0.17 4.47 2.16 SCIGAN 3.82 2.12 1.83 1.49 3.62 4.9

0.0 0.2 0.4 0.6 0.8 1.0 action

averaged response

(a) t0_bias2

0.2 0.4 0.6 0.8 1.0 action

averaged response

(b) t0_bias20

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 action

averaged response

(c) t0_bias50

0.0 0.2 0.4 0.6 0.8 1.0 action

averaged response

(d) t1_bias2

0.2 0.4 0.6 0.8 action

averaged response

(e) t1_bias6

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 action

averaged response

(f) t1_bias8

0.0 0.2 0.4 0.6 0.8 1.0 action

averaged response

(g) t2_bias2

0.2 0.4 0.6 0.8 action

averaged response

(h) t2_bias6

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 action

averaged response

(i) t2_bias8

Figure 21: Illustration of the averaged response curves of Supervised Learning (SL) in TCGA.

2 1 0 1 2 3 4 5 action

averaged response

1 0 1 2 3 4 action

averaged response

(b) e0.2_p1

1 0 1 2 3 4 action

averaged response

(c) e0.05_p1

2 1 0 1 2 3 4 5 action

averaged response

(d) e1_p0.2

1 0 1 2 3 4 action

averaged response

(e) e0.2_p0.2

1 0 1 2 3 4 action

averaged response

(f) e0.05_p0.2

1 0 1 2 3 4 action

averaged response

(g) e1_p0.05

1 0 1 2 3 4 action

averaged response

(h) e0.2_p0.05

1 0 1 2 3 4 action

averaged response

(i) e0.05_p0.05

Figure 22: Illustration of the averaged response curves of Supervised Learning (SL) in GNFC.

(a) t0_bias2

(b) t0_bias20

(c) t0_bias50

(d) t1_bias2

(e) t1_bias6

(f) t1_bias8

(g) t2_bias2

(h) t2_bias6

(i) t2_bias8

Figure 23: Illustration of the averaged response curves of Inverse Propensity Weighting (IPW) in TCGA.

(b) e0.2_p1

(c) e0.05_p1

(d) e1_p0.2

(e) e0.2_p0.2

(f) e0.05_p0.2

(g) e1_p0.05

(h) e0.2_p0.05

(i) e0.05_p0.05

Figure 24: Illustration of the averaged response curves of Inverse Propensity Weighting (IPW) in GNFC.

0.0 0.2 0.4 0.6 0.8 1.0 action

averaged response

(a) t0_bias2

0.2 0.4 0.6 0.8 1.0 action

averaged response

(b) t0_bias20

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 action

averaged response

(c) t0_bias50

0.0 0.2 0.4 0.6 0.8 1.0 action

averaged response

(d) t1_bias2

0.2 0.4 0.6 0.8 action

averaged response

(e) t1_bias6

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 action

averaged response

(f) t1_bias8

0.0 0.2 0.4 0.6 0.8 1.0 action

averaged response

(g) t2_bias2

0.2 0.4 0.6 0.8 action

averaged response

(h) t2_bias6

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 action

averaged response

(i) t2_bias8

Figure 25: Illustration of the averaged response curves of SCIGAN in TCGA.

2 1 0 1 2 3 4 5 action

averaged response

1 0 1 2 3 4 action

averaged response

(b) e0.2_p1

1 0 1 2 3 4 action

averaged response

(c) e0.05_p1

2 1 0 1 2 3 4 5 action

averaged response

(d) e1_p0.2

1 0 1 2 3 4 action

averaged response

(e) e0.2_p0.2

1 0 1 2 3 4 action

averaged response

(f) e0.05_p0.2

1 0 1 2 3 4 action

averaged response

(g) e1_p0.05

1 0 1 2 3 4 action

averaged response

(h) e0.2_p0.05

1 0 1 2 3 4 action

averaged response

(i) e0.05_p0.05

Figure 26: Illustration of the averaged response curves of SCIGAN in GNFC.

(a) t0_bias2

(b) t0_bias20

(c) t0_bias50

(d) t1_bias2

(e) t1_bias6

(f) t1_bias8

(g) t2_bias2

(h) t2_bias6

(i) t2_bias8

Figure 27: Illustration of the averaged response curves of GALILEO in TCGA.

(b) e0.2_p1

(c) e0.05_p1

(d) e1_p0.2

(e) e0.2_p0.2

(f) e0.05_p0.2

(g) e1_p0.05

(h) e0.2_p0.05

(i) e0.05_p0.05

Figure 28: Illustration of the averaged response curves of GALILEO in GNFC.