# rehearsal_learning_for_avoiding_undesired_future__c2d9ddef.pdf

Rehearsal Learning for Avoiding Undesired Future

Tian Qin, Tian-Zuo Wang, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing, 210023, China {qint, wangtz, zhouzh}@lamda.nju.edu.cn

Machine learning (ML) models have been widely used to make predictions. Instead of a predictive statement about future outcomes, in many situations we want to pursue a decision: what can we do to avoid the undesired future if an ML model predicts so? In this paper, we present a rehearsal learning framework, in which decisions that can persuasively avoid the happening of undesired outcomes can be found and recommended. Based on the inﬂuence relation, we characterize the generative process of variables with structural rehearsal models, consisting of a probabilistic graphical model called rehearsal graphs and structural equations, and ﬁnd actionable decisions that can alter the outcome by reasoning under a Bayesian framework. Moreover, we present a probably approximately correct bound to quantify the associated risk of a decision. Experiments validate the effectiveness of the proposed rehearsal learning framework and the informativeness of the bound.

1 Introduction

Machine learning (ML) has achieved great success in various applications, including computer vision [1], natural language processing [2], recommender systems [3, 4], etc. In addition to perception tasks such as classifying an image, ML models have been widely used to make predictions, or forecasting, about future quantities of interest [5, 6]. In many scenarios, however, what we ultimately pursue is typically not attaining merely a predictive statement on what is likely to happen. Instead, as a potential next direction [7], we may seek a way, possibly through decision-making, to avoid the happening of the upcoming undesired future, if the ML model predicts so.

Suppose we use variables X to predict a future outcome Y with an ML model h. Given an instance x, h outputs a warning signal ˆy = h(x) which is outside our desired range. The problem is: what can we do to make the future Y fall into the desired range? We assume that there is an intermediate stage where one can make actionable decisions to inﬂuence Y; e.g., after a sales prediction made at the beginning of a month, there is an intermediate time window before the month-end for the sales manager to take action. We consider decisions in the form of altering variables with ﬁxed values, e.g., setting the discount to 10%. We henceforth call such decisions alterations. Let Z denote the variables in the intermediate stage. The avoiding undesired future (AUF) problem is then how to alter variables in Z so that the future Y could be shifted to fall into the desired range.

Common decision-making methods under the umbrella of reinforcement learning (RL) [8] can be applied to the AUF problem but may not be the most appropriate. The reasons include that (a) RL mainly focuses on sequential decision-making tasks while AUF desires a direct decision to change the upcoming future outcome, for which considering decision sequences could be unnecessary or even unrealistic; (b) the success of RL in playing games [9] and autonomous control [10] rely on sufﬁcient interactions between the agent and the environment, but interactions in real AUF problems can be extremely sparse, e.g., the sales manager in the above example can only interact with the environment and make decisions once a month; (c) the Markov decision process (MDP) formalism in RL abstracts decision-making into states, actions, and rewards, which may overlook useful ﬁne-grained structural

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

information in AUF, e.g., the connections between variables that can be altered may help identify useful actions without any exploration. Explicitly incorporating such connections in modeling the AUF problem would be preferable since otherwise approximating them with MDPs may require unnecessarily large state or action spaces.

Correlation

Figure 1: Relationship between correlation, inﬂuence, and causation [7, 11].

Therefore, we need a speciﬁc method for solving AUF that takes account of the structural information and avoids the use of the large number of interactions. An immediate thought would be seeking cause-effect relations between variables [12], which have been leveraged for some decision-related problems (see Section 5 for a discussion of related work). But as indicated by Zhou [7], causal relations can help but should not be taken as a prerequisite for decision-making for reasons that (a) humans can make good decisions without a thorough or faithful causal understanding of the surrounding environment, sometimes correct decisions can even be made based on an incorrect causal understanding; (b) guiding decisions with causal relations is not always sensible as causation reveals truths that always hold, whereas decision-making is coping with real environments which are often open and dynamic; (c) causal modeling could be helpless for decision-making if the identiﬁed causal factors are unactionable; needless to say, it is very difﬁcult to discover true cause-effect relations from data [13]. Recognizing that correlation is helpful for prediction but insufﬁcient for decision-making and that causation is needed for scientiﬁc discovery but too luxury to be relied on for decision-making, Zhou [7] claimed that what is required for decision-making is a kind of intermediate relation that is stronger than correlation but less demanding than causation; this relation was later called inﬂuence relation [11]. Moreover, Zhou [7] attributed a decision to a series of hypothesized rehearsal of possible actions, which is an intuitive way of leveraging or even discovering the inﬂuence among variables involved in a decision task. Fig. 1 depicts an intuitive demonstration of the relationship among correlation, inﬂuence, and causation.

Based on the concept of rehearsal, we present the structural rehearsal model (SRM) to model the AUF problem. The SRM consists of a new probabilistic graphical model called rehearsal graphs and a collection of generative structural equations. In contrast to traditional structural causal models (SCM) [12], the SRM allows dynamic modeling and accommodates the inﬂuence relation among variables, effectively capturing the interactions between interrelated but not necessarily causally linked variables. Given the true SRM, the effect of alterations can be calculated or estimated by conducting rehearsals. Consequently, the AUF problem can be addressed by searching for optimal alterations that yield the desired outcomes, though the search process could be complex and difﬁcult.

Note that structural models such as SRMs are generally not available in advance, and learning them from limited observational data is difﬁcult as well. In solving the AUF problem, one needs to tackle challenges posed by limited data, the uncertainty of structural models, and the possibly huge search space for ﬁnding an optimal alteration. Moreover, a system outputting decisions should provide some kind of guarantee on the probability of successfully avoiding undesired future, especially in high-stake applications such as economic and safety tasks. The AUF problem can also come in an online fashion and induce an exploration-exploitation trade-off: a decision-maker can choose alterations that will reveal more structural information to beneﬁt future decision-making or alterations that will maximize the success probability in the current decision round.

To address the aforementioned challenges and tackle the AUF problem, we introduce a rehearsal learning framework that integrates structural rehearsal models and Bayesian inference. This framework effectively captures the inherent uncertainty in decision-making processes aimed at averting undesired outcomes by unifying structural modeling, alteration ﬁnding, and success probability bounding. Our contributions are as follows:

1. Structural Rehearsal Model: We propose SRM, a novel modeling approach distinct from SCM, designed to model the inﬂuence relation, which is more essential than causal relations for decisionmaking problems. 2. Constrained Optimization for AUF: We formulate a constrained optimization problem to solve the AUF problem, striking a balance between averting undesired future scenarios and accurately learning rehearsal models.

3. Rehearsal Learning Framework: We develop a rehearsal learning framework tailored for solving AUF, showing that building decisions based on the inﬂuence relation is practical and feasible. We propose speciﬁc learning methods for the basic linear case and show that optimal solutions are not attainable in polynomial time in the framework if joint alterations are allowed, assuming P = NP. This insight underscores the signiﬁcance of developing approximate solutions over exact ones. 4. Risk Quantiﬁcation: We introduce a probably approximately correct bound to quantify the associated risk inherent in a given decision. Our experiments provide empirical evidence of the informativeness of this bound.

2 Structural Rehearsal Models

Conventionally, SCMs [12] are used to describe causal relations among variables, based on which actions that affect the outcome could be found. However, in some real-world problems, especially those involving decisions, causal relations are not adequate. For example, let P and Q denote the prices of a product in two stores, respectively. If the ﬁrst store decreases P to attract customers, the second store will decrease Q accordingly because the second store will lose consumers otherwise. From a causal view, it seems that P causes Q. But obviously, Q also causes P by symmetry. So we have a bi-directional causal relation, which is invalid in causal modeling. Mistakes will occur if one applies SCMs to this example as only a one-way causal relation is allowed in SCMs. The issue of such interrelated but not necessarily causally linked variables, as well as several other issues such as the dynamic and time-dependent nature of decision-making, which will be discussed later, are taken into account in the following new structural model called the structural rehearsal model (SRM).

An SRM consists of a set of rehearsal graphs and corresponding structural equations. A rehearsal graph qualitatively describes the relations between variables, and the structural equations characterize the generating process of variables in detail. In contrast to static causal modeling, SRM allows dynamic modeling by deﬁning both the graphs and equations over a time index t, which accounts for possible evolutions of the environment. We represent an SRM with { Gt, θt }t, where Gt denotes the rehearsal graph at time t and θt parameterizes corresponding structural equations.

Rehearsal graphs allow both directional and bi-directional edges. The directional edges depict the generation ordering of variables and the bi-directional ones indicate interrelated variables that are mutually inﬂuenced. Fig. 2a depicts an example. The deﬁnition of rehearsal graphs is given below: Deﬁnition 1 (Mixed graph). Let G = (V, E) be a graph, where V denotes the vertices and E the edges. G is a mixed graph if for any distinct vertices u, v V, there is at most one edge connecting them, and the edge is either directional (u v or u v) or bi-directional (u v). Deﬁnition 2 (Bi-directional clique). A bi-directional clique C = (Vc, Ec) of a mixed graph G = (V, E) is a complete subgraph induced by Vc V such that any edge e Ec is bi-directional. C is maximal if adding any other vertex does not induce a bi-directional clique. Deﬁnition 3 (Rehearsal graph). Let G = (V, E) be a mixed graph. Let {Ci}l i=1 denote all maximal bi-directional cliques of G, where Ci = (Vc i, Ec i). G is a rehearsal graph if and only if:

1. Vc i Vc j = for any i = j. 2. i [l], u V \ Vc i, if there is any edge pointing from u to Vc i, then v Vc i, u v. 3. The directional edges permit a topological ordering for {Ci}l i=1.

Each vertex in a rehearsal graph corresponds to a variable. And the variables are generated following the ordering depicted by the directional edges. The variables in a common maximal bi-directional clique, where only bi-directional edges exist, are mutually inﬂuenced instead of having a ﬁxed generating order. Given a rehearsal graph G, structural equations accompanying the rehearsal graph are deﬁned over cliques {(Vc i, Ec i)}l i=1. Let PAG i {u | v Vc i, u v in G} denote parents of the i-th maximal bi-directional clique. Suppose that the unobserved noise variables are Gaussian, then the structural equation describing the generation process of Vc i is parameterized by {βi, Σi}l i=1 θ:

Vc i := fi(PAG i ; βi) + εi, (1)

where fi is a multi-valued function parameterized by βi and εi N(0, Σi) denotes the noise.

We denote the operation that alters variable X with ﬁxed value x by Rh(X = x), meaning that we perform a rehearsal of setting X to x through realistic or hypothetical means. When Rh(X = x) is

G0, θ0 G1, θ1 GT , θT

(d) { Gt, θt }t Figure 2: (a) displays an example rehearsal graph. (b) and (c) displays its alteration graphs with rehearsal operations on {F} and {B, E, F} respectively. We use plate notation to simplify the edges: If an endpoint of an edge falls exactly on the boundary of a rectangle, then there are edges for every vertex in that rectangle. (d) shows a full SRM { Gt, θt }t, where t is the time index and the rehearsal graph Gt and parameters θt can evolve over time. Recall the example of two stores and let the variables follow the depicted SRM. Suppose that at t = 1, the second store takes a new marketing strategy that no longer follows the price of the ﬁrst store, then P will not directly affect Q: The generation process evolves to P Q in G1 instead of P Q in G0. Then for the ﬁrst store, a previously correct decision at t = 0, which aims to affect C by altering P, is ineffective at t = 1.

applied, we remove the incoming arrows of X in the original rehearsal graph G to get an alteration rehearsal graph GX=x, and the accompanying structural equations are modiﬁed according to the newly introduced parental relations in GX=x, where a new set of parameters in θ could be introduced for the unseen parental relations to offer more ﬂexible modeling abilities. The distribution P(V | Rh(X = x)) of all variables after applying Rh(X = x) can be derived from Eq. (1). Example illustrations are in Figs. 2b-2c. The invalid bi-directional causal relations are valid in rehearsal graphs: P Q becomes P Q when applying Rh(P) and becomes P Q when applying Rh(Q), capturing the interrelated inﬂuence between the two variables.

An important feature of SRM that distinguishes it from SCM is the capability of modeling dynamic and time-dependent real-world decision-making environments, where variable relations may evolve and correct decisions can vary dramatically at different times even for the same problem [14]. It is worth noting that these dynamic environments should not be confused with the conventional notion of dynamic systems, where the relationships among variables simply repeat over time. As causal modeling seeks to identify enduring cause-and-effect relationships, SCMs can be helpful in tasks such as AI for science" [15, 16, 17], where the objective is to uncover persistent scientiﬁc truths. However, SCMs can be restrictive in describing the dynamic nature of decision problems. In contrast, the dynamic and time-dependent feature of environments is captured by SRM using evolving rehearsal graphs and corresponding parameters: At time t, the generating process characterized by Gt, θt can differ from previous ones. Fig. 2d illustrates an example where correct decisions vary across time and an SRM { Gt, θt }t is used to model the dynamic decision environment.

Although θt can be learned from data with general supervised learning methods given a ﬁxed graphical structure, generally rehearsal graphs could not be uniquely identiﬁed from data. We present basic properties of rehearsal graphs and a preliminary graph class learning procedure in Appendix A.

3 The AUF Problem

We treat AUF as a multi-round online decision-making problem, where in the t-th decision round, an agent makes decisions to affect the outcome if it receives an undesired prediction, serving as a warning signal, from an ML model. In real problems, variables involved in each decision round generally do not appear simultaneously; e.g., if a single decision round spans a full month, then a decision-maker may encounter new variables every day. Ignoring the variable generation order can be problematic, but modeling the decision process with reﬁned time granularity is too demanding.

Therefore, we make a midway proposal. We identify two important time points in AUF, the time the ML prediction is made and the time just before the generation of the concerned outcome, so the variables appearing in the t-th decision round fall into three consecutive time segments separated by the two time points: Xt, variables appeared before the prediction is made; Zt, variables appeared after the prediction and before the generation of the outcome; and Yt, the outcome variable. After the prediction, if the agent decides to do something to change the future outcome, the only variables it can alter are those in Zt, since Xt has already happened and cannot be altered. For example, suppose that in the t-th month, a sales manager predicts the month sales Yt on the ﬁrst day of that month and designs promotions in that month accordingly. Then Xt can be marketing variables that appeared

before the ﬁrst day and Zt can be variables in the remaining days of that month, such as the price next week, which can be altered to affect month sales.

observe Xt and

observe Zt conduct alterations make prediction

Figure 3: Rehearsal graph Gt in the t-th round. The timeline shows the order in which the three sets of variables are generated and events that happened to an agent at different times in one round.

More speciﬁcally, the decision process in the t-th round is: An agent ﬁrst observes Xt, with which an ML model gives a prediction ˆYt for the outcome; if ˆYt / S, where S denotes the set of desired values, the agent should perform alterations on Zt to prevent Yt / S; after the alteration, the agent can observe full Zt and the outcome Yt. We assume that the generation mechanism of the encountered variables can be described by an SRM { Gt, θt }t, where Gt, θt characterizes the generation process in the t-th round. Fig. 3 depicts an example. For simplicity, we assume that the environment is stable, i.e., Gt = G, θt = θ for all t. Note that the agent does not have access to the true SRM, instead, it can access historical data D = {(xi, zi, yi)}m i=1, from which some information about the SRM can be learned.

In real scenarios, some variables cannot be manually altered or cannot be set to certain values. We denote the set of feasible alteration values for an actionable variable Zi Z by (Zi), which we assume to be a closed interval. An alteration is denoted by ξt = {(Zai, zai)}k i=1, where Zai Z is the variable to be altered, zai (Zai) is the alteration value, and k is the alteration size. Applying ξt is treated as performing a rehearsal Rh(ξt) Rh({Zai = zai}k i=1) on G.

The overall AUF problem is then given D and xt, t = 1, . . . , T, that arrives sequentially, ﬁnd alterations ξt in each round t to successfully avoid the undesired future as many times as possible. The problem can be formulated as

max ξ1:T Eε1:T

t=1 I (Yt S | xt, Rh(ξt))

where ε is the noise in Eq. (1) and I( ) is the indicator function.

If the underlying SRM is known, one can obtain P(Yt | xt, Rh(ξt)), and the only issue is searching for a ξt that maximizes P(Yt S | xt, Rh(ξt)). The main obstacle to solving AUF, however, is the uncertainty of the SRM. The uncertainty leads to an exploration-exploitation trade-off: In each round, if the agent chooses alterations that help reveal true SRMs (exploration), a more accurate environment model may be obtained and beneﬁt decisions in future rounds, but it may fail in the current round. On the other hand, if the agent chooses a short-sighted decision that successfully avoids the upcoming undesired future (exploitation), it may learn little about SRMs, making future optimizations difﬁcult. In addition, the effect of alterations can propagate to downstream variables through the directional edges in a rehearsal graph, thus the value of unactionable variables may be affected by altering the actionable ones. A feasible approach should take account of the uncertainty, exploration-exploitation trade-off, and propagation of effects in the online decision process and yield reasonable decisions.

Some extra knowledge can be utilized. First, the partition of the three sets of variables reﬂects the order in which the variables are generated. Any edge crossing the three sets should point from the previous one to the latter one, which reduces the uncertainty on G. Further, since an agent observes Xt before making a decision, the generating mechanism inside Xt does not affect the distribution of outcomes, as shown in Prop. 4, so we can safely pretend that G does not have edges inside Xt. Proposition 4. For any G1, θ1 and G2, θ2 on Xt Zt Yt, if they differ only in describing the generation of Xt, then for any Xt = xt and alterations ξ on Zt, we have P (yt | xt, Rh(ξ) ; G1, θ1) = P (yt | xt, Rh(ξ) ; G2, θ2) .

In the following, we restrict our attention to a basic instance of the AUF problem: The structural equations fis are linear and the desired set S is a convex polytope, i.e.,

Vc i := βT i PAG i + εi, S = {y R|Y| | My d}, (3)

where | | denotes the cardinality of a set, d Rs, M Rs |Y|, and βi R|PAG i | |Vc i |.

4 Rehearsal Learning

In this section, we present a rehearsal learning framework to address the AUF problem in Eq. (2) and present speciﬁc methods for the linear case in Eq. (3), in which we use sampling and rehearsal of actions as a basic building block to ﬁnd and evaluate decisions.

To account for the uncertainty introduced by limited data, we resort to Bayesian inference. The state of knowledge of the SRM is described by the posterior distribution:

P(G, θ | Dt) = P(G | Dt)P(θ | G, Dt), (4)

where Dt is the evidence collected until the beginning of round t, consisting of the initial observational data D and {(xi, ξi, zi, yi)}t 1 i=1. To handle the exploration-exploitation trade-off between choosing alterations that help recovers the true SRM and alterations that can avoid undesired future in the current round, we leave the choice to the agent by introducing a hyper-parameter τ that balances these two kinds of alterations. In the t-th round, we propose to solve

max ξt I (G, Θ; Zt, Yt | Dt, xt, Rh(ξt)) (5)

s.t. EG,θ|Dt [I (Yt S | G, θ, xt, Rh(ξt))] τ. (6)

The constraint in Eq. (6) reﬂects the required performance: Performing the alteration should make the belief that the desired future can be successfully achieved greater than τ given knowledge of the SRM so far. The objective function in Eq. (5) denotes the mutual information between the SRM and the observations revealed after the alteration. Overall, the solution to the optimization problem should avoid undesired future and meanwhile be most informative about the true SRM.

The proposed optimization problem introduces computational challenges, e.g., one has to sum over a super-exponential number of rehearsal graphs to compute the mutual information or the success probability. We divide rehearsal learning into three approximately tractable components: Bayesian update of SRMs, candidate alteration selection, and mutual information maximization. To better guide decision-making, we also present a sampling-based probably approximately correct (PAC) bound to quantify the risk associated with the selected alteration.

4.1 Bayesian Update of SRMs

Recall that the posterior of SRMs is divided into the posterior of graphs P(G | Dt) and that of structural equation parameters P(θ | G, Dt) in Eq. (4). For the former term, we adopt a bootstrappingbased approximation [18], which learns multiple rehearsal graphs on bootstrapped data to build a set of graphs G and weigh each graph G G equally. The approximated posterior of graphs is

P(G | Dt) 1

G G I(G = G). (7)

Consequently, for the second term P(Θ | G, Dt), we only need to estimate parameters corresponding to graphs that appear in G, which greatly reduces computation. The parameter posterior factorizes as

P(θ | G, Dt) = Y

G A(G),Vc G P(βVc, ΣVc | G , Dt),

where A(G) is a set of graphs that includes G and related alteration graphs and Vc denotes a maximal bi-directional clique. P(βVc, ΣVc | G , Dt) is the posterior of a set of regression parameters and noise variance given data of Vc and its parents in G . So various Bayesian learning methods [19] are applicable for its estimation. We adopt Bayesian ridge regression [20] to learn the parameters. In addition, as graph learning is time-consuming, we use an incremental implementation: We build an initial graph posterior from purely observational data for t = 0 using Eq. (7), then update the posterior with

P(G | Dt) P(G | Dt 1) 1

i=1 P(zt, yt | xt, ξt, G, θi),

where θi is sampled from P(θ | G, Dt 1) and the averaged summation term is an empirical estimate of likelihood. With the above approximations and learning, we effectively obtain an approximate posterior and an efﬁcient sampler over SRMs, which forms the basis of the following two steps.

4.2 Candidate Alteration Selection

In the second part, we ﬁnd candidate alterations that satisfy the performance constraint in Eq. (6). The constraint cannot be accurately computed as it takes an expectation over discrete G and continuous θ. To reduce the computational burden, we estimate the constraint with posterior samples. Note that the value of variables V is uniquely determined by G, θ, and the noise ε in Eq. (1). We can evaluate all alterations by generating Yt from a common set of SRMs and noise samples. Suppose we have an i.i.d. sample S = { Gi, θi, εi }n i=1, the empirical estimate of Eq. (6) is

ˆpξt(S) |{i | yi t,ξt S}|/n,

where yi t,ξt is the concerned outcome generated following the SRM speciﬁed by Gi, θi, εi and Rh(ξt). A general method for candidate alteration selection has three stages:

1. Sample two sets of i.i.d. parameters, the training set Str = { Gi, θi, εi }n i=1 and the validation set Sval = { Gi, θi, εi }2n i=n+1, from P(G, θ | Dt) and P(ε | θ). 2. Build a candidate alteration set C by conducting rehearsals on Str and ﬁnding alterations ξt that achieve required performance τ, i.e., ξt C, ˆpξt(Str) τ. 3. Validate alterations in C by rehearsals on Sval and remove those with ˆpξt(Sval) < τ from C.

The validation procedure is to alleviate overﬁtting as in standard machine learning pipelines. If stage 2 or 3 outputs an empty set, the system should refuse to make any recommendations. The agent can lower τ and restart if a smaller τ is still adequate for the task at hand.

For a given alteration, validation is relatively easy to perform. The only issue to address is to ﬁnd a set of alterations that pass the test on Str. An immediate approach would be enumerating possible alterations and making rehearsals on training samples to check the constraint. But the enumeration can be computationally prohibitive as the alteration space can be combinatorial and continuous. A compromise between the optimality of recommended alterations and the computational feasibility is to work on some ﬁnite subset of alterations using some discretization tricks or heuristics. Fortunately, due to the linear structure of the SRMs considered here, Prop. 5 shows that the outcome variables are linear in the alteration values, implying the existence of more efﬁcient and effective solutions. Proposition 5. Let ξ = {(Zai, zai)}k i=1 be a valid alteration where Zai Zt and zai (Zai). Given G, θ, ε and xt, the outcome variables are linear in the alteration values. We have

Yt = Axt + Bzξ + Cε, (8)

where zξ = (za1, . . . , zak)T . A, B, and C are constant matrices of appropriate shapes and are uniquely determined by G, θ and altered variables Zξ = {Zai}k i=1.

Algorithm 1 Finding candidate alterations of size 1

Input: Str = { Gi, θi, εi }n i=1, Sval = { Gj, θj, εj }j 1: C Store candidate alterations 2: for Zi Z do 3: int1 FIND(Str, Zi, τ) Training 4: int2 FIND(Sval, Zi, τ) 5: C C { Zi, int1 int2 } Validation 6: function FIND(S, Z, τ) Find alteration values for Z 7: int Store valid intervals 8: for G, θ, ε S do 9: [u, v] solution of M(Ax + Bzξ + Cε) d 10: A, B, C are determined by G, θ, Z 11: int int {[u, v] (Z)} 12: return {z | nτ #intervals in int containing z} Output: The found alterations C

Size-1 Alteration. We consider cases where an agent can alter only one variable in each round, i.e., k = 1. Suppose the altered variable in round t is Zi. Provided with a ﬁxed SRM, along with ﬁxed noise, Eq. (8) reduces to a line in R|Y| with direction B R|Y|. Due to the convexity of the desired region S, if altering Zi can make Y S, then the set of feasible alteration values can be represented by an interval [u, v] (the endpoints can be inﬁnity). Finding candidate alterations on the training data is then equivalent to ﬁnding elements that intersect at least n τ intervals obtained from samples in Str, which is easy to compute. Alg. 1 includes the basic training and validation stages of ﬁnding candidate alterations. The overall time complexity is O(|Zt| n log n). A more detailed algorithm and running time analysis are given in Appendix B.

Size-k Alteration. By relating to the NP-hard maximum feasible subsystem problem [21], we prove Thm. 6, showing that ﬁnding a single ξ that satisﬁes the constraint, which is much simpler

than ﬁnding all valid alterations, is difﬁcult if P = NP. This result suggests that we should focus on approximate solutions rather than exact ones. We also give a mixed integer linear programming approach in Appendix B.

Theorem 6. Unless P = NP, ﬁnding a k-dimensional zξ (Zξ) that satisﬁes Pn i=1 I(M(Aix + Bizξ + Ciεi) d) n τ (if there is a valid solution) is not solvable with any algorithm of running time polynomial in k and n.

4.3 Mutual Information Maximization

The remaining task is to optimize the mutual information (5) over the found candidate alteration set C. We factorize the objective function with two information entropy terms over Zt and Yt:

I (G, Θ; Zt, Yt | Dt, xt, ξt) = H(Zt, Yt | Dt, xt, ξt) H(Zt, Yt | G, Θ, Dt, xt, ξt). (9)

Eq. (9) can be estimated with samples from P(G, θ | Dt) and P(zt, yt | G, θ, xt, ξt). Due to the sampling involved in the estimation procedure, Eq. (9) is expensive to evaluate. The continuous input space makes it difﬁcult to ﬁnd an optimal alteration value for a ﬁxed alteration target Zξ. Thus, we ﬁnd zξ with Bayesian optimization [22], which seeks to optimize a black-box function g(z) with a small number of evaluations by iteratively querying function values at some data points. We run the optimization procedure for every alteration target in C and ﬁnd the corresponding best alteration values. Finally, the alteration that gives maximum mutual information is selected as the recommended output. Details about the estimation procedure and Bayesian optimization are given in Appendix C.

Given a decision, it is important to know how likely the decision can successfully avoid the undesired future. Though the candidate selection procedure already rules out potentially terrible alterations using Str and Sval, a more reﬁned measure is preferable. Based on the scenario approach in robust control literature [23, 24], we derive a PAC-style bound on the posterior success probability in Thm. 7. The bound is easy to compute as it only requires posterior samples. The practical implication of the bound is that it quantiﬁes the uncertainty, or risk, associated with the alteration, thus can help make better decisions. Notably, the bound holds for any SRM with arbitrary additive noise, which could be helpful for future extensions with non-Gaussian noise and nonlinear SRMs.

Theorem 7. Given an observed evidence xt, an alteration ξ, and a desired set S = {y R|Y| | My d}, let Seval = { Gi, θi, εi }n i=1 be n i.i.d. samples from P(G, θ, ε | Dt). Let {yi}n i=1 be generated from Rh(ξ) and Seval following the structural equations in Eq. (1). Deﬁne no |{i | yi S}|. Let ˆp 1 no/n, p 1 F 1(1 δ/(2n); no + 1, n no) if no < n otherwise p 0, and p 1 F 1(δ/(2n); no, n no + 1) if no > 0 otherwise p 1, where F 1( ; α, β) is the inverse cumulative distribution function of the beta distribution with parameters α and β. For any δ (0, 1), with probability at least 1 δ, we have

ln(2/δ)/2n o P (Yt S|Dt, xt, Rh(ξ)) min n p, ˆp + p

ln(2/δ)/2n o .

4.4 Discussion

The proposed rehearsal learning framework is general and ﬂexible, with any of the three steps adjustable without interfering with the others. For scenarios more complicated than linear ones, one can replace methods in each step accordingly: The Bayesian update of SRMs can use any advanced Bayesian (structure) learning methods [25, 26, 27, 28] as long as they allow sampling from the posterior. The mutual information maximization step can leverage any black-box optimization methods [22, 29]. For the most important and difﬁcult candidate alteration ﬁnding step, which is difﬁcult even for the linear case when allowing joint alterations, efﬁcient heuristics or linearization tricks can be used. Given the great uncertainty of real-world problems, exactly solving the AUF problem is extremely difﬁcult if not impossible, while with the rehearsal learning framework, we provide a practical and systematic approximate solution that tackles the problem with certain guarantees, which exhibits that developing more efﬁcient approximate solutions is a promising future direction.

For the SRM, there are some practical considerations. After a rehearsal operation, the change of the parental relations results in new sets of parameters. In the absence of further constraints, as is the case in this work, these new parameters may bear no direct resemblance to the previous ones. However, it is conceivable to impose speciﬁc assumptions on these parameters to achieve more

efﬁcient modeling. For instance, one may assume the new parameters are connected with the original ones through operations like marginalization. In practical implementations, it is also reasonable to limit the size of a clique and maintain the number of parameter sets to an acceptable level. Moreover, it is unlikely that we would need to or be able to alter every possible subset of a clique, so one can reduce unnecessary computational burden by ignoring the parameters that would not be used.

5 Related Work

There have been some efforts to combine structural models with decision-making, with most of them focusing on causal structures [30]. Some work explores the causal bandit problem, where the causal structure underlying rewards and actions is considered [31, 32, 33, 34, 35, 36]. In a different vein, Aglietti et al. [37] investigated a Bayesian optimization problem with a known causal DAG, and Aglietti et al. [38] extended this work to a dynamic setting. However, these studies typically assume that the true causal graph is known, a condition that is challenging to meet in real-world applications, although causal discovery methods could help to some extent [39, 40, 41, 42, 43, 44]. To remove the assumption, de Kroon et al. [45] leveraged separating sets, while Lu et al. [46] focused on the Markov equivalence class of the true causal graphs. However, even when causal relations are known, applying causal bandits or causal Bayesian optimization to AUF is not suitable because they often seek a single universally optimal action. In contrast, optimal decisions in AUF can vary depending on the context X. Some research approaches decision problems from the perspective of causal structures and causal effect estimation. For example, Wang et al. [47] developed a method for estimating a set of possible causal effects to aid decision-making. Additionally, there has been active research in estimating causal structures or effects in interactive environments [48, 49, 50, 51, 52].

The aforementioned efforts yielded effective methods for certain decision-related problems, but all of them rely on causal modeling, which could be too luxurious and restrictive for decision problems [7]. On the other hand, correlation, which is the basis of most ML models, is insufﬁcient for decisionmaking. Therefore, we turn to the inﬂuence relation, which lies between correlation and causation and forms a basis of decisions [7]. Building on the inﬂuence relation, we propose SRM, which is capable of modeling interrelated but not necessarily causally linked variables and dynamically evolving decision systems. Moreover, the proposed rehearsal learning framework demonstrates the feasibility of building decision-making upon inﬂuence and rehearsal.

6 Experiments

We evaluate the proposed approach on two datasets. We are mainly interested in if the proposed approach can successfully avoid the undesired future with high probability, the exploration-exploitation trade-off, and the informativeness of the PAC-style guarantees. For each dataset, we alter one variable in each round and repeat experiments with 100 rounds 20 times. The graph prior is initialized with samples from the learnable graph equivalence class instead of learning from data as well-established rehearsal graph learning methods are still in the process of development. The success probability is estimated with 1,000 samples from the true SRM. We compute the PAC bound with 1,000 samples from the posterior with δ = 0.05. The observational dataset size is set to 10. We also compare with several reinforcement learning methods DDPG [53], PPO [54], SAC [55], and CATS [56].

Ride-Hailing Data. We abstract an SRM from a ride-hailing scenario, where a ride-hailing app needs to make decisions to promote user rating (RAT). We consider relevant variables including the weather condition, the number of users, trafﬁc congestion (TRA), etc. The app can alter two variables, the discount level (DIS) and the recommendation level (REC) of a speciﬁc route. There exist interrelated variables in the scenario: If REC of a speciﬁc route is high, then TRA on that route will be high since more users will choose that route; and if there is a high TRA on that route, the REC should be low. The size of Xt, Zt, and Yt are 2, 4, and 1 respectively. The true structural equations are set according to domain knowledge. The feasible alteration values are [ 2, 2] for DIS and REC. The range of RAT is [0, 1], so we set the desired region to S = [0.8, 1] to avoid RAT below 0.8.

Bermuda Data. We take an example from ecology, where environment variables in Bermuda are recorded [57] and the variable generation order is available [58]. The size of Xt, Zt, Yt are 3, 7, and 1 respectively. The true structural equations are obtained by performing linear regression on

Ride-hailing

Number of success rounds

τ=0.3 τ=0.5 τ=0.7 τ=0.9

Probability of success

Estimated mutual information

0.8 PAC bounds for τ=0.7

Lower bound Upper bound True probability

0 20 40 60 80 100 0

0 20 40 60 80 100

0 20 40 60 80 100 0.2

0 20 40 60 80 100 Number of rounds

Figure 4: Results on the Ride-hailing and the Bermuda data. The bands depict standard deviations.

normalized data. We assume that 5 variables are actionable [37], with a feasible alteration region [ 1, 1]. Yt represents the net coral ecosystem calciﬁcation (NEC). We want to maintain a high NEC. So we set the desired region to S = [0.5, 2] which is above the 75th percentile of the dataset.

Dataset DDPG PPO SAC CATS τ = 0.7

Ride-hailing 0.173 0.154 0.177 0.104 0.714 Bermuda 0.230 0.190 0.205 0.215 0.679

Table 1: Avg. success probability.

Table 1 shows the average probabilities of successfully avoiding the undesired future of several RL methods and the proposed rehearsal learning method with τ = 0.7. Our method achieves the goal of AUF with a probability around 0.7, while the others fail to do so with merely 100 interactions with the environment, which further underscores the importance of considering structural information in the interaction-limited AUF problem. On the other hand, if we increase T, RL methods can achieve satisfying performance, e.g., DDPG achieves 0.688 average success probability when T = 10, 000 on the Bermuda data. For the ride-hailing data, 95.8% alterations recommended by the proposed method are to alter REC, which affects RAT through the interactions between REC and TRA and the direct link between TRA and REC: REC TRA RAT. Note that REC cannot be identiﬁed as a valid cause so causal modeling is not suitable for this problem.

Fig. 4 shows the full results. Given τ, the found alteration can satisfy the probability requirement in most cases. Exceptions are settings where τ = 0.9. Alterations with such a high success probability rarely exist, so the method can fail. In our implementation, when the recommendation is not attainable, the constraint is dropped and the method will only optimize the mutual information. Therefore the corresponding number of success rounds, as well as the success probability, is low, but the mutual information is high. The third column shows a trade-off between avoiding the undesired future and learning the SRM: smaller τ is likely to give higher mutual information. An interesting phenomenon is that the true success probability often ﬂuctuates around τ instead of achieving higher values. An explanation is that if an alteration has a high success probability, then the effect of the alteration is less uncertain, so less new information can be revealed and the method will instead choose others with lower success probabilities. Finally, the last column shows that the true probabilities are bounded by the PAC-style bound with a relatively small gap, so it can be informative for guiding real decisions.

7 Conclusion

Realizing that correlation is inadequate and that causation is not always suitable, we advocate relying on the inﬂuence relation for decision-making. In this paper, we propose the ﬁrst rehearsal learning framework that tackles the AUF (Avoiding Undesired Future) problem with inﬂuence modeling. The key to the proposed framework is SRM, which considers the interrelations of variables and the dynamic time-dependent nature of decision environments. By unifying rehearsals on SRMs and Bayesian inference, the framework recommends decisions that can persuasively avoid the undesired future. We also provide a PAC-style bound to quantify the associated risk of recommended decisions and further show the hardness of a basic linear instance of the framework, revealing that future work should focus on developing approximate solutions rather than exact ones.

Acknowledgments

This research was supported by National Key R&D Program of China (2022ZD0114800) and NSFC (61921006). The authors thank the reviewers for their valuable comments.

[1] Athanasios Voulodimos, Nikolaos Doulamis, Anastasios D. Doulamis, and Eftychios Protopapadakis. Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, pages 1 13, 2018.

[2] Hang Li. Deep learning for natural language processing: Advantages and challenges. National Science Review, 5(1):24 26, 09 2017.

[3] Zeynep Batmaz, Ali Yürekli, Alper Bilge, and Cihan Kaleli. A review on deep learning for recommender systems: Challenges and remedies. Artiﬁcial Intelligence Review, 52(1):1 37, 2019.

[4] Xiong-Hui Chen, Bowei He, Yang Yu, Qingyang Li, Zhiwei Tony Qin, Wenjie Shang, Jieping Ye, and Chen Ma. Sim2rec: A simulator-based decision-making approach to optimize realworld long-term user engagement in sequential recommender systems. In Proceedings of the Thirty-Ninth IEEE International Conference on Data Engineering, pages 3389 3402, 2023.

[5] Konstantina Kourou, Themis P Exarchos, Konstantinos P Exarchos, Michalis V Karamouzis, and Dimitrios I Fotiadis. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13:8 17, 2015.

[6] Harsha Chamara Hewage and H. Niles Perera. Comparing statistical and machine learning methods for sales forecasting during the post-promotional period. In IEEE International Conference on Industrial Engineering and Engineering Management, pages 462 466, 12 2021.

[7] Zhi-Hua Zhou. Rehearsal: Learning from prediction to decision. Frontiers of Computer Science, 16(4):1 3, 2022.

[8] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[9] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing Atari with deep reinforcement learning. Co RR, abs/1312.5602, 2013.

[10] Bahare Kiumarsi, Kyriakos G. Vamvoudakis, Hamidreza Modares, and Frank L. Lewis. Optimal and autonomous control using reinforcement learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 29(6):2042 2062, 2018.

[11] Zhi-Hua Zhou. Rehearsal: Learning from prediction to decision. Keynote at the CCF Conference on AI, Urumqi, China, 2023.

[12] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition, 2009.

[13] David M. Chickering. Learning Bayesian networks is NP-complete. Learning from Data. Lecture Notes in Statistics, 112:121 130, 1995.

[14] Xiong-Hui Chen, Junyin Ye, Hang Zhao, Yi-Chen Li, Haoran Shi, Yu-Yan Xu, Zhihao Ye, Si-Hang Yang, Anqi Huang, Kai Xu, Zongzhang Zhang, and Yang Yu. Imitator learning: Achieve out-of-the-box imitation ability in variable environments. Co RR, abs/2310.05712, 2023.

[15] Michael E. Sobel. Causal Inference in the Social and Behavioral Sciences, pages 1 38. Springer, 1995.

[16] Kenneth J. Rothman and Sander Greenland. Causation and causal inference in epidemiology. American Journal of Public Health, 95(S1):S144 S150, 2005.

[17] Adam Massmann, Pierre Gentine, and Jakob Runge. Causal inference for process understanding in earth sciences. Co RR, abs/2105.00912, 2021.

[18] Nir Friedman, Moisés Goldszmidt, and Abraham J. Wyner. Data analysis with Bayesian networks: A bootstrap approach. Co RR, abs/1301.6695, 2013.

[19] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[20] Michael E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211 244, 2001.

[21] Edoardo Amaldi and Viggo Kann. The complexity and approximability of ﬁnding maximum feasible subsystems of linear relations. Theoretical Computer Science, 147(1):181 210, 1995.

[22] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104 (1):148 175, 2016.

[23] Giuseppe C. Calaﬁore and Marco C. Campi. The scenario approach to robust control design. IEEE Transactions on Automatic Control, 51(5):742 753, 2006.

[24] Thom S. Badings, Alessandro Abate, Nils Jansen, David Parker, Hasan A. Poonawala, and Mariëlle Stoelinga. Sampling-based robust control of autonomous systems with non-Gaussian noise. In Thirty-Sixth AAAI Conference on Artiﬁcial Intelligence, pages 9669 9678, 2022.

[25] Christopher KI Williams and Carl Edward Rasmussen. Gaussian Processes for Machine Learning. MIT press Cambridge, 2006.

[26] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2370 2378, 2016.

[27] Raj Agrawal, Chandler Squires, Karren D. Yang, Karthikeyan Shanmugam, and Caroline Uhler. ABCD-strategy: Budgeted experimental design for targeted causal structure discovery. In Proceedings of the Twenty-Second International Conference on Artiﬁcial Intelligence and Statistics, pages 3400 3409, 2019.

[28] Lars Lorch, Jonas Rothfuss, Bernhard Schölkopf, and Andreas Krause. Di BS: Differentiable Bayesian structure learning. In Advances in Neural Information Processing Systems, pages 24111 24123, 2021.

[29] Sourabh Katoch, Sumit Singh Chauhan, and Vijay Kumar. A review on genetic algorithm: Past, present, and future. Multimedia Tools and Applications, 80(5):8091 8126, 2021.

[30] Zheng-Mao Zhu, Xiong-Hui Chen, Hong-Long Tian, Kun Zhang, and Yang Yu. Ofﬂine reinforcement learning with causal structured world models. Co RR, abs/2206.01474, 2022.

[31] Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, pages 1181 1189, 2016.

[32] Akihiro Yabe, Daisuke Hatano, Hanna Sumita, Shinji Ito, Naonori Kakimura, Takuro Fukunaga, and Ken-ichi Kawarabayashi. Causal bandits with propagating inference. In Proceedings of the Thirty-Fifth International Conference on Machine Learning, pages 5508 5516, 2018.

[33] Sanghack Lee and Elias Bareinboim. Structural causal bandits: Where to intervene? In Advances in Neural Information Processing Systems, pages 2573 2583, 2018.

[34] Sanghack Lee and Elias Bareinboim. Structural causal bandits with non-manipulable variables. In Proceedings of the Thirty-Third AAAI Conference on Artiﬁcial Intelligence, pages 4164 4172, 2019.

[35] Yangyi Lu, Amirhossein Meisami, Ambuj Tewari, and William Yan. Regret analysis of bandit problems with causal background knowledge. In Proceedings of the Thirty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence, pages 141 150, 2020.

[36] Neil Dhir. Chronological causal bandits. Co RR, abs/2112.01819, 2021.

[37] Virginia Aglietti, Xiaoyu Lu, Andrei Paleyes, and Javier González. Causal Bayesian optimization. In Proceedings of the Twenty-Third International Conference on Artiﬁcial Intelligence and Statistics, pages 3155 3164, 2020.

[38] Virginia Aglietti, Neil Dhir, Javier González, and Theodoros Damoulas. Dynamic causal Bayesian optimization. In Advances in Neural Information Processing Systems, pages 10549 10560, 2021.

[39] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. MIT press, 2nd edition, 2000.

[40] David M. Chickering. Optimal structure identiﬁcation with greedy search. Journal of Machine Learning Research, 3:507 554, 2002.

[41] Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, and Antti J. Kerminen. A linear non Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7: 2003 2030, 2006.

[42] Patrik O. Hoyer, Dominik Janzing, Joris M. Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems, pages 689 696, 2008.

[43] Kun Zhang and Aapo Hyvärinen. On the identiﬁability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 647 655, 2009.

[44] Tian Qin, Tian-Zuo Wang, and Zhi-Hua Zhou. Learning causal structure on mixed data with tree-structured functional models. In Proceedings of the Twenty-Third SIAM International Conference on Data Mining, pages 613 621, 2023.

[45] Arnoud A. W. M. de Kroon, Danielle Belgrave, and Joris M. Mooij. Causal discovery for causal bandits utilizing separating sets. Co RR, abs/2009.07916, 2020.

[46] Yangyi Lu, Amirhossein Meisami, and Ambuj Tewari. Causal bandits with unknown graph structure. In Advances in Neural Information Processing Systems, pages 24817 24828, 2021.

[47] Tian-Zuo Wang, Tian Qin, and Zhi-Hua Zhou. Estimating possible causal effects with latent variables via adjustment. In Proceedings of the Fortieth International Conference on Machine Learning, pages 36308 36335, 2023.

[48] Tian-Zuo Wang, Xi-Zhu Wu, Sheng-Jun Huang, and Zhi-Hua Zhou. Cost-effectively identifying causal effects when only response variable is observable. In Proceedings of the Thirty-Seventh International Conference on Machine Learning, pages 10060 10069, 2020.

[49] Tian-Zuo Wang and Zhi-Hua Zhou. Actively identifying causal effects with latent variables given only response variable observable. In Advances in Neural Information Processing Systems, pages 15007 15018, 2021.

[50] Tian Qin, Tian-Zuo Wang, and Zhi-Hua Zhou. Budgeted heterogeneous treatment effect estimation. In Proceedings of the Thirty-Eighth International Conference on Machine Learning, pages 8693 8702, 2021.

[51] Tian-Zuo Wang, Tian Qin, and Zhi-Hua Zhou. Sound and complete causal identiﬁcation with latent variables given local background knowledge. In Advances in Neural Information Processing Systems, pages 10325 10338, 2022.

[52] Tian-Zuo Wang, Tian Qin, and Zhi-Hua Zhou. Sound and complete causal identiﬁcation with latent variables given local background knowledge. Artiﬁcial Intelligence, 322:103964, 2023.

[53] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the Fourth International Conference on Learning Representations, 2016.

[54] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017.

[55] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the Thirty-Fifth International Conference on Machine Learning, pages 1856 1865, 2018.

[56] Maryam Majzoubi, Chicheng Zhang, Rajan Chari, Akshay Krishnamurthy, John Langford, and Aleksandrs Slivkins. Efﬁcient contextual bandits with continuous actions. In Advances in Neural Information Processing Systems, pages 349 360, 2020.

[57] Travis A. Courtney, Mario Lebrato, Nicholas R. Bates, Andrew Collins, Samantha J. de Putron, Rebecca Garley, Rod Johnson, Juan-Carlos Molinero, Timothy J. Noyes, Christopher L. Sabine, and Andreas J. Andersson. Environmental controls on modern scleractinian coral and reef-scale calciﬁcation. Science Advances, 3(11):e1701356, 2017.

[58] Andreas Andersson and Nicholas Bates. In situ measurements used for coral and reef-scale calciﬁcation structural equation modeling including environmental and chemical measurements, and coral calciﬁcation rates in Bermuda from 2010 to 2012 (BEACON project), 2018. http://lod.bco-dmo.org/id/dataset/720788.

[59] Kumar Sricharan, Dennis L. Wei, and Alfred O. Hero III. Ensemble estimators for multivariate entropy estimation. IEEE Transactions on Information Theory, 59(7):4374 4388, 2013.

[60] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the Twenty-Seventh International Conference on Machine Learning, pages 1015 1022, 2010.

[61] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[62] Licio Romao, Kostas Margellos, and Antonis Papachristodoulou. Tight generalization guarantees for the sampling and discarding approach to scenario optimization. In Proceedings of the Fifty-ninth IEEE Conference on Decision and Control, pages 2228 2233, 2020.

A Rehearsal Graphs: Characterization and Learning

In this section, we present full motivations, basic properties, and preliminary learning strategies for rehearsal graphs.

The ability to describe and identify causal relations is at the heart of both human and artiﬁcial intelligence. The work of Pearl [12] on causality lays the foundation for accurately describing causal relations and making a sound inference. Despite its elegance and expressiveness in abstracting causality from human cognition, one may encounter difﬁculties when applying Pearl s causal framework to real-world observations. We identify three kinds of misspeciﬁcations that could make causal modeling fail: a) model misspeciﬁcation b) data-collection misspeciﬁcation c) adversary misspeciﬁcation.

Model misspeciﬁcation means that the variables used in the causal modeling process are inaccurate so there are indeterminant causal relations among the variables. For example, demand (D) and price (P) are two variables that are causally related. But there is not a determinant causal direction between D and P since intervening on D causes P to change but meanwhile intervening on P causes D to change. A workaround is modeling with ﬁner granularity by adding a time index to D and P. Then we have Dt causes Pt+1 and Pt causes Dt+1.

Data-collection misspeciﬁcation means that the data collected are not perfectly aligned with the speciﬁed variable. For example, we may have variables D1, D2, P1, P2 in the causal model that denote two measures D and P at different time. But the data collected may not rigorously follow the order since there can be time delays and mismatches of the modeling time intervals and real ones. This misspeciﬁcation results in a misalignment between the model and data, hence no sound causal conclusions can be drawn from the data.

Adversary misspeciﬁcation refers to ignoring the existence of adversaries in causal modeling. For example, let P and Q denote the prices of a product in two stores. If P is decreased then Q will also be decreased, otherwise, the second store will lose some consumers. From a causal view, it seems that P causes Q. But obviously, Q also causes P by symmetry. So we have a bi-directional causal relation, which is cannot be described in causal modeling. The reason for this phenomenon is that each store is an adversary that reacts to interventions from other agents.

The preceding three misspeciﬁcations are not necessarily disjoint in real problems and are hard to identify in advance. The consequence is that one cannot identify perfect causal knowledge even if with an inﬁnite amount of data since the modeling and the data do not exhibit consistent determinant causal relations. In order to model both the non-determinant and determinant causal relations in data, we introduce a new probabilistic graphical model: rehearsal graphs.

A.1 Rehearsal Graphs as a Probabilistic Graphical Model

Deﬁnition 8 (Mixed graph). Let G = (V, E) be a graph, where V denotes the vertices and E the edges. G is a mixed graph if for any distinct vertices u, v V, there is at most one edge connecting them, and the edge can be either directional or bi-directional, i.e., u v, u v, or u v. Deﬁnition 9 (Bi-directional clique). A bi-directional clique C = (Vc, Ec) of a mixed graph G = (V, E) is a complete induced subgraph such that any edge e Ec is bi-directional. A single vertex also constituents a valid bi-directional clique. C is maximal if adding any other vertex does not induce a bi-directional clique. Deﬁnition 10 (Rehearsal graph). Let G = (V, E) be a mixed graph. Let {Ci}l i=1 denote all maximal bi-directional cliques of G, where Ci = (Vc i, Ec i). G is a rehearsal graph if and only if:

1. Vc i Vc j = for any i = j. 2. i [l], u V \ Vc i, if there is any edge pointing from u to Vc i, then v Vc i, u v. 3. The directional edges permit a topological ordering for {Ci}l i=1. Deﬁnition 11 (Factorization). Let G = (V, E) be a rehearsal graph and Ci = (Vc i, Ec i), i [k] be maximal bi-directional cliques of G. We say that a joint probability P over V factorizes according to G if P can be expressed as a product

i=1 P(Vc i | PAi), (10)

where PAi = {u | v Vc i, u v in G}.

A.2 c-Separation

As a probabilistic graphical model, like Bayesian networks or Markov networks, rehearsal graphs also have a graphical characterization of independence relations. Mimicking the d-separation for Bayesian networks, we present c-separation for rehearsal graphs. Here the c means cliques . Deﬁnition 12 (c-separation). A path p in a rehearsal graph G is said to be c-separated (or blocked) by a set of vertices Z if and only if

1. p contains a chain i m1 m2 ml j or a fork i m1 m2 ml j such that the bi-directionally connected vertices {mi}l i=1 Z, or

2. p contains an inverted fork (or collider) i m1 m2 ml j such that no vertices in {mi}l i=1 are in Z and such that no descendants of {mi}l i=1 are in Z.

A set Z is said to c-separate X from Y if and only if Z blocks every path from a vertex in X to a vertex in Z. We write X G Y | Z when Z c-separates X from Y in G. Lemma 13. Let G = (V, E) be a rehearsal graph, x, y V be two distinct vertices, and Z a set of vertices. Let C(v) denote the vertices of the maximal bi-directional clique to which v belongs. If x G y | Z, then C(x) G C(y) | Z.

Proof. We prove this by contradiction. Assume that C(x) G C(y) | Z does not hold given x G y | Z, meaning that there exists a path p unblocked by Z between some u C(x) and some v C(y). Then there is an unblocked path x p y between x and y, which is a contradiction since Z is supposed to block every path between x and y.

Lemma 14. Let G = (V, E) be a rehearsal graph, Ca, Cb V be a set of vertices of two distinct maximal bi-directional cliques, and Z a set of vertices. If Ca G Cb | Z, then for all probability functions P that factorizes according to G, Ca P Cb | Z.

Proof. We construct a DAG G = ( V , E) from G with an one-to-one mapping f: every maximal cliques Ci of G is represented with a vertex vi G, vi = (v1 i , . . . , v|Ci| i ), and the edge directions between Cis are kept. For any path p vα1 vα2 vαl (here we omit the edge directions) in G, where {αi}i is an index set, there is a counterpart q in the original G that reads v1 α1

v2 α1 v |Cα1| α1 v1 α2 v2 α2 v |Cα2| α2 v1 αl v2 αl v |Cαl| αl , where vj αk is the j-th vertex in Cαk. For any Z V , we map it to a set of vertices in G with h(Z) = { v V | f 1( v) Z}. From the deﬁnition of c-separation for rehearsal graphs and d-separation for Bayesian networks, we see that if q is c-separated by Z in G, then p is d-separated by h(Z) in G. Therefore, if Ca G Cb | Z, then every path between va and vb is d-separated by h(Z). Due to the Markov property of Bayesian networks, for all P that factorizes according to G (which also factorizes according to G by deﬁnition), we have va P vb | h(Z), i.e., Ca P Cb | f 1(h(Z)). Since f 1(h(Z)) Z, we have Ca P Cb | Z.

Theorem 15. Let G = (V, E) be a rehearsal graph. For all probability functions P that factorizes according to G and any three disjoint subsets of vertices X, Y, Z V, if X G Y | Z, then X P Y | Z.

X G Y | Z = x X, y Y, x G y | Z (11)

Lem. 13 ==== x X, y Y, C(x) G C(y) | Z (12)

Lem. 14 ==== x X, y Y, C(x) P C(y) | Z (13)

= x X, y Y, x P y | Z (14)

= X P Y | Z. (15)

(a) A rehearsal graph G.

Figure 5: An example rehearsal graph and its Bayesian network counterparts that encode the same conditional independence relations. We use plate notation to simplify the edges. For example, in (a), vertex A having an edge pointing to the rectangle containing three vertices, implies that A E, A F, and A G.

Theorem 16. Let G = (V, E) be a rehearsal graph. Let G = (V, E ) be any DAG that have identical skeleton with G (see Fig. 5 for an example). If all directional edges in G appear in G , then for any disjoint set of vertices X, Y and Z, we have X G Y | Z if and only if X G Y | Z, where G means c-separation for rehearsal graphs and G d-separation for DAGs.

Proof. We ﬁrst prove X G Y | Z X G Y | Z.

Every path p in G has a unique counterpart p in G that shares the same vertex sequence. p and p only differ in some edges: a bi-directional edge u v in G is a directed one u v or u v in G . Obviously, the mapping between p and p is one-to-one. We only need to show that for any path p in G , if its counterpart p in G is c-separated by a set Z, then p is d-separated in G.

We prove it by contradiction. Assume that p is not d-separated by Z, then according to the deﬁnition of d-separation, we have that (a) for every chain i m j and every fork i m j in p , m Z, and that (b) for every inverted fork i m j in p , ({m} des(m)) Z = . For every chain i m1 m2 ml j and every fork i m1 m2 ml j in p, we have {mi}l i=1 Z since otherwise either ml j in p and ml Z, or ml j in p and ml Z, which violate requirement (a). For every inverted fork i m1 m2 ml j in p, ({mi}l i=1 des({mi}l i=1)) Z = since otherwise there exists an inverted fork mt , t [l] such that ({mt} des(mt)) Z = , which violates requirement (b). Combining the above results, we have that p is not c-separated by Z, which is a contradiction. Therefore, p is d-separated by Z, which concludes that X G Y | Z X G Y | Z.

We next prove X G Y | Z X G Y | Z.

Again we prove by contradiction. Assume that there exists a path p from x X to y Y in G that is not c-separated by Z. Let p be x m1 1 m2 1 ml1 1 m1 2 m2 2 ml2 2 m1 t m2 t mlt t y (we omit the direction of directional edges). We construct a path q from x to y by picking one vertex from each bi-directionally connected subpath m1 i m2 i mli i in order such that q contains only directional edges, so q is also a path in G . For each bi-directionally connected subpath m1 i m2 i mli i in p, if it is a part of a chain m1 i m2 i mli i or a fork m1 i m2 i mli i , then by the deﬁnition of c-separation, {mj i}li j=1 \ Z = , so we randomly select a vertex vi {mj i}li j=1 \ Z to join q. If the bi-directionally connected subpath is a part of an inverted fork m1 i m2 i mli i , then by the deﬁnition of c-separation, either some mj i Z or some descendant of {mj i}li j=1 is in Z. In the former case, we select vi = mj i to join q, and in the latter, we randomly select any vi {mj i}li j=1. By construction, q = x v1 vt y is a path from X to Y in G and is not d-separated, which is a contradiction. Therefore, any path p from X to Y in G is c-separated by Z, which concludes X G Y | Z X G Y | Z.

Theorem 17. Let G = (V, E) be a rehearsal graph. For any three disjoint subsets of vertices X, Y, Z V , if X P Y | Z for all probability functions P that factorizes according to G, then X G Y | Z.

Proof. Let G = (V, E ) be any DAG that has identical skeleton with G and has all directional edges in G. Let Vi = {vj}l j=1 denote the set of vertices in a maximal bi-directional clique in G and PAi the parent of Vi. Then Vi is also a clique in G . Without loss of generality, assume the topological ordering among Vi is v1 v2 vl. By the chain rule, we have P(Vi | PAi) = Ql k=1 P(vk | vk 1, vk 2, . . . ), where the LHS is the conditional probability deﬁned in the factorization of G and the RHS is the factorization deﬁned by G . So if P is a probability function that factorizes according to G, then p also factorizes according to G , and vice versa. Let P( ) denote the probability family that factorizes according to . We have P(G) = P(G ).

P P(G), X P Y | Z P(G)=P(G ) ======== P P(G ), X P Y | Z (16)

completeness of d-separation =============== X G Y | Z (17)

Thm. 16 ===== X G Y | Z. (18)

A.3 Structure Learning

If we drop the bi-directional edges in a rehearsal graph, it becomes a valid Bayesian network, which means that Bayesian networks are special cases of rehearsal graphs. So learning rehearsal graphs can only be harder than learning Bayesian networks. In this subsection, we present a preliminary learning method. More efﬁcient learning methods are left for further research.

Thm. 16 shows that a rehearsal graph has several Bayesian network counterparts that have the same skeleton and encode the same conditional independence relations. So by learning a Bayesian network or a Markov equivalence class of Bayesian networks from data, we can determine the skeleton of the rehearsal graph. Then we can get a rehearsal graph or a set of rehearsal graphs that cannot be distinguished from observational data by adding bi-directional edges to the edges whose directions are not determined. Assuming that the observational distribution factorizes according to a rehearsal graph and that all conditional independence relations of the distribution are encoded by the rehearsal graph, Algorithm 2 lists all valid rehearsal graphs. The ﬁrst three steps of Algorithm 2 are simply the IC algorithm for Bayesian network learning [12]. Then we list all Bayesian networks and map them back to rehearsal graphs by replacing directional edges with bi-directional ones.

Algorithm 2 Learning rehearsal graphs from observational data

Input: Observational distribution P(V).

1: For each pair of variables a and b in V, search for a set Sab such that a b | Sab holds. Construct an undirected graph G such that vertices a and b are connected with an edge if and only if no set Sab can be found. 2: For each pair of nonadjacent variables a and b with a common neighbor c, check if c Sab. If it is, then continue. If it is not, then add arrowheads pointing at c, i.e., a c b. 3: In the partially directed graph that results, orient as many of the undirected edges as possible subject to two conditions:

(i) Any alternative orientation would yield a new v-structure; (ii) Any alternative orientation would yield a directed cycle. 4: List all Markov equivalent directed graphs from the partially directed graph. Denote the set of Markov equivalent graphs with G1. 5: For each listed graph g in G1, build a set of mixed graphs Gg 2 by enumerating all possible combinations of cliques and replace the directional edges therein with bi-directional edges. 6: Check if the graphs in each Gg 2 are valid rehearsal graphs. If they are not, remove them from corresponding Gg 2. Output: Markov equivalent rehearsal graphs G1 ( g G1Gg 2).

B Candidate Alteration Selection

B.1 Size-1 Alteration

Algorithm 1 in the main text shows the main steps of ﬁnding candidate alterations, where details on ﬁnding alteration values that satisfy the performance requirement are omitted. Here we give more details in Algorithm 3. The main idea is to sort the found intervals in increasing order, then keep those that appear more than n τ times. Sorting the array in Line 17 on average costs O(n log n) time if using quicksort. Other operations cost O(n) time. Thus, the overall average running time of Algorithm 3 is O(|Zt| n log n).

B.2 Size-k Alteration

We propose to build a ﬁnite candidate set by solving a mixed-integer linear programming problem:

max e {0,1}n ,zξ

s.t. M(Aix+Bizξ + Ciεi) d (1 ei)α, i [n ],

where ei is a binary decision variable that equals 1 if the i-th inequality is satisﬁed and 0 otherwise, and α is a constant vector that upper bounds the left-hand side. If the found solution satisﬁes P

i ei n τ, corresponding zξ will be joined in the candidate set. The above programming is run for every combination of at most k alteration targets multiple times to obtain multiple alterations. The hard constraints satisﬁed by previous programs for a common Zξ are randomly dropped in following iterations to encourage diverse solutions, thus the n n constraints instead of n.

C Mutual Information Maximization

Here we detail the estimation of the mutual information objective in Eq. (5). We have

I (G, Θ; Zt, Yt | Dt, xt, ξt) = H(Zt, Yt | Dt, xt, ξt) H(Zt, Yt | G, Θ, Dt, xt, ξt).

In order to estimate the mutual information, we only need to estimate H(Zt, Yt | Dt, xt, ξt) and H(Zt, Yt | G, Θ, Dt, xt, ξt). We give a detailed estimation procedure in Algorithm 4. The entropy estimation method ENTROPYESTIMATE that estimates information entropy with samples is used as a black-box method since various methods are available, e.g., Sricharan et al. [59].

Let g I (G, Θ; Zt, Yt | Dt, xt, ξt) denote the target optimization function to be used in Bayesian optimization [22]. A Gaussian process is used to attain the posterior of g. Denote the posterior

Algorithm 3 Find candidate single alterations

Input: Str = { Gi, θi, εi }n i=1, Sval = { Gi, θi, εj }2n i=n+1, τ 1: C Store candidate alterations 2: for Zi Z do 3: int1 FIND(Str, Zi, τ) Training 4: int2 FIND(Sval, Zi, τ) 5: C C { Zi, int1 int2 } Validation 6: function FIND(S, Z, τ) Find alteration values 7: int Store valid intervals 8: for G, θ, ε S do 9: [u, v] solution of M(Ax + Bzξ + Cε) d A, B, C are determined by G, θ, Z 10: int int {[u, v] (Z)} 11: return FILTERINTERVALS(int, nτ ) 12: function FILTERINTERVALS(I, k) Find elements that appear more than k times in I 13: arr 14: for [u, v] I do 15: arr arr {(u, 0)} 16: arr arr {(v, 1)} 0 or 1 indicates the value is taken from the left or the right endpoints of an interval 17: arr sorted(arr) Sorted in increasing order with the ﬁrst element as the key 18: cnt 0 19: for 0 i < |arr| do 20: if arr[i][1] = 0 then If the value is a left endpoint 21: cnt cnt + 1 22: else 23: cnt cnt 1 24: arr[i][1] cnt Record the number of intervals that intersect at arr[i][0] 25: rtn 26: left NULL, right NULL 27: for 0 i < |arr| do Scan the array and record valid intervals 28: right arr[i][0] 29: if arr[i][1] k then 30: if left is NULL then 31: left arr[i][0] 32: else 33: if left is not NULL then 34: rtn rtn {[left, right]} 35: left NULL 36: return rtn Output: The found alterations C

mean and variance at an unseen zl+1 by µg(zl+1) and σ2 g(zl+1) respectively. Given queried points {z1, . . . , zl}, we use upper conﬁdence bound [60] as a querying criterion to select the next point:

zl+1 = arg max zl+1 C(Zξ) µg(zl+1) + γ σg(zl+1),

where γ is a constant and C(Zξ) is the feasible region of Zξ deﬁned by C. We run this optimization process for every alteration target in C to ﬁnd the corresponding best alteration values. Finally, the alteration, including both target variables and values, that has maximum mutual information is selected as the recommended output.

Algorithm 4 Mutual information estimation

Input: The posterior estimate p(G, θ | Dt), the candidate graph set G, the observed x, the alteration ξ, the number of samples for estimating an entropy term n, the number of samples for structural parameters n . 1: srms 2: for G G do 3: for i [ n p(G | Dt) ] do 4: Sample θ from p(θ | G, Dt) 5: srms srms { G, θi } 6: samples 7: for G, θ srms do 8: Sample z and y from p(z, y | G, θ, x, ξ) 9: samples samples {(z, y)} 10: ˆH1 ENTROPYESTIMATE(samples) 11: ˆH2 0 12: for G G do 13: entropies 14: for i [n ] do 15: Sample θ from p(θ | G, Dt) 16: samples 17: for i [n] do 18: Sample z and y from p(z, y | G, θ, x, ξ) 19: samples samples {(z, y)} 20: entropies entropies {ENTROPYESTIMATE(samples)} 21: cond_entropy average of entropies 22: ˆH2 ˆH2 + cond_entropy p(G | Dt) 23: ˆI ˆH1 ˆH2

Output: The mutual information estimate ˆI

In this section, we provide proof for claims in the main text. Proposition 4. For any G1, θ1 and G2, θ2 on Xt Zt Yt, if they differ only in describing the generation of Xt, then for any Xt = xt and alterations ξ on Zt, we have

P (yt | xt, Rh(ξ) ; G1, θ1) = P (yt | xt, Rh(ξ) ; G2, θ2) .

Proof. We have

P(y | x, Rh(ξ); Gi, θi)

z P(z | x, Rh(ξ); Gi, θi)P(y | z, x, Rh(ξ); Gi, θi), i = 1, 2.

Since (G1, θ1) and (G2, θ2) only differ in parameters controlling P(x), we have

P(z | x, Rh(ξ); G1, θ1) = P(z | x, Rh(ξ); G2, θ2),

and P(y | z, x, Rh(ξ); G1, θ1) = P(y | z, x, Rh(ξ); G2, θ2).

Combining the above equations gives

P (y | x, Rh(ξ) ; G1, θ1) = P (y | x, Rh(ξ) ; G2, θ2) .

Proposition 5. Let ξ = {(Zai, zai)}k i=1 be a valid alteration where Zai Zt and zai (Zai). Given G, θ, ε and xt, the outcome variables are linear in the alteration values. We have

Yt = Axt + Bzξ + Cε, (8)

where zξ = (za1, . . . , zak)T . A, B, and C are constant matrices of appropriate shapes and are uniquely determined by G, θ and altered variables Zξ = {Zai}k i=1.

Proof. The proposition is immediate by noting that if the parent of a set of variables can be described by PA = A x + B zξ + C ε,

then V = βT PA + εV = βT A x + βT B zξ + βT C ε + εV can be described in a similar linear form.

Theorem 6. Unless P = NP, ﬁnding a k-dimensional zξ (Zξ) that satisﬁes Pn i=1 I(M(Aix + Bizξ + Ciεi) d) n τ (if there is a valid solution) is not solvable with any algorithm of running time polynomial in k and n.

Proof. We prove this by contradiction. Assume that the problem is solvable with a subroutine f that runs in polynomial time. Then, given any infeasible linear system Σ : {Ax b}, we can ﬁnd a feasible subsystem containing as many inequalities as possible by ﬁnding a maximized τ that gives a valid solution for the problem of ﬁnding a τ that satisﬁes at least n τ inequalities. And the process of ﬁnding a maximized τ can be ﬁnished by running a binary search on τ and calling f, which will be called at most O(log n) times. Thus the overall running time will still be polynomial. However, ﬁnding a maximum feasible subsystem is known to be NP-hard [21], which cannot be solved within polynomial time if P = NP. Therefore we have a contradiction and conclude that the problem does not admit a polynomial time algorithm.

Theorem 7. Given an observed evidence xt, an alteration ξ, and a desired set S = {y R|Y| | My d}, let Seval = { Gi, θi, εi }n i=1 be n i.i.d. samples from P(G, θ, ε | Dt). Let {yi}n i=1 be generated from Rh(ξ) and Seval following the structural equations in Eq. (1). Deﬁne no |{i | yi S}|. Let ˆp 1 no/n, p 1 F 1(1 δ/(2n); no + 1, n no) if no < n otherwise p 0, and p 1 F 1(δ/(2n); no, n no + 1) if no > 0 otherwise p 1, where F 1( ; α, β) is the inverse cumulative distribution function of the beta distribution with parameters α and β. For any δ (0, 1), with probability at least 1 δ, we have

ln(2/δ)/2n o P (Yt S|Dt, xt, Rh(ξ)) min n p, ˆp + p

ln(2/δ)/2n o .

Proof. The proof is mainly based on the scenario approach [23] and adapted from Badings et al. [24]. The desired region for Y is given by S = {y R|Y| | My d}. We deﬁne a scaled version of S by

R(λ) = n y R|Y| | My λd + (1 λ)Mh o , (19)

where h R is a Chebyshev center of S [61]. We have R(1) = S and R(λ1) R(λ2) for any 0 λ1 < λ2. Using Prop. 5, we express each yi with

yi = Aix + Bizξ + Ciεi. (20)

We use the scenario approach to estimate the probability of the happening of undesired Y with the following convex scenario optimization problem with discarded samples: min λ 0 λ

s.t. yi R(λ) i {1, . . . , n} \ Q, (21)

which has a scalar decision variable λ. We denote the optimal solution of the above optimization program with λ |Q|. Q is a subset of samples whose constraints have been discarded according to the following rule [24, Lem. 1]: The sample removal set Q {1, . . . , n} is obtained by iteratively removing the active constraints from (21). Thus, given N samples and any two removal sets with cardinalities |Q1| < |Q2|, it holds that Q1 Q2. Moreover, any discarded sample i Q violates the solution λ Q to Eq. (21), i.e. yi / Rj λ |Q| , with probability one. When discarding |Q| = no samples, it holds that R(λ no) S. When discarding |Q| = no 1 samples (no > 0), R(λ no 1) S.

We further assume that given a sample Gi, θi, εi from P(G, θ, ε | Dt), the probability that the generated yi is on the boundary of any polytope R(λ) for any λ 0 is zero.

Based on the results of Romao et al. [62, Thm. 5] and Badings et al. [24, Thm. 1], we have

P n P y / R(λ |Q|) ϵ o = F(ϵ; |Q| + 1, n |Q|), (22)

where F( ; α, β) is the cumulative distribution function of the beta distribution with parameters α and β, |Q| < n. Equivalently, we have

P n P y / R(λ |Q|) > ϵ o = 1 F(ϵ; |Q| + 1, n |Q|). (23)

Solving 1 F(ϵ; |Q| + 1, n |Q|) = δ 2n, we have

ϵ = F 1(1 δ

2n; |Q| + 1, n |Q|). (24)

For notational convenience, let Fi( ) = F( ; i + 1, n i). We have

P P y / R(λ |Q|) > F 1 |Q|(1 δ

Using Boole s inequality, we know that

i=0 P (y / R(λ i )) > F 1 i (1 δ

i=0 P P (y / R(λ i )) > F 1 i (1 δ

(26) Therefore, we have

i=0 P (y / R(λ i )) F 1 i (1 δ

After observing yis at hand, we replace |Q| by no. Since R(λ no) S and Tn 1 i=0 P (y / R(λ i )) F 1 i (1 δ 2n) implies P y / R(λ no) F 1 no (1 δ 2n), we have

P P (y / S) F 1 no (1 δ

2n) P P y / R(λ no) F 1 no (1 δ

which further gives

P P (y S) 1 F 1 no (1 δ

When no = n, P (y S) 0 trivially holds. Thus, p P (Yt S | Dt, xt, Rh(ξ)) holds with probability at least 1 δ

2. For the other half of the lower bound, we apply Hoeffding s inequality on binary variables Di I(yi S). Note that Pn i=1 Di = no and E(Di) = P (Yt S | Dt, xt, Rh(ξ)), we have

no n P (Yt S | Dt, xt, Rh(ξ))

P (Yt S | Dt, xt, Rh(ξ)) ˆp

Combining (29) and (31), we have with probability at least 1 δ

P (Yt S | Dt, xt, Rh(ξ)) . (32)

For the upper bound, we adopt a similar approach. From Eq. (22), by noting that P y / R(λ |Q|) +

P y R(λ |Q|) = 1, we get

P n P y R(λ |Q|) 1 ϵ o = F|Q|(ϵ). (33)

Substituting F|Q|(ϵ) with δ 2n, we have

P P y R(λ |Q|) 1 F 1 |Q|( δ

Again via Boole s inequality,

i=0 P (y R(λ i )) 1 F 1 i ( δ

When no > 0, since R(λ no 1) S and Tn 1 i=0 P (y R(λ i )) 1 F 1 i ( δ

2n) implies P(y R(λ no 1)) 1 F 1 no 1( δ

2n), we have

P P (y S) 1 F 1 no 1( δ

2n) P P y R(λ no 1) 1 F 1 no 1( δ

When no = 0, P (y S) 1 trivially holds. Thus,

P (Yt S | Dt, xt, Rh(ξ)) p (37)

holds with probability at least 1 δ

2. Applying Hoeffding s inequality from the other direction of Eq. (30), we get

P (Yt S | Dt, xt, Rh(ξ)) ˆp +

Combining the above results gives the upper bound that holds with probability at least 1 δ

P (Yt S | Dt, xt, Rh(ξ)) min

Combining the lower bound (32) and the upper bound (39) gives the overall bound.

E Data Details

E.1 Ride-Hailing Data

We give details about the Ride-Hailing data in this section. The variables included in the generation process are

weather: weather condition; #user: number of users in the neighborhood;

recommend: recommendation level of the ride-hailing app for a speciﬁc route; congestion: trafﬁc congestion on the route; time: time spent for the ride; discount: discount provided by the app; rating: user rating for the ride.

The rehearsal graph for the variables is given in Fig. 6. The presumed actionable variables that can be altered by the app are recommend and discount .

Xt Zt Yt Figure 6: The rehearsal graph for ride-hailing data.

E.2 Bermuda Data

We give details about the Bermuda data in this section. The Bermuda data involves a set of environmental variables [57, 58, 37]. The variables included in the generation process are

Chla: sea surface chlorophyll a; Sal: sea surface salinity; TA: seawater total alkalinity; DIC: seawater dissolved inorganic carbon; CO2: seawater PCO2; Temp: bottom temperature; NEC: net ecosystem calciﬁcation; Light: bottom light levels; Nut: PC1 of NH4, Ni O2 + Ni O3, Si O4;

p Hsw: seawater p H ΩA: seawater saturation with respect to aragonite.

The rehearsal graph for the variables is given in Fig. 7. The presumed actionable variables that can be altered are: DIC, TA, ΩA, Chla, and Nut.

Xt Zt Yt Figure 7: The rehearsal graph for Bermuda data.