# counterfactual_analysis_in_dynamic_latent_state_models__d97e2002.pdf

Counterfactual Analysis in Dynamic Latent-State Models

Martin Haugh 1 Raghav Singal 2

We provide an optimization-based framework to perform counterfactual analysis in a dynamic model with hidden states. Our framework is grounded in the abduction, action, and prediction approach to answer counterfactual queries and handles two key challenges where (1) the states are hidden and (2) the model is dynamic. Recognizing the lack of knowledge on the underlying causal mechanism and the possibility of inﬁnitely many such mechanisms, we optimize over this space and compute upper and lower bounds on the counterfactual quantity of interest. Our work brings together ideas from causality, statespace models, simulation, and optimization, and we apply it on a breast cancer case study. To the best of our knowledge, we are the ﬁrst to compute lower and upper bounds on a counterfactual query in a dynamic latent-state model.

1. Introduction

Counterfactual analysis, falling on the third rung of Pearl s ladder of causation (Pearl & Mackenzie, 2018), is a fundamental problem in causality. It requires us to imagine a world where a certain policy was enacted with a corresponding outcome given that a different policy and outcome were actually observed. It is performed via the 3-step framework of abduction (conditioning on the observed data), action (changing the policy), and prediction (computing the counterfactual quantity of interest (CQI)), and has wide-ranging applications (Pearl, 2009a;b).

As a concrete application in healthcare and legal reasoning, consider someone who recently died from breast cancer. The exact progression of her disease is unknown. What is known, however, is that over a period of time prior to her diagnosis, her insurance company adopted a strategy of denying her regular scans (e.g., mammograms) even though

1Imperial College 2Dartmouth College. Correspondence to: MH <m.haugh@imperial.ac.uk>, RS <singal@dartmouth.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

these scans should have been covered by her policy. Had these scans gone ahead, the cancer may have been found earlier and the patient s life saved. Now a court wants to know the probability that her life would have been saved had the routine scans been permitted.

On top of the challenges posed by standard counterfactual analysis, there are two that are particular to such a setting. First, it s possible the underlying state of the patient (e.g., stage of cancer) is hidden / latent and we only observe a noisy signal depending on the accuracy of the scan (e.g., sensitivity and speciﬁcity of a mammogram). Second, the underlying model is dynamic as the patient s state evolves over time. As such, our goal in this work is to perform counterfactual analysis in dynamic latent-state models.1

Two streams of work are closely related to ours. The ﬁrst relates to works on constructing bounds on CQIs (Balke & Pearl, 1994; Tian & Pearl, 2000; Kaufman et al., 2005; Cai et al., 2008; Pearl, 2009b; Mueller et al., 2021; Zhang et al., 2021). These papers focus on static models. We note that despite some similarities of our work with Zhang et al. (2021), the two approaches are quite different. In particular, while both papers recognize the relevance of polynomial optimization for bounding CQIs, Zhang et al. (2021) do not solve polynomial optimization problems but instead propose Monte-Carlo algorithms as a work-around. In contrast, we actually solve polynomial optimization problems via sample average approximations (SAAs), which we generate via Monte-Carlo. As such, Monte-Carlo serves as an input to our polynomial programs whereas Zhang et al. (2021) use it as a substitute for polynomial programs. As mentioned above, another difference is our focus on dynamic models whereas Zhang et al. (2021) focus on static models. The second stream is more recent and concerns counterfactual analysis in dynamic models (Buesing et al., 2019; Oberst & Sontag, 2019; Lorberbom et al., 2021; Tsirtsis et al., 2021). Except Buesing et al. (2019), none of these works allows for latent states. In addition, these works perform counterfactual analysis by embedding assumptions that are strong enough to restrict the underlying set of causal mechanisms to a singleton. In particular, Buesing et al. (2019) explicitly ﬁx

1The key feature distinguishing a static model from a dynamic model with T periods say, is that the single-period structure is repeated T times. As we shall see, our framework takes advantage of this repeated structure in several ways.

Counterfactual Analysis in Dynamic Latent-State Models

a single causal mechanism whereas Oberst & Sontag (2019) and Tsirtsis et al. (2021) invoke counterfactual stability and implicitly ﬁx the causal mechanism (via the Gumbel-max distribution). Lorberbom et al. (2021) extend the Gumbelmax approach but their choice of causal mechanism is the one that minimizes variance when estimating the CQI. In summary, none of these approaches explicitly account for all possible causal mechanisms, and therefore, they do not consider the construction of lower and upper bounds on the CQI, which is our focus.

Our key contribution is to provide a principled framework for counterfactual analysis in dynamic latent-state models. We deﬁne our problem in 2 and discuss counterfactual stability in 3. In 4, we present our solution approach and we describe our numerics in 5. We conclude in 6.

2. Problem Deﬁnition

We ﬁrst deﬁne the underlying dynamic latent-state model ( 2.1) and then describe the counterfactual analysis problem ( 2.2). We will use a breast cancer application as a vehicle for explaining ideas throughout but it should be clear our framework is quite general.

2.1. A Dynamic Latent-State Model

The model, visualized in Figure 1, has T discrete periods. In each period t, the system is in a hidden state Ht H (ﬁnite). As a stochastic function of Ht and the policy Xt X (ﬁnite), we observe an emission Ot O (ﬁnite). The emission probability is denoted by ehxi := P(Ot = i | Ht = h, Xt = x) for all (h, x, i). This is followed by the state Ht transitioning to Ht+1 with transition probability qhih := P(Ht+1 = h | Ht = h, Ot = i). The model M comprises three primitives: M (p, E, Q), where p := [ph]h denotes the initial state distribution with ph := P(H1 = h) for all h, E := [ehxi]h,x,i, and Q := [qhih ]h,i,h .

H1 H2 . . . HT

Figure 1. A dynamic latent-state model. States H1:T are hidden (red). Emissions O1:T are observed (blue). X1:T represents the policy (observed). (We use the notation X1:T := (Xt)T t=1.)

In the breast cancer application, the time periods map to the frequency of mammograms (e.g., 6 months) and the hidden state Ht {1, . . . , 7} denotes the patient s condition. State 1 equates to the patient being healthy whereas states 2 and 3 correspond to undiagnosed in-situ and invasive breast cancer, respectively. States 4 and 5 correspond to diagnosed

in-situ and invasive breast cancer respectively, with the understanding that the cancer treatment has begun (since it has been diagnosed). States 6 and 7 are absorbing and denote recovery from cancer (due to treatment) and death from cancer, respectively. The observation Ot {1, . . . , 7} captures the mammogram result. A value of 1 means no screening took place, whereas 2 denotes a negative screening result (possibly a false negative). A value of 3 corresponds to a positive mammogram result, but followed by a negative biopsy (i.e., the patient is healthy and the mammogram produced a false positive). Observations 4 and 5 map to correctly diagnosed in-situ and invasive cancer respectively, i.e., a positive mammogram followed by a positive biopsy. Observations 6 and 7 are used to denote patient recovery and death from breast cancer, respectively. The variable Xt {0, 1} models the insurance company s coverage policy for the mammograms, with 0 denoting the company covers it and 1 denoting the company (incorrectly) denies the coverage. If the coverage is denied, then the mammogram is not performed and hence, the observation cannot be 2, 3, 4, or 5. (In this application, the Xt s are deterministic but in general, they could be the result of a randomized policy.)

Our model is therefore a generalization of a hidden Markov model (HMM) since Ht+1 depends not only on Ht but also on Ot. The dependence on Ot is needed to capture the fact that if cancer was detected during period t and treatment began at that point of time, i.e., Ot {4, 5}, then Ht+1 depends on the fact that the treatment began in period t. For example, if Ht = 2 (in-situ cancer) and Ot = 4 (in-situ diagnosed and hence, treatment began), then Ht+1 would be different compared to when Ht = 2 and Ot = 2 (false negative and hence, treatment did not begin). Remark 1. Ayer et al. (2012) employed a similar model for determining an optimal screening strategy for breast cancer but as their goal was to optimize over screening strategies, their model was a partially observable Markov decision process (POMDP). In contrast, our goal is not to ﬁnd an optimal strategy but to evaluate CQIs. As such, our model is not a POMDP although it is easily related to a POMDP setting. For example, we can view the insurance company s observed coverage strategy and the counterfactual strategy where coverage is always provided, as being feasible strategies from a POMDP. Finally, we also note that a practical justiﬁcation for our model comes from the simulation model used by the National Cancer Institute (UWBCS, 2013).

2.2. The Counterfactual Analysis Problem

We now use our dynamic model to state the counterfactual analysis problem.

Observed data. Suppose we observe emissions o1:T with the underlying policy being x1:T . The true hidden states h1:T are not observed. In the context of breast cancer, the

Counterfactual Analysis in Dynamic Latent-State Models

observations for a particular patient might be as follows:

o1, . . . , oτs 1 | {z } {2,3}

, oτs, . . . , oτe | {z } =1

, oτe+1, . . . , oτd 1 | {z } {4,5}

, oτd:T | {z } =7

That is, the patient was screened (x1:τs 1 = 0) and appeared healthy (o1:τs 1 {2, 3}) up to and including time τs 1. Coverage was denied during periods τs to τe, i.e. xτs:τe = 1; (see red font in (1). Hence, screening was not performed during those periods (oτs:τe = 1). As soon as the coverage for screening was re-approved (period τe + 1 and hence, xτe+1 = 0), the patient was found to have cancer (either in-situ or invasive) and the corresponding treatment began; thus, oτe+1 {4, 5}. Unfortunately, the patient died at τd.

CQI. We focus on the well-known probability of necessity (PN) (Pearl, 2009a) as our CQI. It is the probability the patient would have not died (counterfactual state e HT = 7) had the screening been covered in every period (intervention policy ex1:T = 0) given the observed data (o1:T , x1:T ). ( Tilde notation denotes quantities in the counterfactual world.) The interpretation of ex1:T is straightforward as it is ﬁxed exogenously. The counterfactual state e HT is obtained via the 3 steps of abduction, action, and prediction (Pearl, 2009b). Step 1 (abduction) involves conditioning on the observed data (o1:T , x1:T ) to form a posterior belief over the hidden states. Step 2 (action) changes the policy from x1:T to ex1:T and brings us to the counterfactual world f M. Step 3 (prediction) computes PN in the counterfactual model:

PN = P( e HT = 7), (2)

with the understanding that the event { e HT = 7} is conditional on (o1:T , x1:T ). Though we focus on PN, it is easy to extend our framework to a broad class of CQIs as the abduction and action steps do not depend on the CQI.

Given our focus on counterfactual analysis, we will assume the primitives (p, E, Q) are known. We discuss their calibration to real-world data in 5 and emphasize that even with known (p, E, Q), counterfactual analysis is challenging. This is because we are interested in counterfactuals at an individual level (i.e., conditioning on the patient-level data (o1:T , x1:T ) via abduction), as opposed to the population level. A population-level counterfactual analysis would ignore the ﬁrst step of abduction but simply change the policy to ex1:T to predict the CQI (by simulating the resulting model and obtaining a Monte-Carlo estimate of PN or doing it in closed-form if analytically tractable). However, this is very different from the task at hand, which falls on the highest rung of Pearl s ladder of causation (Pearl & Mackenzie, 2018). For instance, consider a patient who dies immediately after the coverage was denied versus a patient who dies a couple of years after the coverage was denied. Clearly, the ﬁrst patient had a more aggressive

cancer and hence we expect that her PN would be lower. By conditioning on individual-level data (o1:T , x1:T ), we are able to account for such differences. However, it makes the problem considerably more challenging.

In our dynamic latent-state model, each of the three steps of abduction, action, and prediction presents its own set of challenges2, which we discuss when presenting our methodology in 4. Before doing so, we discuss the notion of counterfactual stability (CS), which has become a popular approach in some settings (Oberst & Sontag, 2019).

3. Limitations of Counterfactual Stability

Instead of discussing CS in our dynamic latent-state model, we do so using the following simple model: X Y . Suppose we observe an outcome Y = y under policy X = x. With Yx := Y | (X = x), CS requires that the counterfactual outcome under an interventional policy ex (denoted by e Y := Yex | Yx = y) cannot be y (for y = y) if P(Yex = y)/P(Yx = y) P(Yex = y )/P(Yx = y ). In words, CS states that if y was observed and this outcome becomes relatively more likely than y under the intervention, then the counterfactual outcome e Y can not be y .

Though somewhat appealing, the appropriateness of CS depends on the application and should be justiﬁed by domain speciﬁc knowledge. Moreover, we show in Example 1 that CS can permit counterfactuals that it was seemingly designed to exclude. Example 1. Consider the X Y model and suppose X {0, 1} denotes a medical treatment and Y {bad, better, best} the patient outcome. For illustration, suppose the outcome Yx obeys the following distribution: Y0 {bad, better, best} w.p. {0.2, 0.3, 0.5} and Y1 {bad, better, best} w.p. {0.2, 0.2, 0.6}. That is, under treatment (x = 1), the best outcome becomes more likely but the likelihood of the bad outcome does not change. Consider a patient whose outcome Y was better under no treatment (x = 0). Suppose also that domain speciﬁc knowledge tell us that even at the individual level, the counterfactual outcome e Y should not be worse under treatment (ex = 1) than under no treatment (x = 0). However, since

P(Y1 = better) P(Y0 = better) = 0.2

0.2 = P(Y1 = bad)

P(Y0 = bad),

bad is a feasible counterfactual outcome under CS.

Even if CS is appropriate, its current operationalization has a key limitation. In particular, instead of considering all possible structural causal models (SCMs) that obey CS, both Oberst & Sontag (2019) and Tsirtsis et al. (2021) pick

2Instead of using the twin networks approach (Pearl, 2009b), we perform the counterfactual analysis directly by leveraging the structure in our model.

Counterfactual Analysis in Dynamic Latent-State Models

one SCM via the Gumbel-max distribution. Ideally, one should characterize the space of all SCMs obeying CS, and map that space into appropriate bounds on the CQI.

We present our optimization-based framework to perform counterfactual analysis next. Our framework does not rely on CS. However, if CS is deemed appropriate for one or more components of the SCM (see 4), our approach allows us to encode CS via linear constraints in the optimization and characterize the entire space of solutions that obey CS. We do this in 5 to negatively answer the open question of Oberst & Sontag (2019) regarding whether Gumbel-max obeys CS uniquely. Further, if enforcing the so-called pathwise monotonicity (PM) is desirable, i.e., ensuring the counterfactual outcome does not worsen under a better intervention (as we assumed in Example 1), then we can embed it in our optimization via linear constraints as well.

4. Counterfactual Analysis via Optimization

We now present our solution methodology for the counterfactual analysis problem introduced in 2. We ﬁrst discuss the underlying SCM ( 4.1), which is a precursor to deﬁning the counterfactual model f M ( 4.2), which feeds into our optimization framework for counterfactual analysis ( 4.3).

4.1. The Structural Causal Model (SCM)

Figure 2. The SCM underlying the dynamic latent-state model. The only difference between the SCM here and Figure 1 is the addition of (grey) exogenous noise nodes [Ut, Vt]t.

To understand the SCM (Figure 2), consider Ot for any t, which is a stochastic function of (Ht, Xt). The stochasticity is driven by the exogenous noise vector Vt := [Vthx]h,x, which comprises of |H||X| noise variables. We model the exogenous node as a vector (as opposed to a scalar) to capture the fact that each Othx := Ot | (Ht = h, Xt = x) deﬁnes a distinct random variable for all (h, x). Moreover, these random variables might be independent or not. One way to handle this is to associate each Othx with a noise variable Vthx. The dependence structure among these noise variables [Vthx]h,x is then what determines the dependence structure among [Othx]h,x. The structural equation obeys

Ot = f(Ht, Xt, Vt) = P

h,x fhx(Vthx)I{Ht=h,Xt=x} (3a)

where fhx( ) is deﬁned using the emission distribution [ehxi]i and Vthx Unif[0, 1] wlog. Similarly, for t > 1, recognizing that each Hthi := Ht | (Ht 1 = h, Ot 1 = i) is a distinct random variable for all (h, i), we associate each Hthi with its own noise variable Uthi:

Ht = g(Ht 1, Ot 1, Ut) = P

h,i ghi(Uthi)I{Ht 1=h,Ot 1=i} (3b)

where ghi( ) is deﬁned using the transition distribution [qhih ]h and Uthi Unif[0, 1] wlog.

The representation in (3a) allows us to model [Othx]h,x and capture any dependence structure among these random variables by specifying the joint multivariate distribution of Vt. Since the univariate marginals of Vt are known (Unif[0, 1]), specifying the multivariate distribution amounts to specifying the dependence structure or copula. (Of course, the same comment applies to (3b) and Ut as well.) For example, if the Vthx s are mutually independent (the independence copula) and we have (Ht = h , Xt = x ), then inferring the conditional distribution of Vth x will tell us nothing about the Vthx s for (h, x) = (h , x ). Alternatively, if Vthx = Vth x for all pairs (h, x) and (h , x ), then this models perfect positive dependency (the comonotonic copula) and inferring the conditional distribution of Vth x amounts to simultaneously inferring the conditional distribution of all the Vthx s. We emphasize that we must work with the exogenous vectors (Ut, Vt) when doing a counterfactual analysis since different joint distributions of (Ut, Vt) will lead to (possibly very) different values of PN. If we are not doing a counterfactual analysis and only care about the joint distribution of a (subset of) (O1:T , H1:T ) then our analysis will only depend on the joint distribution of the (Ut, Vt) s via their known univariate marginals. We note the Ut s and Vt s must be mutually independent in order for the SCM to be consistent with the dependence / independence relationships implied by the model of Figure 1.

In our model, the emissions and the state transitions are time-independent. Thus, it is natural to assume the copulas underlying Vt and Ut are time-independent (time invariance). As such, we deﬁne the notation Ohx := Ot | (Ht = h, Xt = x) and Hhi := Ht+1 | (Ht = h, Ot = i).3 Then, ehxi = P(Ohx = i) and qhih = P(Hhi = h ).

While the copula view is useful from a conceptual point of view (since specifying copulas for Ut and Vt amounts to specifying an SCM), it is more convenient to work with an alternative construction of the SCM. This is because in discrete-state space models, there will be inﬁnitely many joint distributions of V (and U) that all lead to the same joint distribution of [Ohx]h,x (and [Hhi]h,i). In other words, the joint distribution of [Vhx]h,x does not uniquely identify

3Ot | (Ht = h, Xt = x) is time-independent and hence, we use the notation Ohx instead of Othx. Same logic holds for Hhi.

Counterfactual Analysis in Dynamic Latent-State Models

the joint distribution of [Ohx]h,x. This is a consequence of Sklar s Theorem from the theory of copulas and is discussed4 further in C. We will therefore take a more direct approach by modeling the unknown joint distribution of [Ohx]h,x (and [Hhi]h,i). As such, we deﬁne

θehex,hx(ei, i) := P(Oehex = ei, Ohx = i) (4a)

πehei,hi(eh , h ) := P(Hehei = eh , Hhi = h ) (4b)

and observe that

θhx,hx(i, i) = ehxi (h, x, i) (5a)

πhi,hi(h , h ) = qhih (h, i, h ). (5b)

This holds because πhi,hi(h , h ) = P(Hhi = h , Hhi = h ) = P(Hhi = h ) = qhih . We also have symmetry, i.e.,

θehex,hx(ei, i) = θhx,ehex(i,ei) (eh, ex,ei) (h, x, i) (6a)

πehei,hi(eh , h ) = πhi,ehei(h ,eh ) (eh,ei,eh ) (h, i, h ). (6b)

This is because πehei,hi(eh , h ) = P(Hehei = eh , Hhi = h ) =

P(Hhi = h , Hehei = eh ) = πhi,ehei(h ,eh ). We only deﬁned the pairwise marginals in (4) but deﬁne the joint PMFs in (9). We are now ready to discuss the counterfactual model.

4.2. The Counterfactual Model f M

Recall from 2 that f M is obtained after the two steps of abduction (conditioning on the observed data (o1:T , x1:T )) and action (changing the policy from x1:T to ex1:T ). Understanding the dynamics underlying f M are non-trivial, primarily due to the abduction step where the goal is to obtain the posterior distribution of the hidden path H1:T . It is not possible to provide a closed-form expression for this distribution but we can use ﬁltering / smoothing methods to describe the posterior dynamics of H1:T . (See B for details.)

We can therefore use these dynamics to generate B Monte Carlo samples [h1:T (b)]B b=1 from the posterior, i.e., from the distribution of H1:T | (o1:T , x1:T ). Then, by conditioning on each sample b, it is possible to characterize f M. In particular, denote by f M(b) (ep(b), [e E(t)(b)]t, [ e Q(t)(b)]t) the counterfactual model corresponding to posterior sample h1:T (b). Similar to the primitives (p, E, Q) in 2, the counterfactual primitives (ep(b), [e E(t)(b)]t, [ e Q(t)(b)]t) correspond to initial state, emission, and transition distributions. As H1 in Figure 1 has no parents, ep(b) is such that the counterfactual hidden state in period 1 equals the posterior sample h1(b), i.e., eh1(b) = h1(b). In contrast with E and Q, both e E(t)(b) and e Q(t)(b) are time-dependent (note

4In C, we also discuss speciﬁc copulas (e.g., independence and comonotonic copulas) that can be used to provide benchmark values of PN.

the super-script (t) ). This is because the period t counterfactual emission e E(t)(b) := [ee(t) ehexei(b)]eh,ex,ei and transition e Q(t)(b) := [eq(t) eheieh (b)]eh,ei,eh probabilities are as follows:

ee(t) ehexei(b) = P(Oehex = ei | Oht(b)xt = ot) (7a)

eq(t) eheieh (b) = P(Hehei = eh | Hht(b)ot = ht+1(b)). (7b)

(The Ohx and Hhi notation is deﬁned above (4).) The dependence on t is through the observed data (ot, xt) and the posterior samples (ht(b), ht+1(b)). As such, for each posterior path b, f M(b) is a time-dependent dynamic latent-state model. If we knew e E(t)(b) and e Q(t)(b), then we could simulate f M(b) to obtain a Monte-Carlo estimate of our CQI by averaging the CQI over the B posterior sample paths. However, e E(t)(b) and e Q(t)(b) are unknown as they depend on the joint distributions of Ut and Vt.

Towards this end, we can combine (7) with (4) to obtain

ee(t) ehexei(b) = θehex,ht(b)xt(ei, ot)

eq(t) eheieh (b) = πehei,ht(b)ot(eh , ht+1(b))

qht(b)otht+1(b) ,

which express the unknown and time-dependent emission and transition distributions in terms of the unknown θ and π that are time-independent.

4.3. Polynomial Optimization

We now propose an optimization model where we treat the unknowns (θ, π) as decisions and maximize (minimize) the CQI to obtain an upper bound (lower bound). We present our optimization model in terms of the objective and constraints, followed by a discussion on how we can enforce CS and PM (if indeed they were deemed appropriate).

Objective. As in (2), we wish to understand the PN, which equals P( e HT = 7), where e HT is the hidden state at time T under f M. The randomness in e HT depends on the randomness in (i) the true hidden path H1:T (captured by [h1:T (b)]b) and (ii) the counterfactual model f M | H1:T after conditioning on H1:T (captured by f M(b)). Lemma 1 decomposes PN using these two uncertainties. (All proofs are in A.)

Lemma 1. We have

PN = 1 lim B 1 B

b=1 Pf M(b)( e HT = 7).

We next express Pf M(b)( e HT ) in terms of (θ, π) from (4).

Counterfactual Analysis in Dynamic Latent-State Models

Lemma 2. For t {T, T 1, . . . , 2}, Pf M(b)(eht) :=

Pf M(b)( e Ht = eht) obeys the following recursion (over t):

Pf M(b)(eht) = X

eht 1,eot 1

πeht 1eot 1,ht 1(b)ot 1(eht, ht(b))

qht 1(b)ot 1ht(b)

θeht 1ext 1,ht 1(b)xt 1(eot 1, ot 1)

eht 1(b)xt 1ot 1

Pf M(b)(eht 1).

The recursion breaks at t = 1:

Pf M(b)(eh1) =

( 1 if eh1 = h1(b) 0 otherwise.

Putting together Lemmas 1 and 2 allows us to express PN in terms of the various primitives, all of which except (θ, π) are known (or can be sampled). Thus, we use the notation PN(θ, π | [h1:T (b)]b). As soon as we ﬁx (θ, π), we can estimate PN. However, it is unclear apriori what we should ﬁx (θ, π) at. We might have some information on the structure of (θ, π) that can help us shrink their feasibility space but in general, there can be many (θ, π)s that are valid . To overcome this lack of knowledge, we take an agnostic view and compute bounds over PN. The upper (lower) bound is computed by maximizing (minimizing) PN(θ, π | [h1:T (b)]b) over the set of (θ, π) that are valid . Denoting by F the set of valid (θ, π) (discussed below), we deﬁne

PNub(B) := max (θ,π) F PN(θ, π | [h1:T (b)]b) (8a)

PNlb(B) := min (θ,π) F PN(θ, π | [h1:T (b)]b). (8b)

Both optimizations in (8) are sample average approximations (SAA) due to the use of the Monte-Carlo samples [h1:T (b)]b. Thus, PNub(B) and PNlb(B) are estimates of the true PNub and PNlb. However, given [h1:T (b)]b are iid samples, the following consistency result is immediate (cf. Proposition 5.2 in Shapiro et al. (2021)).

Proposition 1. (PNlb(B), PNub(B)) converges to (PNlb, PNub) w.p. 1 as B .

In addition, we can characterize the asymptotics of (PNlb(B), PNub(B)) via results in the SAA theory and we refer the reader to 5.1.2 of Shapiro et al. (2021).

Constraints (feasibility set F). We now discuss the feasibility set F. Recall from (4) that we used θ and π to denote the pairwise marginal distributions over [Ohx]h,x and [Hhi]h,i respectively. We will now also use them to represent the full joint distributions of [Ohx]h,x and [Hhi]h,i respectively. To simplify notation, let k (h, x) and

m (h, i). Hence,

Ok Ohx, eki ehxi Hm Hhi, qmh qhih .

We have k [K] and m [M], where K := |H||X| and M := |H||O|. The K and M dimensional joint PMFs for all i1, . . . , i K O and h1, . . . , h M H are deﬁned as

θ1,...,K(i1, . . . , i K) := P(O1 = i1, . . . , OK = i K) (9a)

π1,...,M(h1, . . . , h M) := P(H1 = h1, . . . , HM = h M). (9b)

Note that we only have one joint θ1,...,K among K random variables in contrast to multiple pairwise marginals [θkℓ](k,ℓ). Each of these joint PMFs are decision variables in the optimization (in addition to the pairwise decision variables) and must obey the following set of constraints. First, the 1-dimensional marginals of θ and π must equal the given 1-dimensional marginals [eki](k,i) and [qmh](m,h): X

{i1,...,i K}\{ik} θ1,...,K(i1, . . . , i K) = e1ik ik, k

{h1,...,h M}\{hm} π1,...,M(h1, . . . , h M) = q1hm hm, m.

Recall that (Q, E), i.e., the right-hand-sides of (10), are known. Moreover, since Q and E themselves deﬁne 1dimensional probability distributions and therefore sum to 1, (10) ensures the same will be true of both the joint PMFs, i.e., they will also sum to 1. Second, we must link the pairwise marginals to the joints:

θkℓ(ik, iℓ) = X

{i1,...,i K}\{ik,iℓ} θ1,...,K(i1, . . . , i K)

πmn(hm, hn) = X

{h1,...,h M}\{hm,hn} π1,...,M(h1, . . . , h M).

(11a) holds for all ik, iℓ O and k, ℓ [K] s.t. k < ℓ whereas (11b) holds for all hm, hn H and m, n [M] s.t. m < n. The k < ℓ and m < n conditions avoid unnecessary duplication (recall (5) and (6)).5 Finally, we need to ensure non-negativity:

θ, π 0 (12)

5In fact, given (5) and (6), we do not need to deﬁne all pairwise marginals as decision variables but only for k < ℓ and m < n . This is because if an optimization has two decision variables x and y and the constraint x = y, we can eliminate y and the constraint x = y by replacing y with x everywhere in the optimization.

Counterfactual Analysis in Dynamic Latent-State Models

where we now use (θ, π) to denote all of the corresponding, i.e., joint and pairwise, decision variables.

Let F be the feasible region over (θ, π) deﬁned by the constraints (10), (11), and (12). Observe that PN(θ, π) is a polynomial in (θ, π) (cf. Lemmas 1 and 2) and the constraints in F are linear. Thus, each of the problems in (8) fall within the class of polynomial optimization (Anjos & Lasserre, 2011). Denoting by PN the PN under the true (unknown) (θ, π), we obtain the following inequalities.

Proposition 2. PNlb PN PNub.

Enforcing CS and PM via linear constraints. Suppose that at some time, the patient was in state h, the emission was i, followed by a transition to state h . This maps to the realization Hhi = h . For eh = h , CS requires that if P(Hehei = h )/P(Hhi = h ) P(Hehei = eh )/P(Hhi = eh ), then P(Hehei = eh | Hhi = h ) = 0. Observe that the if condition is equivalent to qeheih /qhih qeheieh /qhieh and the LHS of then equals πehei,hi(eh , h )/qhih . Hence, for the state transitions, CS is equivalent to adding the following linear constraints for all (h, i, h ,eh,ei,eh ):

πehei,hi(eh , h ) = 0 if qeheih

qhih qeheieh

qhieh . (13a)

Similarly, for emissions, CS can be modeled by adding the following linear constraints for all (h, x, i,eh, ex,ei):

θehex,hx(ei, i) = 0 if eehexi

ehxi eehexei

ehxei . (13b)

Hence, we can characterize the space of all SCMs that obey CS, which is in contrast to picking just one such SCM (Oberst & Sontag, 2019). Enforcing CS naturally leads to tighter bounds, but the bounds may not be legitimate if the true (θ, π) does not satisfy CS. Denoting by PNub cs and PNlb cs the bounds obtained by adding CS constraints (13) to the optimizations in (8), we have the following result.

Proposition 3. PNlb PNlb cs PNub cs PNub.

PM can also be enforced via linear constraints. To see this, suppose the patient has in-situ cancer in period t which is not detected but the patient s state remains at in-situ in period t + 1. Then, in the counterfactual world, if the cancer is detected in period t, then PM would require that the cancer can not be worse than in-situ in period t + 1, i.e.,

P(Hehei = eh | Hhi = h ) = 0

for h = 2, i {1, 2}, h = 2, eh {2, 4}, ei = 4, eh {5, 7}. There can be multiple such cases to consider and we can enforce all the PM constraints by setting the corresponding πehei,hi(eh , h ) variables equal to 0 as

πehei,hi(eh , h ) = P(Hhi = h )P(Hehei = eh | Hhi = h ). As

with CS (Proposition 3), PM will result in bounds PNub pm and PNlb pm tighter than PNub and PNlb.

Algorithm 1 Counterfactual analysis via optimization Require: (E, Q), (o1:T , x1:T ), B, ex1:T

1: h1:T (b) H1:T | (o1:T , x1:T ) b = 1, . . . , B 2: PNub(B) = max(θ,π) F PN(θ, π | [h1:T (b)]b) 3: PNlb(B) = min(θ,π) F PN(θ, π | [h1:T (b)]b) 4: return (PNlb(B), PNub(B))

We summarize our developments in Algorithm 1, which outputs the bounds (PNlb(B), PNub(B))6. Line 1 (sampling) can be executed efﬁciently (cf. B), and we discuss three computational considerations behind solving the polynomial optimizations (lines 2 and 3).

First, though the constraints are linear, the objective is polynomial, making it a non-trivial non-convex optimization problem. To solve it, we leverage state-of-the-art developments in optimization. In particular, we use the BARON solver (Sahinidis, 2023), which relies on a polyhedral branch-and-cut approach, allowing it to achieve global optima (Tawarmalani & Sahinidis, 2005). We found it to work well in our numeric experiments ( 5).

Second, in terms of the problem size, it follows from (4) and (9) that we have at most |H|4|O|2 + |H|2|O|2|X|2 pairwise variables and |O||H||X| + |H||H||O| joint variables. Similarly, it follows from (10) and (11) that the feasible region F is deﬁned by at most |O||H||X| + |H|2|O| + |O|2|H|2|X|2 + |H|4|O|2 constraints. However, these are merely upper bounds and we can exploit the sparsity inherent in the underlying application (along with the variable and constraint elimination discussed in Footnote 5) to drastically reduce the problem size. For instance, in our breast cancer application, we have (|H|, |O|, |X|) = (7, 7, 2), with the above formulae giving over 1041 variables and 105 constraints. After we exploit sparsity (discussed in 5), they are reduced to 16,124 and 610, respectively. Further, as CS and PM can be modeled by setting appropriate variables to 0, they allow for further sparsity as we can delete those variables.

Third, observe that a naive expansion of the recursion in Lemma 2 results in a number of terms that is exponential in T, which would result in memory issues for moderate to large values of T. Nonetheless, as we elaborate in D.1, it is possible to remove this exponential dependence on T by a reformulation of the optimization problem. This comes at the cost of introducing polynomial constraints. Nonetheless, this reformulation allowed us to obtain highquality solutions in the breast cancer setting with as many as

6We can output (PNlbcs(B), PNub cs (B)) and (PNlbpm(B), PNub pm(B)) as well by solving the same optimization problems but with additional linear constraints.

Counterfactual Analysis in Dynamic Latent-State Models

T = 100 periods ( D.3). In contrast, we run into memory issues for T as small as 11 with the original formulation. In fact, we discuss an alternative approach at the end of D.1. This approach allows us to compute the objective function efﬁciently without having to add any additional constraints. Unfortunately, the BARON solver does not allow us to use this approach and so we leave this issue for future research.

5. Numerical Experiments

We now apply our approach to the breast cancer application we described in 1.

Setup. We described the elements of the underlying dynamic latent-state model M (p, E, Q) in 2. It has a total of 7 states, 7 emissions, and 2 actions. Given patient-level data (o1:T , x1:T ), we wish to estimate the PN as deﬁned in (2). The primitives (p, E, Q) are calibrated to real-data using a mix of sources, which we discuss in E.1.

We consider two paths with the ﬁrst path deﬁned as:

o1 |{z} =2 , o2, . . . , o T 2, o T 1 | {z } =1

, o T |{z} =7 x1 |{z} =0 , x2, . . . , x T 2, x T 1 | {z } =1

, x T |{z} =0 .

That is, we observe a negative test result in period 1 after which screening was not performed for T 2 periods (red font). The patient died from breast cancer in period T. Note that under this path, given the calibrated primitives in E.1, it has to be the case that h T 1 = 3 (undiagnosed invasive) since a transition from state 2 (undiagnosed in-situ) to 7 is impossible. Further, the transition from 3 to 7 is not unlikely (q317 0.15). The second path is similar but with one difference: screening was performed in period T 1 and invasive cancer was detected:

o1 |{z} =2 , o2, . . . , o T 2 | {z } =1

, o T 1 | {z } =5

, o T |{z} =7 x1 |{z} =0 , x2, . . . , x T 2 | {z } =1

, x T 1, x T | {z } =0

Hence, in contrast with path 1, the ﬁnal transition from invasive to death was under treatment with probability q357 0.01 ( E.1), which is much smaller than q317 from above. Given that this low probability transition did occur, this suggests the patient had an aggressive cancer in path 2. As such, regardless of what the optimal θ and π are, the chances of survival on the counterfactual path would be low because of this aggressive nature of the cancer. This doesn t hold on path 1 as the cancer was less aggressive .

We vary T {4, . . . , 10}, with a larger value of T suggesting the cancer may have progressed more slowly. We

compute PN bounds using our framework (Algorithm 1), which we implemented in MATLAB (MATLAB, 2021). The feasibility set F over (θ, π) corresponds to (10), (11), and (12). To solve the polynomial optimizations, we use the MATLAB-BARON interface (Sahinidis, 2023) with CPLEX (IBM, 2017) as the LP / MIP solver . It solved each of our problem instances to global optimality within minutes / hours (depending on T), with an absolute termination tolerance of 0.01 (on an Intel Xeon E5 processor with 16 GB RAM). Optimizations for T = 10 took the longest time on average ( 2 hours). We generated B = 100 samples using our sampling method in B. It took less than a second and we found B = 100 was large enough to produce stable results for our SAA. We ensured this stability by computing our results for 20 seeds (for each (path, T) pair) and verifying the standard deviations to be small. Though stability over the seeds is important, our PN estimates may still be biased for a ﬁnite B (recall Proposition 1 only holds asymptotically). As a check, we also generated results for B = 500 and observed them to be very similar to the ones for B = 100. As noted below Algorithm 1, the sparse structure of E and Q drastically reduces the size of the problem. For example, when considering the πehei,hi(eh , h ) variables, we rule out the ones that map to impossible (h, i, h ) or (eh,ei,eh ) combinations (refer to E.1.2). The same observation also applies to all the joint variables (details in E.2).

Results. The results for path 1 are displayed in Figure 3 (and for path 2 in Figure 10 ( E.5)), where we show the PN bounds as we vary T. In addition to the bounds (PNlb, PNub) computed via our baseline optimization (UB and LB), we show the bounds obtained when we encode CS (UB(CS) and LB(CS)) and PM (UB(PM) and LB(PM))7. We also show the PN estimate when we perform counterfactual simulations using the two copulas discussed in C (independence and comonotonic8). Finally, the naive estimate completely ignores the information in the observations, i.e., it does not execute the abduction step and is therefore an invalid estimate of PN.

To simplify matters, we adopt an all-or-nothing approach whereby either CS is imposed for both hidden-state transitions and observations or not at all. We do the same for PM. Of course, it is possible to consider various combinations, e.g., imposing PM for hidden-state transitions only or imposing CS only for the observations, etc. This is also true of our copulas when we estimate PN for a particular SCM. In Figure 3, for example, the independence (comonotonic) curve corresponds to assuming the independence (comonotonic) copula for both hidden-state transitions and observations. But we could of course have assumed one copula for

7Details on the PM constraints for breast cancer are in E.3. 8Further details on the comonotonic copula speciﬁc to the breast cancer model are in E.4.

Counterfactual Analysis in Dynamic Latent-State Models

UB LB Naive Independence Comonotonic

(a) UB / LB

UB (CS) LB (CS) Naive Independence Comonotonic

(b) UB / LB with CS

UB (PM) LB (PM) Naive Independence Comonotonic

(c) UB / LB with PM

Figure 3. PN results for path 1 as we vary T {4, . . . , 10}. Observe that the LB, LB(CS), and LB(PM) curves coincide (the lowest curve in each ﬁgure). We show the average over 20 seeds and note that the standard deviation (s.d.) for every data point is smaller than 0.01. Given this small magnitude, we omit the 1 s.d. bars to reduce the clutter in the sub-ﬁgures. Note that to simulate the naive estimate and the two copulas (independence and comonotonic), we use 104 Monte-Carlo samples. These simulations were fast (a couple of minutes). Using 104 samples is in contrast to the 100 samples we use for SAA and we did so to ensure a low s.d.

the hidden-state transitions and an entirely different one for the observations. Each such combination of copulas would yield a different SCM and therefore a feasible value of PN.

The naive estimate is independent of the observed path and can fall outside the bounds. This makes sense as it does not perform abduction but simply simulates the original model M under the intervention policy ex1:T = 0. The naive estimates are very close to 1 as dying of breast cancer in any 5-year period9 is highly unlikely.

For path 1, we obtain relatively tight bounds, with PN always above 0.85. This means that in the counterfactual world, the patient would have not died with high probability, consistent with our discussion around q317 above. Even in the absence of any additional structure such as CS or PM, the gap between the lower and upper bounds is within 10 percentage points. The gap gets tighter with CS (within 5 percentage points) and PM (within 1 percentage point!). The fact that the LB and UB under CS do not coincide resolves the open question of Oberst & Sontag (2019) regarding the uniqueness of the Gumbel-max mechanism w.r.t. CS it is not unique. It is not surprising that the comonotonic estimate falls close to the PM bounds. Interestingly, the estimated PN for the two copulas roughly cover the range of possibilities in terms of the bounds (Figure 3(a)).

For path 2, the lower bounds are close to 0. This aligns with the fact that despite being diagnosed in period T 1 (and hence, provided treatment), the patient eventually died (which suggests that the patient had an aggressive cancer). The bounds without CS and PM are relatively loose, simply reﬂecting the lack of knowledge to reason in a counterfactual world. As soon as we inject knowledge via CS or PM, the bounds become much tighter.

The experiments discussed so far are for up to T = 10 and we run into memory issues for T > 10 (recall the discussion

9Each period maps to 6 months so T = 10 maps to 5 years.

at the end of 4.3). Nonetheless, as we show in D, we can enhance the scalability of the polynomial optimizations in (8) via a reformulation and an approximation. In fact, as we demonstrate via numerics, these ideas allow us to obtain high-quality solutions for T as large as 100 in just a few hours of compute time.

6. Concluding Remarks

We have provided a framework for performing counterfactual analysis in dynamic latent-state models and in particular, computing lower and upper bounds on CQIs. There are several interesting directions for future research. First, we would like to handle the objective function in the optimization more efﬁciently as discussed at the end of 4.3. Speciﬁcally, BARON s solver appears to explicitly expand the objective function which results in a number of terms that is exponential in T. We were able to ﬁnesse this issue in D via a reformulation but we suspect the approach outlined at the end of 4.3 might provide a better solution. All told, it may therefore be worthwhile developing an optimization algorithm speciﬁcally tailored to the problem (a polynomial objective with linear constraints) rather than using an offthe-shelf solver. Another possible direction is exploring the use of variance reduction methods and other Monte-Carlo techniques to improve our basic Monte Carlo approach for generating posterior sample paths. Finally, on the practical front, it would be of interest to apply our framework to real-world medical applications and use domain-speciﬁc knowledge to obtain (via the imposition of additional constraints) tighter bounds on the CQIs.

Acknowledgements

We thank the ICML review team, Madhumitha Shridharan, and Jim Smith for taking the time to read the paper and providing very useful feedback. We also thank Nick Sahinidis for his support with BARON-related issues.

Counterfactual Analysis in Dynamic Latent-State Models

Anjos, M. F. and Lasserre, J. B. Handbook on semideﬁnite, conic and polynomial optimization, volume 166. Springer Science & Business Media, 2011.

Ayer, T., Alagoz, O., and Stout, N. K. A POMDP approach to personalize mammography screening decisions. Operations Research, 60(5):1019 1034, 2012.

Balke, A. and Pearl, J. Counterfactual probabilities: Computational methods, bounds and applications. In Uncertainty Proceedings, pp. 46 54. San Francisco (CA), 1994.

Barber, D. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.

Buesing, L., Weber, T., Zwols, Y., Heess, N., Racaniere, S., Guez, A., and Lespiau, J.-B. Woulda, coulda, shoulda: Counterfactually-guided policy search. In International Conference on Learning Representations, 2019.

Cai, Z., Kuroki, M., Pearl, J., and Tian, J. Bounds on direct effects in the presence of confounded intermediate variables. Biometrics, 64(3):695 701, 2008.

Haugh, M. B. and Lacedelli, O. R. Information relaxation bounds for partially observed Markov decision processes. IEEE Transactions on Automatic Control, 65(8):3256 3271, 2019.

IBM. ILOG CPLEX Optimizer Version 12.8. 2017.

Johnstone, P. A., Norton, M. S., and Riffenburgh, R. H. Survival of patients with untreated breast cancer. Journal of surgical oncology, 73(4):273 277, 2000.

Kaufman, S., Kaufman, J., Mac Lenose, R., Greenland, S., and Poole, C. Improved estimation of controlled direct effects in the presence of unmeasured confounding of intermediate variables. Statistics in Medicine, 25:1683 1702, 2005.

Lorberbom, G., Johnson, D. D., Maddison, C. J., Tarlow, D., and Hazan, T. Learning generalized gumbel-max causal mechanisms. In Advances in Neural Information Processing Systems, volume 34, pp. 26792 26803, 2021.

MATLAB. Version 9.10.0 (R2021b). The Math Works Inc., Natick, Massachusetts, 2021.

Mc Neil, A. J., Frey, R., and Embrechts, P. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press, 2 edition, 2015.

Mueller, S., Li, A., and Pearl, J. Causes of effects: Learning individual responses from population data. ar Xiv, 2021.

Nelsen, R. An Introduction to Copulas. Springer, 2 edition, 2006.

NIH. SEER Cancer Statistics Review (CSR) 19752017. 2020. URL https://seer.cancer.gov/ archive/csr/1975_2017/results_merged/ sect_04_breast.pdf.

Oberst, M. and Sontag, D. Counterfactual off-policy evaluation with Gumbel-max structural causal models. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4881 4890. PMLR, 09 15 Jun 2019.

Pearl, J. Causal inference in statistics: An overview. Statistics surveys, 3:96 146, 2009a.

Pearl, J. Causality. Cambridge University Press, 2 edition, 2009b.

Pearl, J. and Mackenzie, D. The Book of Why. Penguin Books, 2018.

Sahinidis, N. V. BARON 2023.1.5: Global Optimization of Mixed-Integer Nonlinear Programs, User s Manual, 2023.

Shapiro, A., Dentcheva, D., and Ruszczynski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2021.

Sklar, A. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8:229 231, 1959.

Sprague, B. L. and Trentham-Dietz, A. Prevalence of breast carcinoma in situ in the United States. JAMA: the journal of the American Medical Association, 302(8):846, 2009.

Tawarmalani, M. and Sahinidis, N. V. A polyhedral branchand-cut approach to global optimization. Mathematical programming, 103(2):225 249, 2005.

Tian, J. and Pearl, J. Probabilities of causation: Bounds and identiﬁcation. Annals of Mathematics and Artiﬁcial Intelligence, 8:287 313, 2000.

Tsirtsis, S., De, A., and Rodriguez, M. Counterfactual explanations in sequential decision making under uncertainty. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 30127 30139. Curran Associates, Inc., 2021.

UWBCS. University of Wisconsin Breast Cancer Simulation Model. 2013. URL https://resources.cisnet.cancer.gov/ registry/packages/uwbcs-wisconsin/.

Zhang, J., Tian, J., and Bareinboim, E. Partial counterfactual identiﬁcation from observational and experimental data, 2021.

Counterfactual Analysis in Dynamic Latent-State Models

Lemma 1. We have

PN = 1 lim B 1 B

b=1 Pf M(b)( e HT = 7).

Proof. Observe that

PN = P e HT ( e HT = 7) [by deﬁnition]

= 1 P e HT ( e HT = 7) [P(Y = y) = 1 P(Y = y)]

= 1 E e HT [I{ e HT = 7}] [P(Y = y) = E[I{Y = y}]]

= 1 EH1:T [Ef M|H1:T [I{ e HT = 7}]] [law of total expectation]

b=1 Pf M(b)( e HT = 7) as B . [law of large numbers]

The proof is now complete.

Lemma 2. For t {T, T 1, . . . , 2}, Pf M(b)(eht) := Pf M(b)( e Ht = eht) obeys the following recursion (over t):

Pf M(b)(eht) = X

eht 1,eot 1

πeht 1eot 1,ht 1(b)ot 1(eht, ht(b))

qht 1(b)ot 1ht(b)

θeht 1ext 1,ht 1(b)xt 1(eot 1, ot 1)

eht 1(b)xt 1ot 1

Pf M(b)(eht 1).

The recursion breaks at t = 1:

Pf M(b)(eh1) =

( 1 if eh1 = h1(b) 0 otherwise.

Proof. For t {T, T 1, . . . , 2}, observe that

Pf M(b)(eht) = X

eot 1 O Pf M(b)(eht,eht 1, eot 1)

eot 1 O Pf M(b)(eht | eht 1, eot 1)Pf M(b)(eot 1 | eht 1)Pf M(b)(eht 1)

eot 1 O eq(t 1) eht 1eot 1eht(b) ee(t 1) eht 1ext 1eot 1(b) Pf M(b)(eht 1)

πeht 1eot 1,ht 1(b)ot 1(eht, ht(b))

qht 1(b)ot 1ht(b) θeht 1ext 1,ht 1(b)xt 1(eot 1, ot 1)

eht 1(b)xt 1ot 1 Pf M(b)(eht 1).

The base case (t = 1) holds since the counterfactual hidden state in period 1 equals the posterior sample h1(b) (recall from 4.2). The proof is now complete.

B. Sampling Hidden Paths from the Posterior Distribution

In this section, we show how one can efﬁciently perform ﬁltering, smoothing, and sampling for the dynamic latent-state model in Figure 1. As our model is a generalization of an HMM, these algorithms are simple generalizations of the standard variants corresponding to an HMM (Barber, 2012).

Counterfactual Analysis in Dynamic Latent-State Models

Filtering. We ﬁrst compute α(ht) := P(ht, o1:t, x1:t) which will yield the un-normalized ﬁltered posterior distribution. We can then easily normalize it to compute P (ht | o1:t, x1:t) α(ht). We begin with α(h1) := P(o1 | h1, x1)P(h1 | x1)P(x1) = P(o1 | h1, x1)P(h1)P(x1). For t > 1, note that

ht 1 P (ht, ht 1, o1:t 1, ot, x1:t)

ht 1 P (ot | ht, ht 1, o1:t 1, x1:t) P (ht | ht 1, o1:t 1, x1:t) P(xt | ht 1, o1:t 1, x1:t 1)P (ht 1, o1:t 1, x1:t 1)

ht 1 P (ot | ht, xt) P (ht | ht 1, ot 1) P(xt)P (ht 1, o1:t 1, x1:t 1)

= P(xt)P (ot | ht, xt) X

ht 1 P (ht | ht 1, ot 1) α(ht 1).

Smoothing. We now compute β(ht) := P(ot+1:T , xt+1:T | ht, ot) with the understanding that β(h T ) = 1. For t < T, we have

ht+1 P(ot+1, xt+1, ot+2:T , xt+2:T , ht+1 | ht, ot)

ht+1 P(ot+2:T , xt+2:T | ht, ot, ht+1, ot+1, xt+1)P(ht+1, ot+1, xt+1 | ht, ot)

ht+1 P(ot+2:T , xt+2:T | ht+1, ot+1)P(ot+1 | ht+1, xt+1, ht, ot)P(ht+1, xt+1 | ht, ot)

ht+1 β(ht+1)P(ot+1 | ht+1, xt+1)P(xt+1 | ht+1, ht, ot)P(ht+1 | ht, ot)

= P(xt+1) X

ht+1 β(ht+1)P(ot+1 | ht+1, xt+1)P(ht+1 | ht, ot).

Now, note that

P (ht, o1:T , x1:T ) = P (ht, o1:t, x1:t) P (ot+1:T , xt+1:T | ht, o1:t, x1:t)

= P (ht, o1:t, x1:t) P (ot+1:T , xt+1:T | ht, ot)

= α(ht)β(ht).

We therefore obtain the hidden state marginal

P (ht | o1:T , x1:T ) = α(ht)β(ht) P

ht α(ht)β(ht),

which solves the smoothing problem.

Pairwise marginal. We can compute P (ht, ht+1 | o1:T , x1:T ) by noting that P (ht, ht+1 | o1:T , x1:T ) is proportional to

P (o1:t, ot+1, ot+2:T , x1:t, xt+1, xt+2:T , ht+1, ht)

= P (ot+2:T , xt+2:T | o1:t, ot+1, x1:t, xt+1, ht+1, ht) P (o1:t, ot+1, x1:t, xt+1, ht+1, ht)

= P (ot+2:T , xt+2:T | ht+1, ot+1) P (ot+1 | o1:t, ht+1, ht, x1:t, xt+1) P (o1:t, ht+1, ht, x1:t, xt+1)

= P (ot+2:T , xt+2:T | ht+1, ot+1) P (ot+1 | ht+1, xt+1) P (ht+1, xt+1 | o1:t, x1:t, ht) P (o1:t, x1:t, ht)

= P (ot+2:T , xt+2:T | ht+1, ot+1) P (ot+1 | ht+1, xt+1) P (ht+1, xt+1 | ht, ot) P (o1:t, x1:t, ht) . (16)

We can rearrange (16) to obtain

P (ht, ht+1 | o1:T , x1:T ) α(ht)P (ot+1 | ht+1, xt+1) P (ht+1, xt+1 | ht, ot) β(ht+1). (17)

Therefore, P (ht, ht+1 | o1:T , x1:T ) is easy to compute once the forward-backward, i.e. the ﬁltering and smoothing, recursions have been completed.

Counterfactual Analysis in Dynamic Latent-State Models

Sampling. We would like to sample from the posterior P (h1:T | o1:T , x1:T ). We can do this by ﬁrst noting that

P (h1:T | o1:T , x1:T ) = P (h1 | h2:T , o1:T , x1:T ) . . . P (h T 1 | h T , o1:T , x1:T ) P (h T | o1:T , x1:T )

= P (h1 | h2, o1:T , x1:T ) . . . P (h T 1 | h T , o1:T , x1:T ) P (h T | o1:T , x1:T ) .

We can therefore sample sequentially via the following two steps:

First, draw h T from P (h T | o1:T , x1:T ), which we know from the smoothed distribution of h T .

Second, observe that for any t < T, we have

P (ht | ht+1, o1:T , x1:T ) P (ht, ht+1 | o1:T , x1:T )

α(ht)P (ht+1, xt+1 | ht, ot) [by (17)]

= α(ht)P (ht+1 | ht, ot) P (xt+1) ,

from which it is easy to sample.

Hence, we can efﬁciently generate samples [h1:T (b)]B b=1 from the posterior P (h1:T | o1:T , x1:T ).

C. A Brief Introduction to Copulas and Counterfactual Simulations

Copulas are functions that enable us to separate the marginal distributions from the dependency structure of a given multivariate distribution. They are particularly useful in applications where the marginal distributions are known (either from domain speciﬁc knowledge or because there is sufﬁcient marginal data) but a joint distribution with these known marginals is required. In our application in this paper, we know the marginal distribution of each random variable in [Ohx]h,x and [Hhi]h,i, which is dictated by the model primitives (E, Q) as follows: ehxi = P(Ohx = i) and qhih = P(Hhi = h ). Indeed, these marginal distributions can be estimated from data, but the joint distribution must be speciﬁed in order to compute counterfactuals.

In each of these cases, one needs to work with a joint distribution with ﬁxed or pre-speciﬁed marginal distributions. Copulas and Sklar s Theorem (see below) can be very helpful in these situations. We only brieﬂy review some of the main results from the theory of copulas here but Nelsen (2006) can be consulted for an introduction to the topic. Mc Neil et al. (2015) also contains a nice introduction but in the context of ﬁnancial risk management.

Deﬁnition 1. A d-dimensional copula, C : [0, 1]d : [0, 1] is a cumulative distribution function with uniform marginals.

We write C(u) = C(u1, . . . , ud) for a generic copula. It follows immediately from Deﬁnition 1 that C(u1, . . . , ud) is non-decreasing in each argument and that C(1, . . . , 1, ui, 1, . . . , 1) = ui. It is also easy to conﬁrm that C(1, u1, . . . , ud 1) is a (d 1)-dimensional copula and, more generally, that all k-dimensional marginals with 2 k d are copulas. The most important result from the theory of copulas is Sklar s Theorem (Sklar, 1959).

Theorem 1 (Sklar 1959). Consider a d-dimensional CDF Π with marginals Π1, ..., Πd. Then, there exists a copula C such that Π(x1, . . . , xd) = C (Π1(x1), . . . , Πd(xd)) (18)

for all xi [ , ] and i = 1, . . . , d.

If Πi is continuous for all i = 1, . . . , d, then C is unique; otherwise C is uniquely determined only on Ran(Π1) Ran(Πd), where Ran(Πi) denotes the range of the CDF Πi.

Conversely, consider a copula C and univariate CDF s Π1, . . . , Πd. Then, Π as deﬁned in (18) is a multivariate CDF with marginals Π1, . . . , Πd.

A particularly important aspect of Sklar s Theorem in the context of this paper is that C is only uniquely determined on Ran(Π1) Ran(Πd). Because we are interested in applications with discrete state-spaces, this implies that there will be many copulas that lead to the same joint distribution Π. It is for this reason that we prefer to work directly with the joint distribution of [Ohx]h,x and [Hhi]h,i (recall (4)). That said, we emphasize that specifying copulas for the exogenous vectors Ut and Vt is equivalent to specifying a particular structural causal model (SCM) in which any CQI can be computed.

Counterfactual Analysis in Dynamic Latent-State Models

The following important result was derived independently by Fréchet and Hoeffding and provides lower and upper bounds on copulas.

Theorem 2 (The Fréchet-Hoeffding Bounds). Consider a copula C(u) = C(u1, . . . , ud). Then,

C(u) min{u1, . . . , ud}.

Three important copulas are the comonotonic, countermonotonic (only when d = 2) and independence copulas which model extreme positive dependency, extreme negative dependency and (not surprisingly) independence. They are deﬁned as follows.

Comonotonic Copula. The comonotonic copula is given by

CP(u) := min{u1, . . . , ud}, (19)

which coincides with the Fréchet-Hoeffding upper bound. It corresponds to the case of extreme positive dependence. For example, let U = (U1, . . . , Ud) with U1 = U2 = = Ud Unif[0, 1]. Then, clearly min{u1, . . . , ud} = Π(u1, . . . , ud) but by Sklar s Theorem F(u1, . . . , ud) = C(u1, . . . , ud) and so, C(u1, . . . , ud) = min{u1, . . . , ud}.

Countermonotonic Copula. The countermonotonic copula is a 2-dimensional copula given by

CN(u) := max{u1 + u2 1, 0}, (20)

which coincides with the Fréchet-Hoeffding lower bound when d = 2. It corresponds to the case of extreme negative dependence. It is easy to check that (20) is the joint distribution of (U, 1 U) where U Unif[0, 1]. (The Fréchet-Hoeffding lower bound is only tight when d = 2. This is analogous to the fact that while a pairwise correlation can lie anywhere in [ 1, 1], the average pairwise correlation of d random variables is bounded below by 1/(d 1).)

Independence Copula. The independence copula satisﬁes

and it is easy to conﬁrm using Sklar s Theorem that random variables are independent if and only if their copula is the independence copula.

A well known and important result regarding copulas is that they are invariant under monotonic transformations.

Proposition 4 (Invariance Under Monotonic Transformations). Suppose the random variables X1, . . . , Xd have continuous marginals and copula CX. Let Ti : R R, for i = 1, . . . , d be strictly increasing functions. Then, the dependence structure of the random variables

Y1 := T1(X1), . . . , Yd := Td(Xd)

is also given by the copula CX.

This leads immediately to the following result.

Proposition 5. Let X1, . . . , Xd be random variables with continuous marginals and suppose Xi = Ti(X1) for i = 2, . . . , d where T2, . . . , Td are strictly increasing transformations. Then, X1, . . . , Xd have the comonotonic copula.

Proof. Apply the invariance under monotonic transformations proposition and observe that the copula of (X1, X1, . . . , X1) is the comonotonic copula.

Our optimization framework implicitly optimizes over the space of copulas by solving polynomial programs with possibly a large number of variables and constraints. (We saw in 4.3 that the number of variables and constraints is polynomial in |H|,

Counterfactual Analysis in Dynamic Latent-State Models

|O| and |X| when calculating the probability of necessity (PN).) It may also be worthwhile, however, working explicitly with copulas. For example, the independence and comonotonic copulas are well understood and using these copulas to deﬁne SCMs may provide interesting benchmarks. Indeed, we estimate the PN for these benchmarks in our numerical results of 5. Towards this end, in C.1 and C.2, we explain how we can simulate our dynamic latent-state model to estimate the CQI under the independence ( C.1) and comonotonic ( C.2) copulas. Speciﬁcally, we assume each of the copulas for Ut and Vt are the independence copulas in C.1, whereas in C.2, we assume their copulas are the comonotonic copula.

There is no reason, however, why we couldn t combine them and assume, for example, that the copula for Ut was the independence copula and the copula for Vt was the comonotonic copula. More generally, we could use domain-speciﬁc knowledge to identify or narrow down sub-components of the copulas and leave the remaining components to be identiﬁed via the optimization problems. Since convex combinations of copulas are copulas, we could also optimize over such combinations. For example, suppose domain speciﬁc knowledge10 tells us that the copula of Vt is λCN + (1 λ)CI, i.e., a convex combination of the comonotonic and independence copulas, with λ [0, 1] unknown. Then, the optimization over Vt would reduce to a single-variable (λ) optimization with a linear constraint. Of course, the optimization over the copula of Ut must also be included but domain-speciﬁc knowledge may also help to simplify and constrain that component of the optimization. Properties such as pathwise monotonicity (PM) and counterfactual stability (CS) can also be expressed in copula terms. Indeed, PM can be expressed via the comonotonic copula, as we discuss in C.2.

C.1. Counterfactual Simulations Under the Independence Copula

For convenience, we copy Figure 2 from the main text, which is now labelled as Figure 4. Furthermore, recall that (o1:T , x1:T ) is the observed data and ex1:T is the intervention policy that was applied.

Figure 4. The SCM underlying the dynamic latent-state model.

As in 4, we start with the posterior samples [h1:T (b)]B b=1 corresponding to the random path H1:T | (o1:T , x1:T ). These samples can be generated efﬁciently (cf. B). For each sample b, our goal is to convert the sampled path h1:T (b) into a counterfactual path eh1:T (b). As noted in 4.2, irrespective of the copula choice, the counterfactual hidden state in period 1 equals the posterior sample, i.e.,

eh1(b) = h1(b).

We next need to sample eh2(b), but that ﬁrst requires us to sample the counterfactual emission eo1(b) (cf. Figure 4). With the copula underlying V1 being the independence copula, it follows that

( o1 if x1 = ex1 and h1(b) = eh1(b) sample from the emission distribution [eeh1(b)ex1i]i otherwise.

The counterfactual emission eo1(b) allows us to sample the counterfactual state eh2(b), which again leverages the fact that the copula underlying U2 is the independence copula:

( h2(b) if h1(b) = eh1(b) and o1 = eo1(b) sample from the transition distribution [qeh1(b)eo1(b)h ]h otherwise.

We then generate period 2 counterfactual emission eo2(b) in a similar manner and the process repeats until we hit the end of horizon. We summarize the procedure in Algorithm 2.

10It may be more likely that we only have domain speciﬁc knowledge over sub-components of the copulas (which are themselves copulas).

Counterfactual Analysis in Dynamic Latent-State Models

Algorithm 2 Counterfactual simulations under the independence copula Require: (E, Q), (o1:T , x1:T ), [h1:T (b)]B b=1, ex1:T 1: for b = 1 to B do 2: eh1(b) = h1(b) 3: for t = 1 to T 1 do 4: if xt = ext and ht(b) = eht(b) then 5: eot(b) = ot 6: else 7: eot(b) Categorical([eeht(b)exti]i) 8: end if 9: if ht(b) = eht(b) and ot = eot(b) then 10: eht+1(b) = ht+1(b) 11: else 12: eht+1(b) Categorical([qeht(b)eot(b)h ]h ) 13: end if 14: end for 15: end for 16: return [eh1:T (b)]b

C.2. Counterfactual Simulations Under the Comonotonic Copula

Before the formal description (which involves non-trivial notation), we provide the intuition (which is relatively straightforward). We do so by revisiting Example 1, where we have the causal graph X Y with X {0, 1} (medical treatment) and Y {bad, better, best} (patient outcome). The outcome Yx := Y | (X = x) obeys the following distribution: Y0 {bad, better, best} w.p. {0.2, 0.3, 0.5} and Y1 {bad, better, best} w.p. {0.2, 0.2, 0.6}. The underlying SCM is shown again in Figure 5.

Figure 5. SCM for Example 1 with the comonotonic copula and hence, the noise node is a scalar U Unif[0, 1], as opposed to a vector U. The structural equation is Y = f(X, U), which we denote by f X(U), the inverse transform function corresponding to the random variable YX. That is, f0(u) = bad, better, and best if u [0, 0.2], u [0.2, 0.5], and u [0.5, 1], respectively. Similarly, f1(u) = bad, better, and best if u [0, 0.2], u [0.2, 0.4], and u [0.4, 1], respectively.

Consider a patient whose outcome Y was better under no treatment (x = 0). Given the prior U Unif[0, 1], we get the posterior U | (Y0 = better) Unif[0.2, 0.5]. Now, suppose we are interested in the understanding the counterfactual outcome under the intervention ex = 1, i.e., the random variable e Y := Y1 | (Y0 = better). Then, given the Unif[0.2, 0.5] belief over U and the functional form of f1( ) (as deﬁned in the caption of Figure 5), we get that the [0.2, 0.4] region of U will map to better and the [0.4, 0.5] to best . Hence, e Y equals better w.p. 2/3 and best w.p. 1/3. This clearly obeys the pathwise monotonicity (PM) intuition we alluded to towards the end of Example 1 ( the counterfactual outcome e Y should not be worse under treatment (ex = 1) than under no treatment (x = 0) ).

We now formalize this intuition to our dynamic latent-state model. As a prerequisite to discussing the notion of PM, one needs to deﬁne an ordering of the states (set H) and the emissions (set O), e.g., from best to worst . Denote by r H(h) the rank of state h with respect to this ordering and by r O(i) the rank of emission i. Furthermore, let r 1 H (r) and r 1 O (r) denote the inverse functions corresponding to r H(h) and r O(i), respectively. That is, r 1 H (r) returns the state with rank r and r 1 O (r) returns the emission with rank r. Also, for each (h, i) pair, observe that [qhih ]h denotes the transition distribution (which maps to the random variable Hhi). Corresponding to this distribution, deﬁne the rank-ordered CDF as follows:

h :r H(h ) r H(h ) qhih h . (21a)

Counterfactual Analysis in Dynamic Latent-State Models

Similarly, for each (h, x) pair, observe that [ehxi]i denotes the emission distribution (which maps to the random variable Ohx). Corresponding to this distribution, deﬁne the rank-ordered CDF as follows:

j:r O(j) r O(i) ehxj i. (21b)

Also, deﬁne Qhi0 = Ehx0 = 0 for all (h, i) and (h, x). We discuss these orderings for the breast cancer application in E.4.

As in C.1, we start with the posterior samples [h1:T (b)]B b=1 corresponding to the random path H1:T | (o1:T , x1:T ). For each sample b, our goal is to convert the sampled path h1:T (b) into a counterfactual path eh1:T (b). As noted in 4.2, irrespective of the copula choice, the counterfactual hidden state in period 1 equals the posterior sample, i.e.,

eh1(b) = h1(b).

To generate eo1(b), we revisit the SCM in Figure 6, which now has the noise nodes as scalars (as opposed to vectors). This is a direct implication of the comonotonic copula - see the statement immediately below (19).

Figure 6. The SCM with the comonotonic copula. The key change is that the noise nodes are now scalars as opposed to vectors, i.e., (Ut, Vt) as opposed to (Ut, Vt).

By the structural equation (3a), eo1(b) equals

eo1(b) = f(eh1(b), ex1, V1) = feh1(b)ex1(V1), (22a)

where fhx( ) is the inverse transform function corresponding to the rank-ordered CDF [Ehxi]i (recall (21b)). Hence, all we need to sample eo1(b) is the posterior distribution of V1, where the posterior corresponds to conditioning on O1h1(b)x1 = o1 (recall the notation Othx from 4). Given the prior V1 Unif[0, 1], we can compute the posterior in closed-form. In particular,

V1 | (O1h1(b)x1 = o1) Unif[Eh1(b)x1o 1 , Eh1(b)x1o1], (22b)

where o := r 1 O (r O(o) 1) is the emission ranked just below o. Hence, we can efﬁciently sample V1 from its posterior, and this V1 sample can be used to generate eo1(b) (via (22a)). Given we encoded rank orderings in the CDF Ehxi, such sampling will naturally enforce pathwise monotonicity.

We can sample eh2(b) similarly. By the structural equation (3b), eh2(b) equals

eh2(b) = g(eh1(b), eo1, U2) = geh1(b)eo1(U2), (23a)

where ghi( ) is the inverse transform function corresponding to the rank-ordered CDF [Qhih ]h (recall (21a)). Hence, all we need to sample eh2(b) is the posterior distribution of U2, where the posterior corresponds to conditioning on H2h1(b)o1 = h2(b) (recall the notation Hthi from 4). Given the prior U2 Unif[0, 1], we can compute the posterior in closed-form. In particular,

U2 | (H2h1(b)o1 = h2(b)) Unif[Qh1(b)o1h2(b) , Qh1(b)o1h2(b)], (23b)

where h := r 1 H (r H(h) 1) is the state ranked just below h. Hence, we can efﬁciently sample U2 from its posterior, and this U2 sample can be used to generate eh2(b) (via (23a)). Given we encoded rank orderings in the CDF Qhih , such sampling will naturally enforce pathwise monotonicity.

Counterfactual Analysis in Dynamic Latent-State Models

We then generate period 2 counterfactual emission eo2(b) in a similar manner and the process repeats until we hit the end of horizon. We summarize the procedure in Algorithm 3.

Algorithm 3 Counterfactual simulations under the comonotonic copula Require: (E, Q), (o1:T , x1:T ), [h1:T (b)]B b=1, ex1:T , r H( ), r O( ) 1: for b = 1 to B do 2: eh1(b) = h1(b) 3: for t = 1 to T 1 do 4: vt Unif[Eht(b)xto t , Eht(b)xtot] % posterior sample of Vt (see (22b))

5: eot(b) = feht(b)ext(vt) % counterfactual emission (see (22a))

6: ut+1 Unif[Qht(b)otht+1(b) , Qht(b)otht+1(b)] % posterior sample of Ut+1 (see (23b))

7: eht+1(b) = geht(b)eot(ut+1) % counterfactual state (see (23a)) 8: end for 9: end for 10: return [eh1:T (b)]b

D. Enhancing the Scalability of the Polynomial Optimization

In this section, we discuss ways to enhance the scalability of the polynomial optimizations in (8). First, in D.1, we show how the optimization can be reformulated to avoid the exponential dependence on T (recall the discussion towards the end of 4.3). Second, in D.2, we discuss an approximate way to optimize our problem that drastically reduces the underlying dimensionality of the problem. Third, in D.3, we combine our ideas from D.1 and D.2 and demonstrate (via numerics) that we can obtain high-quality solutions for T as large as 100 in just a few hours of compute time.

Related to scalability, we mention in passing that in each of our optimization problems, we added the constraint that the objective value (which is a probability) must lie in [0, 1]. Of course, this constraint is redundant but we found it helped speed up the solver convergence in a few instances, possibly because it shrunk the search space as the solver does not know a priori that the objective is a probability.

D.1. Reformulating the Polynomial Optimization to Avoid the Exponential Dependence on T

Recall Lemmas 1 and 2, which characterize the objective function of our polynomial optimization problem. We repeat them here for the sake of convenience. Lemma 1. We have

PN = 1 lim B 1 B

b=1 Pf M(b)( e HT = 7).

Lemma 2. For t {T, T 1, . . . , 2}, Pf M(b)(eht) := Pf M(b)( e Ht = eht) obeys the following recursion (over t):

Pf M(b)(eht) = X

eht 1,eot 1

πeht 1eot 1,ht 1(b)ot 1(eht, ht(b))

qht 1(b)ot 1ht(b)

θeht 1ext 1,ht 1(b)xt 1(eot 1, ot 1)

eht 1(b)xt 1ot 1

Pf M(b)(eht 1).

The recursion breaks at t = 1:

Pf M(b)(eh1) =

( 1 if eh1 = h1(b) 0 otherwise.

It is easy to see that a naive expansion of PN (as per Lemmas 1 and 2) results in a number of terms that is exponential in T. This is clearly undesirable since we end up running into memory issues for even a moderate value of T. For example,

Counterfactual Analysis in Dynamic Latent-State Models

such issues arise for T > 10 in the breast cancer numerics of 5. It is possible to remove this exponential dependence, however, by a reformulation of the optimization, which we now discuss. (Note that the objective function remains the same irrespective of whether we optimize over the pairwise marginals (as discussed in D.2) or the joint distribution (as presented in 4.3) and hence, the reformulation here is universal .)

The reformulation steps are as follows:

1. Deﬁne Pf M(b)(eht) from Lemma 2 as a decision variable for all (t,eht, b) [T] H [B].

2. Add the Lemma 2 equations as constraints in the optimization (for each (t,eht, b) [T] H [B]). Note that these are non-linear but polynomial constraints and hence, we remain within the class of polynomial programs. Furthermore, none of the constraints have an exponential number of terms since Pf M(b)(eht) are decision variables now.

3. The objective now is simply the expression in Lemma 1.

These steps result in the following11 optimization, where we use the decision variable γt ehtb to denote the probability term12

Pf M(b)(eht) in the LHS of Lemma 2 for all (t,eht, b) [T] H [B], with γ := [γt hb](t,h,b):

max (θ,π) F,γ

s.t. γt hb = X

πh o ,ht 1(b)ot 1(h, ht(b))

qht 1(b)ot 1ht(b) θh ext 1,ht 1(b)xt 1(o , ot 1)

eht 1(b)xt 1ot 1 γt 1 h b t > 1 h H b [B]

γ1 hb = 1 h = h1(b) b [B] (24c)

γ1 hb = 0 h = h1(b) b [B]. (24d)

As before (refer to 4), the feasibility set F over (θ, π) can correspond to (10), (11), and (12). It can also include additional constraints such as CS and PM, or correspond to the lower-dimensional space over the pairwise marginals (as discussed in D.2). Clearly, (24) has a linear objective and polynomial constraints, and is therefore also a polynomial program. The number of terms in the objective is no longer exponential in T but this has come at the cost of having to add a total of |H|TB decision variables and (polynomial) constraints to the original formulation in (8). Though the size of our reformulation (number of variables and constraints) scales with both T and B, we found it to scale much more gracefully (with respect to T) than the original formulation, as we discuss in D.3 below.

Note that we do not necessarily need to add these |H|TB variables and constraints to the optimization but for that, we need the ability to modify the source code of the optimization solver (BARON in our case). This is because even in the original formulation (8), we can actually evaluate the objective function in polynomial time and space rather than naively expanding it into exponentially many terms. To see this, consider a given sample number b [B]. We need to evaluate Pf M(b)( e HT = 7)

from Lemma 2. To do so, we start from period 1 and store Pf M(b)(eh1) for all eh1 H (see Lemma 2 s base case). We then

move to period 2 and store Pf M(b)(eh2) for all eh2 H (see Lemma 2 s recursion). The key here is that when computing

Pf M(b)(eh2), we make use of the stored values of Pf M(b)(eh1). We then move to period 3 evaluations, where we make use of

the stored values of Pf M(b)(eh2). We repeat this procedure until we hit period T. Clearly, this procedure requires polynomial

time and space. Furthermore, we can evaluate the gradient (and the Hessian) of Pf M(b)( e HT = 7) in a similar manner (if needed by the optimization solver). We can therefore evaluate the objective and its gradient information at a given point in

11Note that we focus on the maximization problem from (8) but the same holds for the minimization counterpart. All we need to do is simply change the max to a min in the objective function (24a). 12To be pedantic, we could have added a t super-script in Pf M(b)(eht) and used the notation Pt f M(b)(eht) instead. However, we did not do so earlier since this dependence on t was implicitly understood to exist, and adding this extra super-script felt unnecessary.

Counterfactual Analysis in Dynamic Latent-State Models

polynomial time and space. These can then be used by the optimization solver. However, we are unable to modify the solver we use (BARON), and BARON by default does not exploit this structure but naively expands the objective into exp(T) terms. As such, we use the reformulation presented in (24) instead.

D.2. Approximating the Joint Optimization by the Pairwise Optimization

The problem (8) discussed in 4.3 optimizes over the joint PMFs ( joint optimization ). The challenge here lies in the dimensionality of the underlying joint distribution. As discussed towards the end of 4.3, the problem size (number of decision variables in particular) can grow exponentially in the primitives (e.g., |H|, |X|, and |O|). This is because the decision variables capture the entire joint distribution. Though we might be able to exploit application-speciﬁc sparsity to manage this blow-up (as we in fact do for the breast cancer application), it is worth exploring if there is a more tractable alternative in general (i.e., not speciﬁc to any application). We now show that this is possible.

Recall from 4.3 that we are interested in the following optimizations (repeating (8) for convenience):

PNub(B) := max (θ,π) F PN(θ, π | [h1:T (b)]b) (8a)

PNlb(B) := min (θ,π) F PN(θ, π | [h1:T (b)]b). (8b)

The key observation here is that the objective function does not depend on the joint PMF of (θ, π) but only the corresponding pairwise marginals (recall Lemmas 1 and 2). We introduced the joint PMF decision variables to ensure the feasibility set F is such that the pairwise marginals are valid. However, as an alternative, we can choose to not introduce the joint variables in the optimization and instead approximate F by expressing it in terms of the pairwise variables. For example, since the pairwise variables correspond to the 2-dimensional PMFs, they must obey basic probability axioms. In particular, they must be non-negative and agree with their known 1-dimensional marginals so that X

h πehei,hi(eh , h ) = qeheieh (eh,ei,eh ) (h, i) (25a)

eh πehei,hi(eh , h ) = qhih (eh,ei) (h, i, h ) (25b)

i θehex,hx(ei, i) = eehexei (eh, ex,ei) (h, x) (25c)

ei θehex,hx(ei, i) = ehxi (eh, ex) (h, x, i). (25d)

These constraints are analogous to (10) in 4.3. It is easy to see that if (10) is obeyed, then so is (25). However, the reverse implication does not hold, meaning the feasibility space deﬁned by (25) and non-negativity (say F ) is a super-set of the feasibility space F in 4.3. In other words, though the constraints in F are necessary, they are not sufﬁcient to ensure the pairwise marginals correspond to a valid joint distribution. Hence, optimizing over F ( pairwise optimization )13 is a relaxation to the problem of optimizing over F. In fact, as we show via a simple example next, this relaxation can be strict. (We thank an anonymous reviewer for this example.) Example 2. Consider three random variables X, Y , and Z with the following pairwise marginals:

( (0, 0) w.p. 1/2 (1, 1) w.p. 1/2 (Y, Z) =

( (0, 0) w.p. 1/2 (1, 1) w.p. 1/2 (X, Z) =

( (1, 0) w.p. 1/2 (0, 1) w.p. 1/2.

It is easy to verify these pairwise marginals obey (25) along with non-negativity. However, they do not correspond to any valid joint distribution over (X, Y, Z). To see this, suppose (X, Y ) realizes a value of (0, 0). Then, the pairwise marginal of (Y, Z) implies (Y, Z) has to be (0, 0), which implies (X, Z) must be (1, 0), resulting in a contradiction. Therefore, the bivariate marginals are not consistent with any valid 3-dimensional joint distribution.

Despite the relaxation being strict14, we found it to produce high-quality solutions and be highly scalable (discussed in D.3).

13Note that the pairwise optimization is identical to the joint optimization (8) but with the following two differences: (a) F replaced by F and (b) joint decision variables not deﬁned. 14Since the pairwise optimization is a relaxation of the joint optimization, it follows that Proposition 2 still holds for the bounds produced by the pairwise optimization.

Counterfactual Analysis in Dynamic Latent-State Models

The high scalability is primarily driven by the lower dimensionality of the decision variables. In particular, the pairwise optimization has at most |H|4|O|2 + |H|2|O|2|X|2 decision variables (recall that the joint optimization has an additional |O||H||X| + |H||H||O| decision variables). In fact, after exploiting the sparsity in the breast cancer application (along with the variable elimination discussed in Footnote 5), the pairwise optimization has only 1082 decision variables. This is in contrast to the joint optimization which has 16,124 decision variables. Though the pairwise optimization has more constraints than the joint optimization, the difference is not that stark (2085 vs. 610). (The numbers reported here correspond to the objective formulation presented in 4.3 as opposed to the reformulation in D.1. The reformulation adds a total of |H|TB decision variables and |H|TB constraints to both the pairwise and the joint optimizations.)

D.3. Computational Performance

We now compute upper and lower bounds on PN by (a) using the reformulation discussed in D.1 and (b) optimizing over the relaxed constraint set deﬁned by the pairwise marginals as discussed in D.2.15 We focus on path 1 from 5 for brevity and note that the results for path 2 are similar. The implementation details remain the same as in 5 (i.e., we code in MATLAB-BARON with CPLEX as the LP / MIP solver, set absolute termination tolerance at 0.01, generate B = 100 samples for SAA, and average over 20 seeds). To test the scalability of our approach, we now experiment with T {5, 10, 15, 20, 25, 50, 75, 100}. Note that T = 100 is an order of magnitude larger than the longest horizon we have in 5, i.e., T = 10. We next discuss the results that are shown in Figure 7, with Figure 7(a) showcasing scalability and Figure 7(b) quality.

0 20 40 60 80 100 T

Compute time in hours (mean s.d.)

(a) Compute time

0 20 40 60 80 100 T

UB (joint) LB (joint) UB (pairwise) LB (pairwise) Independence Comonotonic

Figure 7. Evaluating the computational performance of our ideas in D.1 and D.2 on path 1 from 5. All results are averaged over the 20 seeds we use. In sub-ﬁgure (b), the UB and LB curves for the joint optimization (black color) are the same as the ones in Figure 3(a), and only go as far as T = 10 (since we run into memory issues for T > 10). Furthermore, to avoid clutter, we do not show the standard deviation bars in sub-ﬁgure (b) and note that the maximum standard deviation value is less than 0.01. To be clear, the compute times in sub-ﬁgure (a) correspond to the blue curves in sub-ﬁgure (b) ( pairwise ), which are the focus of this section.

In Figure 7(a), we display the compute time as a function of T. Compute time refers to the total time taken to compute LB and UB. Note that we let the solver run until convergence to global optimality (on just one core with at most 16 GB RAM). We are able to solve for T = 100 in 3 hours on average (over 20 seeds), with the minimum time being 1.2 hours and the maximum time being 8.4 hours. This demonstrates the scalability of our approach. (It is worth mentioning that after eliminating redundant variables and constraints, exploiting sparsity, using the reformulation of D.1, and the pairwise approximation of D.2, the T = 100 and B = 100 optimization has 71, 082 decision variables, 2085 linear constraints, and 70, 000 polynomial constraints.)

15We also experimented with other variations of these two approaches. If we use neither of them (as is the case in 5), then we run into memory issues for T > 10. In fact, even if we only use the second approach (optimizing over the relaxed constraint set), then we run into memory issues for T > 10 since the objective still scales exponentially in T. Finally, if we only use the ﬁrst approach, i.e. the reformulation of D.1, and optimize over the joint, the BARON solver does not converge even for T as small as 5 in 24 hours of compute time. This is because keeping the joint variables while doing the reformulation results in an optimization with a very large number of variables and constraints (even after we exploit sparsity).

Counterfactual Analysis in Dynamic Latent-State Models

In Figure 7(b), we display the PN values as a function of T. The joint UB and LB curves are the same as the ones in Figure 3(a), and only go as far as T = 10 because of the aforementioned memory issues for T > 10. The pairwise UB and LB curves are the focus of this section and our goal here is to evaluate the quality of the pairwise bounds (blue curves) and we do so in two ways. First, as we are able to solve the joint optimization for T 10, we can use the joint bounds as benchmarks for the pairwise bounds. As may be seen from the ﬁgure, the joint and pairwise bounds are very close to each other (for values of T 10). In particular, the joint and pairwise lower bounds coincide and equal 0.8725 and 0.8884 for T = 5 and T = 10, respectively. The pairwise and joint upper bounds also coincide and equal 0.9885 for T = 5. The only difference between the two is the upper bound for T = 10 with values of 0.9904 and 0.9899, respectively. We therefore conclude that the pairwise bounds provide a very good approximation to the joint bounds, at least when T 10.

Second, for T > 10, we use the fact that we can simulate the independence and comonotonic copulas, which by deﬁnition are feasible solutions to the joint optimization. We therefore know that maximizing (minimizing) over the joint distribution will yield an upper (lower) bound that is no lower (higher) than the independence (comonotonic) curves in the ﬁgure. As an example, the gap between the independence and the pairwise UB curves (for T > 10) is never greater than 0.01, which means we lose at most 0.01 by restricting ourselves to the pairwise marginals. Similarly, the maximum gap between the comonotonic and the pairwise LB curves is 0.02. Thus, the pairwise bounds provide a high quality approximation to the joint bounds even when T > 10.

We can also embed CS and PM constraints in the pairwise optimization (recall from 4.3 that both these constraints are over the pairwise variables) and we show the corresponding bounds in Figure 8. Naturally, the bounds we obtain are tighter than the pairwise bounds in Figure 7(b). In particular, the UB gets much tighter while the LB does not change much.

0 20 40 60 80 100 T

UB (pairwise CS) LB (pairwise CS)

(a) Counterfactual stability

0 20 40 60 80 100 T

UB (pairwise PM) LB (pairwise PM)

(b) Pathwise monotonicity

Figure 8. PN bounds obtained when we embed CS and PM constraints in the pairwise optimization. All results are computed as the average of bounds obtained from 20 different seeds with each seed being used to generate B = 100 paths. Furthermore, to avoid clutter, we do not show the standard deviation bars and note that the maximum standard deviation value is less than 0.01.

E. Further Details on the Breast Cancer Case Study

We discuss the breast cancer model primitives and their calibration in E.1, followed by showing how we exploit sparsity to reduce the number of decision variables ( E.2). We then provide details on the PM constraints and the comonotonic copula in E.3 and E.4, respectively. Finally, in E.5, we show the results for path 2.

E.1. Model Primitives and Their Calibration

As discussed in 2, the breast cancer application has |H| = 7 states, |O| = 7 emissions, and |X| = 2 actions. To be consistent with the literature (Ayer et al., 2012), we treat each period as corresponding to 6 months. The model comprises of three primitives: p, Q, and E. We discuss their (sparse) structure and the calibration to real-data in E.1.1, E.1.2, and E.1.3, respectively.

Counterfactual Analysis in Dynamic Latent-State Models

E.1.1. INITIAL STATE DISTRIBUTION p

We have p := (p1, . . . , p7) where ph := P(H1 = h) for all h. Usually, breast cancer screening starts around the age of 40 and the prevalence among females aged 40-49 is 1.0183% (Table 4.24 of NIH (2020), all races, females):

p2 + p3 = 0.010183.

Since in-situ cancer comprises 20% of new breast cancer diagnoses (Sprague & Trentham-Dietz, 2009), we get

p2 = 0.2 0.010183

p3 = 0.8 0.010183.

It is natural to set p4 = p5 = p6 = p7 = 0

and hence, p1 = 1 p2 p3 = 1 0.010183.

E.1.2. TRANSITION DISTRIBUTION Q

We have Q := [qhih ]h,i,h with qhih := P(Ht+1 = h | Ht = h, Ot = i). Before discussing the calibration, we discuss the sparse structure of Q. To do so, we deﬁne the transition matrix Q(i) := [qhih ]hh for each emission i (so that each row sums to 1) and observe that we have the following structure:

q11 q12 q13 q22 q23 q27 q33 q37 q44 q45 q46 q47 q55 q56 q57 1 1

q11 q12 q13 q22 q23 q27 q33 q37

q11 q12 q13

q24 q25 q26 q27

q44 q45 q46 q47

q35 q36 q37

q55 q56 q57

A few comments are in order. First, an empty row means it is an impossible (h, i) combination. For example, if the we observe an emission i = 3 (i.e., a negative biopsy), then the underlying patient state has to be healthy, i.e., h / {2, 3, 4, 5, 6, 7}. Thus, rows 2 to 7 are empty in Q(3).

Second, observe that there is a decent amount of overlap across [Q(i)]i in terms of the underlying parameters. For example, q11 corresponds to the probability a healthy patient stays healthy, which is independent of the emission being 1 (no test), 2 (negative test), or 3 (positive test but negative biopsy). Hence, q11 appears in all three matrices Q(1), Q(2), and Q(3). Of course, if the emission is 4, 5, 6, or 7, then the patient can not be healthy and hence, the corresponding entry in matrices Q(4),

Counterfactual Analysis in Dynamic Latent-State Models

Q(5), Q(6), and Q(7) is absent (in fact, the entire ﬁrst row is empty, which means it is an impossible (h, i) combination as discussed above).

Third, some rows have only a partial set of entries, which means that the other entries equal 0. For example, if a patient is healthy (state 1), then her state can not transition to 4 (diagnosed in-situ with treatment started), 5 (diagnosed invasive with treatment started), 6 (recovered), or 7 (death) and hence, q14 = q15 = q16 = q17 = 0. Hence, we do not show q14, q15, q16, q17 in Q(1), Q(2), or Q(3).

Fourth, observe that we have a bar over q27 (in Q(4)) and q37 (in Q(5)). This is done to recognize them being different from q27 (in Q(1) and Q(2)) and q37 (in Q(1) and Q(2)). To see the difference, consider q37 versus q37. q37 corresponds to the patient state transitioning from invasive cancer to death when the cancer was not detected (and hence, no treatment). On the other hand, q37 corresponds to the patient state transitioning from invasive cancer to death when the cancer was detected (and hence, treatment was provided). Naturally, we expect q37 q37.

Finally, since states 6 (recovery) and 7 (death) are absorbing, we have q66 = q77 = 1.

Having discussed the structure of Q, we now calibrate it to real-data. We iterate over each state h {1, . . . , 7} in a sequential manner.

State 1 (healthy). For state 1, we are interested in (q11, q12, q13). These probabilities can depend on a woman s age but we ignore that and work with averages. Let s focus on (q12, q13) since

q11 = 1 q12 q13.

For q12, we use the in-situ incidence rates from Table 4.12 of NIH (2020) (all races, females). For q13, we use the invasive incidence rates from Table 4.11 of NIH (2020) (all races, females). The reported numbers are per year and we should divide by 2 to convert to a 6-month scale:

2 33.0 100000

Note that consistent with the 20-80 split in (p2, p3), we have q13 4q12.

State 2 (undiagnoised in-situ cancer). We are interested in (q22, q23, q27) (if cancer is not detected) and (q24, q25, q26, q27) (if cancer is detected). First, consider (q22, q23, q27). Table 4.13 of NIH (2020) and Page 26 of

UWBCS (2013) imply there is no death from in-situ cancer:

Haugh & Lacedelli (2019) assumed q23 to equal the invasive incidence rate q13 and so do we:

q23 = q13 q22 = 1 q23 q27 = 1 q13.

Second, consider (q24, q25, q26, q27). As q27 = 0 and we expect q27 q27 (recall comment #4 above), we set

As all in-situ cancer patients survive (if treated), no one transitions to invasive (if in-situ detected):

Finally, we have q24 + q26 = 1.

The split between q24 and q26 is irrelevant in terms of the patient dying or not (all will survive as there is no positive probability path from state 4 to death; this will become clear when we discuss state 4 below).

Counterfactual Analysis in Dynamic Latent-State Models

State 3 (undiagnoised invasive cancer). We are interested in (q33, q37) (if cancer is not detected) and (q35, q36, q37) (if cancer is detected). First, consider (q33, q37). q37 is the probability of dying from invasive breast cancer (under no treatment). According to Johnstone et al. (2000), the 5-year and 10-year survival rates for invasive breast cancer patients (under no treatment) are 18.4% and 3.6%, respectively. On calibrating to 5-year rate, we get (1 q37)10 = 0.184, which implies q37 15.6% (note that we use 10 in the exponent since our time periods correspond to 6 months and and hence, 5 years correspond to 10 periods). Similarly, on calibrating to 10-year rate, we get (1 q37)20 = 0.036 implies q37 15.3%. The two calibrations are consistent with each other (lending evidence to time-invariance). Minimizing sum of squared errors over the two data points, i.e., minq37 [0,1]{((1 q37)10 0.184)2 + ((1 q37)20 0.036)2}, gives the following estimate:

q37 = 0.1554.

Naturally, we have

q33 = 1 q37.

Second, consider (q35, q36, q37). q36 and q37 are the probabilities of recovering and dying from invasive breast cancer (under treatment). Table 4.14 of NIH (2020) has various survival rates we can use to calibrate. We calibrate using the 10 data points corresponding to the year 2007 (see Figure 9):

q36 = 0.0459 q37 = 0.0113.

As a sanity check, note that q37 < q37. Finally,

0 2 4 6 8 10 Survival time (years)

Survival rate

Raw data Prediction

Figure 9. Calibration of (q36, q37). Under our (time-invariant) Markov model, with a starting state of invasive breast cancer (under treatment), the survival rate after x years equals 1 q37 P2x 1 i=0 (1 q36 q37)i. Minimizing the sum of squared errors (on the 10 blue data points in the plot) over (q36, q37) gives us an estimate of (0.0459, 0.0113). The prediction using our ﬁt is shown via the black curve.

q35 = 1 q36 q37.

State 4 (diagnoised in-situ cancer). We are interested in (q44, q45, q46, q47). Under our Markov model (which by deﬁnition is memoryless ), it seems reasonable to set

(q44, q45, q46, q47) = (q24, q25, q26, q27).

State 5 (diagnoised invasive cancer). We are interested in (q55, q56, q57). Under our Markov model, it seems reasonable to set (q55, q56, q57) = (q35, q36, q37).

States 6 (recovery) and 7 (death). These two states are absorbing and hence,

q66 = q77 = 1.

Counterfactual Analysis in Dynamic Latent-State Models

E.1.3. EMISSION DISTRIBUTION E

We have E := [ehxi]h,x,i with ehxi := P(Ot = i | Ht = h, Xt = x). Before discussing the calibration, we discuss the sparse structure of E. To do so, we deﬁne the matrix E(x) := [ehxi]hi for each action x (so that each row sums to 1) and observe that we have the following structure:

0 e12 1 e12 0 1 e24 e24 0 1 e35 e35 0 1 0 1 0 1 0 1

1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1

A few comments are in order. First, for x = 1 (no mammogram screening), the emission matrix E(1) is extremely sparse with entries in {0, 1}. For instance, when hidden state equals 1 (healthy), 2 (undiagnosed in-situ), or 3 (undiagnosed invasive), we observe no signal (emission equals 1) w.p. 1. When hidden state equals 4, 5, 6, or 7, we naturally observe the same emission w.p. 1.

Second, for x = 0 (screening), the emission matrix E(0) is quite sparse as well. If the patient is healthy (row 1), then the test result is negative (true negative) w.p. e12 and positive (false positive) w.p. 1 e12. When the patient has undiagnosed in-situ cancer (row 1), it is detected (true positive) w.p. e24 and missed (false negative) w.p. 1 e24. The parameter e35 has the same interpretation as e24 but for invasive cancer. As for x = 1, when hidden state equals 4, 5, 6, or 7, we observe the same emission w.p. 1.

Having discussed the structure of E, we now calibrate it to real-data. There are 3 parameters: e12, e24, and e35. All of them can be age speciﬁc but we ignore that. e12 is the speciﬁcity of the mammogram screening (i.e., probability of a true negative) and we calibrate it using Table 3 of Ayer et al. (2012):

e24 is the in-situ sensitivity (i.e., probability of a true positive) and we calibrate it using Table 3 of Ayer et al. (2012):

Finally, e35 is the invasive sensitivity and following Ayer et al. (2012), we set

E.2. Reducing the Number of Joint Decision Variables by Exploiting Sparsity

Recall from 4.3 the following setup, which we repeat for convenience. Let k (h, x) and m (h, i) so that

Ok Ohx, eki ehxi Hm Hhi, qmh qhih .

We have k [K] and m [M], where K := |H||X| and M := |H||O|. The K and M dimensional joint PMFs for all i1, . . . , i K O and h1, . . . , h M H are deﬁned as

θ1,...,K(i1, . . . , i K) := P(O1 = i1, . . . , OK = i K) (9a)

π1,...,M(h1, . . . , h M) := P(H1 = h1, . . . , HM = h M). (9b)

As discussed towards the end of 4.3, it follows from (9) that we have at most |O||H||X| + |H||H||O| joint variables. We now show that these are merely upper bounds and we can exploit the sparsity inherent in the underlying application to drastically reduce these numbers.

Consider the θ1,...,K decision variables for now. Since θ1,...,K represents the joint PMF of the random variables [Ok]k where k (h, x), we ﬁrst understand which (h, x) pairs are valid (as opposed to naively considering all (h, x) H X). Recall that the state h has the following encoding:

Counterfactual Analysis in Dynamic Latent-State Models

2. undiagnosed in-situ cancer

3. undiagnosed invasive cancer

4. diagnosed in-situ cancer

5. diagnosed invasive cancer

6. recovery

Furthermore, x equals 0 maps to mammogram being performed and 1 to it not being performed. It is easy to see that all 14 combinations of (h, x) H X are valid so none of the corresponding decision variables can be set to zero (and therefore removed). Turning now to the observations, we recall that they are encoded as follows:

1. no screening took place

2. negative screening result (possibly a false negative)

3. positive mammogram result, but followed by a negative biopsy

4. diagnosed in-situ cancer

5. diagnosed invasive cancer

6. recovery

Given this, Table 1 documents the range of all 7 2 = 14 random variables [Ohx]h,x.

Table 1. Range of the 14 random variables [Ohx]h,x corresponding to θ1,...,K.

State h Policy x Range of Ohx Range cardinality

1 0 {2, 3} 2 1 1 {1} 1 2 0 {2, 4} 2 2 1 {1} 1 3 0 {2, 5} 2 3 1 {1} 1 4 0 {4} 1 4 1 {4} 1 5 0 {5} 1 5 1 {5} 1 6 0 {6} 1 6 1 {6} 1 7 0 {7} 1 7 1 {7} 1

Multiplying all of the 14 cardinalities (last column in Table 1) implies that there are only eight θ1,...,K decision variables that need to be considered. This is in contrast to the upper bound of |O||H||X| = 714.

The same logic applies to the π1,...,M decision variables. In fact, for the π1,...,M decision variables, even the ﬁrst step proves useful since not all (h, i) pairs are valid. For instance, if h = 1, then i / {4, 5, 6, 7}. In particular, the ﬁrst step allows us to trim down the number of [Hhi]h,i random variables from |H||O| = 49 to 13. The second step trims down the range of each of the 13 random variables. We document this in Table 2 and are able to reduce the number of π1,...,M decision variables from |H||H||O| = 749 to 15, 552 (which equals the product of the cardinalities presented in the last column).

Counterfactual Analysis in Dynamic Latent-State Models

Table 2. Range of the 13 random variables [Hhi]h,i corresponding to π1,...,M. Only 13 (h, i) pairs are shown as the other 36 are not valid.

State h Observation i Range of Hhi Range cardinality

1 1 {1, 2, 3} 3 1 2 {1, 2, 3} 3 1 3 {1, 2, 3} 3 2 1 {2, 3} 2 2 2 {2, 3} 2 2 4 {4, 6} 2 3 1 {3, 7} 2 3 2 {3, 7} 2 3 5 {5, 6, 7} 3 4 4 {4, 6} 2 5 5 {5, 6, 7} 3 6 6 {6} 1 7 7 {7} 1

E.3. Details on the Pathwise Monotonicity (PM) Constraints

PM can be enforced via linear constraints. We brieﬂy discussed this in 4.3 and now discuss all underlying PM constraints we embedded in our breast cancer numerics.

Recalling our 4.3 discussion for convenience, suppose the patient has in-situ cancer in period t which is not detected but the patient s state remains at in-situ in period t + 1. Then, in the counterfactual world, if the cancer is detected in period t, then PM would require that the cancer can not be worse than in-situ in period t + 1, i.e.,

P(Hehei = eh | Hhi = h ) = 0

for h = 2, i {1, 2}, h = 2, eh {2, 4}, ei = 4, eh {5, 7}. There can be multiple such cases to consider and we can enforce all the PM constraints by setting the corresponding πehei,hi(eh , h ) variables equal to 0 as πehei,hi(eh , h ) = P(Hhi =

h )P(Hehei = eh | Hhi = h ).

Hence, to provide details on which all PM constraints we enforce, it sufﬁces to enumerate the (h, i, h ,eh,ei,eh ) combinations for which we set the πehei,hi(eh , h ) variables equal to 0. To do so, we iterate over each state h {1, . . . , 7}. (Note that for

PM, there are no (h, x, i,eh, ex,ei) combinations for which we set the θehex,hx(ei, i) variables equal to 0.)

State h = 1 (healthy). We enforce PM for the following combinations:

If (h, i, h ) equals (healthy, whatever emission, healthy), then the counterfactual state eh can not be in-situ, invasive, or death if eh is healthy. That is, h = 1, i O, h = 1, eh = 1,ei O, and eh {2, 3, 4, 5, 7}.

If (h, i, h ) equals (healthy, whatever emission, in-situ), then the counterfactual state eh can not be healthy, invasive, or death if eh is healthy. That is, h = 1, i O, h = 2, eh = 1,ei O, and eh {1, 3, 5, 7}.

If (h, i, h ) equals (healthy, whatever emission, invasive), then the counterfactual state eh can not be healthy, in-situ, or death if eh is healthy. That is, h = 1, i O, h = 3, eh = 1,ei O, and eh {1, 2, 4, 7}.

If (h, i, h ) equals (healthy, whatever emission, death), then the counterfactual state eh can not be healthy, in-situ, or invasive if eh is healthy. That is, h = 1, i O, h = 7, eh = 1,ei O, and eh {1, 2, 3, 4, 5}.

State h = 2 (undiagnosed in-situ). We enforce PM for the following combinations:

If (h, i, h ) equals (in-situ, undetected, in-situ), then the counterfactual state eh can not be invasive or death if eh is healthy or in-situ. That is, h = 2, i {1, 2, 3}, h = 2, eh {1, 2, 4},ei O, and eh {3, 5, 7}.

Counterfactual Analysis in Dynamic Latent-State Models

If (h, i, h ) equals (in-situ, detected, in-situ), then the counterfactual state eh can not be invasive or death if eh is in-situ and detected. That is, h = 2, i = 4, h = 4, eh {2, 4},ei = 4, and eh {3, 5, 7}.

If (h, i, h ) equals (in-situ, undetected, invasive), then the counterfactual state eh can not be death if eh is healthy or in-situ. That is, h = 2, i {1, 2, 3}, h = 3, eh {1, 2, 4},ei O, and eh = 7.

If (h, i, h ) equals (in-situ, detected, invasive), then the counterfactual state eh can not be death if eh is in-situ and detected. That is, h = 2, i = 4, h = 5, eh {2, 4},ei = 4, and eh = 7.

If (h, i, h ) equals (in-situ, detected, recovered), then the counterfactual state eh can not be in-situ, invasive, or death if eh is in-situ and detected. That is, h = 2, i = 4, h = 6, eh {2, 4},ei = 4, and eh {2, 3, 4, 5, 7}.

If (h, i, h ) equals (in-situ, undetected, death), then the counterfactual state eh can not be in-situ, invasive, or recovered if eh is in-situ and undetected. That is, h = 2, i {1, 2, 3}, h = 7, eh = 2,ei {1, 2, 3}, and eh {2, 3, 4, 5, 6}.

If (h, i, h ) equals (in-situ, detected, death), then the counterfactual state eh can not be in-situ, invasive, or recovered if eh is in-situ and detected. That is, h = 2, i = 4, h = 7, eh {2, 4},ei = 4, and eh {2, 3, 4, 5, 6}.

State h = 3 (undiagnosed invasive). We enforce PM for the following combinations:

If (h, i, h ) equals (invasive, undetected, invasive), then the counterfactual state eh can not be death if eh is healthy, in-situ, or invasive. That is, h = 3, i {1, 2, 3}, h = 3, eh {1, 2, 3, 4, 5},ei O, and eh = 7.

If (h, i, h ) equals (invasive, detected, invasive), then the counterfactual state eh can not be death if eh is invasive and detected. That is, h = 3, i = 5, h = 5, eh {3, 5},ei = 5, and eh = 7.

If (h, i, h ) equals (invasive, detected, recovered), then the counterfactual state eh can not be invasive or death if eh is invasive and detected. That is, h = 3, i = 5, h = 6, eh {3, 5},ei = 5, and eh {3, 5, 7}.

If (h, i, h ) equals (invasive, undetected, death), then the counterfactual state eh can not be invasive or recovered if eh is invasive and undetected. That is, h = 3, i {1, 2, 3}, h = 7, eh {3, 5},ei {1, 2, 3}, and eh {3, 5, 6}.

If (h, i, h ) equals (invasive, detected, death), then the counterfactual state eh can not be invasive or recovered if eh is invasive and detected. That is, h = 3, i = 5, h = 7, eh {3, 5},ei = 5, and eh {3, 5, 6}.

State h = 4 (diagnosed in-situ). We enforce PM for the following combinations:

If (h, i, h ) equals (in-situ, detected, in-situ), then the counterfactual state eh can not be invasive, recovered, or death if eh is in-situ and detected. That is, h = 4, i = 4, h = 4, eh {2, 4},ei = 4, and eh {5, 6, 7}.

If (h, i, h ) equals (in-situ, detected, invasive), then the counterfactual state eh can not be in-situ, recovered, or death if eh is in-situ and detected. That is, h = 4, i = 4, h = 5, eh {2, 4},ei = 4, and eh {4, 6, 7}.

If (h, i, h ) equals (in-situ, detected, recovery), then the counterfactual state eh can not be in-situ, invasive, or death if eh is in-situ and detected. That is, h = 4, i = 4, h = 6, eh {2, 4},ei = 4, and eh {4, 5, 7}.

If (h, i, h ) equals (in-situ, detected, death), then the counterfactual state eh can not be in-situ, invasive, or recovered if eh is in-situ and detected. That is, h = 4, i = 4, h = 7, eh {2, 4},ei = 4, and eh {4, 5, 6}.

State h = 5 (diagnosed invasive). We enforce PM for the following combinations:

If (h, i, h ) equals (invasive, detected, invasive), then the counterfactual state eh can not be recovered or death if eh is invasive and detected. That is, h = 5, i = 5, h = 5, eh {3, 5},ei = 5, and eh {6, 7}.

Counterfactual Analysis in Dynamic Latent-State Models

If (h, i, h ) equals (invasive, detected, recovery), then the counterfactual state eh can not be invasive or death if eh is invasive and detected. That is, h = 5, i = 5, h = 6, eh {3, 5},ei = 5, and eh {5, 7}.

If (h, i, h ) equals (invasive, detected, death), then the counterfactual state eh can not be invasive or recovery if eh is invasive and detected. That is, h = 5, i = 5, h = 7, eh {3, 5},ei = 5, and eh {5, 6}.

State h = 6 (recovery). No combination for which we enforce PM.

State h = 7 (death). No combination for which we enforce PM.

E.4. Details on the Comonotonic Copula

We discussed the counterfactual simulation under the comonotonic copula for a general dynamic latent-state model in C.2. In this section, we connect that discussion to the breast cancer application. To do so, it sufﬁces to deﬁne the rank functions r H( ) (for states) and r O( ) (for emissions). For states, there are two possible orderings that seem natural (from best to worst ):

(1, 6, 4, 2, 5, 3, 7)

(1, 6, 4, 5, 2, 3, 7).

Recalling the Q(i) notation from E.1, observe that columns 2 and 5 are never active simultaneously in any row of Q(i) (for any i). Hence, the choice of ordering (between the two orderings above) will not matter and we can pick any one. Suppose we pick the ﬁrst one. Then, this ordering deﬁnes the rank function. For example, r H(6) = 2, i.e., rank of state 6 equals 2. For the inverse function, r 1 H (2) = 6.

It is unclear how to deﬁne r O( ) for the breast cancer application but as it turns out, it does not matter. To see why, consider the generic path of interest (from (1)):

o1, . . . , oτs 1 | {z } {2,3}

, oτs, . . . , oτe | {z } =1

, oτe+1, . . . , oτd 1 | {z } {4,5}

, oτd:T | {z } =7

For the ﬁrst τs 1 periods, observe that the counterfactual emission eo1:τs 1(b) equals the observed emission o1:τs 1 for each b. This is because the intervention policy ex1:τs 1 equals the observed policy x1:τs 1. Now, consider periods τs to τe, during which the screening was not done, i.e., xτs:τe = 1. Hence, the emissions oτs:τe = 1 w.p. 1 (see the matrix E(1) in E.1.3). This means that the emissions does not contain any information regarding the underlying noise variables Vτs:τe (see Figure 6) and hence, their posterior equals their prior, which is Unif[0, 1]. As such, for t {τs, . . . , τe}, we can sample eot(b) using the categorical distribution over the probability vector [eeht(b)exti]i. Note that we can use eoτs 1(b) to sample ehτs(b), which we can use to sample eoτs(b), and so on (until we have sampled ehτe+1(b)). Now, consider t = τe + 1. We know ot {4, 5}:

If ot = 4, then ht(b) = 2 and eht(b) {2, 4, 6} (cf. pathwise monotonicity).

If eht(b) {4, 6}, then eot(b) = eht(b) (since rows 4 and 6 of E(0) have 1 on the diagonal).

Else, if eht(b) = 2 (= ht(b)), then eot(b) = ot = 4.

Else, if ot = 5, then ht(b) = 3 and eht(b) {3, 4, 5, 6} (cf. pathwise monotonicity).

If eht(b) {4, 5, 6}, then eot(b) = eht(b) (since rows 4, 5, 6 of E(0) have 1 on the diagonal).

If eht(b) = 3 (= ht(b)), then eot(b) = ot = 5.

Finally, for t τe + 2, we know ht(b) 4 and that the corresponding rows in E(0) are 0-1. Hence, the posterior of Vt equals the prior and we can sample eot(b) using the categorical distribution over the probability vector [eeht(b)exti]i. By construction, the comonotonic copula will obey pathwise monotonicity and hence, will ensure that in the counterfactual world, patient does not die before period T, i.e., e HT 1 = 7 w.p. 1.

Counterfactual Analysis in Dynamic Latent-State Models

E.5. Results for Path 2

UB LB Naive Independence Comonotonic

(a) UB / LB

UB (CS) LB (CS) Naive Independence Comonotonic

(b) UB / LB with CS

UB (PM) LB (PM) Naive Independence Comonotonic

(c) UB / LB with PM

Figure 10. PN results for path 2 as we vary T {4, . . . , 10} (analogous to Figure 3 for path 1). As in Figure 3, observe that the LB, LB(CS), and LB(PM) curves coincide (the lowest curve in each ﬁgure). Further, the UB(CS) and UB(PM) curves coincide for path 2.