# counterfactual_strategies_for_markov_decision_processes__e645738e.pdf

Counterfactual Strategies for Markov Decision Processes

Paul Kobialka1 , Lina Gerlach2 , Francesco Leofante3 , Erika Abrah am2 , Silvia Lizeth Tapia Tarifa1 and Einar Broch Johnsen1

1University of Oslo, Oslo, Norway 2RWTH Aachen University, Germany 3Imperial College London, United Kingdom {paulkob, sltarifa, einarj}@ifi.uio.no, f.leofante@imperial.ac.uk, {gerlach, abraham}@cs.rwth-aachen.de

Counterfactuals are widely used in AI to explain how minimal changes to a model s input can lead to a different output. However, established methods for computing counterfactuals typically focus on one-step decision-making, and are not directly applicable to sequential decision-making tasks. This paper fills this gap by introducing counterfactual strategies for Markov Decision Processes (MDPs). During MDP execution, a strategy decides which of the enabled actions (with known probabilistic effects) to execute next. Given an initial strategy that reaches an undesired outcome with a probability above some limit, we identify minimal changes to the initial strategy to reduce that probability below the limit. We encode such counterfactual strategies as solutions to non-linear optimization problems, and further extend our encoding to synthesize diverse counterfactual strategies. We evaluate our approach on four real-world datasets and demonstrate its practical viability in sophisticated sequential decision-making tasks.

1 Introduction

Consider an application procedure in which clients who want to obtain a loan, interact with a bank to establish their eligibility. Although established prediction methods [Leo et al., 2019; Teinemaa et al., 2019] can be used to filter for eligible clients, the overall application procedure is far from automated. In practice, to receive a loan from the bank, a client must follow a complicated application procedure, involving, e.g., multiple consultations with loan advisors, providing various documents and filling out complex forms. Eligible but impatient clients are prone to abandoning the application procedure before they receive their loan, causing losses for both parties: the client spent time without reaching their goal and the bank invested resources without return. Markov Decision Processes (MDPs) [Baier and Katoen, 2008] can be used to model such procedures and improve their transparency; however, methods are currently lacking to address questions about recourse and process improvement; e.g., what would

enable an ineligible client to obtain a loan? and how can we simplify the application procedure for eligible clients? Counterfactual explanations can help answer such questions by showing how minimal changes in user applications would lead to a desired change of the output, e.g., to make the client eligible for the previously refused loan. However, most available methods for computing counterfactuals target onestep prediction tasks [Guidotti, 2024], and are not applicable to sequential decision making settings, where the output of a process is determined by a sequence of steps.

Contributions. To fill this gap, we propose a method to compute counterfactual explanations for sequential decision making processes modeled by an MDP. In each state of these non-deterministic discrete-time models, a finite number of actions with a known probabilistic effect are enabled. During execution, strategies decide which enabled action to execute next. Given an initial strategy that visits some undesired states with a probability above some limit, we propose to compute explanations in terms of counterfactual strategies that reduce the reachability probability below this limit, while staying as close to the initial strategy as possible. To formalize these requirements, we introduce a distance measure d on strategies and encode counterfactual strategy synthesis as nonlinear optimization problems. By enforcing that only user-controllable actions can be selected in the MDP, we ensure that the resulting counterfactual strategies are indeed actionable. Furthermore, we extend our encoding to compute not only a single solution but a collection of counterfactual strategies, which are optimized for diversity, as several studies emphasize that providing different counterfactuals is key for user understanding [Russell, 2019; Mothilal et al., 2020; Bove et al., 2023]. The method is evaluated on four realworld datasets. The evaluation shows that the method is computationally feasible, and can synthesize counterfactual strategies for sophisticated sequential decision making tasks, modeled as MDPs with thousands of states and ten thousands of transitions. In summary, our main contributions are (1) to introduce counterfactual strategies for MDPs as post-hoc explanations for sequential decision making, (2) to encode a counterfactual strategy synthesis problem for MDPs as a nonlinear optimization problem, (3) to synthesize diverse counterfactual strategies, and (4) to experimentally evaluate the feasibility and

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

performance of our counterfactual strategy synthesis method.

Outline. After recalling some background on MDPs, nonlinear optimization and counterfactuals in Section 2, we formalize counterfactual strategies for MDPs in Section 3, and present our encoding of the MDP counterfactual synthesis problem as a non-linear optimization problem in Section 4. We evaluate our approach in Section 5, discuss related work in Section 6, and conclude the paper in Section 7.

2 Preliminaries

A (discrete probability) distribution is a function µ: X [0, 1] with a discrete domain X such that P x X µ(x) = 1. We write Distr(X) for the set of distributions with domain X. Given two distributions µ1 and µ2 with domain X, their total variation distance is (µ1, µ2) = 1 2 P x X |µ1(x) µ2(x)| [Levin and Peres, 2017, Prop. 4.2], where | | stays for the absolute value. For n N, v Rn and i = 1, . . . , n, we denote the ith element of v as vi, and use the standard norms v 0 = |{vi | vi = 0}|, v 1 = P i |vi|, and v = maxi |vi|.

2.1 Markov Decision Processes

A Markov decision process (MDP) M is a tuple S, A, s0, δ , where S is a finite set of states, A is a finite set of actions, s0 S, and δ: S A Distr(S) is a partial function. For each state s S, let A(s) be the set of all actions a A for which δ(s, a) is defined; we require A(s) = and say that the actions in A(s) are enabled in s. By δ(s, a, s ) we denote δ(s, a)(s ) if δ(s, a) is defined and 0 otherwise. A (finite or infinite) path τ of M is a non-empty sequence of alternating states and actions s0a0s1 . . . such that δ(sj, aj, sj+1) > 0 for all j 0. The cylinder set Cyl(ˆτ) of a finite path ˆτ is the set of all infinite paths with ˆτ as a prefix. Let ΩM(s) and and Ωfin M(s) be the set of all infinite respectively finite paths of M starting in the state s S. A state t S can be reached from state s S if there exists a finite path from s to t. We use Reach(t) to denote the set of all states from which t can be reached. A (memoryless) strategy is a function σ: S Distr(A) that maps states to distributions over actions with σ(s)(a) = 0 for all s S and a A\A(s). We denote the set of strategies for M by ΣM; we omit the index if it is clear from the context. Given two strategies σ, σ Σ, we overload notation and define their distance vector as (σ, σ ) = ( (σ(s), σ (s)))s S (entries in fixed but arbitrary order). Applying a strategy σ to an MDP M induces a deterministic model. Thus we omit the actions and define the discrete-time Markov chain (DTMC) induced by σ on M as the tuple Mσ = S, s0, δσ with S and s0 as before and δσ : S S [0, 1] with δσ(s, s ) = P a A σ(s)(a) δ(s, a, s ) for all s, s S. We associate with D=Mσ the probability space (ΩD(s0), {S ˆτ R Cyl(ˆτ) | R Ωfin D(s0)}, Pr D(s0)) where the probability of the cylinder set of a finite path ˆτ = s0 . . . sn is Pr D(s0)(Cyl(ˆτ)) = Qn i=1 δ(si 1, si). By Pr D(s0, t) we denote the probability of reaching state t S from s0 in D.

Application

Consultation

Application+

Granted Rejected

Apply Consult

Consult Quit

1 Provider 0.1

0.9 Submit Quit

(a) MDP model. The only service provider action Provider appears deterministically, thus the user has full strategy control.

state s action a Apply Consult Quit Submit

s0 1 0 0 0 Error 0 0.2 0.8 0 Consultation 0 0 1 0 Rework 0 0 0.7 0.3

(b) Probability values σ(s)(a) of the impatient client strategy σ.

state s action a Apply Consult Quit Submit

s0 1 0 0 0 Error 0 0.2 0.8 0 Consultation 0 0 1 0 Rework 0 0 0.14 0.86

(c) Counterfactual strategy σ for the impatient client.

Figure 1: Running example of a loan application procedure.

Example 1. Figure 1a shows an MDP model M of a loan application procedure. Starting in s0, the client either directly fills out an application or requests a consultation to increase the probability of direct acceptance. However, when independently filling out the application, there is a 5% chance to make a mistake in the application, which requires a consultation to fix. If the application is not accepted directly, it can be reworked before it is evaluated. The client may decide to quit the application procedure after making a mistake in the form, after the consultation, or if the application is not directly accepted. The behavior of the service provider is captured by several occurrences of the Provider action.

The client s goal is to receive a loan, i.e. reach the Rejected state with a probability of at most 20%. For an impatient client who directly fills out the application using strategy σ ΣM in Fig. 1b, the probability of reaching Rejected is Pr Mσ(s0, Rejected) = 0.411.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

2.2 Non-linear Optimization Mixed Integer Quadratically Constrained Quadratic Problems (MIQCQPs) [Billionnet et al., 2016] are a class of nonlinear optimization problems, where the objective function and the constraints are at most quadratic in variables with real and integer domains. Formally, an MIQCQP has the form

min f0(x) subject to fi(x) bi for i = 1, . . . , m, 0 xj uj for xj VZ VR, xj Z for xj VZ, xj R for xj VR,

where m N is the number of constraints, x = (x1, . . . , xn) are the variables divided into the sets VZ and VR of integerrespectively real-valued variables, fi(x) = x T Qix + c T i x for all i {0, . . . , m} with symmetric matrices Qi Rn n and ci Rn. Bounds conform to the variable domains, i.e. uj Z for xj VZ, and uj R for xj VR. MIQCQPs are not convex and in general hard to solve [Billionnet et al., 2016; Garey and Johnson, 1979].

2.3 Counterfactual Explanations Informally, counterfactual explanations answer the question If A were true, would C have been true? by providing a counterfactual antecedent A such that under its observation the counterfactual consequent C would have evaluated to true [Balke and Pearl, 1994a]. Our notion of counterfactual strategies echoes common formalizations in machine learning [Russell, 2019; Mothilal et al., 2020; Guidotti, 2024; Molnar, 2020], which typically define counterfactual explanations as follows. For a set of classes C, a classifier f : Rn C, and an input x Rn, a counterfactual is a closest input to x w.r.t. a distance measure d that yields a desired class c C:

arg min x d(x, x ) subject to f(x ) = c.

This basic formulation of counterfactuals requires x to be close to the initial input x, to ensure that the changes suggested by the counterfactual are realistic.1 Further properties might be required for counterfactual explanations (see, e.g., the recent survey [Guidotti, 2024]). In the next section we discuss some of them, and map them to the MDP setting.

3 Counterfactual Explanations for MDPs

In this section, we first discuss desired properties of counterfactuals (as formalised in the machine learning literature) and how they translate to MDPs. Based on these, we then introduce our definition of counterfactual strategies for MDPs. We refine our definition to account for different notions of distances between the counterfactual strategy and the initial strategy, while also extending the running example. We consider the following four desired properties of counterfactual explanations from machine learning (ML) [Gajcin and Dusparic, 2024; Guidotti, 2024] and translate them to MDPs:

1Note that if f(x) = c then x = x is a counterfactual.

Validity: ML: The counterfactual does change the classification to the desired class, i.e. f(x ) = c. MDP: Following the counterfactual strategy reduces the probability of reaching t below a given threshold. Proximity: ML: The distance between initial input and counterfactual is minimal. MDP: The distance between initial strategy and counterfactual strategy is minimal. Actionability: ML: Only features from a set of actionable features are mutated. MDP: Only actions controlled by the user are altered in the counterfactual strategy. Sparsity: ML: The number of changed features is minimal. MDP: The number of actions changed in the counterfactual strategy is minimal. To define distance measures for strategies, for any two strategies σ, σ Σ, let

d0(σ, σ ) := (σ, σ ) 0 d1(σ, σ ) := (σ, σ ) 1 / |S|

d (σ, σ ) := (σ, σ )

where |S| is the number of states. Here, d0 captures the sparsity of the counterfactual by measuring the number of states where a decision was changed, while d1 and d address proximity by measuring the average, respectively maximal, changes over all states between counterfactual and input strategy. A strategy distance measure for M is a function d : ΣM ΣM R using some r0, r1, r R to define d(σ, σ ) = r0 d0(σ, σ )+r1 d1(σ, σ )+r d (σ, σ ) for σ, σ ΣM. Now we are ready to define counterfactual strategies for MDPs. Definition 1 (Counterfactual Strategy). Assume an MDP M = (S, A, s0, δ), a strategy σ ΣM, a bound γ [0, 1], and a target state t S such that Pr Mσ(s0, t) > γ. Let furthermore d be a strategy distance measure for M. We call a strategy σ ΣM a counterfactual strategy to σ (under d for reaching t within γ in M) if (i) Pr Mσ (s0, t) γ and (ii) d(σ, σ ) d(σ, σ ) for all σ ΣM with Pr Mσ (s0, t) γ. Example 2. Consider again the MDP M and the strategy σ from Ex. 1 with strategy distance measure d(σ , σ ) := d0(σ , σ ) + d1(σ , σ ) + d (σ , σ ). Strategy σ ΣM from Fig. 1c is a counterfactual strategy to σ under d for reaching Rejected within γ = 0.2 in M by asking the client to continue after Rework. To reduce computational cost, we also define ϵcounterfactual strategies by replacing the requirement of smallest distance by the requirement of bounded distance. Definition 2 (ϵ-Counterfactual Strategy). Assume an MDP M = (S, A, s0, δ), a strategy σ ΣM, a bound γ [0, 1], and a target state t S such that Pr Mσ(s0, t) > γ. Let furthermore d be a strategy distance measure for M and ϵ > 0. We call a strategy σ ΣM an ϵ-counterfactual strategy to σ (under d for reaching t within γ in M) if (i) Pr Mσ (s0, t) γ and (ii) d(σ, σ ) ϵ. In this paper we focus on counterfactual strategies, but our methods can be easily adapted to ϵ-counterfactual strategies,

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

which need less computational effort because the optimality criterion is dropped. Example 3. Strategy σ from Ex. 2 changes the decision only in state Rework, thus (σ, σ ) = (0, 0, 0, 0.56). Therefore, strategy σ is a 0.56-counterfactual strategy under d , 0.14-counterfactual strategy under d1 and a 1-counterfactual strategy under d0. Only all three measures combined reveal an accurate picture of the changes required for adapting to the counterfactual. Note that for ϵ < 0.51, there exists no valid counterfactual strategy under d for γ = 0.2.

4 Computing Counterfactual Strategies We propose to generate counterfactual strategies σ by solving non-linear optimization problems, minimizing the distance d(σ, σ ) = r0 d0(σ, σ )+r1 d1(σ, σ )+r d (σ, σ ) to the initial strategy σ over all strategies σ that reach the target state t with a probability below the given limit of γ:

arg min σ ΣM d(σ, σ ) subject to Pr Mσ (s0, t) γ.

To formalize the above optimization problem in arithmetic terms, we use for each s S and a A(s) the following real variables: (1) psa to encode the probability σ (s)(a) that in the state s the counterfactual strategy σ chooses the action a; (2) ps to encode the probability of reaching t from s in Mσ ; (3) s to encode (σ(s), σ (s)); (4) D for {0, 1, } to encode the distances d (σ, σ ). In addition, we introduce for each state s an integer variable is {0, 1}, whose value is 1 iff σ and σ define different distributions at state s. For fixed input MDP M = S, A, s0, δ , state t, limit γ, strategy σ, and real coefficients r0, r1 and r , the encoding is as follows:

min r0 D0 + r1 D1 + r D (1)

s S, a A(s) : 0 psa 1 (2)

a A(s) psa = 1 (3)

pt = 1 (4) s S\Reach(t) : ps = 0 (5) s Reach(t)\{t} : 0 ps 1 (6)

s Reach(t)\{t} : ps = X

s S psa δ(s, a, s ) ps (7)

s S : s = 1

a A(s) |σ(s)(a)9psa| (9)

s S : 0 is 1 s is (10)

s S : D0 = X

s S is (11)

s S : s D (13)

Here, Eq. (1) encodes the objective function value d(σ, σ ). Constraints (2)-(3) encode psa as the probabilistic choices of

σ . Constraints (4)-(7) use the Bellman equations for computing the probabilities to reach t from individual states, where Reach(t) is the set of all states from which t is reachable (this can be easily computed by graph analysis). Constraint (8) enforces that Pr Mσ (s0, t) γ. Finally, Constraints (9)-(13) encode the distances d (σ, σ ). Constraint 10 ensures that a positive distance s, indicating a difference between σ(s) and σ (s), enforces is = 1, and minimization will ensure is = 0 for s = 0. Note that for the infinity norm, even though Constraint 13 only encodes that D is an upper bound on the distribution distance (σ(s), σ (s)) for all s S, minimizing the objective function will ensure that D equals the smallest such value (i.e. the maximum) over all states. Note that the non-linearity of the problem stems from the calculation of ps in Constraint (7), since we allow probabilistic strategy decisions. Let in the following P denote the MIQCQP optimization problem defined by the Constraints (1)-(13).

Lemma 1. Assume a solution to P, assigning to each variable v the value ν(v). Let σ be the strategy for M with σ (s)(a) = ν(psa) for all s S and a A(s). Then the objective function value as specified in Constraint (1) equals d(σ, σ ).

Proof sketch. We observe:

s = (σ(s), σ (s)) according to Constraint (9);

D0 = d0(σ, σ ) denotes the number of non-zero elements in (σ, σ ), using the counting mechanism from Eq. (10). The variables is indicate whether the entry s for s in (σ, σ ) is non-zero. As s 1 holds for all s S, it follows that for is = 1 we have s is. By minimizing D0, each is is set to 1 if and only if s = 0.

D1 = d1(σ, σ ) denotes the average over (σ, σ ).

D = d (σ, σ ) encodes the maximal entry in (σ, σ ): by limiting each element s = (σ(s), σ (s)) from above by D and by minimizing D , D is forced to be the maximum.

Thus, the value of the objective function (1) is per definition d(σ, σ ).

Theorem 1 (Soundness and Completeness). P admits a solution iff there exists a counterfactual strategy to σ (under d for reaching t within γ in M).

Proof sketch. . Let ν be a solution to P assigning to each variable v the value ν(v). The values of the variables psa induce a valid strategy σ Σ for M, with σ (s)(a) = ν(psa) for all s S and a A(s). By satisfying (8), σ satisfies Pr Mσ (s0, t) γ. According to Lemma 1, the objective function value is d(σ, σ ). As the solution minimizes the objective function value, there exists no strategy with smaller distance. Hence, σ is a counterfactual strategy to σ (under d for reaching t within γ in M). . Let σ Σ be a counterfactual strategy to σ. The strategy can be extended to a solution for the optimization problem by (1) encoding σ into variables psa = σ (s)(a), (2) computing reachability values for ps satisfying the Bellman optimality equation, (3) setting is = 1 iff any decision in

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

state s was changed, and (4) setting s and the distance variables D0, D1 and D accordingly. As σ is well-defined, all constraints in P are satisfied, and from Definition 1 it follows that σ minimizes d(σ, σ ).

The following theorem prohibits efficient, i.e. polynomialtime, algorithms for solving the MIQCQP optimization problem for counterfactual strategies.

Theorem 2. The presented optimization problem for counterfactual strategies is generally nonconvex.

Proof. Recall that an optimization problem is convex iff the target function and all constraints are convex [Boyd and Vandenberghe, 2004]. We show that the constraints of our optimization problem are, in general, not convex by providing a counterexample: Consider the MDP defined in Fig. 1a. The quadratic constraint stemming from (7) for encoding ps0 can be expressed as follows:

ps0 = ps0Apply 0.95 p Application + ps0Apply 0.05 p Error+ ps0Consult 1 p Consultation =

ps0Apply p Application p Error ps0Consult p Consultation

2 0 0 0 0.95

2 0 0 0 0.05

2 0 0 0 0 0 0 0 1 2 0 0 0 1 2 0

ps0Apply p Application p Error ps0Consult p Consultation

Our goal is now to check whether the function f : R5 R defined by f(x) = x T Ps0x for x R5 is convex. Oberserve that the Hessian of x T Ps0x is 2Ps0. The eigenvalues of 2Ps0 are 1, 1,

362 20 , and 0. As the matrix is symmetric and has a negative eigenvalue, it is not positive semi-definite. By the second-order condition of convexity [Boyd and Vandenberghe, 2004], the constraint is thus not convex, and the whole problem is neither.

Remark. We note that validity, actionability, proximity, and sparsity are ensured by construction in our approach. Validity of counterfactual strategies requires that Pr Mσ(s0, t) γ, i.e. following the counterfactual strategy σ reduces the chance of reaching t below the limit γ, ensured by Constraint (8). Proximity minimizes the changes in the counterfactual strategy σ . The minimization of the strategy distance measure d ensures that σ is as close to σ as possible but yet satisfies Pr Mσ(s0, t) γ. Actionability of counterfactual strategies follows from a valid MDP where only actually controllable features are controllable. Sparsity between initial strategy σ and counterfactual strategy σ follows from minimizing the strategy distance measure.

Diverse Counterfactual Strategies. A single counterfactual strategy provides only a single alternative, e.g., for recourse. However, recent work [Bove et al., 2023] has shown that offering diverse counterfactuals demonstrating different possibilities for recourse may improve the interpretability of AI decisions. To this end, we extend our method for computing an individual counterfactual strategy

to an iterative method where the nth counterfactual minimizes the distance to the initial strategy while maximizing the distance to all previously generated solutions. We define the diversity of a collection of strategies σ0, . . . , σn Σ as the determinant of the matrix of inverse pairwise distances D Rn n with Dij = 1 1+ (σi,σ j) 1 , as

done in [Mothilal et al., 2020]. Further, we adjust the objective function (1) to additionally optimize for diversity:

min r0 D0 + r1 D1 + r D λ det(D).

To avoid ill-defined determinants, a small perturbation is added to the diagonals. In comparison to [Mothilal et al., 2020], we do not average the 1 norm as each element is smaller than 1 and we wish to maximize for diversity. The parameter λ weights the diversity part of the target function. In this work, we use r0 = r1 = r = 1 and λ = 2 to weight each distance component equally and to weight diversity higher than distances. To evaluate the diversity of counterfactual strategies, we consider the fraction of novel state-action pairs introduced. By this, a diverse counterfactual strategy contains many stateaction pairs not utilized in previously.

5 Experimental Evaluation

Our aim is to demonstrate that (1) counterfactual strategies can be efficiently computed for complex sequential decision processes and (2) diverse counterfactual strategies can be generated. This is done by experiments Exp1 and Exp2, respectively, conducted on complementary real-world datasets.

5.1 Experimental Design and Setup In our experiments, we consider four real-world datasets. Grep S records customer interaction with a programming skill evaluation service [Kobialka et al., 2022]. BPIC12 [van Dongen, 2012] and BPIC17 [van Dongen, 2017], which record the loan application procedure in a bank, stem from the Business Process Intelligence Challenge2 of the IEEE Task Force on Process Mining.3 MSSD is the Music Streaming Sessions Dataset [Brost et al., 2019] from Spotify; we consider the small version of MSSD, with 10 000 listening sessions. We briefly outline the experimental setup (for further details, see the extended version [Kobialka et al., 2025]). After standard preprocessing of the datasets, stochastic automata learning [Mao et al., 2016] was used to generate the MDPs. Given the size of the MSSD dataset, we construct 10 models depending on the number of included traces; e.g., MSSD10 and MSSD40 include 10% and 40% of the data set, respectively. For MSSD, the number of states in each model is drastically higher than for the other datasets: already MSSD10 has on average 40 times more states than BPIC12 or BPIC17, and a 100 times higher max degree. We randomly generate ten initial user strategies for each model and let the target probability γ range over {0.0001} {0.1, 0.2, . . . , 1}, where 0.0001 represents near-perfect performance.

2https://www.tf-pm.org/competitions-awards/bpi-challenge 3https://www.tf-pm.org/

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Figure 2: Runtime comparison.

Model mean(t) std(t) min(t) max(t) Opt. Inf. T.O.

Greps 0.01 0.01 0.01 0.05 90 20 0 BPIC12 0.79 0.94 0.04 4.24 110 0 0 BPIC17 1.00 1.06 0.01 5.01 100 10 0

Table 1: Averaged Grep S and BPIC runtime results in seconds for computing counterfactual strategies.

5.2 Computing Counterfactual Strategies In Exp1, we compute counterfactual strategies for all models, showing that counterfactual strategies for models with thousands of states and tens of thousands of transitions can be computed within minutes. Table 1 compares averaged computation times and outcomes for all values of γ for Grep S, BPIC12 and BPIC17; Opt. denotes optimally solved instances, Inf. infeasible instances and T.O. timeouts. No computation took more than a minute, e.g. for BPIC17 the longest computation took 5.01 seconds. The mean for all models is around one second. Table 2 shows runtimes for the MSSD models; Sub.O. denotes instances solved within the time limit, but not necessarily optimally. While MSSD10 MSSD30 had few timeouts, MSSD70 MSSD100 had 269 300 timeouts. Table 3 shows individual results for sextiles S1 to S3 of γ; see the full table in [Kobialka et al., 2025]. For all MSSD models both trivial and infeasible problems are solved, see S1 and S3 in Table 3a. Figure 2 details runtimes for MSSD10 MSSD30 with γ [0.1, 0.5], highlighting the runtime peak for nontrivial instances. The difficulty lies in computing non-trivial counterfactual strategies around S2, where all non-trivial models from MSSD50 MSSD100 timeout, see Table 3b. The counterfactual strategies can be used to provide interpretable recommendations to users, which is essential to enable users to follow the recourse suggested by the counterfactual strategy. To this aim, counterfactual strategies are presented in a textual representation highlighting the suggested changes. An example for two states of BPIC12 is given below, where the client is asked not to cancel the loan offer after receiving the first offer but to continue in the process:

State negative is reached with probability 0.64. You can reach negative with probability 0.09 as follows: In state q9: A_CANCELLED increase probability of action Nabellen offer. to 0.89

Model mean(t) std(t) min(t) max(t) Opt. Inf. T.O. Sub.O.

MSSD10 56.08 152.35 0.05 T.O. 710 388 2 0 MSSD20 195.77 453.00 0.10 T.O. 707 343 41 9 MSSD30 276.62 556.85 0.14 T.O. 703 320 60 17 MSSD40 412.96 718.68 0.18 T.O. 700 193 207 0 MSSD50 464.38 767.34 0.22 T.O. 700 149 251 0 MSSD60 483.53 788.52 0.24 T.O. 700 123 277 0 MSSD70 485.84 789.70 0.29 T.O. 700 131 269 0 MSSD80 493.91 799.60 0.33 T.O. 700 103 297 0 MSSD90 494.57 799.79 0.36 T.O. 700 100 300 0 MSSD100 494.39 799.90 0.39 T.O. 700 100 300 0

Table 2: MSSD runtime results in seconds for computing counterfactual strategies.

decrease probability of action negative to 0.07 In state q27: Nabellen offer.#0 increase probability of action O_SENT_BACK to 0.81 decrease probability of action negative to 0.0 decrease probability of action O_CANCELLED to 0.0

We summarize our conclusions for Exp1: counterfactual strategies can be efficiently computed for complex MDPs with up to 10 000 states and 20 000 transitions; these models are significantly larger than models used in current process mining benchmarks but occur in, e.g., MSSD.

5.3 Diverse Counterfactuals In Exp2, we generate a collection of diverse counterfactual strategies for Grep S, BPIC12, and BPIC17 and compare the distance between counterfactual strategies as well as the diversity of the counterfactual strategies. To evaluate diversity, we investigate the fraction of novel state-action pairs, i.e., the actions changed in a counterfactual strategy compared to the initial strategy that were not changed in any previous one. Figure 3 compares the distance between each generated counterfactual strategy and the initial strategy (Fig. 3a), and shows the fraction of novel state-action pairs in each strategy (Fig. 3b). While the distance to the initial strategy varies only slightly between consecutive counterfactual strategies, each provides novel recourse strategies. Intermediate values of γ offer the largest range of diversity for counterfactual explanations. The individual distances to the initial strategy and the fraction of novel state-action pairs increase with γ, until the problem is trivially satisfied, see [Kobialka et al., 2025]. We summarize our conclusions for Exp2: diverse counterfactual strategies can be efficiently computed for the benchmark problems considered. The diverse counterfactual strategies introduce new recourse possibilities while remaining at a short distance to the initial strategy, comparable to the distance of the first counterfactual strategy.

6 Related Work We discuss related work with respect to stochastic counterfactuals, model repair and synthesis. Balke and Pearl discuss stochastic evaluations of counterfactual queries in their seminal work [Balke and Pearl, 1994b]. Since then, much work on counterfactual explanations has been published, summarized by Guidotti [Guidotti, 2024]. An adaptation for predictive business process monitoring was presented with LOR-

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

S1 (0, 0.17] S2 (0.17, 0.33] S3 (0.33, 0.5] Model Opt. Inf. T.O. Sub. Opt. Inf. T.O. Sub. Opt. Inf. T.O. Sub.

MSSD10 0 200 0 0 10 188 2 0 200 0 0 0 MSSD20 0 200 0 0 7 143 41 9 200 0 0 0 MSSD30 0 200 0 0 3 120 60 17 200 0 0 0 MSSD40 0 187 13 0 0 6 194 0 200 0 0 0 MSSD50 0 149 51 0 0 0 200 0 200 0 0 0 MSSD60 0 123 77 0 0 0 200 0 200 0 0 0 MSSD70 0 131 69 0 0 0 200 0 200 0 0 0 MSSD80 0 103 97 0 0 0 200 0 200 0 0 0 MSSD90 0 100 100 0 0 0 200 0 200 0 0 0 MSSD100 0 100 100 0 0 0 200 0 200 0 0 0

(a) Categorical results by sextile.

S1 (0, 0.17] S2 (0.17, 0.33] S3 (0.33, 0.5] Model mean(t) std(t) min(t) max(t) mean(t) std(t) min(t) max(t) mean(t) std(t) min(t) max(t)

MSSD10 39 40 0 128 266 267 73 T.O. 1 0 1 2 MSSD20 92 114 0 684 978 603 126 T.O. 2 0 1 4 MSSD30 159 167 0 567 1352 495 335 T.O. 3 1 2 5 MSSD40 468 585 0 T.O. 1793 59 1144 T.O. 3 1 2 7 MSSD50 743 796 0 T.O. T.O. 0 T.O. T.O. 3 1 2 6 MSSD60 847 865 0 T.O. T.O. 0 T.O. T.O. 3 1 2 8 MSSD70 858 867 0 T.O. T.O. 0 T.O. T.O. 4 1 3 7 MSSD80 899 901 0 T.O. T.O. 0 T.O. T.O. 5 1 3 13 MSSD90 900 902 0 T.O. T.O. 0 T.O. T.O. 6 1 3 12 MSSD100 900 902 0 T.O. T.O. 0 T.O. T.O. 5 1 4 12

(b) Runtime results by sextile, rounded to seconds.

Table 3: MSSD results for selected sextiles of γ.

(a) Boxplot over distances from three diverse counterfactual strategies to the initial strategy.

(b) Boxplot showing novel state-action fractions for three diverse counterfactual strategies.

Figure 3: Results for diverse counterfactual strategies.

LEY [Huang et al., 2021]. Notably, MDPs in combination with causal models have been adapted for counterfactual reasoning [Tsirtsis et al., 2021; Kazemi et al., 2024]. In this line of work, the authors define the transition probabilities of the MDP via structural causal models and then search for a counterfactual path that only diverges in k actions from the given path. In contrast, our work does not assume causal knowledge. We further define counterfactuals over strategies, as opposed to paths in the MDP. Although both model repair and counterfactuals start with a model and a violated property, these problems aim for different solutions. Model repair considers the problem of ad-

justing a model to satisfy a desired property [Bartocci et al., 2011; Chatzieleftheriou and Katsaros, 2018; Chen et al., 2013; Pathak et al., 2015]. In contrast, counterfactual strategies do not aim to adjust the underlying model, but rather to propose behavior changes to the user (thus, the transition probabilities of the model remain unchanged). Recent work on model synthesis for parametric MDPs consider transition probabilities expressed as functions over variables. In this setting, one searches for a parameter valuation such that a given property is satisfied under every strategy [Cubuktepe et al., 2017; Cubuktepe et al., 2018; Cubuktepe et al., 2021]. In contrast, our work starts from a fully specified model and a strategy, and we search for a minimal change of the strategy that satisfies the given property.

7 Conclusion

In this work, we introduced counterfactual strategies for Markov decision processes as post-hoc explanations for sequential decision-making tasks. We presented an optimization approach for computing counterfactual strategies and extended it to also optimize for diversity. In extensive experiments on four real-world datasets, we evaluated the generation of diverse counterfactual strategies, showing that counterfactual strategies can be generated within minutes for models significantly larger than current Process Mining benchmarks. Our work opens several interesting avenues for future work. First, we plan to investigate the complexity of generating counterfactual strategies further, as well as techniques to further reduce the runtime of our approach. Approximations of the optimization problem, such as those based on linearization, e.g., [Cubuktepe et al., 2021], promise to reduce the computation time while producing local optimal strategies. Second, it would be interesting to study the problem of generating counterfactual strategies for scenarios where the environment (e.g., the service provider) adapts to counterfactual changes. This will require generalizing our counterfactual strategies from MDPs to Stochastic Games, thus requiring novel theoretical investigations. Finally, it would be interesting to investigate whether our counterfactual strategies result in realistic recourse behaviors, e.g. by measuring similarity to those witnessed in our dataset or by running user studies.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

Acknowledgments

Leofante was supported by Imperial College through the Imperial College Research Fellowship scheme. Kobialka, Tapia Tarifa, and Johnsen were supported by the Smart Journey Mining project, funded by the Research Council of Norway (grant no. 312198). Gerlach is supported by the DFG RTG 2236/2 project Un RAVe L.

References [Baier and Katoen, 2008] Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. MIT press, 2008. [Balke and Pearl, 1994a] Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds and applications. In Uncertainty in artificial intelligence, pages 46 54. Elsevier, 1994. [Balke and Pearl, 1994b] Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Proc. 12th National Conference on Artificial Intelligence (AAAI), pages 230 237. AAAI Press / The MIT Press, 1994. [Bartocci et al., 2011] Ezio Bartocci, Radu Grosu, Panagiotis Katsaros, CR Ramakrishnan, and Scott A Smolka. Model repair for probabilistic systems. In Proc. 17th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2011), pages 326 340. Springer, 2011. [Billionnet et al., 2016] Alain Billionnet, Sourour Elloumi, and Am elie Lambert. Exact quadratic convex reformulations of mixed-integer quadratically constrained problems. Mathematical Programming, 158(1):235 266, 2016. [Bove et al., 2023] Clara Bove, Marie-Jeanne Lesot, Charles Albert Tijus, and Marcin Detyniecki. Investigating the intelligibility of plural counterfactual examples for non-expert users: an explanation user interface proposition and user study. In Proc. 28th International Conference on Intelligent User Interfaces, pages 188 203. ACM, 2023. [Boyd and Vandenberghe, 2004] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [Brost et al., 2019] Brian Brost, Rishabh Mehrotra, and Tristan Jehan. The music streaming sessions dataset. In The World Wide Web Conference, pages 2594 2600, 2019. [Chatzieleftheriou and Katsaros, 2018] George Chatzieleftheriou and Panagiotis Katsaros. Abstract model repair for probabilistic systems. Information and Computation, 259:142 160, 2018. [Chen et al., 2013] Taolue Chen, Ernst Moritz Hahn, Tingting Han, Marta Kwiatkowska, Hongyang Qu, and Lijun Zhang. Model repair for Markov decision processes. In 2013 International Symposium on Theoretical Aspects of Software Engineering, pages 85 92. IEEE, 2013. [Cubuktepe et al., 2017] Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, Ivan Papusha,

Hasan A. Poonawala, and Ufuk Topcu. Sequential convex programming for the efficient verification of parametric MDPs. In Proc. 23rd International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2017), volume 10206 of LNCS, pages 133 150. Springer, 2017.

[Cubuktepe et al., 2018] Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, and Ufuk Topcu. Synthesis in p MDPs: A tale of 1001 parameters. In International Symposium on Automated Technology for Verification and Analysis, pages 160 176. Springer, 2018.

[Cubuktepe et al., 2021] Murat Cubuktepe, Nils Jansen, Sebastian Junges, Joost-Pieter Katoen, and Ufuk Topcu. Convex optimization for parameter synthesis in MDPs. IEEE Transactions on Automatic Control, 67(12):6333 6348, 2021.

[Gajcin and Dusparic, 2024] Jasmina Gajcin and Ivana Dusparic. Redefining counterfactual explanations for reinforcement learning: Overview, challenges and opportunities. ACM Computing Surveys, 56(9):1 33, 2024.

[Garey and Johnson, 1979] Michael R Garey and David S Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.

[Guidotti, 2024] Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining and Knowledge Discovery, 38(5):2770 2824, 2024.

[Huang et al., 2021] Tsung-Hao Huang, Andreas Metzger, and Klaus Pohl. Counterfactual explanations for predictive business process monitoring. In European, Mediterranean, and Middle Eastern Conference on Information Systems, pages 399 413. Springer, 2021.

[Kazemi et al., 2024] Milad Kazemi, Jessica Lally, Ekaterina Tishchenko, Hana Chockler, and Nicola Paoletti. Counterfactual influence in Markov decision processes. ar Xiv preprint ar Xiv:2402.08514, 2024.

[Kobialka et al., 2022] Paul Kobialka, Silvia Lizeth Tapia Tarifa, Gunnar Rye Bergersen, and Einar Broch Johnsen. Weighted games for user journeys. In Proc. 20th International Conference Software Engineering and Formal Methods (SEFM 2022), volume 13550 of LNCS, pages 253 270. Springer, 2022.

[Kobialka et al., 2025] Paul Kobialka, Lina Gerlach, Francesco Leofante, Erika Abrah am, Silvia Lizeth Tapia Tarifa, and Einar Broch Johnsen. Counterfactual strategies for Markov decision processes. ar Xiv preprint ar Xiv:2505.09412, 2025.

[Leo et al., 2019] Martin Leo, Suneel Sharma, and Koilakuntla Maddulety. Machine learning in banking risk management: A literature review. Risks, 7(1):29, 2019.

[Levin and Peres, 2017] David A Levin and Yuval Peres. Markov Chains and Mixing Times, volume 107. American Mathematical Soc., 2017.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)

[Mao et al., 2016] Hua Mao, Yingke Chen, Manfred Jaeger, Thomas D. Nielsen, Kim G. Larsen, and Brian Nielsen. Learning deterministic probabilistic automata from a model checking perspective. Machine Learning, 105(2):255 299, 2016. [Molnar, 2020] Christoph Molnar. Interpretable Machine Learning. Lulu.com, 2020. [Mothilal et al., 2020] Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. In Proc. 2020 Conference on Fairness, Accountability, and Transparency, pages 607 617. ACM, 2020.

[Pathak et al., 2015] Shashank Pathak, Erika Abrah am, Nils Jansen, Armando Tacchella, and Joost-Pieter Katoen. A greedy approach for the efficient repair of stochastic models. In NASA Formal Methods Symposium, pages 295 309. Springer, 2015. [Russell, 2019] Chris Russell. Efficient search for diverse coherent explanations. In Proc. Conference on Fairness, Accountability, and Transparency, pages 20 28. ACM, 2019. [Teinemaa et al., 2019] Irene Teinemaa, Marlon Dumas, Marcello La Rosa, and Fabrizio Maria Maggi. Outcomeoriented predictive process monitoring: Review and benchmark. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(2):1 57, 2019. [Tsirtsis et al., 2021] Stratis Tsirtsis, Abir De, and Manuel Rodriguez. Counterfactual explanations in sequential decision making under uncertainty. Advances in Neural Information Processing Systems, 34:30127 30139, 2021. [van Dongen, 2012] Boudewijn van Dongen. BPI Challenge, 2012. [van Dongen, 2017] Boudewijn van Dongen. BPI Challenge, 2017.

Proceedings of the Thirty-Fourth International Joint Conference on Artiﬁcial Intelligence (IJCAI-25)