# counterfactual_plans_under_distributional_ambiguity__37472699.pdf Published as a conference paper at ICLR 2022 COUNTERFACTUAL PLANS UNDER DISTRIBUTIONAL AMBIGUITY Ngoc Bui, Duy Nguyen, Viet Anh Nguyen Vin AI Research, Vietnam Counterfactual explanations are attracting significant attention due to the flourishing applications of machine learning models in consequential domains. A counterfactual plan consists of multiple possibilities to modify a given instance so that the model s prediction will be altered. As the predictive model can be updated subject to the future arrival of new data, a counterfactual plan may become ineffective or infeasible with respect to the future values of the model parameters. In this work, we study the counterfactual plans under model uncertainty, in which the distribution of the model parameters is partially prescribed using only the firstand second-moment information. First, we propose an uncertainty quantification tool to compute the lower and upper bounds of the probability of validity for any given counterfactual plan. We then provide corrective methods to adjust the counterfactual plan to improve the validity measure. The numerical experiments validate our bounds and demonstrate that our correction increases the robustness of the counterfactual plans in different real-world datasets. 1 INTRODUCTION Machine learning models, thanks to their superior predictive performance, are blooming with increasing applications in consequential decision-making tasks. Along with the potential to help make better decisions, current machine learning models are also raising concerns about their explainability and transparency, especially in domains where humans are at stake. These domains span from loan approvals (Siddiqi, 2012), university admission (Waters & Miikkulainen, 2014) to job hiring (Ajunwa et al., 2016). In these applications, it is instructive to understand why a particular algorithmic decision is made, and counterfactual explanations act as a useful toolkit to comprehend (black-box) machine learning models (Wachter et al., 2017). Counterfactual explanation is also known in the field of interpretable machine learning as contrastive explanation (Miller, 2018; Karimi et al., 2020b) or recourse (Ustun et al., 2019). A counterfactual explanation suggests how an instance should be modified so as to receive an alternate algorithmic outcome. As such, it could be used as a suggestion for improvement purposes. For example, a student is rejected from graduate study, and the university can provide one or multiple counterfactuals to guide the applicant for admission in the following year. A concrete example may be of the form get a GRE score of at least 325 or get a 6-month research experience . In practice, providing a counterfactual plan consisting of multiple examples is highly desirable because a single counterfactual to every applicant with the same covariates may be unsatisfactory (Wachter et al., 2017). Indeed, the covariates can barely capture the intrinsic behaviors, constraints, and unrevealed preferences of the person they represent so that the users with the same features may have different preferences to modify their input. As a consequence, a pre-emptive design choice is to provide a menu of possible recourses, and let the applicant choose the recourse that fits them best. Viewed in this way, a counterfactual plan has the potential to increase satisfaction and build trust among the stakeholders of any machine learning application. Constructing a counterfactual plan, however, is not a straightforward task because of the many competing criteria in the design process. By definition, the plan should be valid: by committing to any counterfactual in the plan, the application should be able to flip his current unfavorable outcome to a favorable one. However, each possibility in the plan should be in the proximity of the covariates Published as a conference paper at ICLR 2022 of the applicant so that the modification is actionable. Further, the plan should consist of a diverse range of recourses to accommodate the different tastes and preferences of the population. Russell (2019) propose a mixed-integer programming method to generate a counterfactual plan for a linear classifier, in which the diversity is imposed using a rule-based approach. In Dandl et al. (2020), the authors propose a model-agnostic approach using a multi-objective evolutionary algorithm to construct a diverse counterfactual plan. More recently, Mothilal et al. (2020) use the determinantal point process to measure the diversity of a plan. The authors then formulate an optimization problem to find the counterfactual plan that minimizes the weighted sum of three terms representing validity, proximity, and diversity. A critical drawback of the existing works is the assumption of an invariant predictive model, which often fails to hold in practical settings. In fact, during a turbulent pandemic time, it is difficult to assume that the demographic population of students applying for postgraduate studies remain unchanged. And even in the case that the demography remains unchanged, special pandemic conditions such as hybrid learning mode or travel bans may affect the applicants package, which in turn leads to fluctuations of the covariate distribution in the applicant pool. Figure 1: A student applies in Year 2021 and receives an unfavorable admission outcome. The student implements one of the recommended recourse x chosen from the counterfactual plan {xj} and re-applies in Year 2022. However, the outcome is again unfavorable because of the change in the model parameters θ. These shifts in the data are channeled to the shift in the parameters of the predictive model: when the machine learning models are re-trained or re-calibrated with new data, their parameters also change accordingly (Venkatasubramanian & Alfano, 2020). This raises an emerging concern because the counterfactual plan is usually designed to be valid to only the current model, but that is not enough to guarantee any validity on the future models. Thus, the counterfactual plan carries a promise of a favorable future outcome, nevertheless, this promise is fragile. It is hence reasonable to demand the counterfactual plan to be robust with respect to the shift of the parameters. Pawelczyk et al. (2020) study the sparsity of counterfactuals and its non-robustness under different fixed models (predictive multiplicity). Rawal et al. (2020) consider the counterfactual plan problem and describe several types of model shift related to the correction, temporal, and geospatial shift from data. They also study the trade-off between the recourse proximity and its validity regarding the model updates. Most recently, Upadhyay et al. (2021) leverage robust optimization to generate a counterfactual that is robust to some constrained perturbations of the model s parameters. However, both works consider only the single counterfactual settings. Contributions. We study the many facets of the counterfactual plans with respect to random future model parameters. We focus on a linear classification setting and we prescribe the random model parameters only through the firstand second-moment information. We contribute concretely 1. a diagnostic tool to assess the validity of a counterfactual plan. It provides a lower and upper bound on the probability of joint validity of a given plan subject to uncertain model parameters. 2. a correction tool to improve the validity of a counterfactual plan, while keeping the modifications to each counterfactual at a minimal level. The corrections are intuitive and admit closed-form expression. 3. a COunterfactual Plan under Ambiguity (COPA) framework to construct a counterfactual plan which explicitly takes the model uncertainty into consideration. It minimizes the weighted sum of validity, proximity, and diversity terms, and can be solved efficiently using gradient descents. Each of our above contributions is exposed in Section 2, 3 and 4, respectively. In Section 5, we conduct experiments on both synthetic and real-world datasets to demonstrate the efficiency of our corrections and of our COPA framework. All proofs can be found in the appendix. General setup. Consider a covariate space Rd and a linear binary classification setting. Each linear classifier can be parametrized by θ Rd with decision output Cθ(x) = 1 if θ x 0, and Published as a conference paper at ICLR 2022 0 otherwise, where 0 represents an unfavorable outcome. Note that we omit the bias term to avoid clutter, taking the bias term into account can be achieved by extending the dimension of x and θ by an extra dimension. A counterfactual plan is a set of J counterfactual explanations {xj}j=1,...,J, and we denote {xj} for short. When J = 1, we have a single counterfactual explanation problem, which is the subject of recent works (Ustun et al., 2019; Karimi et al., 2020a; Upadhyay et al., 2021). Next, we define the joint validity of a counterfactual plan. Definition 1.1 (Joint validity). A counterfactual plan {xj} is valid with respect to a realization θ if Cθ(xj) = 1 for all j = 1, . . . , J. Notations. We use Sd ++ (Sd +) to denote the space of symmetric positive (semi)definite matrices. For any A Rm m, the trace operator is Tr A = Pd i=1 Aii. For any integer J, [J] {1, . . . , J}. 2 VALIDITY BOUNDS OF COUNTERFACTUAL PLANS In this section, we propose a diagnostic tool to benchmark the validity of a pre-computed counterfactual plan {xj}. We model the random model parameters θ with a nominal distribution b P. Instead of making a strong assumption on a specific parametric form of b P such as Gaussian distribution, we only assume that b P is known only up to the second moment. More specifically, we assume that under b P, θ has a nominal mean vector bµ and nominal covariance matrix bΣ Sd ++. Definition 2.1 (Gelbrich distance). The Gelbrich distance between two pairs (µ1, Σ1) Rd Sd + and (µ2, Σ2) Rd Sd + is defined as G (µ1, Σ1), (µ2, Σ2) q µ1 µ2 2 2 + Tr Σ1 + Σ2 2 Σ The Gelbrich distance is closely related to the optimal transport distance between Gaussian distributions. Indeed, G (µ1, Σ1), (µ2, Σ2) is equal to the type-2 Wasserstein distance between two Gaussian distributions N(µ1, Σ1) and N(µ2, Σ2) (Gelbrich, 1990). It is thus trivial that G is a distance on Rd Sd +, and as a consequence, it is symmetric and G (µ1, Σ1), (µ2, Σ2) = 0 if and only if (µ1, Σ1) = (µ2, Σ2). Using the Gelbrich distance to design the moment ambiguity set for distributionally robust optimization leads to many desirable properties such as computational tractability and performance guarantees (Kuhn et al., 2019; Nguyen et al., 2021a). Motivated by this idea, we first construct the following uncertainty set U {(µ, Σ) Rd Sd + : G((µ, Σ), (bµ, bΣ)) ρ}, which is formally a ρ-neighborhood in the mean vector-covariance matrix space around the nominal moment (bµ, bΣ). The ambiguity set for the distributions of θ is obtained by lifting U to generate a family of probability measures that satisfy the moment conditions B {Q P : (µ, Σ) U such that Q (µ, Σ)} , where P is a set of all probability measures supported on Rd and Q (µ, Σ) indicates that Q has mean vector µ and covariance matrix Σ. The central question of this section is: If the distribution of θ belongs to B, what is the probability that a given plan {xj} is valid? To answer this question, we define the event set Θ({xj}) that contains all model parameter values that renders {xj} jointly valid. Under the definition of a linear model, Θ({xj}) is an intersection of J open hyperplanes of the form Θ({xj}) θ Rd : x j θ 0 j [J] . (1) We name Θ({xj}) the set of favorable parameters. The probability of validity for a plan under a measure Q is Q( θ Θ({xj})). We are interested in evaluating the lower and the upper bound probability that the plan {xj} is valid uniformly over all distributions Q B. This is equivalent to quantifying the following quantities inf Q B Q( θ Θ({xj})) and sup Q B Q( θ Θ({xj})). Published as a conference paper at ICLR 2022 In the remainder of this section, we discuss how to evaluate the bounds for these terms. Lower bound. We denote by Θ the interior of the set Θ, that is, Θ ({xj}) θ Rd : x j θ > 0 j . Note that all the inequalities defining Θ ({xj}) are strict inequalities. By definition, we have Θ ({xj}) Θ({xj}), and hence inf Q B Q( θ Θ ({xj})) inf Q B Q( θ Θ({xj})). Because Θ ({xj}) is an open set, we can leverage the generalized Chebyshev lower bound to evaluate the minimum quantity of Q( θ Θ ({xj})) over all distributions with a given mean and covariance matrix (Vandenberghe et al., 2007). Adding moment uncertainty via the set U is obtained by rejoining two minimization layers. The next theorem presents this result. Theorem 2.2 (Lower bound). For any ρ R+, bµ Rd and bΣ Sd +, let L be the optimal value of the following semidefinite program j [J] λj s. t. µ Rd, Σ Sd +, C Rd d, M Sd + λj R, zj Rd, Zj Sd j [J] x j zj 0, Zj zj z j λj Zj zj z j λj 0, M Σ µ µ 1 bµ 2 2bµ µ + Tr M + bΣ 2C ρ2. Then we have L = inf Q B Q( θ Θ ({xj})) inf Q B Q( θ Θ({xj})). Upper bound. Because Θ({xj}) is a closed set, we can leverage a duality result to evaluate the maximum quantity of Q( θ Θ({xj})) over all distributions with a given mean and covariance matrix (Isii, 1960). Adding moment uncertainty via the set U is obtained by invoking the support function of the moment set. This result is presented in the next theorem Theorem 2.3 (Upper bound). For any ρ R+, bµ Rd and bΣ Sd +, let U be the optimal value of the following semidefinite program inf z0 + γ(ρ2 bµ 2 2 Tr bΣ ) + q + Tr Q s. t. γ R+, z0 R, z Rd, Z Sd +, q R+, Q Sd +, λ RJ + γI Z γbΣ 1 2 γbΣ 1 2 Q 0, γI Z γbµ + z γbµ + z q 0, Z z z z0 1 0 1 2xj 1 2x j 0 Then we have sup Q B Q( θ Θ({xj})) U . Thanks to the choice of the Gelbrich distance G, both optimization problems in Theorems 2.2 and 2.3 are linear semidefinite programs, and they can be solved efficiently by standard, off-the-shelf solvers such as MOSEK to high dimensions (MOSEK Ap S, 2019). Other choices of distance (divergence) are also available: for example, one may opt for the Kullback-Leibler (KL) type divergence between Gaussian distribution to prescribe U as in Nguyen et al. (2020) and Taskesen et al. (2021). Unfortunately, the KL type divergence entails a log-determinant term, and the resulting optimization problems are no longer linear programs and are no longer solvable using MOSEK. Equipped with L and U , we have the bounds L Q({xj} is a valid plan) U Q B on the validity of the counterfactual plans {xj} under the distributional ambiguity set B. Complementary information. The previous results show that we can compute the lower bound L and upper bound U for the probability of validity by solving semidefinite programs. We now show that the two quantities L and U are complementary to each other in a specific sense. Proposition 2.4 (Complementary information). For any instance, either L = 0 or U = 1. More specifically, we have: (i) If bµ Θ({xj}), then U = 1, and (ii) If bµ Θ({xj}), then L = 0. Published as a conference paper at ICLR 2022 Because L and U are bounds for a probability quantity, they are only informative when they are different from 0 and 1. Proposition 2.4 asserts that the upper bound U is trivial when bµ Θ({xj}), while the lower bound L becomes trival if bµ Θ({xj}). Next, we leverage these insights to improve the validity of a given counterfactual plan. 3 COUNTERFACTUAL PLAN CORRECTIONS Given a counterfactual plan {xj}, it may happen that {xj} have low probability of being valid under random realizations of the future model parameter θ. The diagnostic tools proposed in Section 2 indicate that {xj} has low validity when the bounds are low, and we are here interested in correcting this plan such that the lower bounds L are increased. Indeed, increasing L guarantees higher confidence that the plan is valid, should the distribution of θ belongs to the ambiguity set. At this point, one may be tempted to optimize L directly with {xj} by first converting problem (2) into a maximization problem, and then jointly maximizing with {xj} being decision variables. Unfortunately, this approach entails bilinear terms x j zj in the constraints, and this approach is notoriously challenging to solve. We thus resort to heuristics for correction. Towards this end, the results from Proposition 2.4 suggest that there are two correction operations that we need to perform to improve the validity of the counterfactual plan: (i) When bµ Θ({xj}), then Proposition 2.4 suggests that we should modify the plan {xj} so that the resulting set of favorable parameters contains bµ. We term this type of correction as a Requirement correction, and we consider one specific Requirement correction in Section 3.1. (ii) When bµ Θ({xj}), we can also modify {xj} to as to increase the lower bound L . This type of correction is termed an Improvement correction because its goal is to increase the validity of the counterfactual plans. We consider the Mahalanobis Improvement correction in Section 3.2. We emphasize that the corrections of the plan {xj} are designed such that the modifications to each counterfactual xj should be minimal. This is achieved by two main criteria: the correction should modify as few counterfactuals as possible, and the modification to each counterfactual should also be as small as possible. 3.1 REQUIREMENT CORRECTION We propose a Requirement correction with the goal of obtaining a corrected plan {x j} from the given plan {xj} such that bµ lies inside (or strictly inside) the set Θ({x j}). A simple Requirement correction is to construct the corrected plan {x j} by j [J] : x j = xj if bµ xj ϵ, arg min{ x xj 2 : bµ x ϵ} if bµ xj < ϵ, for some ϵ 0. Using this rule, x j is the smallest modification of xj measured in the Euclidean distance such that x j is valid with ϵ-margin with respect to the expected future parameter bµ. The margin ϵ adds a layer of robustness: if ϵ > 0 then bµ lies in the interior of the set Θ({x j}), while if ϵ = 0 then bµ lies on the boundary of the set Θ({x j})). Moreover, it is easy to see that in the case bµ xj < ϵ, the resulting x j is the Euclidean projection of xj onto the hyperplane bµ xj = ϵ. The proposed Requirement correction admits thus the analytical form: j [J] : x j = xj min{0, bµ xj ϵ} 3.2 MAHALANOBIS IMPROVEMENT CORRECTION Given a plan {xj} such that bµ Θ({xj}) and an integer K between 1 and J, the Mahalanobis Improvement correction aims to modify K out of J plans to obtain the corrected plan {x j}. The goal of this correction is to increase the lower bound value L associated with the plan {x j}, while at the same time keeping the amount of modification as small as possible. To attain this goal, we first describe the geometric intuition behind the lower bound L in (2), and then leverage this intuition to generate the correction. Published as a conference paper at ICLR 2022 Geometric intuition. We first analyze the distribution of the random vector θ that attains the validity lower bound L . To simplify the exposition, we assume that λ > 0 and define λ 0 = 1 P j λ j. Following the same argument as in Vandenberghe et al. (2007, 2.2), the distribution of θ can be constructed as a mixture of J + 1 random vectors θj satisfying: j = 0, . . . , J : θ = θj with probability λ j, where for each j = 1, . . . , J, we have E[ θj] = z j /λ j and θ0 follows a properly chosen distribution. By the validity of z , we can verify that the location z j /λ j lies on the hyperplane {θ : x j θ = 0}. Thus, we can think of λ j as the marginal increase in the lower bound L if we slightly perturb xj so that the point z j /λ j lies inside the set of favorable parameters. This observation underlies the Mahalanobis correction which we describe next. Figure 2: Illustration with d = 2 and J = 3. Shaded area is Θ({xj}), dashed ellipsoid represents (θ bµ) bΣ 1(θ bµ) = 1, black dots are the locations of z j /λ j. Correction procedure. If we can adjust K out of J counterfactuals to improve the validity, then it is reasonable to modify the K counterfactuals associated with the K largest values of λ j, where λ j is the optimal value of the variable λj in problem (2). Without any loss of generality, assume that λ j have decreasing values, and in this case, our correction procedure will modify the counterfactuals xj for j = 1, . . . , K. Further, to correct each counterfactual, we find x j in a -neighborhood of xj such that the Mahalanobis distance from bµ to the hyperplane {θ : θ x j = 0} is maximized, where the Mahalanobis distance is computed with the nominal covariance matrix bΣ. This is equivalent to solving a max-min problem x j = arg max minθ:θ x=0 q (θ bµ) bΣ 1(θ bµ) s. t. x Rd, x xj 2 . (3) The next result indicates that x j can be found by solving a conic optimization problem. Theorem 3.1 (Mahalanobis Improvement correction). The Mahalanobis correction of xj is x j = v /t , where (v , t ) is the optimal solution of the following conic optimization problem min n v bΣv : v Rd, t R+, v txj 2 t, v bµ = 1 o . We have specifically modified x j in (3) with respect to the nominal mean vector and covariance matrix (bµ, bΣ) of the random vector θ. Alternatively, we can also use (µ , Σ ), where (µ , Σ ) is the optimal solution in the variable (µ, Σ) of (2) to form the optimization problem. Theorem 3.1 holds with the corresponding parameters (µ , Σ ). Similarly, equation (4) recovers the Euclidean projection if we use an identity matrix for weighting. The conic optimization problem in Theorem 3.1 can be solved using standard off-the-shelf solvers such as Mosek (MOSEK Ap S, 2019). 4 COUNTERFACTUAL PLAN CONSTRUCTION UNDER AMBIGUITY We propose in this section the COunterfactual Plan under Ambiguity (COPA) framework to devise a counterfactual plan that has high validity under random future model parameters. Given an input instance x0, COPA builds a plan {xj} of J 1 counterfactuals that balances competing objectives including proximity, diversity, and validity. We next describe each cost component. Proximity. It is reasonable to ask that each counterfactual xj should be close to the input x0 so that xj is actionable. We suppose that the distance between x0 and xj can be measured using a function c : Rd Rd R. In general, the cost c is used to capture the ease of adopting the changes for a specific variable (e.g., one could barely change their height or race). The proximity of a plan {xj} is Published as a conference paper at ICLR 2022 simply the average distance from x0 to each counterfactual in the plan. More specifically, we have Proximity({xj}, x0) 1 j=1 c(xj, x0). (4) Diversity. We measure the diversity of a plan using the determinant point process (Kulesza, 2012) similar to the approach in Mothilal et al. (2020). The diversity is given by: Diversity({xj}) det(K), where Ki,j = (1 + c(xi, xj)) 1 1 i, j J. (5) Then, a plan with a larger value Diversity({xj}) is more diverse. Validity. Given the moment information (bµ, bΣ), one potential approach to compute the validity of a plan {xj} is to compute the value L in (2). However, for large covariate dimension d or high number of counterfactual J, the semidefinite program (2) becomes time-consuming to solve and is not practical. This entails us to derive the a computationally efficient proxy for the validity of {xj}. Towards this goal, we use the volume of the maximum-volume ellipsoid with center bµ and covariance bΣ that can be inscribed in Θ({xj}). Following Boyd & Vandenberghe (2004, 8.4.2), an ellipsoid with center bµ, covariance matrix bΣ and radius r can be written in the parametric form as E(bµ,bΣ)(r) = {bΣ 1 2 u + bµ : u 2 r}. The validity of the plan {xj} is thus defined as Validity({xj}, bµ, bΣ) max{r : r 0, E(bµ,bΣ)(r) Θ({xj})}. The next result asserts that the above validity measure can be re-expressed in closed form, which justifies its computational efficiency. Lemma 4.1 (Validity value). If bµ Θ({xj}), then Validity({xj}, bµ, bΣ) = minj bµ xj/ bΣ 1 2 xj 2. Lemma 4.1 and the analysis in Proposition 2.4 also suggest that the counterfactual plan should satisfy bµ Θ({xj}) so as to improve the validity. Similar to Section 3.1, we will impose the constraints that bµ xj ϵ j for some margin ϵ 0 for validity purposes. COPA framework. Our COPA framework finds the counterfactual plan that minimizes the weighted sum of the proximity, the diversity and the validity measure. More precisely, the COPA counterfactual plan is the minimizer of min x1,...,x J Proximity({xj}, x0) λ1Validity({xj}, bµ, bΣ) λ2Diversity({xj}) s. t. bµ xj ϵ j (6) for some non-negative parameters λ1 and λ2. The COPA problem (6) can be solved efficiently under mild conditions using a projected (sub)gradient descent algorithm. A projected gradient descent algorithm can be used to solve the COPA problem (6). The gradient of the objective function of (6) can be computed using auto-differentiation. We now discuss further the projection operator. Let X {x Rd : bµ x ϵ}, then the feasible set of the COPA problem (6) is a product space X J. The projection operator Proj X J on the product set X J is decomposable into simpler projections onto individual set X as Proj X J({x j}) = {Proj X (x 1), . . . , Proj X (x J)}, where each individual projection is Proj X (x j) = arg min{ x x j 2 : bµ x ϵ} = x j min{0, bµ x j ϵ}bµ/ bµ 2 2. Note that the second equality above follows from the analytical formula for the Euclidean projection onto a half-space, which was previously used in Section 3.1. 5 NUMERICAL EXPERIMENTS In this section, we evaluate the correctness of our validity bounds and the performance of our corrections and our COPA framework on both synthetic and real-world datasets. Our baseline for comparison is the counterfactual plan constructed from the state-of-the-art Di CE framework (Mothilal et al., 2020). Throughout the experiments, we set the number of counterfactuals to J = 5. For Di CE, we use the default parameters recommended in the Di CE source code. The Mahalanobis correction will use the counterfactual plan obtained by the Di CE method with K = 3 and the perturbation limit is 0.1. In our COPA framework, we use Adam optimizer to implement Projected Gradient Descent and ℓ2-distance to compute perturbation cost between inputs. Published as a conference paper at ICLR 2022 (a) ˆµ / Θ({xj}) (b) ˆµ Θ({xj}) Figure 3: The impact of the Gelbrich radius on the validity of a counterfactual plan. The vertical axis of each green point represents the empirical validity of the plan with respect to which θ N(µg, Σg) and the horizontal axis is the Gelbrich distance G((bµ, bΣ), (µg, Σg)). (a) Mean shift (b) Covariance shift (c) Mean & Covariance shift Figure 4: The impact of shift magnitudes on the validity of the plans obtained by three algorithms. 5.1 SYNTHETIC DATASET We first generate 1000 samples with two-dimensional features from two Gaussian distributions N(µ0, Σ0) and N(µ1, Σ1) to create a synthetic dataset. Each instance is labelled as 0 or 1 corresponding to the distribution that generated it. For the Gaussian distributions, we use similar parameters as in Upadhyay et al. (2021), where µ0 = [ 2, 2] , µ1 = [2, 2] , Σ0 = Σ1 = 0.5I with I being the identity matrix. This dataset is then used to train a logistic classifier with the present parameter θ0. This classifier Cθ0 is fixed for the experiments that follow. The impact of Gelbrich radius on the validity. Given a counterfactual plan generated by Di CE on the classifier Cθ0, we consider two scenarios: bµ Θ({xj}) and bµ / Θ({xj}). We choose bµ = θ0 for the case bµ Θ({xj}) and bµ = θ0, otherwise. We also set bΣ = 0.5I. We then compute the lower and upper validity bound of this plan with respect to different Gelbrich bounds ρ [0, 1]. To evaluate the empirical validity of this plan, we simulate 1000 futures for θ. For each future, we generate µg and Σg randomly so that G((bµ, bΣ), (µg, Σg)) 1, and then we sample 106 values of θ Ng(µg, Σg). The empirical validity of the plan for each future is the fraction of parameter samples from the future that the prescribed plan is valid. We plot the 1000 empirical validity of the plan in Figure 3. This result is consistent with our guarantees that the validity is between the two bounds. We also observe that increasing ρ loosens the validity bounds. The impact of degree of distribution shift on validity of a plan. We explore the case bµ Θ({xj}), where bµ = θ0 and bΣ = 0.5I, to assess the impact of distribution shift to three algorithms Di CE, Mahalanobis Crr, and COPA. In this experiment, we run our COPA framework with λ1 = 2.0, λ2 = 200.0. To assess the performance of three algorithms, we parameterize the ground truth distribution of the future parameters θ N(µg, Σg) as follows: µg = bµ + α[0, 1, 0] , Σg = (1 + β)I. Here, we simulate three types of distributional shift of the parameters θ: (1) mean shift (α [0, 1], β = 0), (2) covariance shift (α = 0, β [0, 3]), and (3) mean and covariance shift ((α, β) [0, 1] [0, 3]). For each shift s type, we generate 100 counterfactual plans corresponding to 100 original inputs x0 and compute the empirical validity as previously described. The average and confidence range of the empirical validity are plotted in Figure 4. This result shows the tendency of decreasing validity measure of all algorithms when increasing the Gelbrich distance between estimate and ground truth distribution. However, COPA shows stability and robustness for Published as a conference paper at ICLR 2022 all shift types. The validity of Di CE deteriorates when the ground truth distribution is far from θ0. Meanwhile, Mahalanobis Crr increases the robustness of the plans obtained by Di CE significantly. 5.2 REAL-WORLD DATASETS In this experiment, we evaluate the robustness of the counterfactual plans obtained by three frameworks on the real datasets. We use three real-world datasets: German Credit (Dua & Graff, 2017; Groemping, 2019), Small Bussiness Administration (SBA) (Li et al., 2018), and Student performance (Cortez & Silva, 2008). Each dataset contains two sets of data (the present data - D1 and the shifted data D2). The shifted dataset D2 could capture the correction shift (German credit), the temporal shift (SBA), or the geospatial shift (Student). More details for each dataset are provided in Appendix. Experimental settings. For each present dataset D1, we train a logistic classifier Cθ0 with parameter θ0 on 80% of instances of the dataset and fix this classifier to construct counterfactual plans in whole experiment. We generate 100 counterfactual plans for 100 original inputs and report the average values of our evaluation metrics. To estimate bµ and bΣ, we train 1000 different classifiers from the present dataset D1 (each is trained on a random set containing 50% instances of D1), then use the empirical mean and covariance matrix of the parameter. We set Gelbrich radius ρ = 0.01. Metrics. To compute the empirical validity in the shift dataset D2, we sample 50% instances of D2 1000 times to train 1000 different logistic classifiers. We then report the empirical validity of a plan as the fraction of the classifiers with respect to which the plan is valid. We also use the lower validity bound as a metric for evaluating the robustness of a plan. We use the formula in (4) and (5) to measure the proximity and diversity of a counterfactual plan. Table 1: Performance of competing algorithms on real world datasets. For Proximity, lower is better. For Diversity, L and Validity, higher is better. Bold indicate the best performance for each dataset. Dataset Method Proximity Diversity L Empirical Validity Correction Di CE 0.986 0.324 0.072 0.050 0.649 0.073 0.996 0.008 Mahalanobis Crr 1.002 0.323 0.064 0.047 0.750 0.064 0.999 0.003 COPA (λ1 = 0.2; λ2 = 2.0) 0.916 0.178 0.017 0.058 0.944 0.168 0.997 0.018 COPA (λ1 = 0.5; λ2 = 5.0) 1.154 0.253 0.114 0.101 0.946 0.040 1.000 0.000 COPA (λ1 = 1.0; λ2 = 10.0) 1.351 0.166 0.225 0.045 0.911 0.022 1.000 0.000 Temporal Di CE 2.037 0.470 0.089 0.057 0.946 0.014 0.801 0.061 Mahalanobis Crr 2.014 0.473 0.085 0.055 0.966 0.007 0.945 0.062 COPA (λ1 = 0.2; λ2 = 2.0) 1.831 0.139 0.253 0.026 0.994 0.000 1.000 0.000 COPA (λ1 = 0.5; λ2 = 5.0) 1.966 0.112 0.363 0.012 0.995 0.000 1.000 0.000 COPA (λ1 = 1.0; λ2 = 10.0) 2.010 0.124 0.380 0.006 0.995 0.000 1.000 0.000 Geospatial Di CE 1.486 0.325 0.136 0.044 0.549 0.307 0.408 0.363 Mahalanobis Crr 1.497 0.325 0.126 0.044 0.864 0.117 0.757 0.284 COPA (λ1 = 0.2; λ2 = 2.0) 1.779 0.352 0.052 0.047 0.998 0.000 1.000 0.000 COPA (λ1 = 0.5; λ2 = 5.0) 1.882 0.353 0.089 0.032 0.998 0.000 1.000 0.000 COPA (λ1 = 1.0; λ2 = 10.0) 1.926 0.349 0.109 0.024 0.997 0.000 1.000 0.000 Results. The results in Table 1 show that our COPA framework achieves the highest empirical validity, L , and diversity (especially when increasing λ2) in all evaluated datasets. Comparing Di CE and Mahalanobis correction, we can observe that the trade-off of proximity and diversity of Mahalanobis correction is relatively small as compared to its improvement in terms of validity. 6 CONCLUSION This paper studies the problem of generating counterfactual plans under the distributional shift of the classifier s parameters given the fact that the classification model is usually updated upon the arrival of new data. We propose an uncertainty quantification tool to compute the bounds of the probability of validity for a given counterfactual plan, subject to uncertain model parameters. Further, we introduce a correction tool to increase the validity of the given plan. We also propose a COPA framework to construct a counterfactual plan by taking the model uncertainty into consideration. The experiments demonstrate the efficiency of our methods on both synthetic and real-world datasets. Further extensions, notably to incorporate nonlinearities, are presented in the appendix. Published as a conference paper at ICLR 2022 Ifeoma Ajunwa, Sorelle Friedler, Carlos E Scheidegger, and Suresh Venkatasubramanian. Hiring by algorithm: Predicting and preventing disparate impact. Available at SSRN, 2016. Dimitris Bertsimas and Ioana Popescu. Optimal inequalities in probability theory: A convex optimization approach. SIAM Journal on Optimization, 15(3):780 804, 2005. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Paulo Cortez and Alice Silva. Using data mining to predict secondary school student performance. Proceedings of 5th FUture BUsiness TEChnology Conference, 2008. Susanne Dandl, Christoph Molnar, Martin Binder, and Bernd Bischl. Multi-objective counterfactual explanations. In International Conference on Parallel Problem Solving from Nature, pp. 448 469. Springer, 2020. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive. ics.uci.edu/ml. M. Gelbrich. On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert spaces. Mathematische Nachrichten, 147(1):185 203, 1990. U Groemping. South German credit data: Correcting a widely used data set. Reports in Mathematics, Physics and Chemistry, Department II, Beuth University of Applied Sciences Berlin, 2019. Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. Lemna: Explaining deep learning based security applications. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 364 379, 2018. Keiiti Isii. The extrema of probability determined by generalized moments (i) bounded random variables. Annals of the Institute of Statistical Mathematics, 12(2):119 134, 1960. Amir-Hossein Karimi, Gilles Barthe, Borja Balle, and Isabel Valera. Model-agnostic counterfactual explanations for consequential decisions. In International Conference on Artificial Intelligence and Statistics, pp. 895 905. PMLR, 2020a. Amir-Hossein Karimi, Gilles Barthe, Bernhard Sch olkopf, and Isabel Valera. A survey of algorithmic recourse: Definitions, formulations, solutions, and prospects. ar Xiv preprint ar Xiv:2010.04050, 2020b. Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh. Wasserstein distributionally robust optimization: Theory and applications in machine learning. INFORMS Tut ORials in Operations Research, pp. 130 169, 2019. Alex Kulesza. Determinantal point processes for machine learning. Foundations and Trends R in Machine Learning, 5(2-3):123 286, 2012. Min Li, Amy Mickel, and Stanley Taylor. Should this loan be approved or denied? : A large dataset with class assignment guidelines. Journal of Statistics Education, 26(1):55 66, 2018. Luigi Malag o, Luigi Montrucchio, and Giovanni Pistone. Wasserstein Riemannian geometry of Gaussian densities. Information Geometry, 1(2):137 179, 2018. Albert W. Marshall and Ingram Olkin. Multivariate Chebyshev inequalities. The Annals of Mathematical Statistics, 31(4):1001 1014, 1960. Tim Miller. Contrastive explanation: A structural-model approach. ar Xiv preprint ar Xiv:1811.03163, 2018. MOSEK Ap S. MOSEK Optimizer API for Python 9.2.10, 2019. URL https://docs.mosek. com/9.2/pythonapi/index.html. Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 607 617, 2020. Published as a conference paper at ICLR 2022 Viet Anh Nguyen, Nian Si, and Jose Blanchet. Robust Bayesian classification using an optimistic score ratio. In Proceedings of the 37th International Conference on Machine Learning, 2020. Viet Anh Nguyen, Soroosh Shafieezadeh-Abadeh, Damir Filipovi c, and Daniel Kuhn. Meancovariance robust risk measurement. ar Xiv preprint ar Xiv:2112.09959, 2021a. Viet Anh Nguyen, Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Bridging Bayesian and minimax mean square error estimation via Wasserstein distributionally robust optimization. Mathematics of Operations Research, 2021b. Martin Pawelczyk, Klaus Broelemann, and Gjergji Kasneci. On counterfactual explanations under predictive multiplicity. In Conference on Uncertainty in Artificial Intelligence, pp. 809 818. PMLR, 2020. Imre P olik and Tamas Terlaky. A survey of the S-lemma. SIAM Review, 49(3):371 418, 2007. doi: 10.1137/S003614450444614X. Kaivalya Rawal and Himabindu Lakkaraju. Beyond individualized recourse: Interpretable and interactive summaries of actionable recourses. ar Xiv preprint ar Xiv:2009.07165, 2020. Kaivalya Rawal, Ece Kamar, and Himabindu Lakkaraju. Can i still trust you?: Understanding the impact of distribution shifts on algorithmic recourses. ar Xiv preprint ar Xiv:2012.11788, 2020. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135 1144, 2016. Chris Russell. Efficient search for diverse coherent explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 20 28, 2019. Naeem Siddiqi. Credit risk scorecards: Developing and implementing intelligent credit scoring. John Wiley & Sons, 2012. Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171 176, 1958. Bahar Taskesen, Man-Chung Yue, Jose Blanchet, Daniel Kuhn, and Viet Anh Nguyen. Sequential domain adaptation by synthesizing distributionally robust experts. In Proceedings of the 38th International Conference on Machine Learning, 2021. Sohini Upadhyay, Shalmali Joshi, and Himabindu Lakkaraju. Towards robust and reliable algorithmic recourse. In Advances in Neural Information Processing Systems 35, 2021. Berk Ustun, Alexander Spangher, and Yang Liu. Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 10 19, 2019. Lieven Vandenberghe, Stephen Boyd, and Katherine Comanor. Generalized Chebyshev bounds via semidefinite programming. SIAM Review, 49(1):52 64, 2007. Suresh Venkatasubramanian and Mark Alfano. The philosophical basis of algorithmic recourse. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 284 293, 2020. Sandra Wachter, Brent Daniel Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 2017. Austin Waters and Risto Miikkulainen. Grade: Machine learning support for graduate admissions. Ai Magazine, 35(1):64 64, 2014. Xingyu Zhao, Wei Huang, Xiaowei Huang, Valentin Robu, and David Flynn. Baylime: Bayesian local interpretable model-agnostic explanations. ar Xiv preprint ar Xiv:2012.03058, 2020. Published as a conference paper at ICLR 2022 A.1 PROOFS OF SECTION 2 Proof of Theorem 2.2. For any (µ, Σ) Rd Sd +, let P(µ, Σ) {Q : EQ[ θ] = µ, EQ[ θ θ ] = µµ + Σ} denote the set of probability measures under which the random vector θ has mean µ and covariance matrix Σ. The infimum probability can be decomposed as inf Q B Q( θ Θ ({xj})) = inf (µ,Σ) U inf Q P(µ,Σ) Q( θ Θ ({xj})) inf (µ,Σ) U inf 1 P j [J] λj s. t. λj R, zj Rd, Zj Sd j [J] x j zj 0, Zj zj z j λj Zj zj z j λj Σ + µµ µ µ 1 where the second equality follows from Vandenberghe et al. (2007, 2). By Malag o et al. (2018, Proposition 2), we have G2((µ, Σ), (bµ, bΣ)) = min C Rd d µ bµ 2 + Tr Σ + bΣ 2C s. t. Σ C C bΣ min C Rd d bµ 2 2bµ µ + Tr Σ + µµ + bΣ 2C s. t. Σ C C bΣ Hence, by combing two infimum operators, we have inf Q B Q( θ Θ ({xj})) = j [J] λj s. t. µ Rd, Σ Sd +, C Rd d λj R, zj Rd, Zj Sd j [J] x j zj 0, Zj zj z j λj Zj zj z j λj Σ + µµ µ µ 1 bµ 2 2bµ µ + Tr Σ + µµ + bΣ 2C ρ2, Σ C C bΣ In the last step, we add an auxiliary variable M Sd + with the constraint M = Σ + µµ . Note that this constraint can be replaced by M Σ + µµ without affecting the optimal value of the optimization problem. Using the Schur complement, this constraint is equivalent to M Σ µ µ 1 This completes the proof. Proof of Theorem 2.3. Let 1Θ(θ) be the indicator function of the set Θ({xj}), that is, 1Θ(θ) = 1 if θ Θ({xj}), 0 otherwise. By defining the loss function ℓ(θ) = 1Θ(θ) and let Z be the convex feasible set defined by Z z0 R, z Rd, Z Sd : z0 + 2z θ + Z, θθ 1Θ(θ) θ Rd . Published as a conference paper at ICLR 2022 Notice that Z is a closed and convex set because it is an intersection of uncountably many closed and convex sets. Denote the following set V of mean - second moment matrices that are induced by U by V {(µ, M) Rd Sd + : (µ, Σ) U such that (µ, M) = (µ, Σ + µµ )}. The support function δ V of the set V is defined as δ V(z, Z) = sup{z µ + Tr ZM : (µ, M) V}. Using these notations, we now have sup Q B Q(θ Θ({xj})) = sup (µ,Σ) U sup Q P(µ,Σ) EQ[1Θ( θ)] (7a) sup (µ,Σ) U inf (z0,z,Z) Z z0 + 2µ z + Tr (Σ + µµ )Z (7b) = sup (µ,M) V inf (z0,z,Z) Z z0 + 2µ z + Tr MZ = inf (z0,z,Z) Z sup (µ,M) V z0 + 2µ z + Tr MZ (7c) = inf (z0,z,Z) Z z0 + δ V(2z, Z), where equality (7a) is from the two layer decomposition of the ambiguity set B, and inequality (7b) is from the Isii s duality result Isii (1960). Equality (7c) follows from the Sion s minimax theorem Sion (1958) which holds because the objective function is linear in each variable and because V is compact by the compactness of U (Nguyen et al., 2021b, Lemma A.6). We thus have sup Q B Q(θ Θ({xj})) = inf z0 + γ(ρ2 bµ 2 2 Tr bΣ ) + q + Tr Q s. t. γ R+, z0 R, z Rd, Z Sd, q R+, Q Sd + γI Z γbΣ 1 2 γbΣ 1 2 Q 0, γI Z γbµ + z γbµ + z q z0 + 2z θ + Z, θθ 1Θ(θ) θ Rd, where the equality follows by substituting the support function of V in Kuhn et al. (2019, Lemma 2). Consider now the last constraint of the above optimization problem, it is easy to see that it is equivalent to z0 + 2z θ + Z, θθ 0 θ Rd, z0 + 2z θ + Z, θθ 1 θ Θ({xj}). The first semi-infinite constraint is equivalent to the semidefinite constraints Z 0, Z z z z0 A sufficient condition for the second semi-infinite constraint is that λ RJ + : Z z z z0 1 0 1 2xj 1 2x j 0 which holds thanks to the S-lemma P olik & Terlaky (2007). Adding these above constraints into the optimization problem leads to the desired upper bound. This completes the proof. The proof of Proposition 2.4 relies on the following result on the multivariate Chebyshev inequalities, which can be found in Marshall & Olkin (1960) and Bertsimas & Popescu (2005). Theorem A.1 (Multivariate Chebyshev inequality). Let S be a convex set, then sup Q (µ,Σ) Q( θ S) = 1 1 + κ, κ = inf θ S (θ µ) Σ 1(θ µ). We are now ready to prove Proposition 2.4. Published as a conference paper at ICLR 2022 Proof of Proposition 2.4. If bµ Θ({xj}), then it is clear that inf θ Θ({xj}) (θ bµ) bΣ 1(θ bµ) = 0, thus we have U sup Q (bµ,bΣ) Q( θ Θ({xj})) = 1 by Theorem A.1. This leads to U = 1. Consider the case when bµ Θ({xj}). By the hyperplane separation theorem, there exists a vector x Rd such that x θ 0 for all θ Θ({xj}) and x bµ < 0. Let T {θ : x θ 0}, then it is trivial that Θ({xj}) T and that T is a convex set. We now have inf Q B Q( θ Θ({xj})) = 1 sup Q B Q( θ Θ({xj})) 1 sup Q B Q( θ T) = 1 1 = 0, where the penultimate equality follows from Theorem A.1. This leads to L = 0. A.2 PROOFS OF SECTION 3 Proof of Theorem 3.1. Notice that the optimal solution in x should satisfy x bµ > 0. Fix any value of x = 0. Consider first the inner minimization problem of (3), and associate with the equality constraint a Lagrangian dual variable ζ R, we have min θ:θ x=0 (θ bµ) bΣ 1(θ bµ) = min θ Rd max ζ R (θ bµ) bΣ 1(θ bµ) + 2ζθ x = max ζ R min θ Rd (θ bµ) bΣ 1(θ bµ) + 2ζθ x = max ζ R ζ2x bΣx + 2ζbµ x where the second equality follows from convex duality result. The third equality follows from the fact that for every value of ζ, the optimal solution in the variable θ is θ (ζ) = bµ ζ bΣx. Moreover, the last equality follows from the optimality condition in ζ which gives ζ = bµ x/(x bΣx). Because the optimal solution in x should satisfy x bµ > 0, problem (3) is hence equivalent to x bΣx s. t. x xk . Adding now two auxiliary variables t R+ and v Rd with the constraints: 1 x bµ = t, v = tx, the claim in the statement of the theorem now follows by a simple substitution to get v bΣv s. t. v Rd, t R+, v txk 2 t, v bµ = 1. Swapping the maximum operator to a minimum operator completes the proof. Published as a conference paper at ICLR 2022 A.3 PROOFS OF SECTION 4 Proof of Lemma 4.1. From the definition of the set Θ({xj}), we have max{r : Er Θ({xj})} = ( max r s. t. sup u 2 r (bΣ 1 2 u + bµ) xj 0 j ( max r s. t. bµ xj + sup u 2 r x j bΣ 1 2 u 0 j = max r s. t. bµ xj + r bΣ 1 2 xj 2 0 j, where the last equality follows from the dual norm property. The proof now follows by finding the maximum value of r so that the problem is feasible. B EXPERIMENTS B.1 EXPERIMENTAL DETAIL Real-world datasets Here, we provide more detail about the three real-world datasets we used. Source code can be found at https://github.com/ngocbh/COPA. i German Credit (Dua & Graff, 2017). The dataset contains the information (e.g. age, gender, financial status,...) of 1000 customers who took a loan from a bank. The classification task is to determine the risk (good or bad) of an individual. There is another version of this dataset regarding corrections of coding error (Groemping, 2019). We use the corrected version of this dataset as shifted data to capture the correction shift. The features we used in this dataset include duration , amount , personal status sex , and age . ii Small Bussiness Administration (SBA) (Li et al., 2018). This data includes 2,102 observations with historical data of small business loan approvals from 1987 to 2014. We divide this dataset into two datasets (one is instances from 1989 - 2006 and one is instances from 2006 - 2014) to capture temporal shift. We use the following features: selected, Term , No Emp , Create Job , Retained Job , Urban Rural , Chg Off Prin Gr , Gr Appv , SBA Appv , New , Real Estate , Portion , Recession . iii Student performance (Cortez & Silva, 2008). This data includes the performance records of 649 students in two schools: Gabriel Pereira (GP) and Mousinho da Silveira (MS). The classification task is to determine if their final score is above average or not. We split this dataset into two sets in two schools to capture geospatial shift. The features we used are: age , Medu , Fedu , studytime , famsup , higher , internet , romantic , freetime , goout , health , absences , G1 , G2 . Classifier Throughout this paper, we use a Logistic Regression for a linear classifier and a threelayer MLP with 20, 50, 20 nodes and Re LU activation in each consecutive layer as the nonlinear classifier. We use one-hot encoding for categorical features in the datasets to convert it to a vector of [0, 1]. We use min-max normalization to scale the numerical features to [0, 1]. We report the performance of the classifiers in three real-world datasets in Table 2 B.2 ADDITIONAL EXPERIMENTS The impact of degree of distribution shift on validity of a plan. We provide an additional experiment in different covariance shift Σg = (1 + β)A, A 0. In this experiment, we choose A as: 1 1 0 1 1 1 0 1 1 The matrix A introduces both positive and negative correlations between the classifier s parameters. Other settings are set the same as the experiment in Section 5.1. Published as a conference paper at ICLR 2022 Table 2: Performance of the underlying classifiers. Logistic Regression Neural Network Accuracy AUC Accuracy AUC German 0.71 0.01 0.64 0.02 0.68 0.02 0.62 0.02 Shifted German 0.71 0.01 0.64 0.02 0.68 0.02 0.62 0.02 SBA 0.71 0.02 0.86 0.02 0.96 0.02 0.99 0.01 Shifted SBA 0.87 0.01 0.90 0.02 0.97 0.01 0.98 0.01 Student 0.83 0.02 0.91 0.02 0.88 0.02 0.95 0.01 Shifted Student 0.87 0.03 0.93 0.03 0.90 0.03 0.96 0.01 (a) Covariance shift (b) Mean & Covariance shift Figure 5: The impact of shift magnitudes on the validity of the plans obtained by three algorithms. Mahalanobis correction on real-world datasets. In this experiment, we evaluate the Mahalanobis correction on different number of corrections K and different perturbation limit . We set ρ = 0.01, ϵ = 0.1, K {0, . . . , J}, J = 5, [0.05, 0.35]. (bµ, bΣ) is estimated using similar manner as in Section 5.2 of the main paper. The results in shown in Figure 6. Counterfactual explanations for real-world datasets. To illustrate the use case of the counterfactual explanations, we provide some examples on the German dataset with J = 3 (Table 3) in which we consider the personal status and sex feature as immutable. Here, we can observe that three algorithms could provide diverse sets of counterfactuals that the users may prefer. However, by providing better empirical validity, the plans generated by Mahalanobis Crr and COPA are more robust with distribution shift than Di CE (generated without considering the shift). C EXTENSION TO NONLINEAR CLASSIFIERS In the main paper, our analysis is based on the linearity in both features and model parameters. We now discuss two extensions of our COPA framework to the nonlinear settings. C.1 NONLINEARITY IN INPUT FEATURES This section extends to any linear classifier Cθ(x) = 1 if θ φ(x) 0, and 0 otherwise, where φ : X Rd is a (possibly nonlinear) feature mapping that maps input features to a latent representation in a covariate space Rd. Note that our bounds in Section 2 still hold in latent space Rd: for a concrete example, Theorem 2.2 holds with xj being replaced by φ(xj). Published as a conference paper at ICLR 2022 Figure 6: Evaluation of Mahalanobis correction on the real-world datasets. We fix = 0.1, evaluate the effect of the number of correction on the lower validity bound, diversity, and proximity (left column). We fix K = 3, evaluate the effect of the perturbation limit on the lower validity bound, diversity, and proximity (right column). The COPA framework is also extendable to incorporate the feature map φ. Assuming that φ is differentiable, the COPA framework solves the following optimization problem: min x1,...,x J Proximity({xj}, x0) + λ1Validity({φ(xj)}, bµ, bΣ) λ2Diversity({xj}) s. t. bµ xj ϵ j. (8) The proximity and diversity are measured in the input space and the validity term is now measured in latent space instead. This optimization problem can be solved efficiently by a projected gradient descent algorithm similar to Section 4. C.2 NONLINEARITY IN MODEL S PARAMETERS Similar to the prior works (Ustun et al., 2019; Rawal & Lakkaraju, 2020; Upadhyay et al., 2021), our work can adapt to nonlinear classifiers Cnl using a local surrogate models such as LIME (Ribeiro et al., 2016). LIME (Ribeiro et al., 2016) is a popular technique for explaining predictions of blackbox machine learning models. The main idea of LIME is to train a local surrogate model Cx0 θ on perturbed samples around a given input instance x0 to approximate the local decision boundary of the black-box models. We thus model the uncertainty of parameters θ in the surrogate model Cx0 θ for x0 instead of the parameters of Cnl. For the experiment, we first generate a local linear model Cx0 θ using LIME method with 5000 perturbed samples. We then choose (bµ, bΣ) = (θ, 0.05I), where I is identity matrix, to model the Published as a conference paper at ICLR 2022 Duration Credit amount Personal status Age L Empirical Validity Instance 30.0 4249.0 A94 28.0 - - Di CE 49.4 596.4 - 28.0 0.00 0.005 72.0 4330.2 - 28.0 59.7 13776.4 - 28.0 Mahalanobis Crr 42.5 69.8 - 30.0 0.15 0.802 40.8 1153.9 - 30.0 27.4 10047.3 - 30.0 COPA 4.0 18424.0 - 28.0 0.11 0.797 72.0 9410.3 - 28.0 40.3 250.0 - 28.0 Instance 42.0 7174.0 A92 30.0 - - Di CE 11.8 250.0 - 30.0 0.38 0.88 4.0 7167.4 - 30.0 13.4 13386.8 - 30.0 Mahalanobis Crr 7.9 523.7 - 33.0 0.61 0.968 3.6 5500.1 - 32.0 9.1 12234.5 - 32.0 COPA 4.0 3884.7 - 30.0 0.88 1.000 16.3 3280.8 - 30.0 15.5 250.0 - 30.0 Instance 24.0 4526.0 A93 74.0 - - Di CE 72.0 2165.7 - 74.0 0.00 0.080 72.0 9907.4 - 74.0 72.0 18424.0 - 74.0 Mahalanobis Crr 62.1 1766.4 - 75.0 0.01 0.614 55.6 8881.0 - 75.0 48.8 16680.9 - 75.0 COPA 4.0 250.0 - 74.0 0.59 0.997 44.8 3070.5 - 74.0 4.0 18424.0 - 74.0 Table 3: Counterfactual examples on German dataset. distributional uncertainty of the parameters. Similar to Section 5.2, we set Gelbrich radius ρ is to 0.01, J = 5, K = 3. Table 4: Performance of competing algorithms on nonlinear classifiers. The current validity is the validity of counterfactual plan with respect to the current nonlinear classifier Cnl (i.e., the fraction of instances that the generated counterfactual plan is feasible). Dataset Method Proximity Diversity L Empirical Validity Current Validity Correction Di CE 0.515 0.204 0.043 0.037 0.005 0.041 0.414 0.238 0.990 Mahalanobis Crr 0.595 0.210 0.035 0.035 0.021 0.058 0.409 0.313 0.670 COPA (λ1 = 0.1; λ2 = 1.0) 0.219 0.183 0.001 0.011 0.065 0.088 0.556 0.331 0.560 COPA (λ1 = 0.1; λ2 = 2.0) 0.432 0.403 0.100 0.116 0.049 0.093 0.301 0.341 0.270 COPA (λ1 = 0.2; λ2 = 2.0) 0.673 0.314 0.162 0.084 0.038 0.097 0.125 0.186 0.040 Temporal Di CE 1.573 0.451 0.107 0.071 0.637 0.350 0.852 0.270 1.000 Mahalanobis Crr 1.567 0.449 0.099 0.070 0.868 0.118 0.987 0.076 1.000 COPA (λ1 = 0.1; λ2 = 1.0) 1.388 0.540 0.002 0.008 0.981 0.014 1.000 0.000 1.000 COPA (λ1 = 0.1; λ2 = 2.0) 1.534 0.408 0.247 0.043 0.976 0.012 1.000 0.000 1.000 COPA (λ1 = 0.2; λ2 = 2.0) 1.447 0.340 0.118 0.072 0.990 0.004 1.000 0.000 1.000 Geospatial Di CE 1.576 0.349 0.175 0.070 0.022 0.046 0.328 0.303 1.000 Mahalanobis Crr 1.594 0.349 0.169 0.071 0.113 0.084 0.689 0.280 1.000 COPA (λ1 = 0.1; λ2 = 1.0) 1.342 0.367 0.000 0.000 0.011 0.007 0.384 0.310 0.710 COPA (λ1 = 0.1; λ2 = 2.0) 1.552 0.292 0.243 0.039 0.010 0.024 0.168 0.210 0.750 COPA (λ1 = 0.2; λ2 = 2.0) 1.637 0.284 0.287 0.017 0.164 0.066 0.679 0.274 1.000 We report the performance of three algorithms on the MLP classifier in the real-world datasets in Table 4. The result is promising since the proposed COPA can increase the empirical validity Published as a conference paper at ICLR 2022 significantly. However, the infidelity of LIME could lead to invalid counterfactual explanations, represented by a lower current validity value. The low current validity is also observed in the literature, see Upadhyay et al. (2021). For further investigation, one can use another local surrogate model that provides a better approximation of the decision boundary (e.g., Bay LIME (Zhao et al., 2020)). Another direction is to use a mixture linear regression model to approximate the decision boundary as in Guo et al. (2018). However, advocating for the mixture of linear models requires further analysis.