# consistent_counterfactuals_for_deep_models__bd877d97.pdf Published as a conference paper at ICLR 2022 CONSISTENT COUNTERFACTUALS FOR DEEP MODELS Emily Black , Zifan Wang, Anupam Datta, Matt Fredrikson {emilybla, zifan, danupam, mfredrik} @cmu.edu Carnegie Mellon University Counterfactual examples are one of the most commonly-cited methods for explaining the predictions of machine learning models in key areas such as finance and medical diagnosis. Counterfactuals are often discussed under the assumption that the model on which they will be used is static, but in deployment models may be periodically retrained or fine-tuned. This paper studies the consistency of model prediction on counterfactual examples in deep networks under small changes to initial training conditions, such as weight initialization and leave-one-out variations in data, as often occurs during model deployment. We demonstrate experimentally that counterfactual examples for deep models are often inconsistent across such small changes, and that increasing the cost of the counterfactual, a stability-enhancing mitigation suggested by prior work in the context of simpler models, is not a reliable heuristic in deep networks. Rather, our analysis shows that a model s Lipschitz continuity around the counterfactual, along with confidence of its prediction, is key to its consistency across related models. To this end, we propose Stable Neighbor Search1 as a way to generate more consistent counterfactual explanations, and illustrate the effectiveness of this approach on several benchmark datasets. 1 INTRODUCTION Deep Networks are increasingly being integrated into decision-making processes which require explanations during model deployment, from medical diagnosis to credit risk analysis (Bakator & Radosav, 2018; et. al, 2017; Liu et al., 2014; Sun et al., 2016; De Fauw et al., 2018; Babaev et al., 2019; Addo et al., 2018; Balasubramanian et al., 2018; Wang & Xu, 2018). Counterfactual examples (Wachter et al., 2018; Van Looveren & Klaise, 2019; Mahajan et al., 2019; Verma et al., 2020; Laugel et al., 2018; Keane & Smyth, 2020; Ustun et al., 2019; Sharma et al., 2019; Poyiadzi et al., 2020; Karimi et al., 2020; Pawelczyk et al., 2020a) are often put forth as a simple and intuitive method of explaining decisions in such high-stakes contexts (Mc Grath et al., 2018; Yang et al., 2020). A counterfactual example for an input x is a related point x that produces a desired outcome y from a model. Intuitively, these explanations are intended to answer the question, Why did point x not receive outcome y ? either to give instructions for recourse, i.e. how an individual can change their behavior to get a different model outcome, or as a check to ensure a model s decision is well-justified (Ustun et al., 2019). Counterfactual examples are particularly popular in legal and business contexts, as they may offer a way to comply with regulations in the United States and Europe requiring explanations on high-stakes decisions (e.g. Fair Credit Reporting Act (FCRA) and General Data Protection Regulation (GDPR, 2016)), while revealing little information about the underlying model (Barocas et al., 2020; Mc Grath et al., 2018). Counterfactual examples are often viewed under the assumption that the decision system on which they will be used is static: that is, the model that creates the explanation will be the same model to which, e.g. a loan applicant soliciting recourse re-applies (Barocas et al., 2020). However, during real model deployments in high-stakes situations, models are not constant through time: there are often retrainings due to small dataset updates, or fine-tunings to ensure consistent good behavior (Merchant, 2020; pwc, 2020). Thus, in order for counterfactuals to be usable in practice, they must return the same desired outcome not only for the model that generates them, but for similar models created during deployment. This paper investigates the consistency of model predictions on counterfactual examples between deep models with seemingly inconsequential differences, i.e. random seed and one-point changes in the training set. Equal Contribution 1Implementation is available at https://github.com/zifanw/consistency Published as a conference paper at ICLR 2022 We demonstrate that some of the most common methods generating counterfactuals in deep models either are highly inconsistent between models or very costly in terms of distance from the original input. Recent work that has investigated this problem in simpler models (Pawelczyk et al., 2020b) has pointed to increasing counterfactual cost, i.e. the distance between an input point and its counterfactual, as a method of increasing consistency. We show that while higher than minimal cost is necessary to achieve a stable counterfactual, cost alone is not a reliable signal to guide the search for stable counterfactuals in deep models (Section 3). Instead, we show that a model s Lipschitz continuity and confidence around the counterfactual is a more reliable indicator of the counterfactual s stability. Intuitively, this is due to the fact that these factors bound the extent of a models local decision boundaries will change across fine-tunings, which we prove in Section 4. Following this result, we introduce Stable Neighbor Search (SNS), which finds counterfactuals by searching for high-confidence points with small Lipschitz constants in the generating model (Section 4). Finally, we empirically demonstrate that SNS generates consistent counterfactuals while maintaining a low cost relative to other methods over several tabular datasets, e.g. Seizure and German Credit from UCI database (Dua & Karra Taniskidou, 2017), in Section 5. In summary, our main contributions are: 1) we demonstrate that common counterfactual explanations can have low consistency across nearby deep models, and that cost is an insufficient signal to find consistent counterfactuals (Theorem. 1); 2) to navigate this cost-consistency tradeoff, we prove that counterfactual examples in a neighborhood where the network has a small local Lipschitz constant are more consistent across changes to the last layer of weights, which suggests that such points are more stable across small changes in the training environment (Theorem. 2) ; 3) leveraging this result, we propose SNS as a way to generate consistent counterfactual explanations (Def. 5); 4) we empirically demonstrate the effectiveness of SNS in generating consistent and low-cost counterfactual explanations (Table 1). More broadly, this paper further develops a connection between the geometry of deep models and the consistency of counterfactual examples. When considered alongside related findings that focus on attribution methods, our work adds to the perspective that good explanations require good models to begin with (Croce et al., 2019; Wang et al., 2020; Dombrowski et al., 2019; Simonyan et al., 2013; Sundararajan et al., 2017). 2 BACKGROUND Notation. We begin with notation, preliminaries, and definitions. Let F(x;θ)=argmaxifi(x;θ) be a deep network where fi denotes the logit output for the i-th class and θ is the vector of trainable parameters. If F(x;θ) {0,1}, there is only one logit output so we write f. Throughout the paper we assume F is piece-wise linear such that all the activation functions are Re LUs. We use ||x||p to denote the ℓp norm of a vector x and Bp(x,ϵ) def= {x |||x x||p ϵ,x Rd} to denote a norm-bounded ball around x. Counterfactual Examples. We introduce some general notation to unify the definition of a counterfactual example across various approaches with differing desiderata. In the most general sense, a counterfactual example for an input x is an example xc that receives the different, often targeted, prediction while minimizing a user-defined quantity of interest (Qo I) (see Def. 1): for example, a counterfactual explanation for a rejected loan application is a related hypothetical application that was accepted. We refer to the point x requiring a counterfactual example the origin point or input interchangeably. We note that there is a different definition of counterfactual widely used in the causality literature, where a counterfactual is given by an intervention on a causal model that is assumed to generate data observations (Pearl, 2009). This is a case of overlapping terminology, and is orthogonal to this work. We do not consider causality in this paper. Definition 1 (Counterfactual Example). Given a model F(x), an input x, a desired outcome class c = F(x;θ) , and a user-defined quantity of interest q, a counterfactual example xc for x is defined as xc def= argmin F(x ;θ)=cq(x ,x) where the cost of xc is defined as ||x xc||p. The majority of counterfactual generation algorithms minimize of qlow(x,x ) def= ||x x ||p, potentially along with some constraints, to encourage low-cost counterfactuals (Wachter et al., 2018). Some common variations include ensuring that counterfactuals are attainable, i.e. not changing features that cannot be changed (e.g. sex, age) due to domain constraints (Ustun et al., 2019; Lash et al., 2017), ensuring sparsity, so that fewer features are changed (Dandl et al., 2020; Guidotti et al., 2018), or incorporating user preferences into what features can be changed (Mahajan et al., 2019). Alternatively, a somewhat distinct line of work (Pawelczyk et al., 2020a; Van Looveren & Klaise, 2019; Joshi et al., 2019) also adds constraint to ensure that counterfactuals come from the data manifold. Other works still integrate causal validity into counterfactual search (Karimi et al., 2020), or generate multiple counterfactuals at once (Mothilal et al., 2020). Published as a conference paper at ICLR 2022 We focus our analysis on the first two approaches, which we denote minimum-cost and data-support counterfactuals. We make this choice as the causal and distributional assumptions used in other counterfactual generation methods referenced are specific to a given application domain, whereas our focus is on the general properties of counterfactuals across domains. Specifically, we evaluate our results on minimum-cost counterfactuals introduced by Wachter et al. (2018), and data-support counterfactuals from Pawelczyk et al. (2020a), and Van Looveren & Klaise (2019). We give the full descriptions of these approaches in Sec. 5. Counterfactual Consistency. Given two models F(x;θ1) and F(x;θ2), a counterfactual example xc for F(x;θ1) is consistent with respect to F(x;θ2) means F(xc;θ1) = F(xc;θ2). Following Pawelczyk et al. (2020b), we define the Invalidation Rate for counterfactuals in Def. 2. Definition 2 (Invalidation Rate). Suppose xc is a counterfactual example for x found in a model F(x;θ), we define the invalidation rate IV(xc,Θ) of xc with respect to a distribution Θ of trainable parameters as IV(xc,Θ) def= Eθ ΘI[F(xc;θ ) =F(xc;θ)]. Throughout this paper, we will call the model F(x;θ) that creates the counterfactual the generating or base model. Recent work has investigated the consistency of counterfactual examples across similar linear and random forest models (Pawelczyk et al., 2020b). We study the invalidation rate with respect to the distribution Θ introduced by arbitrary differences in the training environment, such as random initialization and onepoint difference in the training dataset. We also assume F(x;θ ) uses the same set of hyper-parameters as chosen for F(x;θ), e.g. the number of epochs, the optimizer, the learning rate scheduling, loss functions, etc. 3 COUNTERFACTUAL INVALIDATION IN DEEP MODELS As we demonstrate in more detail in Section 5, counterfactual invalidation is a problem in deep networks on real data: empirically, we find that counterfactuals produce inconsistent outcomes in duplicitous deep models up to 94% of the time. Previous work investigating the problem of counterfactual invalidation (Pawelczyk et al., 2020b; Rawal et al., 2021), has pointed to increasing counterfactual cost as a potential mitigation strategy. In particular, they prove that higher cost counterfactuals will lead to lower invalidation rates in linear models in expectation (Rawal et al., 2021), and demonstrate their relationship in a broader class of well-calibrated models (Pawelczyk et al., 2020b). While this insight provides interesting challenge to the perspective that low cost counterfactuals should be preferred, we show that cost alone is insufficient to determine which counterfactual has a greater chance of being consistent at generation time in deep models. The intuition that a larger distance between input and counterfactual will lead to lower invalidation rests on the assumption that the distance between a point x and a counterfactual xc is indicative of the distance from xc to the decision boundary, with a higher distance making xc s prediction more stable under perturbations to that boundary. This holds well in a linear model, where there is only one boundary (Rawal et al., 2021). However, in the complex decision boundaries of deep networks, going farther away from a point across the nearest boundary may lead to being closer to a different boundary. We prove that this holds even for a one-hidden-layer network by Theorem 1. This observation shows that a counterfactual example that is farther from its origin point may be equally susceptible to invalidation as one closer to it. In fact, we show that the only models where ℓp cost is universally a good heuristic for distance from a decision boundary, and therefore by the reasoning above, consistency, are linear models (Lemma 1). Theorem 1. Suppose that H1, H2 are decision boundaries in a piecewise-linear network F(x) = sign{w 1 Re LU(W0x)}, and let x be an arbitrary point in its domain. If the projections of x onto the corresponding halfspace constraints of H1,H2 are on H1 and H2, then there exists a point x such that: 1) d(x ,H2)=0 2) d(x ,H2)