# rethinking_visual_counterfactual_explanations_through_region_constraint__72fbe7fb.pdf

Published as a conference paper at ICLR 2025

RETHINKING VISUAL COUNTERFACTUAL EXPLANATIONS THROUGH REGION CONSTRAINT

Bartlomiej Sobieski University of Warsaw b.sobieski@uw.edu.pl

Jakub Grzywaczewski Warsaw University of Technology jakub.grzywaczewski2.stud@pw.edu.pl

Bartlomiej Sadlej University of Warsaw b.sadlej@student.uw.edu.pl

Matthew Tivnan Harvard Medical School mtivnan@mgh.harvard.edu

Przemyslaw Biecek University of Warsaw, Warsaw University of Technology przemyslaw.biecek@pw.edu.pl

Visual counterfactual explanations (VCEs) have recently gained immense popularity as a tool for clarifying the decision-making process of image classiﬁers. This trend is largely motivated by what these explanations promise to deliver indicate semantically meaningful factors that change the classiﬁer s decision. However, we argue that current state-of-the-art approaches lack a crucial component the region constraint whose absence prevents from drawing explicit conclusions, and may even lead to faulty reasoning due to phenomenons like conﬁrmation bias. To address the issue of previous methods, which modify images in a very entangled and widely dispersed manner, we propose region-constrained VCEs (RVCEs), which assume that only a predeﬁned image region can be modiﬁed to inﬂuence the model s prediction. To effectively sample from this subclass of VCEs, we propose Region-Constrained Counterfactual Schr odinger Bridges (RCSB), an adaptation of a tractable subclass of Schr odinger Bridges to the problem of conditional inpainting, where the conditioning signal originates from the classiﬁer of interest. In addition to setting a new state-of-the-art by a large margin, we extend RCSB to allow for exact counterfactual reasoning, where the predeﬁned region contains only the factor of interest, and incorporating the user to actively interact with the RVCE by predeﬁning the regions manually.

1 INTRODUCTION

Visual counterfactual explanations (VCEs) aim at explaining the decision-making process of an image classiﬁer by modifying the input image in a semantically meaningful and minimal way so that its decision changes. Over time, they have become an independent research direction with the latest methods presenting impressive and visually appealing results. Nevertheless, in this work we show that they possess a fundamental ﬂaw at a conceptual level the lack of region constraint and its proper utilization.

Consider the image x in Fig. 1, which the classiﬁer f correctly predicts to be a jay. In essence, VCEs focus on semantically editing x so that the prediction of f changes to some target class bulbul in this case hence providing an answer to a speciﬁc what-if question, through which the model s reasoning is explained. Consider now an example VCE for x , denoted as x VCE, obtained with a recent state-of-the-art (SOTA) method. While x VCE is successful at changing the prediction of f and can be considered both realistic and semantically close to x , answering why f now predicts it as a bulbul is close to impossible.

Corresponding author

Published as a conference paper at ICLR 2025

Figure 1: Previous methods create VCEs with unconstrained changes, making it virtually impossible to understand the decision-making process of a model. We propose region-constrained VCEs, establishing a new paradigm for comprehensible and actionable explanatory process.

The algorithm simultaneously modiﬁes the bird s head and feathers, changes the texture of the branch and even modiﬁes the copyright caption. The entanglement and dispersion of introduced changes hence leaves the question unanswered. We argue that to circumvent these fundamental difﬁculties, VCEs should be synthesized with a hard constraint on the region, where the changes are allowed to appear, while leaving the rest of the image unchanged. For example, consider the image x R with regions of the bird s head (R1) and body (R2) overlayed. Constraining the VCEs to introduce changes only to predetermined regions leads to two distinct explanations, x R1 and x R2, of why the decision changes to bulbul. By isolating the modiﬁed factors, the explanatory process greatly simpliﬁes one can now state with certainty that f s new prediction is based either on the modiﬁed feathers (x R2) or the changed characteristics of its head (x R1). Region-constrained VCEs (RVCEs) allow, therefore, to reason about the model s thought process in a causal and principled manner, mitigating the potential conﬁrmation bias and clarifying the explanatory process.

By putting RVCEs in the spotlight, our work establishes new frontiers in the ﬁeld of VCE generation. First, we deﬁne the objective of ﬁnding RVCEs as solving a conditional inpainting task. By building on top of the Image-to-Image Schr odinger Bridge (I2SB, Liu et al. (2023a)) approach and adapting it to the classiﬁer guidance scheme, we develop an efﬁcient algorithm which synthesizes RVCEs with extreme realism, sparsity and closeness to the original image. Speciﬁcally, we set a new quantitative state-of-the-art (SOTA) on Image Net (Deng et al., 2009) with up to 4 times better scores in FID and 3 times better s FID (realism), up to 2 times higher COUT (sparsity), and match or exceed S3 (similarity) and Flip Rate (efﬁciency) achieved by previous methods. Through large-scale experiments, we demonstrate that, besides a fully automated way of synthesizing meaningful and highly interpretable RVCEs, our approach, Region-constrained Counterfactual Schr odinger Bridge (RCSB), allows to infer causally about the model s change in prediction and enables the user to actively interact with the explanatory process by manually deﬁning the region of interest. Moreover, our results highlight the importance of RVCEs in future research, indicating potential pitfalls of unconstrained methods that could lead to drawing misleading conclusions.

2 BACKGROUND & RELATED WORK

In this section, we introduce the necessary background knowledge connected with score-based generative models (SGMs) and I2SB, which forms the foundation of our method. We then present an overview of recent methods for VCE generation based on SGMs. For an extended literature review and detailed description of the theoretical basis, please refer to the Appendix.

SGM. Following the work of Song et al. (2021), SGMs can be constructed through the framework of stochastic differential equations (SDEs), where samples from a complex distribution p0 (e.g., natural images) are mapped to a Gaussian distribution p1, while the model is trained to reverse this mapping. Formally, converting data to noise is performed by following the forward SDE (Eq. (1a)), while denoising happens through the reverse SDE (Eq. (1b), Anderson (1982)):

dxt = Ft(xt)dt + p

dxt = (Ft(xt) βt xt log p(xt, t))dt + p

βtd w, (1b) where xt is the noisy version of a clean image x Rn for some n N at timestep t [0, 1] , w and w denote the Wiener process and its reversed (in time) counterpart, Ft(xt) : Rn Rn is

Published as a conference paper at ICLR 2025

the drift coefﬁcient, βt R is the diffusion coefﬁcient and xt log p(xt, t) is the score function. An SGM sθ, where θ denotes the model s parameters, is trained to approximate the score, i.e., sθ(xt, t) xt log p(xt, t). During sampling, denoising begins from pure noise x1 p1 and follows some discretized version of Eq. (1b) with the approximate score sθ.

SGMs can also be adapted to conditional generation, where y represents the conditioning variable. In this case, the score xt log p(xt, t) is replaced by xt log p(xt, t | y), which can be decomposed with Bayes Theorem into xt log p(xt, t | y) = xt log p(xt, t) + xt log p(y | xt, t). While xt log p(xt, t) can be approximated with an already trained sθ, xt log p(y | xt, t) must be modeled additionally. For y representing class labels, p(y | xt, t) can be approximated with an auxiliary time-dependent classiﬁer pϕ(y | xt, t) trained on noisy images {xt}t [0,1]. Incorporating pϕ into the sampling process is termed as classiﬁer guidance (CG), and can be strengthened (or weakened) with guidance scale s through xt log p(xt, t)+s xt log p(y | xt, t). Therefore, class-conditional sampling in SGMs amounts to additionally maximizing the likelihood pϕ(y | xt, t) of the classiﬁer throughout the generative process to arrive at images from the data manifold, which resemble (according to pϕ) instances of a speciﬁc class. We emphasize this fact here for further reference.

Figure 2: Generative trajectories of I2SB and SGM. Intermediate images of I2SB are much closer to the data manifold.

I2SB. The framework of I2SB extends SGMs to p1 representing an arbitrary data distribution. For training, I2SB requires paired data, e.g., in the form of clean and partially masked samples for inpainting, where it learns to inﬁll the missing parts. While SGMs can also be adapted to solve inverse problems like inpainting, I2SB maps these samples directly (see Fig. 2 for a comparison of their generative trajectories). Therefore, I2SB follows the same theoretical paradigm, where sampling is achieved by discretizing Eq. (1b) and using a score approximator sψ, but the generative process begins from a corrupted (e.g., masked) image instead of pure noise. Hence, I2SB can also be adapted to conditional generation in the same manner as SGMs, especially for class-conditioning with an auxiliary classiﬁer. Importantly, a special case of I2SB follows an optimal transport ordinary differential equation (OTODE) when βt 0, eliminating stochasticity beyond the initial sampling step (see Appendix). We utilize the OT-ODE version of I2SB in our implementation.

SGM-based VCEs. The initial approach of adapting SGMs to VCE generation, Di ME (Jeanneret et al., 2022), obtains the classiﬁer s gradient by mapping the noised image to its clean version at each step through the reverse process. Augustin et al. (2022) (DVCE) incorporate the gradient of a robust classiﬁer and a cone projection scheme. Jeanneret et al. (2023) (ACE) decompose the VCE generation into pre-explanation construction and reﬁnement using Re Paint (Lugmayr et al., 2022). Jeanneret et al. (2024) utilize a foundation model, Stable Diffusion (SD, Rombach et al. (2022)), to generate VCEs in a black-box scenario. Farid et al. (2023) (LDCE) and Motzkus et al. (2024) utilize Latent Diffusion Models (LDMs), including SD, in a white-box context. Weng et al. (2024) propose Fast Di ME to accelerate the generation process in a shortcut learning scenario. Also in black-box context, Sobieski & Biecek (2024) utilize a Diffusion Autoencoder (Preechakul et al., 2022) to ﬁnd semantic latent directions that globally ﬂip the classiﬁer s decision. Augustin et al. (2024) also make use of SD in various contexts, including classiﬁer disagreement and neuron activation besides VCEs. While ACE and Fast Di ME also assume constraints to some regions, those are always classiﬁerdependent. In this work, we consider a more general deﬁnition provided in the next section.

In this section, we describe the details of our approach, beginning with the formulation of RVCEs as solutions to conditional inpainting task. Next, we motivate the use of I2SB as an effective prior for synthesizing meaningful RVCEs and follow with a series of steps that better align the gradients of a standard classiﬁer w.r.t corrupted images from its generative trajectory. We conclude with a description of the automated region extraction method, forming the basis of our algorithm.

Published as a conference paper at ICLR 2025

Trajectory truncation Adaptive normalization Tweedie's

ADAM stabilization Factual Naive

Region + + + +

46.4 27.5 (-18.9) 23.1 (-4.4) 20.2 (-2.9) 16.1 (-4.1) Figure 3: Series of proposed improvements to better align the gradient s of the classiﬁer of interest with the generative trajectory. Changes to the factual image are constrained to the indicated region. Subsequent images illustrate the inﬂuence of each new adaptation. Numbers below images correspond to FID ( ) values obtained in a larger-scale experiment (for details, see Appendix).

RVCEs through conditional inpainting. We deﬁne the problem of ﬁnding RVCEs for the classiﬁer f from a given image x , a region R and target class label y, where arg maxy f(y | x ) = y, as the task of sampling from

p(x | arg max y f(y | x) = y, (1 R) x = (1 R) x ), (2)

where R is a binary mask with 1 indicating the region which is not restricted to depend on the classiﬁer f. Intuitively, sampling from Eq. (2) means obtaining x with the complement of R unchanged and the content of R modiﬁed in a way that changes the decision of f to y, i.e., performing inpainting with additional condition coming from the classiﬁer f.

Synthesizing meaningful RVCEs. Looking at Eq. (2), one quickly realizes that obtaining semantically meaningful RVCEs requires maximizing the likelihood f(y | x) of the classiﬁer while inpainting R with content that keeps x in the data manifold. These conditions greatly resemble the CG scheme in the context of I2SB, since the score estimate sψ serves as an effective prior for generating in-manifold inﬁlls, while the likelihood pϕ(y | x) of an auxiliary classiﬁer is maximized to ensure that pϕ predicts them as instances of y. Moreover, I2SB maps masked images directly to clean samples, leaving the content outside R unchanged in the ﬁnal image.

The above arguments suggest that inserting f in place of pϕ should function as an effective mechanism for sampling meaningful RVCEs. However, a fundamental drawback of this naive approach is that, throughout the generative process, f s gradients originate from evaluating it on images with highly noised inﬁlls inside R (see Fig. 2). Such corrupted images are far from what f observed during training, hence leading to a misalignment of its gradients with the correct trajectory and generation of out-of-manifold samples. Similar issue has been identiﬁed by previously mentioned SGM-based methods for VCEs, which can be generally uniﬁed as attempts to replace the auxiliary classiﬁer pϕ with f in the CG scheme in SGMs and correct f s gradients. Following Fig. 2, one should expect the misalignment in these methods to be of great extent, as the generative trajectory consists of highly noised images, leaving no meaningful content for f to provide accurate gradients. There, as shown in Fig. 2, I2SB provides a crucial advantage, which stems from its generative trajectory being much closer to the data manifold. Moreover, by using I2SB, f is able to effectively utilize the readily available context outside R. Hence, in the following, we focus on reducing the misalignment problem caused by the noised content inside R, in the end arriving at a highly effective algorithm for meaningful RVCEs.

Aligning the gradients. We propose to adapt the gradients of f to properly align with the generative trajectory of I2SB through a series of incremental steps. To provide the intuition standing behind the introduction of each consecutive improvement, Fig. 3 provides an example RVCE task, where the factual image depicts a zebra correctly predicted by the model (Res Net50 (He et al., 2016)), and the goal is to change the decision to sorrel . We set the region constraint to include the entire animal to make the task challenging enough and verify the improvements quantitatively through a large-scale experiment with around 2000 images. For each step, we compute FID between the RVCEs and original images to assess their realism. For details on the experimental setup, see Appendix.

Naive. We ﬁrst verify that naively plugging f in place of pϕ does not provide meaningful results. Indeed, as shown in Fig. 3, the method struggles to include the information from f. The unrealistic inﬁll also suggests that the classiﬁer s signal negatively inﬂuences the score from I2SB.

Published as a conference paper at ICLR 2025

Tweedie s formula. To begin with closing the gap between the data manifold and the generative trajectory, we refer to a classic result of Tweedie s formula (Robbins, 1992; Chung et al., 2022; Weng et al., 2024), which states that a denoised estimate of the ﬁnal image at step t can be achieved by computing the posterior expectation ˆx0(xt) := E[x0 | xt] = xt + σ2 t xt log p(xt, t), (3)

where σ2 t = R t 0 βτdτ. For visual differences between xt and ˆx0(xt), see Appendix. Crucially, one has access to approximate ˆx0(xt) at every step t by utilizing I2SB as the approximate score. Replacing xt log f(y | xt) with xt log f(y | ˆx0(xt)) brings the inputs of f much closer to what it expects, improving the conditional inpainting process as indicated by Fig. 3, which now shows a structure resembling a sorrel and a much smaller FID.

ADAM stabilization. Despite utilizing the Tweedie s estimate, we observed the norms of f s gradients to have a very noisy tendency throughout the generation process, pointing out a possible cause for visible artifacts and the missing parts of the animal. Hence, we propose to smooth out the gradients by applying the ADAM update rule at each step (Vaeth et al., 2024; Kingma, 2019), to which we simply refer as ADAM stabilization. Figure 3 indicates that this modiﬁcation allows for ﬁlling in the missing parts of the sorrel and further lowering FID.

Adaptive normalization. Incorporating ADAM stabilization required greatly lowering the guidance scale to values on the order of 1e 2, as using standard s = 1 led to extreme artifacts. This phenomenon suggested that the step size could also be adjusted throughout the generation process. While we initially experimented with various types of schedulers (see Appendix), using adaptive normalization has empirically proven to be the most effective approach. Speciﬁcally, at the beginning of the conditional inpainting process, we register the norm of the ﬁrst encountered gradient of the log-likelihood of f. We then use it as a normalizing constant for each subsequent gradient, meaning that the generation begins with gradient of unit norm. This simple modiﬁcation not only further lowered FID, but also reduced the ﬁnal visible artifacts and improved color balance (Fig. 3).

Trajectory truncation. Up until this point, we relied solely on the ability of I2SB and the classiﬁer s signal to correctly inﬁll the missing regions with semantically meaningful content, with no knowledge of the structure of the missing objects. Since a possible inﬁll of the region is always available from the original image, one can begin the inpainting process from some intermediate step instead of the ﬁnal one. This intervention allows for mixing the available information with the one coming from the classiﬁer, and gives direct control over the preservation of the original content. As our approach does not bias the conditional score with signal from any additional losses (like Learned Perceptual Image Patch Similarity (LPIPS Zhang et al. (2018)) or l2 in other works), we can fully rely on the conceptual compression of I2SB, similarly to SGMs (Ho et al., 2020), which decomposes the generation process into initial phases responsible for the overall structure of objects and later ones responsible for small details. Figure 3 showcases the effect of using this trajectory truncation (τ) at the 0.4 level, meaning that the inﬁlling process starts from t = τ T, where T denotes the ﬁnal timestep.

Figure 4: Example region obtained with our automated region extraction. Instead of directly binarizing an attribution map (upper row), we amplify the focus on semantic concepts (bottom row) with a simple approach based on grid cells.

Understandably, trajectory truncation greatly lowers the FID score, as much more information is available from the very beginning of the process, and introduces much more subtle changes to the image. We explore the effect of manipulating τ further in the Appendix, showing that it functions as a very interpretable mechanism for controlling the content preservation.

Automated region extraction. While the introduced algorithmical improvements effectively incorporate the classiﬁer s signal into the inpainting process, they do not address the issue of predetermining the region for the resulting explanation. To this end, the optimal strategy would be fully automated and focus on regions that are both important to the classiﬁer s prediction and point to semantically meaningful concepts. This description closely resembles the

Published as a conference paper at ICLR 2025

role of visual attribution methods, which assign importance values to pixels based on their relevance to the model s output (Holzinger et al., 2022). Figure 4 shows an example attribution map obtained with Integrated Gradients (IG, Sundararajan et al. (2017)) method for the squirrel prediction of a Res Net50 model. Perceptually, highest attributions are focused around the squirrel s head. To extract a region from such attributions, one can threshold them to cover a speciﬁc fraction a of the total image area. However, after binarizing the attributions with a = 0.05, we observe that the resulting region is highly scattered, losing focus from semantic concepts. To address this issue, we divide the image into a grid of square cells of size c c, where each cell receives the value equal to the sum of the absolute pixel attributions inside it. Figure 4 shows that this postprocessing mechanism (here with c = 16) greatly ampliﬁes the focus of the resulting map. By thresholding it with a = 0.05, we observe the extracted region to focus solely on the squirrel s head. This leads to a fully automated strategy for obtaining regions that are both aligned with semantically meaningful concepts and based on pixels that are important for the classiﬁer.

We term the ﬁnal version of the algorithm which combines all of the aforementioned improvements with the automated region extraction as RCSB. For the pseudocode of the entire procedure, see Appendix. We include our implementation at https://github.com/sobieskibj/rcsb.

4 EXPERIMENTS

Method FID s FID S3 COUT FR

Zebra Sorrel

ACE l1 84.5 122.7 0.92 0.45 47.0 ACE l2 67.7 98.4 0.90 0.25 81.0 LDCE-cls 84.2 107.2 0.78 0.06 88.0 LDCE-txt 82.4 107.2 0.71 0.21 81.0 DVCE 33.1 43.9 0.62 0.21 57.8 RCSBC 13.0 20.4 0.82 0.70 99.7 RCSBB 9.51 17.4 0.86 0.72 97.4 RCSBA 8.0 16.2 0.88 0.74 94.7

Cheetah Cougar

ACE l1 70.2 100.5 0.91 0.02 77.0 ACE l2 74.1 102.5 0.88 0.12 95.0 LDCE-cls 71.0 91.8 0.62 0.51 100.0 LDCE-txt 91.2 117.0 0.59 0.34 98.0 DVCE 46.9 54.1 0.70 0.49 99.0 RCSBC 30.2 39.2 0.87 0.79 100.0 RCSBB 23.4 32.4 0.90 0.85 99.9 RCSBA 17.2 26.6 0.92 0.92 100.0

Egyptian Cat Persian Cat

ACE l1 93.6 156.7 0.85 0.25 85.0 ACE l2 107.3 160.4 0.78 0.34 97.0 LDCE-cls 102.7 140.7 0.63 0.52 99.0 LDCE-txt 121.7 162.4 0.61 0.56 99.0 DVCE 46.6 59.2 0.59 0.60 98.5 RCSBC 41.1 56.3 0.79 0.82 100.0 RCSBB 31.3 48.1 0.84 0.87 100.0 RCSBA 23.0 40.0 0.87 0.92 100.0

Table 1: Quantitative comparison with SOTA. RCSB outperforms previous methods by a large margin across all metrics. The best results are obtained with A(a = 0.1, c = 4, s = 3, τ = 0.6), but the superiority is clear for various conﬁgurations, including B(a = 0.2, c = 4, s = 1.5, τ = 0.6), C(a = 0.3, c = 4, s = 1.5, τ = 0.6).

Following previous works for VCEs on Image Net, we base the quantitative evaluation on 3 challenging main VCE generation tasks: Zebra Sorrel, Cheetah Cougar, Egyptian Cat Persian Cat, where each task requires creating VCEs for images from both classes and ﬂipping the decision to their counterparts. We treat it as a general benchmark for evaluating the effectiveness of RCSB in various scenarios. We use FID ( ) and s FID ( ) to assess realism (Heusel et al., 2017), S3 ( ) for representation similarity (Chen & He, 2021), COUT [ 1, 1] ( ) (Khorram & Fuxin, 2022) for sparsity and Flip Rate (FR) ( ) for efﬁciency. For qualitative examples, we extend the main tasks with a large array of other tasks, which we show throughout the paper and the Appendix, where more details regarding the experimental setup and the metrics description can be found.

RCSB sets new SOTA for VCEs. We ﬁrst verify that synthesizing RVCEs with RCSB leads to new SOTA in VCE generation. Table 1 quantitatively compares RCSB with recent SOTA approaches to VCEs on Image Net. Our RVCEs are much more realistic (at least 2 4 decrease in FID and s FID), stay close to original images (match or exceed best values of S3) and almost always ﬂip the model s decision (FR 1.0). RCSB also solves a long-standing challenge of achieving extremely sparse explanations on Image Net, especially on Zebra Sorrel task. While all other methods fail to achieve nonnegative values, RCSB approaches the upper bound of COUT. Our method is clearly the most balanced, as it does not struggle on any speciﬁc metric like, e.g., DVCE on S3. In the Appendix, we show that it is also the most computationally efﬁcient.

Published as a conference paper at ICLR 2025

Figure 5: Qualitative examples obtained with RCSB using automated region extraction. Each task of the form predicted class target class shows the factual image, the extracted region and the RVCE obtained with RCSB.

Figure 5 shows example explanations obtained with RCSB, greatly highlighting the importance of synthesizing RVCEs instead of standard VCEs. Our region extraction approach is able to precisely localize semantic concepts responsible for the model s decision. For example, in the Guacamole Cabbage task, RCSB detects the guacamole bowl in the background and, guided by the classiﬁer, inﬁlls it with cabbage while leaving the rest of the image unchanged. RCSB is capable of performing a wide range of editing tasks with various levels of difﬁculty, beginning with textural and color-based edits (e.g., Tench Goldﬁsh, Mashed Potato Cauliﬂower) to partially changing the object s structure (e.g., Limpkin Flamingo) to inﬁlling the region with new, realistically looking concepts (e.g., Cougar Lynx, Green Mamba Indian Cobra, Cougar Lynx). Most importantly, thanks to the region constraint, our RVCEs allow for greatly limiting the potential factors that inﬂuenced the model s decision, making the explanations much more interpretable.

RCSB allows for causal inference about the model s reasoning. Drawing deﬁnite conclusions about the model s reasoning from an unconstrained VCE is not possible, as one cannot be certain that modifying potentially irrelevant factors did not in fact inﬂuence the prediction. RVCEs overcome this limitation when constrained on the region connected with the sole factor of interest, e.g., the body of an animal in a species prediction task. To adapt RCSB to such scenario, we replace the automated region extraction method with a foundation text-to-object-segmentation model 1. Using the class name from a given task as the text prompt allows us to obtain highly precise segmentation masks of the relevant objects, enabling the identiﬁcation of the cause behind the model s prediction change based solely on factors related to the object of interest.

We ﬁrst quantitatively assess that RCSB is capable of utilizing regions provided by a generic object detector at scale. Table 2(A) shows the results of this evaluation together with the used text prompt. Here, the metrics are computed by ﬁrst discarding images with a mask that covers area larger than

1Language Segment Anything (Lang SAM) combines Segment Anything Model (Kirillov et al., 2023) with Grounding DINO (Liu et al., 2023b) to allow object segmentation from text prompts.

Published as a conference paper at ICLR 2025

Metric FID s FID S3 COUT FR FID s FID S3 COUT FR FID s FID S3 COUT FR

Task Zebra Sorrel Cheetah Cougar Egyptian Cat Persian Cat

A Exact regions obtained with Lang SAM and prompts: zebra / horse, cheetah / cougar, cat respectively

Values 32.8 41.5 0.87 0.74 98.9 37.2 50.6 0.91 0.84 99.4 52.0 82.8 0.81 0.84 99.2

B Regions based on freeform masks with the area in the indicated range

10 20% 6.7 15.0 0.85 0.85 87.6 9.0 19.1 0.89 0.72 96.6 12.4 29.6 0.80 0.73 96.9 20 30% 7.8 15.8 0.84 0.53 92.2 11.6 21.3 0.88 0.71 99.6 17.7 34.0 0.78 0.74 99.3

C Ablation study with adaptations of other inpainting algorithms

Re Paint 63.8 76.0 0.55 0.77 99.3 129.3 144.2 0.50 0.77 99.0 148.7 175.2 0.38 0.76 99.5 MCG 43.2 55.6 0.73 0.45 96.0 76.6 91.4 0.74 0.64 100.0 93.7 117.5 0.62 0.65 99.9 DDRM 42.5 49.4 0.69 0.72 99.6 60.5 68.4 0.72 0.76 100.0 59.2 73.0 0.63 0.76 100.0

Table 2: Quantitative results from various experiments. A: regions extracted from Lang SAM with text prompt connected to the initial class name. B: regions based on freeform masks that cover the fraction of the total area from the indicated range. C: automatically extracted regions used with adaptations of other inpainting algorithms.

40%. Despite I2SB being trained on masks covering at most 30% of the image area, we observed that it generalizes well beyond this threshold with 40% starting to pose a challenge. Crucially, despite the regions being classiﬁer-agnostic and hence not necessarily focused on the most inﬂuential pixels, Table 2(A) indicates that RCSB is versatile enough to maintain most of the performance from the automated approach. The efﬁciency, sparsity and representation similarity of the obtained RVCEs remain very close to the values achieved by the closest conﬁguration (in terms of hyperparameters) from Table 1, as the region area is often close to or exceeds 30%. The slight increase in FID and s FID stems mainly from the regions covering complex objects, whose modiﬁcation may naturally move RVCEs further from original data at a distribution level, and a lower number of images used for these metrics computation (as both are sensitive to sample size) due to the rejection of samples from the area constraint.

Regions that contain exactly the objects of interest provide novel insights about the model s reasoning. For example, consider the Lemon Orange task from Fig. 6, where the lemons were correctly identiﬁed by the Res Net50 model. One would require the VCE for this task to indicate the sole determining factor of why lemons and not oranges . However, with unconstrained VCEs, this identiﬁcation process quickly becomes incomprehensible due to small changes added to each object in the image, such as other fruits. By constraining VCEs to the region occupied by the lemons, the reasoning process can be disentangled and simpliﬁed, as one can now look for this factor in the modiﬁcations of the lemons only. In this case, RCSB is guided by the classiﬁer to change the image in a way that is consistent with human intuition.

RVCEs also allow for clarifying the model s decision-making when its reasoning is not initially understandable. In the Volcano Seashore task, the image shows both objects, while the model predicts it as the former. Applying RCSB to the exact region of the seashore results in a RVCE that changes the model s decision when the water s color becomes more light blue and structures like stones start to appear. Hence, one is able to better understand what the model actually identiﬁes as a seashore. In other examples, the method introduces class-speciﬁc characteristics when the changes are constrained precisely and exclusively to the object of interest, ensuring the receiver about the general cause of the model s decision change. Such cases are also especially relevant when the generative model used to synthesize explanations is prone to systematic errors like, e.g., SGMs struggling with correctly generating hands. In the Night Snake Kingsnake task, this error can be bypassed with the region constraint by not allowing the generative model to affect anything other than the animal, hence alleviating the evaluation of the classiﬁer on out-of-manifold samples.

Discovering complex patterns with interactive RVCEs. Despite the impressive capabilities of deep models in object localization, the receiver of the explanation may be interested in testing the model for highly abstract and complex concepts that cannot be localized automatically and must be provided manually by the user. We begin with verifying the capability of RCSB in generating RVCEs based on user-deﬁned regions by simulating such scenario at scale. Speciﬁcally, we randomly match images from the main tasks with regions given by the 10% 20% and 20% 30% freeform masks from the I2SB training data (Saharia et al., 2022). We argue that this serves as a very

Published as a conference paper at ICLR 2025

Figure 6: Qualitative examples obtained with RCSB using exact regions extracted from Lang SAM using text prompt of the predicted class. For each task of the form predicted class target class, a factual image together with the used region and the resulting RVCE are shown. The used text prompts are emphasized.

challenging benchmark, since the algorithm s access to the most inﬂuential pixels (for the classiﬁer) might often be very restricted.

Despite the task s difﬁculty, quantitative results from Table 2(B) highlight the versatility of RCSB, which is able to effectively utilize the restricted resources to inﬂuence the classiﬁer s prediction. While S3, COUT and FR are not signiﬁcantly different from previous results, we observe a decrease in FID and s FID, indicating higher realism and closeness to the data distribution. This is largely due to the fact that freeform masks are often not connected to entire complex objects and do not contain the pixels most important to the classiﬁer. Hence, RCSB may often leave large portions of the regions unchanged, which boosts the realism evaluation.

To allow for true interaction of the user with the explanatory process, we implement a simple interface that allows for manual image segmentation using a brush-like cursor. Figure 7 shows example results, where we manually predeﬁne the regions on different images. This exploration gives important insights about the added value provided by RVCEs. In the Cat Tiger task, we discover that the classiﬁer s decision can be ﬂipped by independently modifying either the cat s paws or snout, in both cases introducing a tiger s coloration. Similarly in the Arctic Fox Red Fox task, choosing either the ears and muzzle or paws and stomach area allows for changing the model s decision with the features of a red fox. User-deﬁned regions also allow to discover unusual reasoning patterns of the model. In the Cucumber Zucchini task, the model s decision can be inﬂuenced by modifying only one of the cucumbers to zucchini, leaving the other unchanged. This observation connects with recent positions on the topic of contextual and spatial understanding of predictive models (Tomaszewska & Biecek, 2024), providing new rationale in further exploring how image classiﬁers actually reason.

Ablating RCSB s components. We empirically veriﬁed that combining our novel guidance mechanism with the I2SB prior leads to highly effective RVCEs. To better understand the beneﬁts provided by each component of our framework, we perform an ablation study, where we adapt the proposed improvements to SGM-based inpainters, aiming to assess the inﬂuence of the guidance scheme and I2SB in isolation. Speciﬁcally, we pick Re Paint (Lugmayr et al., 2022), one of the ﬁrst adaptations of SGMs to inpainting, MCG (Chung et al., 2022) and DDRM (Kawar et al., 2022), two different adaptations of SGMs to linear inverse problems, which also include inpainting. We manually tune our guidance scheme to each method on a small subset of images and repeat the same evaluation protocol with the automated region extraction method (see Appendix for details of each adaptation).

Published as a conference paper at ICLR 2025

Figure 7: Qualitative examples obtained with RCSB from user-deﬁned regions. For each task of the form predicted class target class, a factual image together with the provided regions are shown. Arrows point to RVCEs obtained by modifying only the indicated region. As these methods are much less compute-efﬁcient, we cap their computational budget on each task to 24 A100 GPU hours.

Table 2(C) shows the results of the ablation study. Despite the fact that the used methods were never explicitly trained for inpainting, combining them with our guidance mechanism and region extraction allows for matching or even exceeding previous SOTA. For example, all adaptations achieve very high sparsity, almost always ﬂip the classiﬁer s decision and keep the explanation close to the original. This indicates the beneﬁts of utilizing only the pixels from the extracted region and a proper utilization of the classiﬁer s gradients without biasing them with additional components like LPIPS or l2 loss. RCSB differentiates itself from the adaptations with a much higher realism of the obtained RVCEs (signiﬁcantly lower FID and s FID), more balanced results and much smaller computational burden, e.g. 24 less NFEs than Re Paint. These beneﬁts stem from the I2SB prior, which is trained to map corrupted images directly to clean samples and the resulting trajectory being much closer to the data manifold, allowing the classiﬁer to more effectively inﬂuence the inpainting process.

5 DISCUSSION & LIMITATIONS

RVCEs offer a new perspective on the concept of VCEs, with RCSB effectively demonstrating their versatility in various scenarios that have not been explored in previous work. In view of this, enforcing a hard region constraint potentially chosen independently of the predictive model introduces novel challenges and raises important questions. For instance, the explanations do not reveal changes in the interactions between different objects in the image that inﬂuence the model s decision. Furthermore, due to the absence of ground truth, verifying the actual reasoning of a model based on the explanation remains difﬁcult, even if RVCEs appear intuitive. Additionally, the evaluation process of RVCEs may be skewed by the preservation of a large portion of the original pixels (e.g., FID).

We address several of these issues in the Appendix, including two user studies on the general usefulness of RVCEs, the possibility of interacting with the explanatory process, and their informativeness regarding model misclassiﬁcations. We also present extended qualitative and quantitative results for other classiﬁers, datasets, and attribution methods, empirical demonstrations of some of RCSB s capabilities (e.g., shape changes), and other key aspects. We believe that the limitations of RVCEs and RCSB offer valuable directions for future research.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

This work was ﬁnancially supported by the Polish National Centre for Research and Development (NCBi R, x Lungs grant no. INFOSTRATEG-I/0022/2021-00). The computational resources were provided by the Laboratory of Bioinformatics and Computational Genomics and the High Performance Computing Center of the Faculty of Mathematics and Information Science, Warsaw University of Technology.

Brian DO Anderson. Reverse-time diffusion equation models, volume 12. Stochastic Processes and their Applications, Elsevier, 1982.

Maximilian Augustin, Valentyn Boreiko, Francesco Croce, and Matthias Hein. Diffusion visual counterfactual explanations. In Advances in Neural Information Processing Systems, 2022.

Maximilian Augustin, Yannic Neuhaus, and Matthias Hein. Analyzing and explaining image classiﬁers via diffusion guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

Sebastian Bach, Alexander Binder, Gr egoire Montavon, Frederick Klauschen, Klaus-Robert M uller, and Wojciech Samek. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance propagation. In Plo S one, 2015.

Przemyslaw Biecek and Wojciech Samek. Position: Explain to question not to justify. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 3996 4006, 21 27 Jul 2024.

Valentyn Boreiko, Maximilian Augustin, Francesco Croce, Philipp Berens, and Matthias Hein. Sparse visual counterfactual explanations in image space. In DAGM German Conference on Pattern Recognition, pp. 133 148. Springer, 2022.

Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp. 67 74. IEEE, 2018.

Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining image classiﬁers by counterfactual generation. In International Conference on Learning Representations, 2019.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750 15758, 2021.

Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Advances in Neural Information Processing Systems, 2022.

Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023a.

Hyungjin Chung, Jeongsol Kim, and Jong Chul Ye. Direct diffusion bridge using data consistency for inverse problems. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. IEEE, 2009.

Li Deng. The mnist database of handwritten digit images for machine learning research. In IEEE Signal Processing Magazine, 2012.

Published as a conference paper at ICLR 2025

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. In Advances in neural information processing systems, 2021.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Logan Engstrom, Andrew Ilyas, Hadi Salman, Shibani Santurkar, and Dimitris Tsipras. Robustness (python library). In https://github.com/Madry Lab/robustness, 2019.

Karim Farid, Simon Schrodi, Max Argus, and Thomas Brox. Latent diffusion counterfactual explanations. In ar Xiv preprint ar Xiv:2310.06668, 2023.

Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In International Conference on Machine Learning, pp. 2376 2384. PMLR, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, 2020.

Andreas Holzinger, Anna Saranti, Christoph Molnar, Przemyslaw Biecek, and Wojciech Samek. Explainable AI methods-a brief overview. In International workshop on extending explainable AI beyond deep models and classiﬁers, pp. 13 38. Springer, 2022.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Paul Jacob, Eloi Zablocki, Hedi Ben-Younes, Micka el Chen, Patrick P erez, and Matthieu Cord. Steex: steering counterfactual explanations with semantics. In European Conference on Computer Vision, pp. 387 403. Springer, 2022.

Guillaume Jeanneret, Lo ıc Simon, and Fr ed eric Jurie. Diffusion models for counterfactual explanations. In Proceedings of the Asian Conference on Computer Vision, pp. 858 876, 2022.

Guillaume Jeanneret, Lo ıc Simon, and Fr ed eric Jurie. Adversarial counterfactual visual explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16425 16435, 2023.

Guillaume Jeanneret, Lo ıc Simon, and Fr ed eric Jurie. Text-to-image models for counterfactual explanations: a black-box approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4757 4767, 2024.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.

Saeed Khorram and Li Fuxin. Cycle-consistent counterfactuals by latent transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10203 10212, 2022.

Ba Jimmy Kingma, Diederik P. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2019.

Published as a conference paper at ICLR 2025

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015 4026, 2023.

Andreas Kirsch. An introduction to the mathematical theory of inverse problems. Applied mathematical sciences. Springer, New York, 2nd ed edition, 2011.

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. Captum: A uniﬁed and generic model interpretability library for pytorch. In ar Xiv preprint ar Xiv:2009.07896, 2020.

Oran Lang, Yossi Gandelsman, Michal Yarom, Yoav Wald, Gal Elidan, Avinatan Hassidim, William T Freeman, Phillip Isola, Amir Globerson, Michal Irani, et al. Explaining in style: Training a gan to explain a classiﬁer in stylespace. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 693 702, 2021.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.

Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos Theodorou, Weili Nie, and Anima Anandkumar. I2SB: Image-to-Image Schr odinger Bridge. In International Conference on Machine Learning, pp. 22042 22062. PMLR, 2023a.

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ar Xiv preprint ar Xiv:2303.05499, 2023b.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012 10022, 2021.

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976 11986, 2022.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11461 11471, 2022.

Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.

Robert J. Mc Cann. A convexity principle for interacting gases. In Advances in Mathematics, 1997.

Franz Motzkus, Christian Hellert, and Ute Schmid. Cola-dce concept-guided latent diffusion counterfactual explanations. In ar Xiv preprint ar Xiv:2406.01649, 2024.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in neural information processing systems, 2019.

Gabriel Peyr e and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5 6):355 607, 2019.

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10619 10629, 2022.

Published as a conference paper at ICLR 2025

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you? Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135 1144, 2016.

Herbert E Robbins. An empirical Bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, 1992.

Pau Rodr ıguez, Massimo Caccia, Alexandre Lacoste, Lee Zamparo, Issam Laradji, Laurent Charlin, and David Vazquez. Beyond trivial counterfactual explanations with diverse valuable explanations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1056 1065, 2021.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234 241. Springer, 2015.

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings, pp. 1 10, 2022.

Jonathan Scarlett, Reinhard Heckel, Miguel RD Rodrigues, Paul Hand, and Yonina C Eldar. Theoretical perspectives on deep learning methods in inverse problems. In IEEE journal on selected areas in information theory, 2022.

Erwin Schr odinger. Sur la th eorie relativiste de l electron et l interpr etation de la m ecanique quantique. In Annales de l institut Henri Poincar e, volume 2, pp. 269 310, 1932.

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, 2022.

Lisa Schut, Oscar Key, Rory Mc Grath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. Generating interpretable counterfactual explanations by implicit minimisation of epistemic and aleatoric uncertainties. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1756 1764. PMLR, 2021.

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: visual explanations from deep networks via gradient-based localization. In International journal of computer vision, 2020.

Sheng-Min Shih, Pin-Ju Tien, and Zohar Karnin. Ganmex: One-vs-one attributions using gan-based model explainability. In International Conference on Machine Learning, pp. 9592 9602. PMLR, 2021.

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International conference on machine learning, pp. 3145 3153. PMLR, 2017.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

Published as a conference paper at ICLR 2025

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. In 2nd International Conference on Learning Representations, Workshop Track Proceedings, 2014.

Sumedha Singla, Brian Pollack, Junxiang Chen, and Kayhan Batmanghelich. Explanation by progressive exaggeration. In International Conference on Learning Representations, 2020.

Bartlomiej Sobieski and Przemysław Biecek. Global counterfactual directions. In European Conference on Computer Vision, 2024.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations, Workshop Track Proceedings, 2015.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319 3328. PMLR, 2017.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016.

Jayaraman Thiagarajan, Vivek Sivaraman Narayanaswamy, Deepta Rajan, Jia Liang, Akshay Chaudhari, and Andreas Spanias. Designing counterfactual generators using deep model inversion. In Advances in Neural Information Processing Systems, 2021.

Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, and Yu-Gang Jiang. Deeper insights into the robustness of vits towards common corruptions. In ar Xiv preprint ar Xiv:2204.12143, 2022.

Paulina Tomaszewska and Przemysław Biecek. Position paper: Do not explain (vision models) without context. In International Conference on Machine Learning, 2024.

Philipp Vaeth, Alexander M Fruehwald, Benjamin Paassen, and Magda Gregorova. Gradcheck: Analyzing classiﬁer guidance gradients for conditional diffusion sampling. In ar Xiv preprint ar Xiv:2406.17399, 2024.

Arnaud Van Looveren and Janis Klaise. Interpretable counterfactual explanations guided by prototypes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 650 665. Springer, 2021.

Nina Weng, Paraskevas Pegios, Aasa Feragen, Eike Petersen, and Siavash Bigdeli. Fast diffusionbased counterfactuals for shortcut removal and generation. In European Conference on Computer Vision, 2024.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818 833. Springer, 2014.

Mehdi Zemni, Micka el Chen, Eloi Zablocki, H edi Ben-Younes, Patrick P erez, and Matthieu Cord. Octet: Object-aware counterfactual explanations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15062 15071, 2023.

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018.

Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In International Conference on Learning Representations, 2018.

Published as a conference paper at ICLR 2025

1 Introduction 1

2 Background & Related Work 2

4 Experiments 6

5 Discussion & Limitations 10

A Pseudocode 18

B Limitations 19

C Extended Background & Related Work 20

D Comparison to previous works 21

E Extended Method 22

E.1 Additional ﬁgures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

E.2 Incorporating the classiﬁer s signal . . . . . . . . . . . . . . . . . . . . . . . . . . 22

E.3 Analytic posterior and OT-ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

F Extended Experiments 23

F.1 Details of individual experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

F.2 Metrics description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

F.3 Adaptation of other inpainting algorithms . . . . . . . . . . . . . . . . . . . . . . 25

F.3.1 Manifold Constrained Gradient (MCG, Chung et al. (2022)) . . . . . . . . 25

F.3.2 Denosing Diffusion Restoration Models (DDRM, Kawar et al. (2022)) . . . 25

F.3.3 Re Paint (Lugmayr et al., 2022) . . . . . . . . . . . . . . . . . . . . . . . . 26

F.4 Schedulers for guidance scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

F.5 Quantitative evaluation of other attribution methods . . . . . . . . . . . . . . . . . 26

F.6 Diversity assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

F.7 Computational efﬁciency assessment . . . . . . . . . . . . . . . . . . . . . . . . . 27

F.8 Additional quantitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

F.9 Freeform masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

F.10 Unintuitive classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

F.11 Shape modiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.12 Lower-level attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.13 Other classiﬁers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

F.14 Other benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Published as a conference paper at ICLR 2025

G User studies 32

G.1 I: Usefulness and interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

G.2 II: Understanding model s failures . . . . . . . . . . . . . . . . . . . . . . . . . . 32

H Qualitative examples 33

The Appendix is structured as follows. Appendix A shows pseudocode for both I2SB (stochastic and deterministic version) and our RCSB. Appendix B delves deeper into the possible limitations of RVCEs and RCSB. Appendix C includes additional background knowledge connected with I2SB and an extensive literature review regarding topics connected with our work. Appendix D compares our approach to prior methods for VCE generation in detail. Appendix E shows additional ﬁgures from the method s description, considerations regarding the incorporation of the classiﬁer s signal into I2SB and more detailed derivation of the OT-ODE version. Appendix F extends our experimental evaluation with details about the setup, additional results regarding, e.g., efﬁciency and diversity, other datasets and classiﬁers, and concludes with details about the adaptation of different inpainting algorithms. Appendix G provides details about the conducted user studies. Appendix H provides qualitative examples for 7 additional classiﬁers, showing the versatility of RCSB, together with more RVCEs for the Res Net50 model.

Published as a conference paper at ICLR 2025

A PSEUDOCODE

Algorithm 1 Standard I2SB Generation

1: Input: x N p1(x N), trained sψ( , ) 2: for n = N to 1 do 3: Predict ˆx0(xn) using sψ(xn, tn) 4: xn 1 p(xn 1 | ˆx0, xn) according to DDPM 5: end for 6: return x0

Algorithm 2 OT-ODE I2SB Generation

1: Input: x N p1(x N), trained sψ( , ) 2: for n = N to 1 do 3: Predict ˆx0(xt) using sψ(xn, tn) 4: xn 1 = µn 1ˆx0 + µn 1xn 5: end for 6: return x0

Algorithm 3 RCSB

1: Input: Number of steps N, binary region mask R, trajectory truncation τ, classiﬁer scale s, input image x , trained sψ( , ), trained classiﬁer f(y | ), target class y 2: x1 = (1 R) x + R z, where z N(z; 0, I) 3: Discretize truncated timeline 0 = t0 < t1 < < t N = τ 4: x N q(x N|x0, x1) # sample from analytic posterior (Eq. (15)) 5: for n = N to 1 do 6: Predict ˆx0(xn) using sψ(xn, tn) 7: gn = xn log f(y | ˆx0) 8: gn = ADAM(gn) 9: if n == N then g = g N 2 # register norm of the ﬁrst gradient 10: end if 11: xn = xn + s gn

g 12: xn 1 = µn 1ˆx0 + µn 1 xn 13: end for 14: return x0

Algorithm 4 ADAM Update Rule

1: Input: Gradient at step n gn, hyperparameters α, ϵ, β1, β2 (set to Py Torch (Paszke et al., 2019) defaults) 2: mn = β1mn 1 + (1 β1)gn # update biased ﬁrst moment estimate 3: vn = β2vn 1 + (1 β2)g2 n # update biased second moment estimate 4: ˆmn = mn/(1 βn 1 ) # compute bias-corrected ﬁrst moment 5: ˆvn = vn/(1 βn 2 ) # compute bias-corrected second moment 6: gn = α ˆmn/( ˆvn + ϵ) # update gradient 7: return gn # return updated gradient

Published as a conference paper at ICLR 2025

Figure 8: Visual difference between xt and its corresponding Tweedie s estimate ˆx0(xt) across different timesteps.

Figure 9: Inﬂuence of manipulating τ on the ﬁnal RVCE obtained with the region shown in Fig. 3.

B LIMITATIONS

Despite setting new quantitative records, our approach comes with natural limitations that must be mentioned. From the user perspective, RVCEs generated with RCSB, especially when the region is visually appealing and extracted independently from the classiﬁer, may lead to overconﬁdence about the model s decision-making process. As an example, consider the bird s image from Fig. 1 and the RVCE based on the region of its head. Because our approach allows for changing the model s decision by using solely the area of the head, one could interpret that the new features are explicitly and exclusively responsible for the new prediction. However, the situation may be much more complex. For example, the appearance of some speciﬁc features of the head, e.g., a red eye, may inﬂuence the model conditionally on some other features which were already present in the image. Since no ground truth to a counterfactual explanation exists (at least in non-synthetic scenarios), the exact relationships between different aspects of the image remain unknown. Hence, one must be careful when interpreting any kind of counterfactual explanation, especially RVCEs, to not draw incorrect and misleading conclusions.

Moreover, modifying a given image to a counterfactual explanation, in particular to an RVCE, does not mean that the model is not relying on some unintuitive shortcuts or spurious correlations which were not modiﬁed in the process. Hence, one has to remember that VCEs aim at identifying the minimal semantic change required to change the model s decision, and are not guaranteed to modify every feature that the model relies on. These aspects highlight the importance and need of principled evaluation measures for this kind of sample-based explanations. While attribution methods have been heavily addressed in this context in recent years, the evaluation of VCEs remains a very difﬁcult challenge.

In terms of practical limitations, our experiments are based on an I2SB model trained for an inpainting task based on freeform masks with coverage of 20%-30% of the total area. While the model generalizes well to larger fractions of the total image area on the order of 40%-45%, its performance deteriorates above this threshold. From a theoretical point of view, the I2SB algorithm is not limited to any particular upper bound on the total area, but demonstrating that it is possible to obtain good performance over a total area of the order of, e.g., 90%, remains an open research question that has important implications for our work. We believe this to be an interesting direction for future research with many possible extensions.

Published as a conference paper at ICLR 2025

C EXTENDED BACKGROUND & RELATED WORK

Visual counterfactual explanations. In recent years, increasing attention is being paid to synthesizing VCEs for image classiﬁers (Goyal et al., 2019; Schut et al., 2021). These explanations aim at elucidating the model s reasoning by modifying the input image in a semantically minimal and meaningful way while ﬂipping its prediction. Utilizing generative models for this task has historically proven to be very effective (Chang et al., 2019; Singla et al., 2020; Lang et al., 2021). Non-SGM-based methods include works like Thiagarajan et al. (2021) which builds on the concept of deep inversion, OCTET from Zemni et al. (2023) focusing on VCEs for complex scenes and more examples built on top of generative models (Rodr ıguez et al., 2021; Jacob et al., 2022; Shih et al., 2021; Zhao et al., 2018; Van Looveren & Klaise, 2021).

While offering impressive results (Farid et al., 2023; Jeanneret et al., 2024; Augustin et al., 2024; Motzkus et al., 2024), we argue that utilizing general foundation models like SD in the VCE generation task may cause misleading conclusions, since the explained classiﬁers are trained on much smaller datasets than the generative model. For example, about 1 million images from Image Net (Deng et al., 2009) are used to train the classiﬁer, while SD is trained on 5 billion images from LAION-5B (Schuhmann et al., 2022). This discrepancy may naturally lead to SD synthesizing realistically looking variations of a given image that ﬂip the classiﬁer s decision but simultaneously include semantic attributes never observed by the predictive model during training. Therefore, one may question the counterfactual nature of the explanation, as the classiﬁer should not be expected to correctly treat attributes that it was never close to observing. Hence, in this work, we focus on generative models trained on the same data as the classiﬁer of interest. This way, we can study the behavior of predictive models in a faithfull manner, which is an open challenge for XAI community Biecek & Samek (2024).

Inverse problems. Inverse problems (Kirsch, 2011) are deﬁned as the task of recovering an unknown signal x based on a measurement y related via a measurement model H through y = H(x), where H is not necessarily required to be linear or bijective. Hence, for a given measurement y, there may exist a probability distribution over possible solutions p(x | y = H(x)). One special case of an inverse problem is image inpainting, where the missing area of an image, indicated by the mask M, must be inﬁlled using the available context. The measurement model is then deﬁned as H(x) = M x, where denotes an element-wise product and M is a binary mask.

In recent years, deep learning methods have proven to be very effective at solving various kinds of inverse problems (Scarlett et al., 2022). Recently, utilizing generative methods, especially SGMs, established itself as the new SOTA approach in the image domain. One way of adapting SGMs to inverse problems is through conditional generation, where the conditional score can be derived with the measurement model. Many additional techniques, such as data consistency (Chung et al., 2023b), manifold constraint (Chung et al., 2022; 2023a) and others (Kawar et al., 2022), are further utilized to improve this adaptation.

Image-to-Image Schr odinger Bridges. A much harder but possibly also much more effective approach is to learn direct mappings between the distribution of signals x p0 and measurements y p1 instead of adapting pretrained models. In this line of research, Liu et al. (2023a) propose to learn such mappings by constructing a tractable subclass of Schr odinger bridges (SBs, Schr odinger (1932)), termed Image-to-Image Schr odinger Bridges (I2SBs). The SB is an entropy-regularized optimal transport model, which, resembling the framework of SGMs, considers the following forward and backward SDEs: dxt = (Ft(xt) + βt xt log Ψ(xt, t))dt + p

dxt = (Ft(xt) βt xt log bΨ(xt, t))dt + p

βtd w, (4b)

Similarly to SGMs, the marginal densities of Eqs. (4a) and (4b) are equivalent. The functions Ψ, bΨ C2,1(Rd, [0, 1]) represent time-varying energy potentials and are additionally constrained to solve the following partial differential equations ( Ψ(xt,t)

2β Ψ b Ψ(xt,t)

t = ( bΨF) + 1

s.t. Ψ(x0, 0) bΨ(x, 0)=p0(x), Ψ(x, 1) bΨ(x, 1)=p1(x) (5b)

Published as a conference paper at ICLR 2025

In general, numerically solving Eqs. (4a) and (4b) is much more difﬁcult compared to SGMs due to nonlinear terms Ψ, bΨ being coupled via Eq. (5b). However, with Theorem 3.1, Liu et al. (2023a) show an important connection between the frameworks of SBs and SGMs. We repeat it here explicitly for later reference.

Theorem 1 (Reformulating SB drifts as score functions (Liu et al., 2023a)) If bΨ, Ψ fulﬁll the constraints given by Eq. (5), then xt log bΨ(xt, t), xt log Ψ(xt, t) are the score functions of the following linear SDEs, respectively

dxt = Ft(xt)dt + p

βtdw, x0 bΨ( , 0), (6)

dxt = Ft(xt)dt + p

βtd w, x1 Ψ( , 1). (7)

Crucially, Theorem 1 states that, while bΨ, Ψ are not in general assumed to be valid probability distributions, it is true that log bΨ = log p6 and log Ψ = log p7 for p6, p7 representing the densities of the respective SDEs. Following this theoretical result, Liu et al. (2023a) also show a principled approach for approximating xt log bΨ(xt, t) with a neural network sψ. In essence, these results allow to train direct inverse problem solvers with the use of paired data, where p0 represents a clean data distribution and p1 the distribution of its corrupted measurements. Additionally, Liu et al. (2023a) show how I2SB connects with ﬂow-based optimal transport (OT) (Peyr e & Cuturi, 2019; Mc Cann, 1997), where assuming that βt 0 leads to an ordinary differential equation (ODE) dxt = vt(xt | x0)dt that provides a deterministic mapping with the use of sψ estimate. In practive, this is achieved by eliminating the noise from the intermediate sampling steps (see Algorithm 2).

Visual attribution methods. The very ﬁrst works in the current era of Explainable Artiﬁcial Intelligence (XAI) were concerned with providing explanations of the model s decision through visual heatmaps, which highlighted pixels considered important to the its prediction. One of the ﬁrst approaches by Simonyan et al. (2014) proposed simple backpropagation of the model s output w.r.t. the input, indicating the direction of its greatest ascent in the pixel space, often termed as Saliency. More sophisticated approaches emerged in the following years, where techniques like Layer-wise Relevance Propagation (LRP, Bach et al. (2015)), Integrated Gradients (IG, Sundararajan et al. (2017)), Deep Lift and Input Gradient (Shrikumar et al., 2017), Guided Backpropagation (Guided Backprop, Springenberg et al. (2015)), Grad CAM (Selvaraju et al., 2020), Deconvolution (Zeiler & Fergus, 2014) and others that utilize the gradient of the neural network promised to indicate more semantically meaningful concepts in a less noisy manner. Concurrent line of research about perturbation-based methods assumed a more general black-box scenario, where explanations could be provided for a broader class of models. There, methods like Occlusion (Zeiler & Fergus, 2014), Local Interpretable Model-agnostic Explanations (LIME, Ribeiro et al. (2016)), SHapley Additive ex Planations (SHAP, Lundberg & Lee (2017)) and its variations quickly advanced the stateof-the-art. In this work, we utilize their uniﬁed implementations provided by the Captum package (Kokhlikyan et al., 2020).

D COMPARISON TO PREVIOUS WORKS

Due to space limits, we only brieﬂy mention the previous diffusion-based approaches to VCE generation in Section 2. In the following, we provide more details about their theoretical formulations and how they relate to our approach.

Most of the previously published diffusion-based approaches to VCEs rely on the conditional reverse process obtained by replacing the unconditional score xt log p(xt, t) in Eq. (1b) with the conditional variant of the form xt log p(xt, t) + xt log p(y | xt, t). This formulation allows one to incorporate the classiﬁer-based conditioning in various forms. Jeanneret et al. (2022) (Di ME) propose to approximate the likelihood score xt log p(y | xt, t) by fully denoising the image at step t and using it to obtain the classiﬁer s gradient leading to quadratic complexity with respect to the total number of timesteps. The work of Augustin et al. (2022) (DVCE) instead propose to regularize the gradients of the explained classiﬁer with the one coming from a robustly trained classiﬁcation network. Speciﬁcally, the gradient of the former is projected onto a cone around the direction indicated by the gradient of the latter with some predetermined angle. The follow-up work of Jeanneret et al.

Published as a conference paper at ICLR 2025

(2023) (ACE) proposes to split the explanation generation into two distinct phases, with the ﬁrst one responsible for generating the pre-explanation, and the second one performing post-processing. The former combines standard diffusion denoising with a PGD attack performed with the use of the classiﬁer on xt for each t. Then, the latter phase computes the absolute difference between the original image and the pre-explanation, and extracts a binary mask by thresholding this difference. Next, it uses the Re Paint (Lugmayr et al., 2022) algorithm to unconditionally inpaint the masked region beginning from some intermediate timestep. To address the low efﬁciency of Di ME, Weng et al. (2024) (Fast Di ME) propose to improve the conditioning process by utilizing at each timestep t the gradient of the classiﬁer with respect to the Tweedie s estimate. By default, their approach performs dynamic masking throughout the generation process which indicates the region modiﬁed at each timestep. Moreover, the authors propose two 2-step extensions of their approach, namely Fast Di ME-2 and Fast Di ME-2+, that ﬁrst perform either standard Fast Di ME or Fast Di ME without dynamic masking, then extract the most differing regions and utilize the binary mask resulting from the largest changes to conditionally inpaint it with the use of the classiﬁer.

Importantly, ACE and Fast Di ME bear resemblence to our approach in terms of performing some variant of conditional inpainting. Speciﬁcally, ACE ﬁrst combines the classiﬁer with diffusion denoising to extract a region important to the predictive model, but inpaints it unconditionally. Fast Di ME (and its variants) also ﬁnd a region that is regarded as important to the classiﬁer, but inpaint it conditionally. Our approach solves a more general problem by synthesizing RVCEs which do not assume any dependence of the predetermined region to the classiﬁer of interest, which we show through a large array of experiments with regions coming from sources like automated segmentation or user interaction. Moreover, while all of the previously mentioned works are concerned with adapting standard SGMs to either guided generation or inptainting, we show how a more general class of models (tractable Schr odinger Bridges) can be adapted to the problem of conditional inpainting with the use of a classiﬁer. Additionally, we propose a series of improvements that better align its gradients with the generative trajectory that were not previously present in this line of research connected to XAI, such as ADAM stabilization or adaptive normalization. Finally, to the best of our knowledge, our approach is the ﬁrst one to show that guidance can be performed with the signal coming solely from the classiﬁer of interest, omitting the usage of additional proxy measures (like l2 loss or LPIPS) that maintain similarity to the original image. It it also important to higlight that, while very simple, our automated region extraction approach was also not present in this line of research and, through experimental evaluation, was shown to provide highly interpretable and intuitive regions.

E EXTENDED METHOD

E.1 ADDITIONAL FIGURES

We provide illustrative examples for the visual differences between xt and the Tweedie s estimate ˆx0(xt) in Fig. 8. For the effect of manipulating the τ hyperparameter, see Fig. 9

E.2 INCORPORATING THE CLASSIFIER S SIGNAL

In the following, we explicitly deﬁne the DDPM (Ho et al., 2020) sampler mentioned in Algorithm 1 and elaborate on the exact way of incorporating the classiﬁer s gradients into the generation process of I2SB.

Denote by {ti}i {0,...,N} the discrete sequence of timesteps of length N such that 0 = t0 < t1 < < t N = 1. By σ2 n = R tn 0 βτdτ and σ2 n = R 1 tn βτdτ, we denote the variances accumulated

from each side. Additionally, let α2 n 1 = R tn tn 1 βτdτ be the variance accumulated between two consecutive timesteps. For ease of notation, we deﬁne µn 1 and µn 1 as

µn 1 = α2 n 1 α2 n 1 + σ2 n 1 , (8)

µn 1 = σ2 n 1 α2 n 1 + σ2 n 1 . (9)

Published as a conference paper at ICLR 2025

With that, we can deﬁne the DDPM posterior sampler as

xn 1 p(xn 1|ˆx0, xn), (10)

xn 1 N µn 1ˆx0 + µn 1xn, α2 n 1σ2 n 1 α2 n 1 + σ2 n 1 I , (11)

where ˆx0 = ˆx0(xn) denotes the Tweedie s estimate obtained with sψ, i.e., a trained I2SB.

When using the OT-ODE version of I2SB, we replace sampling from the posterior with a deterministic version by following the mean, which yields the update rule

xn 1 = µn 1ˆx0 + µn 1xn. (12)

By converting the Tweedie s estimate to the conditional score using Eq. (3) and applying Bayes Theorem, we are left with

xn 1 = µn 1 xn + σ2 n xn log p(xn, n | y) + µn 1xn = µn 1 xn + σ2 n xn log p(xn, n) + σ2 n xn log p(y | xn, n) + µn 1xn, (13)

where xn log p(xn, n) can be approximated by a standard I2SB network trained on the task of inpainting. By manipulating Eq. (13) further, one can arrive at the following update rule

xn 1 = µn 1 xn + σ2 n xn log p(xn, n) + µn 1σ2 n xn log p(y | xn, n) + µn 1xn

= µn 1 xn + σ2 n xn log p(xn, n) + µn 1

µn 1σ2 n µn 1 xn log p(y | xn, n) + xn

= µn 1 xn + σ2 n xn log p(xn, n) + µn 1 (cn xn log p(y | xn, n) + xn) .

Here, we explicitly deﬁne the time-dependent coefﬁcient cn. While plugging xn log f(y | xn) in place of xn log p(y | xn, n) in Eq. (13) is the most intuitive, we empirically veriﬁed that replacing cn xn log p(y | xn, n) with xn log f(y | xn) leads to more semantically meaningful results. Practically, this can be explained by µn 1 achieving its highest values at the end of the generation process, effectively incorporating the classiﬁer s signal to the highest extent in the ﬁnal steps of the generation. Since we are interested in inﬂuencing the generative trajectory with the classiﬁer f along the entire process (and possibly decreasing its inﬂuence to the greatest possible extent in the ﬁnal timesteps to avoid adversarial changes), it seems intuitive that incorporating f into Eq. (14) allows for obtaining more meaningful RVCEs. This is due to µn = 1 µn, meaning that the classiﬁer s signal is ampliﬁed in the beginning of the generation and decreased in the ﬁnal steps. This intervention also explains the effectiveness of the introduced improvements, as they break the independence of the classiﬁer s signal between consecutive steps, practically incorporating the time-dependent coefﬁcient cn into the gradient alignment.

E.3 ANALYTIC POSTERIOR AND OT-ODE

Following the original work of Liu et al. (2023a) (I2SB), the analytic posterior from the forward stochastic process, which governs the mapping between a given boundary pair (x0, x1), is deﬁned as q(xt|x0, x1) = N µ(x0, x1, t) = x0 + t(x1 x0), Σt = αt(1 t)I , (15)

where by default α = 1. To arrive at the OT-ODE version of I2SB, one must use α 0, effectively reducing q to a Dirac delta distribution centered at µ(x0, x1, t).

F EXTENDED EXPERIMENTS

We follow the evaluation protocol from previous works for VCEs on Imagenet, which, for a given task, uses all images from the training subset correctly predicted by the evaluated model. For Res Net50, this results in around 2000 images per task. We extract the results of other methods from the work of Farid et al. (2023), except the DVCE method (Augustin et al., 2022), which evalutes with a protocol that we were not able to fully reproduce. Hence, to ensure fair comparison, we adapted the implementation of DVCE to our evaluation. Speciﬁcally, we utilize the multiple-norm

Published as a conference paper at ICLR 2025

robust Res Net50 from the work of Boreiko et al. (2022), which the authors of DVCE propose as default, to achieve VCEs for the Res Net50 model. In terms of hyperparameters, we ﬁne-tuned them with grid search on a subset of Zebra Sorrel task and used s = 18.0 as the guidance scale for the non-robust Res Net50, since it performed the best.

For I2SB, we utilize the original checkpoint from Liu et al. (2023a) trained on 20 30% freeform masks from Saharia et al. (2022). While the checkpoint trained on the 10 20% variant is also available and veriﬁed to work within our framework, we discovered that the former generalizes well to smaller area values. Hence, for the sake of simplicity, we utilize the 20 30% version only. By default, we use NFE=100, which we explored the most, but lower NFE regimes provided promising initial results. For the automated region extraction, we use IG by default, but evaluate 10 other attribution methods in Appendix F.5.

F.1 DETAILS OF INDIVIDUAL EXPERIMENTS

Fig. 3: Each improvement, together with the naive approach, is evaluated on the zebra-sorrel task with around 2000 images from Image Net training set (following the protocol from the main experimental evaluation). Each image is initially predicted as either zebra or sorrel by the Res Net50 (He et al., 2016) model and the decision must be ﬂipped to the opposite class. FID is computed between the obtained explanations and original images. The hyperparameter values used for all improvements are: a = 0.3, s = 1.0 (except ADAM stabilization with s = 1e 2), c = 16, τ = 1.0 (except trajectory truncation, where τ = 0.6).

Table 2: A: For all tasks, we use s = 1.5 and τ = 0.4 (to better preserve the original content). As we cannot control the area of masks provided by Lang SAM, hyperparameters a and c are not applicable in this scenario. Images with masks covering area greater than 40% of the total image are discarded from the evaluation to ensure that we only use meaningful RVCEs. B: Across all tasks, the 10% 20% experiment uses conﬁguration B from Table 1, while the 20% 30% experiment uses conﬁguration C. Hyperparameters a and c are not applicable, since masks are provided automatically from the mentioned dataset. C: Each inpainting algorithm is given a 24 A100 GPU hours time budget, resulting in around 2000 images for DDRM, 800 images for MCG and 400 images for Re Paint on each task. Details of their adaptations are provided separately in Appendix F.3.

F.2 METRICS DESCRIPTION

In the following, we provide detailed description of each metric used in the quantitative evaluation.

FID and s FID (realism). Following works on image synthesis, measuring the realism of the obtained explanations at a distribution level is often done with FID and s FID (Heusel et al., 2017). Speciﬁcally, FID compares a set of real (r) and generated (g, in this case, the explanations) images by ﬁrst extracting their corresponding features from the Inception V3 network (Szegedy et al., 2016) and then computing

FID = ||µr µg||2 + Tr Σr + Σg 2 (ΣrΣg)1/2 , (16)

where µr, µg are the mean vectors and Σr, Σg are the covariance matrices of the respective distributions in the feature space. As comparing original images with their edited versions (e.g., explanations) may bias the metric with original pixels mostly unchanged, artiﬁcially boosting the realism evaluation, s FID ﬁrst divides the sets into folds and averages FID over the independent counterparts.

S3 (representation similarity).

Explanations should also resemble original images from a representation respective. Here, following the work of Jeanneret et al. (2023), we compute average Sim Siam Similarity (S3) over a set of original images and the resulting counterfactuals. Speciﬁcally, S3 utilizes a Sim Siam network (Chen & He, 2021), which encodes both the factual and counterfactual images into their respective representations rf, rcf and computes the cosine similarity as

S3 = rf rcf ||rf||2 ||rcf||2 . (17)

COUT (sparsity). In the context of VCEs, sparsity is understood as perturbing a minimal number of pixels to ﬂip the model s decision. To quantify this criterion, the COUnterfactual Transition (COUT)

Published as a conference paper at ICLR 2025

metric computes

COUT =AUPCy AUPCy ,

1 2(f(k | x(m)) f(k | x(m+1))) (18)

where x(0) is the factual image, x(M) the resulting VCE, and y , y are the class labels predicted by f for x(0), x(M) respectively. In practice, COUT measures how fast the classiﬁer s decision changes when interpolating between the original and the explanation, but the interpolation is deﬁned as inserting pixels to the original image according to the extent (absolute value) of change observed in the VCE through M steps. COUT is typically reported as an average over a set of samples.

Flip Rate (efﬁciency). A major criterion for a VCE method is its efﬁciency, understood as the ability to effectively ﬂip the model s decision. For a set of triplets {x i , xi, yi}I i=1, where x i is the original image and xi is the resulting VCE targeted to ﬂip f s decision to yi, Flip Rate (FR) is deﬁned as the fraction of cases which correctly ﬂipped the decision to the target class, i.e.,

i=1 1(arg max y f(y | xi) = yi). (19)

F.3 ADAPTATION OF OTHER INPAINTING ALGORITHMS

In this subsection, we describe the adaptation of each inpainting algorithm from our ablation study. For each method, we follow the notation from its corresponding original work to make the description easier to follow.

F.3.1 MANIFOLD CONSTRAINED GRADIENT (MCG, CHUNG ET AL. (2022))

MCG iteratively denoises (inpaints) the missing parts with the following two-step update (Equations 14 and 15, Chung et al. (2022)):

x t 1 = f(xi, sθ) xt W(y Hˆx0(xt) 2 2 + g(xt)z, z N(0, I) (20)

xi t = Ax t 1 + b (21)

where Eq. (20) is a manifold constraint update and Eq. (21) is a data consistency step. As described by the authors, both steps are crucial to ensure that the gradient of the measurement term stays on the manifold and to deal with the potential deviation from the measurement consistency.

Since f(xt, sθ) implicitly predicts the mean µt and variance σt at each step t, related to the underlying SDE dynamics, we apply our guidance scheme by modifying the original

f(xi, sθ) = µt + σtz, z N(0, I) (22)

by adding properly scaled (according to the relationship between the likelihood score and mean) conditioning:

f (xi, sθ) = µt + σtz + s σ2 t gn

g , z N(0, I) (23)

where gn and g are obtained as described in Algorithm 3.

F.3.2 DENOSING DIFFUSION RESTORATION MODELS (DDRM, KAWAR ET AL. (2022))

DDRM considers the SVD decomposition of the measurement model matrix H in the linear noisy inverse problem y = Hx + σyz, z N(0, I), (24)

Published as a conference paper at ICLR 2025

where σy is the standard divination of the measurement noise. For the task of inpaiting, the H matrix is a diagonal matrix with either 0 or 1 on the diagonal indicating available and missing pixels. Hence, its SVD decomposition simpliﬁes to using identity matrices in place of U and V.

The main contribution of DDRM is that it provides a way to include the information from that decomposition and observation y into the generative process, which the authors summarize in Equations 7 and 8 in the original work. The method uses a trained denoising network to obtain a prediction of x0 at timestep t, denoted as xθ,t. In order to include the additional information from the classiﬁer into the DDRM framework, we modify that prediction to include the model s gradients by replacing the update rule

xθ,t = V T xθ,t (25) with

xθ,t = V T (xθ,t + s g

where g and g are obtained as described in Algorithm 3.

F.3.3 REPAINT (LUGMAYR ET AL., 2022)

Re Paint performs the task of inpainting by modifying the standard denoising process, where, at each timestep t, the network s input is composed of noised pixels known from the original input sampled directly from q and the unknown noisy pixels predicted by the network in the previous timestep. Additionally, to harmonize the two parts of the image, Re Paint samples xt directly from q(xt|xt 1) and repeats the forward procedure. By default, this resampling scheme is repeated 20 times for each of the standard diffusion steps.

In order to incorporate the information from the classiﬁer, we modify the unconditional mean of the posterior pθ in the denosing step to conditional one, effectively replacing the mean predictor

εθ(Xt, t) (27)

shown in the 7-th step of Algorithm 1 from the original work (Lugmayr et al., 2022) with

εθ(Xt, t) + s σ2 t g g (28)

where g and g are obtained as described in Algorithm 3.

F.4 SCHEDULERS FOR GUIDANCE SCALE

Figure 10 visualizes example schedulers used throughout the development of our method. Our adaptive normalization technique empirically outperformed all tested schedulers.

(a) Interval scheduler

(b) Exponential scheduler

(c) Gaussian scheduler

Figure 10: Visualization of more complex schedulers used throughout the development of our method.

F.5 QUANTITATIVE EVALUATION OF OTHER ATTRIBUTION METHODS

To pick a default attribution method for RCSB, we evaluated it on the Zebra Sorrel using the RCSBB hyperparameter conﬁguration for 11 different attribution methods shown in Table 3. Based on these results, we chose Integrated Gradients (Sundararajan et al., 2017) as the default, since it provides the most balanced performance.

Published as a conference paper at ICLR 2025

Zebra Sorrel

Attribution method FID s FID S3 COUT FR

LRP 7.5 15.5 0.87 0.62 93.6 Input XGradient 9.0 16.8 0.87 0.73 97.8 Deep Lift 9.2 17.0 0.87 0.73 97.9 Integrated Gradients 9.5 17.4 0.86 0.72 97.4 Gradient Shap 10.5 18.5 0.87 0.74 97.4 LIME 12.9 20.7 0.85 0.55 88.4 Guided Backprop 13.8 21.49 0.86 0.72 96.5 Occlusion 13.9 21.7 0.86 0.50 86.0 Grad CAM 14.1 22.15 0.85 0.52 87.1 Guided Grad CAM 15.1 22.5 0.86 0.71 96.1 Saliency 15.2 23.0 0.86 0.75 98.4

Table 3: Quantitative evaluation of 11 attribution methods (described in Appendix C) on the Zebra Sorrel task following our evaluation protocol.

F.6 DIVERSITY ASSESSMENT

RCSB utilizes the OT-ODE version of the I2SB, which provides a deterministic mapping between the noisy image and the resulting RVCE. The source of randomness comes from the Gaussian noise inserted into the image in the place of missing pixels at the beginning of the inpainting process. In order to examine the diversity of the generated RVCEs, we followed the evaluation procedure from the work of Jeanneret et al. (2023). In essence, we compute the mean pair-wise LPIPS metric between two runs with different seeds (used in generation of the Gaussian noise) for our three main conﬁgurations of hyperparameters RCSBA, RCSBB and RCSBC. For each run, 256 RVCEs were generated. Results are shown in Table 4. Naturally, decreasing the area hyperparameter limits the extent of possible changes, leading to a decrease in diversity. Picking a = 0.3 results in diversity comparable to values reported by previous works, e.g., Jeanneret et al. (2023).

Zebra Sorrel Cheetah Cougar Egyptian Cat Persian Cat

RCSBA 0.067 0.060 0.065 RCSBB 0.092 0.096 0.095 RCSBC 0.129 0.140 0.137

Table 4: Diversity evaluation using 256 images for each task from our experimental protocol.

F.7 COMPUTATIONAL EFFICIENCY ASSESSMENT

What connects previous SOTA SGM-based methods with our work is the use of large U-Net (Ronneberger et al., 2015) checkpoints for the denoising network with the number of hyperparameters far exceeding (e.g., 10 ) the size of the utilized classiﬁer, effectively dominating the computational burden. Hence, to ensure fair comparison, Table 5 shows the number of Neural Function Evaluations (NFEs) used by each method to produce a single explanation, divided into a. model (U-Net, classiﬁer and other) b. and forward / backward passes, where the backward pass is around 2 more computationally demanding than the forward pass. Importantly, this comparison eliminates the differences stemming from the utilized hardware and optimality of the implementation, and is fair as each method consumes virtually the same amount of GPU memory. One exception to the use of the standard U-Net model is the LDCE method of Farid et al. (2023), which applies it in the latent space of an autoencoder. However, as each latent U-Net step also requires decoding the image with the decoder, the computational demand stays similar to the standard approach.

As indicated by Table 5, RCSB is the most efﬁcient approach, both in terms of balancing the use of the U-Net and the classiﬁer, and the number of forward/backward passes. Importantly, the other category shows non-zero numbers only for the DVCE method, which additionaly uses the gradients of a robust classiﬁer in the generation process. The high number of forward/backward passes through the classiﬁers in DVCE stems from applying them to a set of 16 augmented versions of xt at each timestep.

Published as a conference paper at ICLR 2025

Inpainting method U-Net Classiﬁer Other

forward backward forward backward forward backward

RCSB 100 100 100 100 0 0 LDCE 191 191 191 191 0 0 DVCE 200 200 1600 1600 1600 1600 ACE 520 500 25 25 0 0 DDRM 200 200 200 200 0 0 MCG 1000 1000 1000 1000 0 0 Re Paint 2410 2410 2410 2410 0 0

Table 5: Number of NFEs for each respective method with details about the model type and forward/backward passes.

F.8 ADDITIONAL QUANTITATIVE RESULTS

For most visually appealing results, we found a 0.1 0.15, c 8 16, τ 0.3 0.6 and s 2 3 to perform the best. These hyperparameters were used to create RVCEs for Fig. 5. To assess that the performance of these conﬁgurations does not deviate from best conﬁgurations of Table 1, we followed the same evaluation protocol on the most challenging Zebra Sorrel task and include the results in Table 6. Crucially, the performance stays virtually the same when comparing with Table 1.

Area a Cell size c Guidance scale s Trajectory truncation τ FID s FID S3 COUT FR

0.2 10.1 18.4 0.92 0.79 95.8 0.3 10.7 19.0 0.91 0.76 95.0 0.4 10.8 18.9 0.90 0.74 94.3

0.2 11.0 19.2 0.91 0.81 97.0 0.4 11.4 19.4 0.89 0.77 96.0 0.3 11.6 19.7 0.90 0.79 96.2

0.2 11.7 19.8 0.91 0.83 97.2 0.4 12.3 20.2 0.88 0.79 96.7 0.3 12.4 20.4 0.89 0.80 96.2

0.4 11.2 19.2 0.89 0.77 97.7 0.5 12.1 20.0 0.88 0.74 96.5 0.6 12.7 20.4 0.86 0.70 94.8

0.4 13.5 21.3 0.87 0.82 99.5 0.5 13.9 21.7 0.86 0.80 99.2 0.6 14.2 21.7 0.85 0.78 98.6

0.4 15.3 22.9 0.86 0.84 99.6 0.5 15.5 23.1 0.85 0.82 99.7 0.6 15.8 23.5 0.83 0.81 99.4

Table 6: Quantitative results for hyperparameters that provide the most visually appealing results.

F.9 FREEFORM MASKS

Figure 11 presents example RVCEs for the sorrel zebra task from the experiments based on freeform masks with quantitative results in Table 2(B). We observe that RCSB focuses on modifying the intersection of the randomly assigned mask with features that should intuitively be important to the classiﬁer, while leaving the unimportant parts, like background and sky, mostly unchanged. This provides additional justiﬁcation for high performance of RCSB despite not exclusively focusing on the most important regions.

F.10 UNINTUITIVE CLASSES

Figure 12 shows RVCEs obtained for unintuitive class pairings. Interestingly, RCSB is able to largely preserve the realism of the explanations, while providing unusual compositions of objects, e.g. placing a maltese dog in place of cauliﬂower.

Published as a conference paper at ICLR 2025

F.11 SHAPE MODIFICATION

While VCEs, and RVCEs in particular, are focused on providing the minimal semantic change that modiﬁes the classiﬁer s decision, one may be interested in obtaining new content that deviates further from the original. Figure 13 shows example RVCEs focused on modifying the original image to a larger extent, leading to shape and contour changes. By increasing the area (a) of the region constraint and the trajectory truncation (τ), RCSB allows for weaker preservation of the original content, leading to more visible changes. Modifying the shape of the original objects is possible in various scenarios. For example, in the white stork black stork task, despite the fact that the bird s color is the dominant differentiating feature, guiding RCSB with the classiﬁer of interest can also lead to large shape changes. This is also visible in the vulture ﬂamingo task, where the bird s legs appear thinner and longer, while its modiﬁed head points to an opposite direction. Moreover, Image Net contains task that are mostly characterized by shape differences rather than color or texture changes. Figure 13 shows that the primary characteristic of a pretzel can be easily modiﬁed to change the model s decision to bagel. The same can be seen when changing the decision from hatchet to hammer, where the latter s back part becomes longer than the original. Moreover, objects like paperknife and spoon can be easily modiﬁed with RCSB to be predicted as spoon and ladle respectively through realistic shape changes.

To assess the effectiveness of RCSB in tasks, where the classiﬁer s decision should mostly depend on the objects shape rather than texture or color, we evaluate it on another three tasks: pretzel bagel, hatchet hammer and paperknife wooden spoon. To compare with previous SOTA on Image Net, we evaluate DVCE in the same scenario. The hyperparameters of both methods and experimental details follow those from the main evaluation protocol in Table 1. Results from Table 7 show that, despite the different nature of the considered tasks, RCSB preserves its performance and advantage over DVCE from Table 1.

Pretzel Bagel Hatchet Hammer Paperknife Wooden spoon

Method FID s FID S3 COUT FR FID s FID S3 COUT FR FID s FID S3 COUT FR

DVCE 34.3 43.9 0.59 0.37 77.4 31.2 39.8 0.66 0.43 92.8 29.1 35.4 0.69 0.41 88.2 RCSBA 11.4 22.9 0.86 0.84 97.2 9.8 15.4 0.91 0.89 97.8 9.9 18.2 0.86 0.88 98.9

Table 7: Quantitative results for tasks focused on characteristics connected to shape instead of texture or color. RCSB is compared to DVCE, which is regarded as current SOTA on Image Net.

We also address the topic of shape modiﬁcation further in Appendix F.14, where RCSB is evaluated on the MNIST dataset. In grayscale handwritten digits, individual classes are primarily identiﬁed based on the shape of samples, hence serving as a proper proof-of-concept benchmark for evaluating the understanding of shapes by the method.

F.12 LOWER-LEVEL ATTRIBUTIONS

To verify whether RCSB is able to effectively synthesize RVCEs based on pixels that are considered less important to the classiﬁer, we extract the regions using our automated approach by ﬁrst zeroing out the absolute attributions above some quantile q before converting it to a binary mask with the approach mentioned at the end of Section 3.

Table 9 shows the results of this experiment performed on the three main tasks from Image Net, where we follow the default protocol from the main part of this paper and use the RCSBA hyperparameter conﬁguration from Table 1. We evaluate the RVCEs resulting from q {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, where the extracted region covers 10% of total image area. Crucially, the presented results showcase that RCSB s is largely preserved despite using pixels which should inﬂuence the classiﬁer much more weakly. FID and s FID both decrease monotonically with respect to q, which indicates that an increasing number of pixels from the background gets modiﬁed, resulting in a smaller number of changed data characteristics. S3 remains mostly unchanged, meaning that the representational similarity is not inﬂuenced by varying q. Intuitively, both COUT and FR also decrease when picking smaller q. This is because RCSB is not able to utilize the most inﬂuential pixels for the classiﬁer, which makes the task harder, thus lowering both the sparsity (changing most inﬂuential pixels) and efﬁciency of ﬂipping the model s decision.

Published as a conference paper at ICLR 2025

q Zebra Sorrel Cheetah Cougar Egyptian Cat Persian Cat

Metric FID s FID S3 COUT FR FID s FID S3 COUT FR FID s FID S3 COUT FR

0.9 5.2 13.8 0.87 0.62 91.6 6.1 17.1 0.92 0.88 99.8 13.5 31.4 0.86 0.89 99.9 0.8 4.6 13.3 0.87 0.53 88.0 4.4 15.7 0.91 0.80 99.1 11.0 28.6 0.85 0.84 99.3 0.7 3.9 12.7 0.88 0.48 85.7 3.5 15.1 0.91 0.74 97.4 9.2 26.9 0.85 0.79 98.5 0.6 3.6 12.4 0.89 0.45 82.4 2.9 14.6 0.92 0.68 94.3 7.8 25.5 0.85 0.73 96.1 0.5 3.6 12.4 0.89 0.40 78.5 2.7 14.5 0.92 0.65 93.1 6.9 24.7 0.86 0.69 94.6 0.4 3.7 12.3 0.89 0.38 77.7 2.8 14.6 0.92 0.65 93.4 6.6 24.5 0.87 0.65 93.1 0.3 3.9 12.6 0.89 0.40 79.2 3.1 14.9 0.92 0.69 95.7 7.5 25.3 0.86 0.67 94.2

Table 8: Quantitative results for lower-level attributions on three main tasks from Image Net. Here, q denotes the value of the quantile above which absolute attributions are zeroed out before extracting a region with 10% area coverage.

F.13 OTHER CLASSIFIERS

In Table 9, we provide quantitative results for other classiﬁers mentioned at the end of the experimental evaluation. The experimental setting follows the same protocol as the one from Table 1 and we use the RCSBA conﬁguration. Note that the results are consistent with those for Res Net50 in Table 1 for all classiﬁers except the robust Madry Res Net50. Since we use a single hyperparameter conﬁguration, this is probably due to the different nature of the model, and the results could be easily improved by tuning a speciﬁc conﬁguration for it.

Zebra Sorrel

Classiﬁer FID s FID S3 COUT FR

Clip Zero Shot 4.13 12.76 0.90 0.93 100.0 Conv Ne Xt Base 15.69 23.55 0.82 0.84 99.8 Madry Res Net50 47.49 55.22 0.65 0.19 36.2 RBDei T 10.00 17.76 0.83 0.70 94.0 RBXCi T 16.04 23.45 0.79 0.46 83.6 Swin B 3.20 12.19 0.94 0.50 88.0 VGG16 7.29 15.39 0.88 0.84 98.0 VGG16 BN 5.44 13.53 0.91 0.87 99.9 Vi TB16 8.60 16.84 0.86 0.80 98.9

Table 9: Quantitative results for other classiﬁers evaluated on the most challenging (out of the three considered) zebra sorrel task from Image Net.

F.14 OTHER BENCHMARKS

We extend the evaluation of RCSB with three additional datasets: Celeb A-HQ (Karras et al., 2018) with 30 000 samples of 256 256 resolution face images, Celeb A (Liu et al., 2015) with around 200 000 samples of 128 128 resolution face images, and MNIST (Deng, 2012) with 70 000 samples of 32 32 resolution images of handwritten digits. The ﬁrst two datasets are chosen to compare with all previously published diffusion-based approaches that did not evaluate on Image Net, effectively complementing our experimental results. While both of these datasets contain face images, they pose unique challenges, since Celeb A-HQ contains much less samples ( 6 ) which are also of higher resolution, making the predictive tasks very different. The MNIST dataset was chosen to provide additional proof of the versatility of RCSB. While this dataset is no longer in active use as an evaluation benchmark, we include it here to show that even on the data of much different nature and resolution, RCSB is still able to provide meaningful and informative RVCEs that must focus on modifying the shape of a digit, since there is no notion of texture present in the data.

For both Celeb A and Celeb A-HQ, we follow previous works and provide explanations for the Dense Net121 (Huang et al., 2017) model trained in a multilabel scenario consisting of 40 distinct attributes. We consider two tasks evaluated in prior works, i.e., ﬂipping the smile and age classes to the opposite prediction. For MNIST, we train Le Net (Lecun et al., 1998) from scratch using the default training and validation splits. We note that, for each considered dataset, I2SB must be trained independently for the task of inpainting (on 20% 30% freeform masks). We start its training using a pretrained diffusion checkpoint a. from Lugmayr et al. (2022) for Celeb A-HQ with default training hyperparameters from Liu et al. (2023a) and 40 000 iterations, b. from Jeanneret et al. (2022)

Published as a conference paper at ICLR 2025

for Celeb A with default training hyperparameters from Liu et al. (2023a) and 100 000 iterations and c. from scratch on MNIST.

To provide a more comprehensive comparison, we also adapt the implementation of Di ME (Jeanneret et al., 2022) to Image Net and Fast Di ME to both Celeb A-HQ and Image Net. We ﬁrst tune the hyperparameters of both methods with a large grid search on a small subset of images, and then evaluate the best conﬁguration using the standard protocol for Image Net (i.e., the same as our method) and 2048 samples from Celeb A-HQ. Both Di ME and Fast Di ME are implemented with the use of the same checkpoints that the training of I2SB starts from, i.e. from the work of Lugmayr et al. (2022) for Celeb A-HQ and Dhariwal & Nichol (2021) for Image Net.

Method FID s FID S3 COUT FR

Zebra Sorrel

Di ME 222.85 243.16 0.19 0.31 0.0 Fast Di ME 96.48 103.45 0.22 0.44 14.0 RCSBA 8.0 16.2 0.88 0.74 94.7

Cheetah Cougar

Di ME 268.22 291.99 0.11 0.16 0.0 Fast Di ME 133.01 141.12 0.12 0.11 18.0 RCSBA 17.2 26.6 0.92 0.92 100.0

Egyptian Cat Persian Cat

Di ME 322.79 352.08 0.44 0.05 0.0 Fast Di ME 193.63 207.12 0.10 0.01 20.0 RCSBA 23.0 40.0 0.87 0.92 100.0

Table 10: Quantitative results on the Image Net dataset for Di ME and Fast Di ME.

In terms of evaluation measures, we follow previous works (Jeanneret et al., 2022; 2023; Weng et al., 2024) and utilize FVA, FS (Cao et al., 2018), MNAC (Rodr ıguez et al., 2021) and CD (Jeanneret et al., 2022) in addition to metrics used on Image Net.

Tables 11 and 12 show the quantitative results achieved by all considered methods on Celeb A and Celeb A-HQ. Importantly, RCSB is able to outperform the current SOTA in many cases. For example, it provides new records for COUT, MNAC and FR on all considered (dataset, class) pairs. While high COUT and FR are expected based on the method s performance on Image Net, where it is able to efﬁciently generate very sparse RVCEs (with respect to the classiﬁer), low MNAC additionaly shows that RCSB focuses on a small subset of face attributes and leaves others mostly unmodiﬁed. Our approach is also able to obtain very low FID and s FID values, often performing worse than ACE only, which indicates that the obtained explanations preserve the realism of the original samples. We include example RVCEs obtained with RCSB on Celeb A-HQ in Fig. 14.

Method FID s FID FVA FS MNAC CD COUT FR FID s FID FVA FS MNAC CD COUT FR

Di VE 29.4 - 97.3 - - - - - 33.8 - 98.1 - 4.58 - - - Di VE100 36.8 - 73.4 - 4.63 2.34 - - 39.9 - 52.2 - 4.27 - - - STEEX 10.2 - 96.9 - 4.11 - - - 11.8 - 97.5 - 3.44 - - -

ACE ℓ1 1.27 3.97 99.9 0.87 2.94 1.73 0.78 97.6 1.45 4.12 99.6 0.78 3.20 2.94 0.72 96.2 ACE ℓ2 1.90 4.56 99.9 0.87 2.77 1.56 0.62 84.3 2.08 4.62 99.6 0.80 2.94 2.82 0.56 77.5 Di ME 3.17 4.89 98.3 0.73 3.72 2.30 0.53 97.0 4.15 5.89 95.3 0.67 3.13 3.27 0.44 99.0 Fast Di ME 4.18 6.13 99.8 0.76 3.12 1.91 0.44 99.0 4.82 6.76 99.2 0.74 2.65 3.80 0.36 98.6 Fast Di ME-2 3.33 5.49 99.9 0.77 3.06 1.89 0.44 99.4 4.04 6.01 99.6 0.75 2.63 3.80 0.37 99.3 Fast Di ME-2+ 3.24 5.23 99.9 0.79 2.91 2.02 0.41 98.9 3.60 5.59 99.7 0.77 2.44 3.76 0.32 98.7 RCSB 2.98 4.79 100.0 0.91 2.24 2.78 0.87 99.8 2.94 4.94 99.9 0.88 2.14 3.63 0.81 99.3

Table 11: Quantitative results on the Celeb A dataset. We extract the results of other methods (Rodr ıguez et al., 2021; Jacob et al., 2022; Jeanneret et al., 2022; 2023) from the work of Weng et al. (2024).

Regarding the MNIST dataset, we provide example RVCEs obtained with RCSB using the automated region extraction for various tasks in Fig. 15. Crucially, the presented samples show that, despite the complexity of the detected regions, RCSB is able to properly modify the shape of the initial digit to change the classiﬁer s decision. As can be observed, this is performed through a hybrid

Published as a conference paper at ICLR 2025

Method FID s FID FVA FS MNAC CD COUT FR FID s FID FVA FS MNAC CD COUT FR

Di VE 107.0 - 35.7 - 7.41 - - - 107.5 - 32.3 - 6.76 - - - STEEX 21.9 - 97.6 - 5.27 - - - 26.8 - 96.0 - 5.63 - - -

Di ME 18.1 27.7 96.7 0.67 2.63 1.82 0.65 97.0 18.7 27.8 95.0 0.66 2.10 4.29 0.56 97.0 ACE ℓ1 3.21 20.2 100.0 0.89 1.56 2.61 0.55 95.0 5.31 21.7 99.6 0.81 1.53 5.4 0.40 95.0 ACE ℓ2 6.93 22.0 100.0 0.84 1.87 2.21 0.60 95.0 16.4 28.2 99.6 0.77 1.92 4.21 0.53 95.0 LDCE 13.6 25.8 99.1 0.76 2.44 1.68 0.34 - 14.2 25.6 98.0 0.73 2.12 4.02 0.33 - Fast Di ME-2+ 16.51 31.4 99.9 0.87 1.43 4.16 0.28 87.1 26.0 40.3 99.6 0.81 3.15 4.36 0.31 92.6 RCSB 3.04 20.0 100.0 0.93 1.22 3.22 0.83 98.9 4.92 27.3 100.0 0.96 1.47 5.16 0.80 99.4

Table 12: Quantitative results on the Celeb A-HQ dataset. We extract the results of other methods from the work of Farid et al. (2023) and Jeanneret et al. (2023) except Fast Di ME which we implement and evaluate ourselves.

approach, where some parts of the initial digit remain the same, while new parts appear to either combine the existing elements of the digit or create entirely new ones.

G USER STUDIES

The main goal of VCEs, and RVCEs in particular, is to explain the model s reasoning to humans. This capability can be evaluated from various perspectives. In the following, we provide a detailed analysis of two independently conducted user studies, with the ﬁrst one focused on the general usefulness of RVCEs in understanding the model s decision-making and the potential beneﬁts stemming from a possible interaction of humans with the explanation creation process, and the second one concerned with a speciﬁc use-case, where RVCEs are used to inform the user about the causes of model s misclassiﬁcation and what must be changed in a given image for it to predict correctly. For both studies, the model of interest is the Res Net50 used in main experimental evaluation and the samples are extracted from Image Net.

G.1 I: USEFULNESS AND INTERACTION

In this study, 15 participants with background knowledge in machine learning (at the level of MSc studies, not aware of the research conducted for this paper) were presented with comparisons of VCEs and RVCEs for the same factual images, together with absolute differences between the original and the explanation. An example of such comparison is included in Fig. 16. The participants were asked about which type of explanation is more useful in understanding the model s decisionmaking. Here, 86.6% answered in favor of RVCEs. Moreover, each user was asked to provide the reasons for their judgement. The answers generally focused on the semantic change being more localized, easier to interpret and better aligned with human intuition.

The second part of this study focused on evaluating the added value provided by the possibility of interacting with the explanation creation process through manual region speciﬁcation. First, the participants were shown a default interaction with VCEs, i.e., the original image and its VCE were presented to users with no possibility of interacting with it. Then, the participants were presented with the process of manual region speciﬁcation for which an RVCE was generated, and offered the possibility to provide the region themselves. After that, each participant was asked whether the interactive process may be helpful and more useful in obtaining a better understanding of the model s reasoning than the standard scenario. There, 93.3% of participants answered in favor of the interactive process. They were also asked to justify their choice, and the answers generally focused on the possibility of verifying regions that align with human understanding and incorporating domain experts that would be able to more thoroughly analyze the model.

In both parts, RVCEs were extracted from the ﬁgures in the main part of the manuscript and DVCE was used to generate standard VCEs, since this method provided the best quantitative results prior to our work.

G.2 II: UNDERSTANDING MODEL S FAILURES

In this study, a different group of 11 participants with background knowledge in machine learning at a similar level took part in evaluating whether RVCEs help in identifying the reasons for model s

Published as a conference paper at ICLR 2025

misclassiﬁcations. The study consisted of a general introduction to the problem of explaining deep classiﬁers with VCEs, the concept of RVCEs and the two-fold goal of the study: to understand why a given model misclassiﬁes the image and to identify the minimal semantic change required to correct the prediction. Then, the participants took part in 5 variations of the same experiment, where in each case a different misclassiﬁcation was shown. The experiment began with a presentation of the factual image, the initially predicted class, the correct class and two sets of images, each representing instances of one of the two classes taken randomly from the web (see Fig. 17(left) for an example introduction to the experiment). Then, each participant observed the original image, the region constraint from our automated approach and the resulting series of RVCEs (see Fig. 17(right) for an example). After each of the 5 variants of this experiment, the participants were asked whether they were able to identify the semantic features that the model lacked in its initial prediction and that lead to correcting the decision once they appeared on the image. The response was, on average, positive in 80.02% of the cases with 14.56% standard deviation. In each case, they were also asked to describe these features. Here, the answers almost always aligned with what the RVCEs were introducing to the image. After all experiments, the participants were asked if RVCEs are able to indicate the semantic features that were missing in the beginning for the model to predict correctly (90.9% positive answers), whether they better understood the initial misclassiﬁcation (90.9% positive answers) and if they judge RVCE as a useful tool in explaining the model s decision-making (100% positive answers).

The conducted user studies highlight that the concept of RVCEs and the application of RCSB to their generation is preferred by the users in comparison to standard VCEs. Our explanations are found to be more useful and helpful in explaining the model s decision-making. The possibility of interacting with the explanation creation is also enjoyed and recognized to improve the explanatory process by almost all participants. Moreover, our method helped the users in obtaining a better understanding of the classiﬁer s failure cases and its potential causes.

H QUALITATIVE EXAMPLES

We provide additional qualitative examples for different scenarios. Results for other classiﬁers, obtained with the automated region extraction, are depicted in Fig. 21 (VGG16, Simonyan & Zisserman (2015)), Fig. 22 (VGG16 with Batch Normalization (BN), Simonyan & Zisserman (2015)), Fig. 23 (Conv Ne Xt Base, Liu et al. (2022)), Fig. 24 (Vi T-B/16, Dosovitskiy et al. (2021)), Fig. 25 (Swin B, Liu et al. (2021)) , Fig. 26 (robust Madry Res Net50, Engstrom et al. (2019)), Fig. 27 (robust Tian Dei T, Tian et al. (2022)), Fig. 28 (zero-shot CLIP classiﬁer, Radford et al. (2021)). We used a = 0.2, τ = 0.7, c = 16, s = 1.5 universally across all additional classiﬁers, showcasing the versatility of RCSB. Additional examples for Res Net50 are shown in Figs. 29 to 32 (automated region extraction with different hyperparameters), Fig. 33 (exact regions from Lang SAM) and Fig. 34 (user-deﬁned regions).

Published as a conference paper at ICLR 2025

Figure 11: Example RVCEs obtained in the sorrel zebra task from the experiments based on freeform masks with quantitative results in Table 2(B). The columns show the factual image, the region to which changes are constrained and the resulting RVCE.

Figure 12: RVCEs for unusual class pairings. Each task of the form initial class target class depicts the factual image, the region constraint and the resulting explanation respectively.

Published as a conference paper at ICLR 2025

Figure 13: RVCEs focused on modifying the shape of objects within the region constraint obtained with the automated extraction approach. By picking larger area (a) and trajectory truncation (τ), the preservation of the original content can be reduced, leading to more diverse inﬁlls that change the classiﬁer s decision.

Published as a conference paper at ICLR 2025

Figure 14: Qualitative examples of RVCEs obtained with RCSB in smiling not smiling and young old tasks on Celeb A-HQ.

Figure 15: Qualitative examples of RVCEs obtained with RCSB in various digit-ﬂipping tasksk on the MNIST dataset.

Published as a conference paper at ICLR 2025

Image Explanation Difference

Image Explanation Difference

Prediction: jay Prediction: bulbul Prediction: jay Prediction: bulbul

Type A Type B

Prediction: guacamole

Prediction:

Prediction: guacamole

Prediction:

Difference Difference

Figure 16: An example comparison of VCEs and RVCEs presented to participants of the I user study.

Figure 17: An example introduction (left) and evaluation (right) from one of the experiments conducted in the II user study.

Published as a conference paper at ICLR 2025

Figure 18: Qualitative comparison to other VCE generation methods: Di ME (Jeanneret et al., 2022), Fast Di ME (Weng et al., 2024), LDCE (Farid et al., 2023), DVCE (Augustin et al., 2022) and ACE (Jeanneret et al., 2023). For each explanation, the absolute difference between the factual image is additionaly provided.

Published as a conference paper at ICLR 2025

Figure 19: Qualitative comparison to other VCE generation methods: Di ME (Jeanneret et al., 2022), Fast Di ME (Weng et al., 2024), LDCE (Farid et al., 2023), DVCE (Augustin et al., 2022) and ACE (Jeanneret et al., 2023). For each explanation, the absolute difference between the factual image is additionaly provided.

Published as a conference paper at ICLR 2025

Figure 20: Qualitative comparison to other VCE generation methods: Di ME (Jeanneret et al., 2022), Fast Di ME (Weng et al., 2024), LDCE (Farid et al., 2023), DVCE (Augustin et al., 2022) and ACE (Jeanneret et al., 2023). For each explanation, the absolute difference between the factual image is additionaly provided.

Published as a conference paper at ICLR 2025

Figure 21: Extended qualitative evaluation of automated region extraction for VGG16 (Simonyan & Zisserman, 2015) classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Figure 22: Extended qualitative evaluation of automated region extraction for VGG16BN (Simonyan & Zisserman, 2015) classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Published as a conference paper at ICLR 2025

Figure 23: Extended qualitative evaluation of automated region extraction for Conv Ne Xt Base (Liu et al., 2022) classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Figure 24: Extended qualitative evaluation of automated region extraction for Vi TB16 (Dosovitskiy et al., 2021) classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Published as a conference paper at ICLR 2025

Figure 25: Extended qualitative evaluation of automated region extraction for Swin B (Liu et al., 2021) classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Figure 26: Extended qualitative evaluation of automated region extraction for Madry Res Net50 (Engstrom et al., 2019) l2-norm robust classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Published as a conference paper at ICLR 2025

Figure 27: Extended qualitative evaluation of automated region extraction for Tian Dei T (Tian et al., 2022) corruption robust classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE(s) on the right.

Figure 28: Extended qualitative evaluation of automated region extraction for CLIP Vi T-B/32 (Radford et al., 2021) zero-shot classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Published as a conference paper at ICLR 2025

Figure 29: Extended qualitative evaluation of automated region extraction with c = 4, a = 0.1 for the Res Net50 classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Figure 30: Extended qualitative evaluation of automated region extraction with c = 4, a = 0.3 for the Res Net50 classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Published as a conference paper at ICLR 2025

Figure 31: Extended qualitative evaluation of automated region extraction with c = 4, a = 0.2 for the Res Net50 classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE on the right.

Published as a conference paper at ICLR 2025

Figure 32: Extended qualitative evaluation of automated region extraction with c = 8, a = 0.2 for the Res Net50 classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE(s) on the right.

Published as a conference paper at ICLR 2025

Figure 33: Extended qualitative evaluation of exact regions obtained with Lang SAM for the Res Net50 classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE(s) on the right.

Figure 34: Extended qualitative evaluation of user-deﬁned regions for the Res Net50 classiﬁer. For each task, factual image is shown on the left with the used region in the middle and the generated RVCE(s) on the right.

Published as a conference paper at ICLR 2025

Brian DO Anderson. Reverse-time diffusion equation models, volume 12. Stochastic Processes and their Applications, Elsevier, 1982.

Maximilian Augustin, Valentyn Boreiko, Francesco Croce, and Matthias Hein. Diffusion visual counterfactual explanations. In Advances in Neural Information Processing Systems, 2022.

Maximilian Augustin, Yannic Neuhaus, and Matthias Hein. Analyzing and explaining image classiﬁers via diffusion guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

Sebastian Bach, Alexander Binder, Gr egoire Montavon, Frederick Klauschen, Klaus-Robert M uller, and Wojciech Samek. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance propagation. In Plo S one, 2015.

Przemyslaw Biecek and Wojciech Samek. Position: Explain to question not to justify. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pp. 3996 4006, 21 27 Jul 2024.

Valentyn Boreiko, Maximilian Augustin, Francesco Croce, Philipp Berens, and Matthias Hein. Sparse visual counterfactual explanations in image space. In DAGM German Conference on Pattern Recognition, pp. 133 148. Springer, 2022.

Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp. 67 74. IEEE, 2018.

Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining image classiﬁers by counterfactual generation. In International Conference on Learning Representations, 2019.

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750 15758, 2021.

Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints. In Advances in Neural Information Processing Systems, 2022.

Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023a.

Hyungjin Chung, Jeongsol Kim, and Jong Chul Ye. Direct diffusion bridge using data consistency for inverse problems. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. IEEE, 2009.

Li Deng. The mnist database of handwritten digit images for machine learning research. In IEEE Signal Processing Magazine, 2012.

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. In Advances in neural information processing systems, 2021.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Logan Engstrom, Andrew Ilyas, Hadi Salman, Shibani Santurkar, and Dimitris Tsipras. Robustness (python library). In https://github.com/Madry Lab/robustness, 2019.

Published as a conference paper at ICLR 2025

Karim Farid, Simon Schrodi, Max Argus, and Thomas Brox. Latent diffusion counterfactual explanations. In ar Xiv preprint ar Xiv:2310.06668, 2023.

Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual explanations. In International Conference on Machine Learning, pp. 2376 2384. PMLR, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, 2017.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, 2020.

Andreas Holzinger, Anna Saranti, Christoph Molnar, Przemyslaw Biecek, and Wojciech Samek. Explainable AI methods-a brief overview. In International workshop on extending explainable AI beyond deep models and classiﬁers, pp. 13 38. Springer, 2022.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Paul Jacob, Eloi Zablocki, Hedi Ben-Younes, Micka el Chen, Patrick P erez, and Matthieu Cord. Steex: steering counterfactual explanations with semantics. In European Conference on Computer Vision, pp. 387 403. Springer, 2022.

Guillaume Jeanneret, Lo ıc Simon, and Fr ed eric Jurie. Diffusion models for counterfactual explanations. In Proceedings of the Asian Conference on Computer Vision, pp. 858 876, 2022.

Guillaume Jeanneret, Lo ıc Simon, and Fr ed eric Jurie. Adversarial counterfactual visual explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16425 16435, 2023.

Guillaume Jeanneret, Lo ıc Simon, and Fr ed eric Jurie. Text-to-image models for counterfactual explanations: a black-box approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4757 4767, 2024.

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.

Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022.

Saeed Khorram and Li Fuxin. Cycle-consistent counterfactuals by latent transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10203 10212, 2022.

Ba Jimmy Kingma, Diederik P. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2019.

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015 4026, 2023.

Andreas Kirsch. An introduction to the mathematical theory of inverse problems. Applied mathematical sciences. Springer, New York, 2nd ed edition, 2011.

Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. Captum: A uniﬁed and generic model interpretability library for pytorch. In ar Xiv preprint ar Xiv:2009.07896, 2020.

Published as a conference paper at ICLR 2025

Oran Lang, Yossi Gandelsman, Michal Yarom, Yoav Wald, Gal Elidan, Avinatan Hassidim, William T Freeman, Phillip Isola, Amir Globerson, Michal Irani, et al. Explaining in style: Training a gan to explain a classiﬁer in stylespace. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 693 702, 2021.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.

Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos Theodorou, Weili Nie, and Anima Anandkumar. I2SB: Image-to-Image Schr odinger Bridge. In International Conference on Machine Learning, pp. 22042 22062. PMLR, 2023a.

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ar Xiv preprint ar Xiv:2303.05499, 2023b.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012 10022, 2021.

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976 11986, 2022.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11461 11471, 2022.

Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.

Robert J. Mc Cann. A convexity principle for interacting gases. In Advances in Mathematics, 1997.

Franz Motzkus, Christian Hellert, and Ute Schmid. Cola-dce concept-guided latent diffusion counterfactual explanations. In ar Xiv preprint ar Xiv:2406.01649, 2024.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in neural information processing systems, 2019.

Gabriel Peyr e and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5 6):355 607, 2019.

Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10619 10629, 2022.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you? Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135 1144, 2016.

Herbert E Robbins. An empirical Bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, 1992.

Published as a conference paper at ICLR 2025

Pau Rodr ıguez, Massimo Caccia, Alexandre Lacoste, Lee Zamparo, Issam Laradji, Laurent Charlin, and David Vazquez. Beyond trivial counterfactual explanations with diverse valuable explanations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1056 1065, 2021.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234 241. Springer, 2015.

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings, pp. 1 10, 2022.

Jonathan Scarlett, Reinhard Heckel, Miguel RD Rodrigues, Paul Hand, and Yonina C Eldar. Theoretical perspectives on deep learning methods in inverse problems. In IEEE journal on selected areas in information theory, 2022.

Erwin Schr odinger. Sur la th eorie relativiste de l electron et l interpr etation de la m ecanique quantique. In Annales de l institut Henri Poincar e, volume 2, pp. 269 310, 1932.

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, 2022.

Lisa Schut, Oscar Key, Rory Mc Grath, Luca Costabello, Bogdan Sacaleanu, Yarin Gal, et al. Generating interpretable counterfactual explanations by implicit minimisation of epistemic and aleatoric uncertainties. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1756 1764. PMLR, 2021.

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: visual explanations from deep networks via gradient-based localization. In International journal of computer vision, 2020.

Sheng-Min Shih, Pin-Ju Tien, and Zohar Karnin. Ganmex: One-vs-one attributions using gan-based model explainability. In International Conference on Machine Learning, pp. 9592 9602. PMLR, 2021.

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International conference on machine learning, pp. 3145 3153. PMLR, 2017.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. In 2nd International Conference on Learning Representations, Workshop Track Proceedings, 2014.

Sumedha Singla, Brian Pollack, Junxiang Chen, and Kayhan Batmanghelich. Explanation by progressive exaggeration. In International Conference on Learning Representations, 2020.

Bartlomiej Sobieski and Przemysław Biecek. Global counterfactual directions. In European Conference on Computer Vision, 2024.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Published as a conference paper at ICLR 2025

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations, Workshop Track Proceedings, 2015.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pp. 3319 3328. PMLR, 2017.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016.

Jayaraman Thiagarajan, Vivek Sivaraman Narayanaswamy, Deepta Rajan, Jia Liang, Akshay Chaudhari, and Andreas Spanias. Designing counterfactual generators using deep model inversion. In Advances in Neural Information Processing Systems, 2021.

Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, and Yu-Gang Jiang. Deeper insights into the robustness of vits towards common corruptions. In ar Xiv preprint ar Xiv:2204.12143, 2022.

Paulina Tomaszewska and Przemysław Biecek. Position paper: Do not explain (vision models) without context. In International Conference on Machine Learning, 2024.

Philipp Vaeth, Alexander M Fruehwald, Benjamin Paassen, and Magda Gregorova. Gradcheck: Analyzing classiﬁer guidance gradients for conditional diffusion sampling. In ar Xiv preprint ar Xiv:2406.17399, 2024.

Arnaud Van Looveren and Janis Klaise. Interpretable counterfactual explanations guided by prototypes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 650 665. Springer, 2021.

Nina Weng, Paraskevas Pegios, Aasa Feragen, Eike Petersen, and Siavash Bigdeli. Fast diffusionbased counterfactuals for shortcut removal and generation. In European Conference on Computer Vision, 2024.

Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818 833. Springer, 2014.

Mehdi Zemni, Micka el Chen, Eloi Zablocki, H edi Ben-Younes, Patrick P erez, and Matthieu Cord. Octet: Object-aware counterfactual explanations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15062 15071, 2023.

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018.

Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. In International Conference on Learning Representations, 2018.