# diffusion_visual_counterfactual_explanations__8d655a24.pdf

Diffusion Visual Counterfactual Explanations

Maximilian Augustin Valentyn Boreiko* Francesco Croce Matthias Hein University of Tübingen

Visual Counterfactual Explanations (VCEs) are an important tool to understand the decisions of an image classiﬁer. They are small but realistic semantic changes of the image changing the classiﬁer decision. Current approaches for the generation of VCEs are restricted to adversarially robust models and often contain non-realistic artefacts, or are limited to image classiﬁcation problems with few classes. In this paper, we overcome this by generating Diffusion Visual Counterfactual Explanations (DVCEs) for arbitrary Image Net classiﬁers via a diffusion process. Two modiﬁcations to the diffusion process are key for our DVCEs: ﬁrst, an adaptive parameterization, whose hyperparameters generalize across images and models, together with distance regularization and late start of the diffusion process, allow us to generate images with minimal semantic changes to the original ones but different classiﬁcation. Second, our cone regularization via an adversarially robust model ensures that the diffusion process does not converge to trivial non-semantic changes, but instead produces realistic images of the target class which achieve high conﬁdence by the classiﬁer. Code is available under https://github.com/valentyn1boreiko/DVCEs.

1 Introduction

It can be argued that one of the main problems hindering the widespread use of machine learning, and image classiﬁcation in particular, is the missing possibility to explain the decisions of black-box models such as neural networks. This is not only a problem for decisions affecting humans where the current draft for AI regulation in Europe [10] requires transparency , but it is a pressing problem in all applications of machine learning. The reason, to some extent, is that humans would like to understand but also control if the learning algorithm has captured the concepts of the underlying classes or if it just predicts well using spurious features, artefacts in the data set, or other sources of error. In this paper, we focus on model-agnostic explanations which can, in principle, be applied to any image classiﬁer and do not rely on the speciﬁc structure of the classiﬁer such as decision trees or linear classiﬁers. In this area, in particular for image classiﬁcation, sensitivity based explanations [4], explanations based on feature attributions [3], saliency maps [47, 46, 17, 56, 51], Shapley additive explanations [31], and local ﬁts of interpretable models [39] have been proposed. Moreover, [55] proposed counterfactual explanations (CEs), which are instance-speciﬁc explanations. They can be applied to any classiﬁer and it has been argued that they are close to the human justiﬁcation of decisions [34] using counterfactual reasoning: I would recognize it as zebra (instead of horse) if it had black and white stripes. For a given classiﬁer almost all methods for the generation of CEs [55, 15, 35, 6, 38, 54, 44] try to solve the following problem: Given a target class c what is the minimal change δ of input x, such that x + δ is classiﬁed as class c with high probability and is a realistic instance of my data generating distribution? From the perspective of debugging existing machine learning models, CEs are interesting as they construct a concrete input x + δ with a different classiﬁcation that allows the developer to test if the model has learned the correct features.

Equal contribution.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Original Class 1 Class 2 Original Class 1 Class 2 chimpanzee

Chesapeake Bay

golden retriever:

labrador retriever:

Figure 1: Our method DVCE creates visual counterfactual explanations (VCEs) that explain any given classiﬁer (here: non-robust Conv Ne Xt) by changing an image of one class into a similar target class by changing class relevant features while preserving the overall image structure.

The main reason why visual counterfactual explanations (VCEs), that is CEs for image classiﬁcation, are not widely used is that the tasks of generating CEs and adversarial examples [52] are very related. Even imperceivable changes of the image can already change the prediction of the classiﬁer, however, the resulting noise patterns do not show the user if the classiﬁer has picked up the right class-speciﬁc features. One possible solution is using adversarially robust models [43, 1, 7], which have been shown to produce semantically meaningful VCEs by directly maximizing the probability of the target class in image space. These approaches have the downside that they only can generate VCEs for robust models which is a signiﬁcant restriction as these models are not competitive in terms of prediction accuracy. The second approach is to restrict the generation of VCEs using a generative model or constraining the set of potential image manipulations [19, 20, 42, 8, 18, 45, 27]. However, these approaches are either restricted to datasets with a small number of classes, cannot provide explanations for arbitrary classiﬁers, or generate VCEs that look realistic but have so little in common with the original image that not much insight can be gained.

Recently, [26] trained a Style GAN2 [24] model to discover and manipulate class attributes. While their approach yields impressive results on smaller-scale datasets with few similar classes (for example, different bird species), the authors did not demonstrate that the method scales to complex tasks such as Image Net with hundreds of classes where different classes require different sets of attributes. Another disadvantage is that the Style GAN model needs to be retrained for every classiﬁer, making it prohibitively expensive to explain multiple large models. Moreover, [25] have proposed a loss to do VCEs in the latent space of a GAN/VAE. They show promising results for MNIST/FMNIST, but no code is available for Image Net. Furthermore, to explain a classiﬁer, the conditional information during the generation of explanations should ideally come only from the classiﬁer itself, but [25] rely on a conditional GAN which might introduce a bias.

In this paper, we overcome the aforementioned challenges and generate our Diffusion Visual Counterfactual Explanations (DVCEs) for arbitrary Image Net classiﬁers (see Fig 1). We use the progress in the generation of realistic images using diffusion processes [48, 23, 49, 50], which recently were able to outperform GANs [14]. Similar to [37, 2], we use a classiﬁer and a distance-type regularization to guide the generation of the images. Two modiﬁcations to the diffusion process are key elements for our DVCEs: i) a combination of distance regularization and starting point of the diffusion process together with an adaptive reparameterization lets us generate VCEs visually close to the original image in a controlled way so that hyperparameters can be ﬁxed across images and even across models, ii) our cone regularization of the gradient of the classiﬁer via an adversarially robust model ensures that the diffusion process does not converge to trivial non-semantic changes but instead produces realistic images of the target class which achieve high conﬁdence by the classiﬁer. Our approach can be employed for any dataset where a generative diffusion model and an adversarially robust model are available. In a qualitative comparison, user study, and a quantitative evaluation (Sec. 4.1), we show that our DVCEs achieve higher realism (according to FID) and have more meaningful features (according to the user study) than both the recent methods of [7] and [2].

2 Diffusion models

Diffusion models [14, 48, 49, 23, 36] are generative models that consist of two steps: a forward diffusion process (and thus a Markov process) that transforms the data distribution to a prior distribution (which is usually assumed to be a standard normal distribution) and the reverse diffusion process that

transforms the prior distribution back to the data distribution. The existence of the (continuous-time) reverse diffusion process was closer investigated in [50]. In the discrete-time setting, a Markov chain {x1, ..., x T } for any data point x0 in the forward direction is deﬁned via a Markov process that adds noise to the data point at each timestep:

q(xt|xt 1) = N(

1 βtxt 1, βt I), (1)

where {β1, ..., βT } is some variance schedule, chosen such that q(x T ) N(0, I). Note that given x0, it is possible to sample from q(xt|x0) in closed form instead of applying noise t times using:

q(xt|x0) = N(p tx0, 1 t I), (2)

xt = p tx0 +

1 t, N(0, I), (3)

where t := Qt

k=1(1 βk). In [48], the authors have shown that the reverse transitions q(xt 1|xt) approach diagonal Gaussian distributions as T ! 1 and thus one can use a DNN with parameters to approximate q(xt 1|xt) as p (xt 1|xt) by predicting the mean and the diagonal covariance matrix µ (xt, t) and (xt, t):

p (xt 1|xt) = N(µ (xt, t), (xt, t)). (4) To sample from p (x0), one samples x T from N(0, I) and then follows the reverse process by repeatedly sampling from the transition probabilities p (xt 1|xt).

Instead of predicting µ (xt, t) directly, it has been shown in [23] that the best performance can be achieved when a neural network (xt, t) predicts the source noise in (3). Then, the loss used for training resembles that of a denoising model:

Lsimple( , T, q(x0)) := Et [1,T ],x0 q(x0), N (0,I) k (xt, t)k2 . (5)

The mean of the reverse step µ in (4) can be derived [32] using Bayes theorem as:

µ (xt, t) = 1 p1 βt

This is, however, only one of the possible equivalent parameterization of the learning objective as shown in [32]. Because the objective Lsimple does not give a learning signal for (xt, t), in practice one combines Lsimple with another loss based on a variational lower bound of the data likelihood [48], which unlike Lsimple allows us to learn the diagonal covariance matrix (xt, t). Concretely, the network (xt, t) outputs additionally a vector that is used to parametrize (xt, t).

2.1 Class conditional sampling

There exist diffusion models trained with and without knowledge about classes in the dataset [14, 22]. For our experiments, we only use the class-unconditional diffusion model, such that all classconditional features are introduced by the classiﬁer that we want to explain. For the noise-aware classiﬁer pφ(y| , ) : Rd {1, ..., T} ! [0, 1], with parameters φ, that is trained on noisy images corresponding to the various timesteps [14], the reverse process transitions are then of the form:

p ,φ(xt 1|xt, y) = Z p (xt 1|xt) pφ(y|xt, t) (7)

for a normalization constant Z. As we would like to explain any classiﬁer and not only noise-aware ones, we follow the approach of [2], where a classiﬁer pφ(y| ) : Rd ! [0, 1] is given as input the denoised sample ˆx0 = fdn(xt) of xt, using the mapping:

fdn : Rd {1, ..., T} ! Rd, (xt, t) 7! xt p t

p1 t (xt, t) p t

This mapping, derived from (3), estimates the noise-free image x0 using the noise approximated by the model (xt, t) for a given timestep t. With this, we can deﬁne a timestep-aware posterior pφ(y|xt, t) for any classiﬁer pφ(y| ) as pφ(y|xt, t) := pφ(y|fdn(xt, t)).

The issue is that efﬁcient sampling from the original diffusion model is possible only because the reverse process is made of normal distributions. As we need to sample from p ,φ(xt 1|xt, y) hundreds

of times to obtain a single sample from the data distribution, it is not possible to use MCMC-samplers with high complexity to sample from each of the individual transitions. In [14], they proposed to solve it by approximating p ,φ(xt 1|xt, y) with slightly shifted versions of p (xt 1|xt) to make closed-form sampling possible. Such transition kernels are given by:

p ,φ(xt 1|xt, y) = N(µt, (xt, t)), (9) µt = µ (xt, t) + (xt, t)rxt log pφ(y|xt, t), (10)

which we further adapt for the goal of generating VCEs and use in our experiments.

3 Diffusion Visual Counterfactual Explanations

A VCE x for a chosen target class y, a given classiﬁer pφ(y| ), and an input ˆx should satisfy the following criteria: i) validity: the VCE x should be classiﬁed by pφ(y| ) as the desired target class y with high predicted probability, ii) realism: the VCE should be as close as possible to a natural image, iii) minimality/closeness: the difference between the VCE x and the original image ˆx should be the minimal semantic modiﬁcation necessary to change the class, in particular, the generated image x should be close to ˆx while being valid and realistic, e.g. by changing the object in the image and leaving the background unchanged. Note that targeted adversarial examples are valid but do not show meaningful semantic changes in the target class for a non-robust model and are not realistic. The l1.5-SVCEs of [7] change the image in order to maximize the predicted probability of the classiﬁer into the target class inside an l1.5-ball around the image which is a targeted adversarial example. Thus this only works for robust classiﬁers and they use an Image Net classiﬁer that was trained to be multiple-norm robust (MNR), which we denote in this paper as MNR-RN50 (see Sec. 4.2). The realism of the l1.5-SVCEs comes purely from the generative properties of robust classiﬁers [43], which can lead to artefacts. In contrast, our Diffusion Visual Counterfactual Explanations (DVCEs) work for any classiﬁer and our DVCEs are more realistic due the better generative properties of diffusion models. An approach similar to our DVCE framework is Blended Diffusion (BD) [2] which manipulates the image inside a masked region. One can adapt BD for the generation of VCEs by using as mask the whole image. DVCE and BD share the same diffusion model, but BD cannot be applied to arbitrary classiﬁers and requires image-speciﬁc hyperparameter tuning, see Fig. 2.

3.1 Adaptive Parameterization

For generating the DVCE of the original image ˆx, we do not want to sample just any image x from p(x|y) (high realism and validity), but also to make sure that d(x, ˆx) is small. For this, we have to condition our diffusion process on ˆx. Using the denoising function fdn and analogously to the derivation of (9) and its applications in [14, 27], the mean of the transition kernel becomes:

µ (xt, t) + (xt, t)rxt

y|fdn(xt, t)

ˆx, fdn(xt, t)

where Cc is the coefﬁcient of the classiﬁer, and Cd the one of the distance guiding loss. Intuitively, we take a step in the direction that increases the classiﬁer score while staying close to ˆx. Note that the last term can be interpreted as the log gradient of a distribution with density exp( Cd dt(ˆx, )), where dt(ˆx, ):=d(ˆx, fdn( , t)). Thus we introduce a timestep-aware prior distribution that enforces our output to be similar to ˆx. In our work, we use the l1-distance as it produces sparse changes.

Several works have tried to minimize some distance during the diffusion process, implicitly in [9] or explicitly, but for the background, in BD. However, there is no principled way to choose the coefﬁcient for such a regularization term. It turns out that it is impossible to ﬁnd a parameter setting for BD that works across images and classiﬁers. Thus, we propose a parameterization

gupdate = Cc

rxt log pφ(y|fdn(xt, t))) krxt log pφ(y|fdn(xt, t)))k2

rxtd(ˆx, fdn(xt, t)) krxtd(ˆx, fdn(xt, t))k2

which adapts to the predicted mean of the diffusion model that we use to change µt in (10) to

µt = µ (xt, t) + (xt, t) kµ (xt, t)k2 gupdate. (13)

This adaptive parameterization allows for ﬁne-grained control of the inﬂuence of the classiﬁer and distance regularization so that now the hyperparameters Cc and Cd have the same inﬂuence across

Original DVCEs BDVCEs Cc 10 BDVCEs Cc 25 Cd = 100 500 1000 100 500 1000 keeshond:

guinea pig:

guinea pig:

guinea pig:

guinea pig:

guinea pig:

guinea pig:

guinea pig:

Figure 2: DVCEs for the robust MNR-RN50 model [7] in the second column (ours) and using Blended Diffusion (BDVCEs) [2], such that regularization is applied to the whole image. Due to our adaptive parameterization, the same parameters (Cc = 0.1, Cd = 0.15) work for our DVCEs across different images and classiﬁers (see Fig. 5). For BDVCEs, it is difﬁcult to choose a set of hyperparameters that, even for a single classiﬁer, work for multiple images. For BDVCEs, we report the parameter Cd of the LPIPS weight and the l2-weight is 10 times as large (ratio used in [2]).

images and even classiﬁers. It facilitates the generation of DVCEs as otherwise hyperparameter ﬁnetuning would be necessary for each image as in BD, see Fig. 2 for a comparison. However, even with our adaptive parameterization, it is still not easy to produce semantically meaningful changes close to ˆx as can be seen in App. B.2. Thus, as in [2], we vary the starting point of the diffusion process and observe in App. B.2 that starting from step T

2 of the forward diffusion process, together with the adaptive parameterization and using as the distance the l1-distance, provides us with sparse but semantically meaningful changes. In our experiments, we set T = 200.

3.2 Cone Projection for Classiﬁer Guidance

A key objective for our DVCEs is that they can be applied to any image classiﬁer, adversarially robust or not. Diffusion with the new mean as in (13) does not work with a non-robust classiﬁer and leads to very small modiﬁcations of the image, which are similar to adversarial examples without meaningful changes. The reason is that the gradients of non-robust classiﬁers are noisier and less semantically meaningful than those of robust classiﬁers. We illustrate this for a non-robust Swin-TF [28] in Fig. 3 where hardly any changes are generated. This even holds if we ﬁrst denoise the sample using fdn from (8) and average gradients of augmented images as in [2] (see Appendix B.6 for details).

As a solution, we suggest projecting the gradient of an adversarially robust classiﬁer with parameters , rxt log probust, (y|fdn(xt, t)), onto a cone centered at the gradient of the classiﬁer, rxt log pφ(y|fdn(xt, t)). More precisely, we deﬁne

gproj = Pcone( ,rxt log pφ(y|fdn(xt,t)))

rxt log probust,

y|fdn(xt, t)

where cone( , v) := {w 2 Rd : \(v, w) } is the cone of angle around vector v, and the projection Pcone( ,v) onto cone( , v) is given as:

Pcone( ,v)[w] :=

hu, wi u, \(w, v) > w, else,

where Pv?(w) := w hw,vi

hv,vi v and u = sin( )

Pv?(w) k Pv?(w)k2

+ cos( ) v kvk2 (note kuk2 = 1). The

motivation for this projection is to reduce the noise in the gradient of the non-robust classiﬁer. Note

Original Non-robust Robust Cone Proj. Original Non-robust Robust Cone Proj. ladybug

Figure 3: DVCEs for the non-robust Swin-TF[28], robust MNR-RN50 model, and cone-projected DVCEs for Swin-TF[28]. Similar to adversarial examples, guidance by the non-robust Swin-TF does not yield semantically meaningful changes. In contrast, the DVCEs of the robust MNR-RN50 and the DVCE with cone projection for the Swin-TF[28] yield valid and realistic VCEs.

that the projection of the gradient of the robust classiﬁer onto the cone generated by the non-robust classiﬁer with angle < 90 is always an ascent direction for log pφ(y|fdn(xt, t)) , which we would like to maximize (note that gproj is not necessarily an ascent direction for log probust, (y|fdn(xt, t))). Thus, the cone projection is a form of regularization of rxt log pφ(y|fdn(xt, t)), which guides the diffusion process to semantically meaningful changes of the image. The averaging over augmentations of xt has an additional regularizing effect, see Appendix B.6.

3.3 Final Scheme for Diffusion Visual Counterfactuals

Our solution for a non-adversarially robust classiﬁer pφ(y| ) is to use Algorithm 1 of [14] by replacing the update step with:

gproj = Pcone( ,rxt log pφ(y|fdn(xt,t)))

rxt log probust,

y|fdn(xt, t)

gupdate = Cc

gproj kgprojk2

rxtd(ˆx, fdn(xt, t)) krxtd(ˆx, fdn(xt, t))k2

µt = µ (xt, t) + (xt, t) kµ (xt, t)k2 gupdate, p(xt 1|xt, ˆx, y) = N(µt, (xt, t)).

For an adversarially robust classiﬁer the cone projection is omitted and one uses gproj = rxt log probust,

y|fdn(xt, t)

. In all our experiments we use Cc = 0.1, and Cd = 0.15 unless we show ablations for one of the parameters. The angle for the cone projection is ﬁxed to 30 . In strong contrast to BD, these parameters generalize across images and classiﬁers.

4 Experiments

In this section, we evaluate the quality of the DVCE. We compare DVCE to existing works in Sec. 4.1. In Sec. 4.2, we compare DVCEs for various state-of-the-art Image Net models and show how DVCEs can be used to interpret differences between classiﬁers. For our DVCEs, we use for all experiments the ﬁxed hyperparameters from Sec. 3.3. The diffusion model used for DVCE is from [14] and was trained class-unconditionally on 256x256 Image Net images using a modiﬁed UNet[40]. A user study, discovery of spurious features using VCEs, further experiments, and ablations are in the appendix.

4.1 Comparison of Methods for VCE Generation

We compare DVCEs with VCEs produced by BD (BDVCEs) and l1.5-SVCEs. As the latter only works for adversarially robust classiﬁers, we use the multiple-norm robust Res Net50 from [7], MNRRN50, as the classiﬁer to create VCEs for all three methods. As this model is robust on its own, we do not use the cone projection for its DVCEs.

First, in Fig. 4 we present a qualitative evaluation, where we transform one image into two different classes that are close to the true one in the Word Net hierarchy[33]. The radius of the l1.5-ball of [7] is chosen as the smallest r 2 {50, 75, 100, 150} such that the conﬁdence in the target class is larger than 0.9 per image. For BD, we select the image with the smallest classiﬁer and regularization weight that reaches conﬁdence larger than 0.9 from the set of parameters discussed in Sec. 3.3. If 0.9 is not reached by any setting for one of the two baselines, we show the image that achieves the highest

Original Target class 1 Target class 2 DVCEs (ours) l1.5-SVCEs [7] BDVCEs [2] DVCEs (ours) l1.5-SVCEs [7] BDVCEs [2] mashed potato:

cheesburger:

hognose snake:

king snake:

king snake:

king snake:

night snake:

night snake:

night snake:

timber wolf:

timber wolf:

timber wolf:

white wolf:

white wolf:

white wolf:

promontory:

Figure 4: Comparison of different VCE-methods for the robust MNR-RN50 (taken from [7]). We show for each original image (outer left column) and for each target class our DVCEs (left column), the l1.5-SVCEs of [7] (middle), and the adaptation of BDVCEs [2] (right). Only DVCEs satisfy all desired properties of VCEs. BD fails for leopard and tiger and often produces images far away from the original one (pizza, potpie, timber wolf, white wolf, and alp). l1.5-SVCEs show artefacts (white wolf, volcano, night snake) and have often lower image quality.

conﬁdence. As Fig. 4 shows, DVCE is the only method that satisﬁes all desired properties of VCEs. For example, for mashed potato , DVCE preserves the bowl and only changes the content into either guacamole or carbonara. Our qualitative comparison shows that the same hyperparameter setting of DVCE can handle different classes and transfer between similar classes, such as different snakes or wolf types, as well as different object sizes, e.g. cheetah and snake.

In contrast, both l1.5-SVCEs and BDVCEs require different hyperparameters for different images to achieve high conﬁdence in the target class. Even with the six parameter conﬁgurations, BD is not able to always produce images with high conﬁdence in the target class. More problematic for VCEs is that the resulting images can have high conﬁdence but are neither realistic (cheetah ! tiger) nor resemble the original image at all (dingo ! timber wolf/white wolf). Even if the method works

with the given parameters, for example, mashed potato ! guacamole or carbonara, the overall image quality cannot match that of DVCE as often the images contain overly bright colors. In the case of the volcano VCE, DVCE shows class features like lava whereas BDVCE can not clearly be labeled as a volcano. For the l1.5-SVCEs, one often needs large radii to achieve the desired conﬁdence of 0.9, which often results in images that do not look realistic.

As noted in [7], a quantitative analysis of VCEs using FID scores is difﬁcult as methods not changing the original image have low FID score. Thus, we have developed a cross-over evaluation scheme, where one partitions the classes into two sets and only analyzes cross-over VCEs (more details are in App. E). We show the results in Tab. 1. In terms of closeness, DVCEs are worse than l1.5-SVCEs, which can be expected as they optimize inside an l1.5-ball. However, DVCE are the most realistic ones (FID score) and have similar validity as l1.5-SVCEs. BDVCEs are the worst in all categories.

Moreover, we have conducted a user study (20 users), in which participants decided if the changes of the VCE are meaningful or subtle and if the generated image is realistic, see App. C for details. The percentage of total images having the three different properties is (order: DVCEs, l1.5-SVCEs, BD): meaningful - 62.0%, 48.4%, 38.7%; realism - 34.7%, 24.6%, 52.2%; subtle - 45.0%, 50.6%, 31.0%. This conﬁrms that DVCEs generate more meaningful features in the target classes. While the result regarding realism seems to contradict the quantitative evaluation, this is due to fact that realism means that the user considered the image realistic irrespectively if it shows the target class or not.

Closeness Validity Realism Metric l1 # l1.5 # l2 # LPIPS-Alex # Mean Conf. " Avg. FID # DVCEs (ours) 12799 293 48 0.35 0.932 17.6 l1.5-SVCEs[7] 5139 139 26 0.20 0.945 25.6 Blended Diffusion[2] 35678 722 108 0.58 0.825 27.9

Table 1: Quantitative evaluation of VCEs. DVCEs outperform BDVCEs [2] in all metrics. Moreover, they achieve comparable to l1.5-SVCEs Mean Conf. (validity), while outperforming them signiﬁcantly in Avg. FID. (realism) but do larger changes to the image than SVCE (closeness).

4.2 Model comparison

Non-robust models. In Fig. 5 we show that DVCEs can be generated for various state-of-theart Image Net [41] models. We use a Swin-TF-L[28], a Conv Ne Xt-L [29] and a Noisy-Student Efﬁcient Net-B7 [57, 53]. Both the Swin-TF and the Conv Ne Xt are pretrained on Image Net21k [13] whereas the Efﬁcient Net uses noisy-student self-training on a large unlabeled pool. As all models do not yield perceptually aligned gradients, we use the cone projection with 30 angle described in Sec. 3.2 with the robust model from the previous section. All other parameters are identical to the previous experiment. We highlight that DVCE is the ﬁrst method capable of explaining (in the sense of VCEs) arbitrary classiﬁers on a task as challenging as Image Net. Overall, DVCEs satisfy all desired properties of VCEs. It also allows us to inspect the most important features of each model and class. For the stupa and church classes, for example, it seems like all models use different roof and tower structures as the most prominent feature as they spent most of their budget on changing those.

Robust models. Here, we evaluate the DVCEs of different robust models trained to be multiple-norm adversarially robust. They are generated by multiple norm-ﬁnetuning [11] an initially lp-robust model to become robust with respect to l1-, l2and l1-threat models. Speciﬁcally an l2-robust Res Net50 [16] results in MNR-RN50 used in the previous sections, an l1-robust XCi T-transformer model [12] - in MNR-XCi T in the following and an l1-robust Dei T-transformer [5] - in MNR-Dei T. Their multiple-norm robust accuracies can be found in [11]. In [7], different lp-norm adversarially robust models were compared for the generation of VCEs and they showed that for their l1.5-SVCEs the multiple norm robust model was better (both in terms of FID and qualitatively) than individual lp-norm robust classiﬁers. This is the reason why we also use a multiple-norm robust model for the cone-projection of a non-robust classiﬁer. In Fig. 6, we show DVCEs of the three different classes for two examples images and two target classes each. All of the multiple-norm robust models have similarly good DVCEs showing classiﬁer-speciﬁc variations in the semantic changes. This proves again the generative properties of adversarially robust models, in particular of robust transformers. More examples and further experiments are in App. B.4.

Original Target Class 1 Target Class 2 Swin-TF Conv Ne Xt Efﬁcient Net Swin-TF Conv Ne Xt Efﬁcient Net loggerhead

leatherback:

leatherback:

leatherback:

box turtle:

box turtle:

box turtle:

black and golden

garden spider

black widow:

black widow:

black widow:

sea cucumber:

sea cucumber:

sea cucumber:

african grey:

african grey:

african grey:

timber wolf:

timber wolf:

timber wolf:

custard apple:

custard apple:

custard apple:

Figure 5: DVCEs for three non-robust classiﬁers: Swin-TF[28], Conv Ne Xt[29] and Efﬁcient Net[57]. For each original image (outer left column) we show DVCEs into two target classes. Please zoom into the images to see more ﬁne-grained details and differences.

Original Target Class 1 Target Class 2 MNR-RN50 MNR-XCi T MNR-Dei T MNR-RN50 MNR-XCi T MNR-Dei T pirate ship

liner ship:

liner ship:

liner ship:

container ship:

container ship:

container ship:

mashed potato

Figure 6: We compare DVCEs for three different robust models (no cone projection) which are all ﬁne-tuned to be multiple-norm adversarially robust [11], that is against l1, l2 and l1-perturbations. More examples can be found in Fig. 10.

5 Limitations, Future Work and Societal impact

In comparison to GANs, diffusion-based approaches can be expensive to evaluate as they require multiple iterations of the reverse process to create one sample. This can be an issue when creating a large amount of VCEs and makes deployment in a time-sensitive setting challenging. We also rely on a robust model during the creation of VCEs for the cone projection. While we show that standard classiﬁers do not yield the desired gradients, training robust models can be challenging. An interesting direction for future research is thus to replace the cone projection with another denoising procedure for the gradient of a non-robust model. From a theoretical standpoint, using conditional sampling with reverse transitions of the form (7) is justiﬁed, in practice, however, we have to approximate the reverse transitions with shifted normal distributions of the form (9). This approximation relies on the conditioning function having low curvature, and it can be hard to verify this once we start adding a classiﬁer, distance and other possible terms and it is unclear how this inﬂuences the outcome of the diffusion process. The quantitative evaluation of VCEs is difﬁcult, as the standard FID metric compares the distribution of features of a classiﬁer over a test and a generated dataset. However, for VCEs it is not only important to generate realistic images, but also to achieve high conﬁdence and to create meaningful changes. Moreover, metrics such as IM1, IM2 [30] for VCEs rely on a well-trained (V)AE for every class, which is difﬁcult to achieve for a dataset with 1000 classes and high-resolution images. Future research should therefore try to develop metrics for the quantitative evaluation of VCEs and we think that the evaluation in Tab. 1 is the ﬁrst step in that direction. DVCEs and VCEs in general help to discover biases of the classiﬁers and thus have a positive societal impact, however one can abuse them for unintended purposes as any conditional generative model.

6 Conclusion

We have proposed DVCEs, a novel way to create VCEs for any state-of-the-art Image Net classiﬁer using diffusion models and our cone projection. DVCEs can handle vastly different image conﬁgurations, object sizes, and classes and satisfy all desired properties of VCEs.

Acknowledgement

The authors acknowledge support by the DFG Excellence Cluster Machine Learning - New Perspectives for Science, EXC 2064/1, Project number 390727645 and the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A as well as the DFG grant 389792660 as part of TRR 248.

[1] Maximilian Augustin, Alexander Meinke, and Matthias Hein. Adversarial robustness on in-

and out-distribution improves explainability. In ECCV, 2020.

[2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of

natural images. In CVPR, 2022.

[3] Sebastian Bach, Alexander Binder, Frederick Klauschen Gregoire Montavon, Klaus-Robert

Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance propagation. PLo S One, 2015.

[4] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and

Klaus-Robert Müller. How to explain individual classiﬁcation decisions. JMLR, 2010.

[5] Yutong Bai, Jieru Mei, Alan Yuille, and Cihang Xie. Are transformers more robust than cnns?

In Neur IPS, 2021.

[6] Solon Barocas, Andrew D. Selbst, and Manish Raghavan. The hidden assumptions behind

counterfactual explanations and principal reasons. In FAT, 2020.

[7] Valentyn Boreiko, Maximilian Augustin, Francesco Croce, Philipp Berens, and Matthias Hein.

Sparse visual counterfactual explanations in image space. In GCPR, 2022.

[8] Chun-Hao Chang, Elliot Creager, Anna Goldenberg, and David Duvenaud. Explaining image

classiﬁers by counterfactual generation. In ICLR, 2019.

[9] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr:

Conditioning method for denoising diffusion probabilistic models. In ICCV, 2021.

[10] EU Commission. Regulation for laying down harmonised rules on AI. European Commission,

[11] Francesco Croce and Matthias Hein. Adversarial robustness against multiple lp-threat models at

the price of one and how to quickly ﬁne-tune robust models to another threat model. In ICML, 2022.

[12] Edoardo Debenedetti. Adversarially robust vision transformers. Master s thesis, Swiss Federal

Institue of Technology, Lausanne (EPFL), 2022.

[13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: a large-scale

hierarchical image database. In CVPR, 2009. License: No license speciﬁed.

[14] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. In Neur IPS,

[15] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shan-

mugam, and Payel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives. In Neur IPS, 2018.

[16] Logan Engstrom, Andrew Ilyas, Hadi Salman, Shibani Santurkar, and Dimitris Tsipras. Robust-

ness (python library), 2019. License: MIT.

[17] Christian Etmann, Sebastian Lunz, Peter Maass, and Carola-Bibiane Schönlieb. On the connec-

tion between adversarial robustness and saliency map interpretability. In ICML, 2019.

[18] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual

visual explanations. In ICML, 2019.

[19] Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and

Trevor Darrell. Generating visual explanations. In ECCV, 2016.

[20] Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. Grounding visual

explanations. In ECCV, 2018.

[21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.

Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neur IPS, 2017.

[22] Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In Neur IPS Workshop, 2021.

[23] Pieter Abbeel Jonathan Ho, Ajay Jain. Denoising diffusion probabilistic models. In Neur IPS,

[24] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.

Analyzing and improving the image quality of Style GAN. In CVPR, 2020.

[25] Saeed Khorram and Li Fuxin. Cycle-consistent counterfactuals by latent transformations. In

CVPR, 2022.

[26] Oran Lang, Yossi Gandelsman, Michal Yarom, Yoav Wald, Gal Elidan, Avinatan Hassidim,

William T. Freeman, Phillip Isola, Amir Globerson, Michal Irani, and Inbar Mosseri. Explaining in style: Training a gan to explain a classiﬁer in stylespace. In ICCV, 2021.

[27] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu,

Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In WACV, 2023.

[28] Ze Liu, Fand Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and

Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.

[29] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining

Xie. A convnet for the 2020s. In CVPR, 2022.

[30] Arnaud Looveren and Janis Klaise. Interpretable counterfactual explanations guided by proto-

types. In ECML, 2021.

[31] Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In

Neur IPS, 2017.

[32] Calvin Luo. Understanding diffusion models: A uniﬁed perspective. ar Xiv preprint, ar Xiv:2208.11970, 2022.

[33] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 1995.

[34] Tim Miller. Explanation in artiﬁcial intelligence: Insights from the social sciences. Artiﬁcial

Intelligence, 2019.

[35] Ramaravind K. Mothilal, Amit Sharma, and Chenhao Tan. Explaining machine learning

classiﬁers through diverse counterfactual explanations. In FAcc T, 2020.

[36] Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In

ICML, 2021.

[37] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc Grew,

Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.

[38] Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. Deep structural causal models for

tractable counterfactual inference. In Neur IPS, 2020.

[39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trust you?": Explaining

the predictions of any classiﬁer. In KDD, 2016.

[40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for

biomedical image segmentation. In MICCAI, 2015.

[41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng

Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 2015. License: No license speciﬁed.

[42] Pouya Samangouei, Ardavan Saeedi, Liam Nakagawa, and Nathan Silberman. Explaingan:

Model explanation via decision boundary crossing transformations. In ECCV, 2018.

[43] Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and

Aleksander Madry. Image synthesis with a single (robust) classiﬁer. In Neur IPS, 2019.

[44] Lisa Schut, Oscar Key, Rory Mc Grath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran,

and Yarin Gal. Generating interpretable counterfactual explanations by implicit minimisation of epistemic and aleatoric uncertainties. In AISTATS, 2021.

[45] Kathryn Schutte, Olivier Moindrot, Paul Hérent, Jean-Baptiste Schiratti, and Simon Jégou.

Using stylegan for visual interpretability of deep learning models on medical images. In Neur IPS Workshop, 2020.

[46] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi

Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. IJCV, 2019. [47] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:

Visualising image classiﬁcation models and saliency maps. In ICLR, 2014. [48] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper-

vised learning using nonequilibrium thermodynamics. In ICML, 2015. [49] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In

ICLR, 2021. [50] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and

Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021. [51] Suraj Srinivas and François Fleuret. Full-gradient representation for neural network visualization.

In Neur IPS, 2019. [52] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-

fellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. [53] Mingxing Tan and Quoc Le. Efﬁcientnet: Rethinking model scaling for convolutional neural

networks. In ICML, 2019. [54] Sahil Verma, John P. Dickerson, and Keegan Hines. Counterfactual explanations for machine

learning: A review. ar Xiv preprint, ar Xiv:2010.10596, 2020. [55] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without

opening the black box: Automated decisions and the GDPR. Harvard Journal of Law & Technology, 2018. [56] Zifan Wang, Haofan Wang, Shakul Ramkumar, Matt Fredrikson, Piotr Mardziel, and Anupam

Datta. Smoothed geometry for robust attribution. In Neur IPS, 2020. [57] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. Self-training with noisy student

improves imagenet classiﬁcation. In CVPR, 2020. [58] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason-

able effectiveness of deep features as a perceptual metric. In CVPR, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec. 5.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Sec.

5. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] See supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [N/A] We didn t train models. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] We have conducted qualitative experiments in the main paper for the same seed and show the diversity of VCEs across seeds in App. B.3. (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See App. D 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] Yes, directly in the citation, when

applicable. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

We include our code in the supplemental material. (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] Imange Net and Image Net21k are public datasets. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots,

if applicable? [Yes] We include the results from the user study together with the screenshots of instructions in App. C. (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] Not applicable. (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [Yes] See App. C.