# robust_perception_through_equivariance__8fe7dbc4.pdf

Robust Perception through Equivariance

Chengzhi Mao 1 Lingyu Zhang 1 Abhishek Vaibhav Joshi 1 Junfeng Yang 1 Hao Wang 2 Carl Vondrick 1

Deep networks for computer vision are not reliable when they encounter adversarial examples. In this paper, we introduce a framework that uses the dense intrinsic constraints in natural images to robustify inference. By introducing constraints at inference time, we can shift the burden of robustness from training to testing, thereby allowing the model to dynamically adjust to each individual image s unique and potentially novel characteristics at inference time. Our theoretical results show the importance of having dense constraints at inference time. In contrast to existing singleconstraint methods, we propose to use equivariance, which naturally allows dense constraints at a ﬁne-grained level in the feature space. Our empirical experiments show that restoring feature equivariance at inference time defends against worst-case adversarial perturbations. The method obtains improved adversarial robustness on four datasets (Image Net, Cityscapes, PASCAL VOC, and MS-COCO) on image recognition, semantic segmentation, and instance segmentation tasks.

1. Introduction

Despite the strong performance of deep networks on computer vision benchmarks (He et al., 2016; 2017; Yu et al., 2017), state-of-the-art systems are not reliable when evaluated in open-world settings (Geirhos et al., 2019; Hendrycks et al., 2021; Szegedy et al., 2013; Hendrycks & Dietterich, 2019; Croce & Hein, 2020; Carlini & Wagner, 2017). However, the robustness against a large number of adversarial cases remains a prerequisite necessary to deploy models in real-world applications, such as in medical imaging, healthcare, and robotics.

1Department of Computer Science, Columbia University, New York, USA 2Department of Computer Science, Rutgers University, New Jersey, USA. Correspondence to: Chengzhi Mao <cm3797@columbia.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Figure 1. Equivariance is shared across the input images (left) and

the output labels (right), providing a dense constraint. The predictions from a model F(x) should be identical to performing a spatial transformation on x, a forward pass of F, and undoing that spatial transformation on the output space (black).

Due to the importance of this problem, there has been a large number of investigations aiming to improve the training algorithm to establish reliability. For example, data augmentation (Yun et al., 2019; Hendrycks et al., 2021) and adversarial training (Madry et al., 2017; Carmon et al., 2019) improve robustness by training the model on anticipated distribution shifts and worst-case images. However, placing the burden of robustness on the training algorithm means that the model can only be robust to the corruptions that are anticipated ahead of time, which is an unrealistic assumption in an open world. In addition, retraining the model on new distributions each time can be expensive.

To address this challenge, we propose to robustify the model at inference time. Speciﬁcally, instead of retraining the whole model on the new distribution, our inference-time defense shifts the burden to test time with our robust inference algorithm without updating the model. Prior work (Mao et al., 2021; Shi et al., 2020; Wang et al., 2021) is limited to a single constraint at inference time and hence may not provide the model with enough information to dynamically adjust to the unique and potentially novel characteristics of the corruption in the testing image. We therefore ask the natural question: Can we further improve the robustness through increasing the number of constraints?

We start with theoretical analysis and prove that applying more constraints at inference time strictly improves the model s robustness. The next question is then: How to efﬁciently apply multiple constraints at inference time? One approach is to directly apply multiple feature invariance

Robust Perception through Equivariance

constraints to the defense. While this defense is effective, we ﬁnd the resulting representation can be limited by the invariance property, therefore harming robust accuracy. For example, after resizing, the representations of the segmentation models are not the same, and it is unclear which part should be invariant.

Our further study with empirical results suggest that a better approach is to use dense equivariance constraints. Our main hypothesis is that visual representations must be equivariant under spatial transformations, which is a dense property that should hold for all natural images (equivariance consistency in Figure 1). This property holds when the test data are from the same distribution that the model has been trained on. However, once there has been an adversarial corruption, the equivariance is often broken (Figure 2). Therefore our key insight is that we can repair the model s prediction on corrupted data by restoring the equivariance.

Empirical experiments, theoretical analysis, and visualizations highlight that equivariance signiﬁcantly improves model robustness over other methods (Mao et al., 2021; Shi et al., 2020; Wang et al., 2021). On four large datasets (Image Net (Deng et al., 2009), Cityscapes (Cordts et al., 2016), PASCAL-VOC (Everingham et al., 2010), and MSCOCO (Lin et al., 2014)), our approach improves adversarial robust accuracy by up to 15 points. Our study shows that equivariance can efﬁciently improve robustness by increasing the number of constraints (Figure 3). Even under two adaptive adversarial attacks where the attacker knows our defense (Athalye et al., 2018; Mao et al., 2021), adding our method improves robustness. In addition, since equivariance is an intrinsic property of visual models, we do not need to train a separate model to predict equivariance (Shi et al., 2020; Mao et al., 2021). Our code is available at https: //github.com/cvlab-columbia/Equi4Rob.

2. Related Work

Equivariance. Equivariance beneﬁts a number of visual tasks (Dieleman et al., 2016; Cohen & Welling, 2016b; Gupta et al., 2021; Zhang, 2019; Chaman & Dokmanic, 2021; Chaman & Dokmani c, 2021). Cohen & Welling (2016a) proposed the ﬁrst group-convolutional operation that produces equivariant features to symmetry-group. However, it can only be equivariant on a discrete subset of transformation (Sosnovik et al., 2019). Steerable equivariance achieves continuous equivariant transformation (Cohen & Welling, 2016b; Weiler et al., 2018) on the deﬁned set of basis, but they cannot be applied to arbitrary convolution ﬁlters due to the requirement of an equivariant basis. Besides architecture design (Weiler & Cesa, 2019), adding regularization (Barnard & Casasent, 1991) can improve equivariance in the network. (Kamath et al., 2021) shows that training equivariance at training time decreases adversarial

robustness. Our method sidesteps this issue by promoting equivariance for attacked images at test time, improving equivariance when it is most needed.

Adversarial Attack and Defense. Adversarial attacks (Szegedy et al., 2013; Madry et al., 2017; Cisse et al., 2017a; Dong et al., 2018; Carlini & Wagner, 2017; Croce & Hein, 2020; Arnab et al., 2018) are perturbations optimized to change the prediction of deep networks. Adversarial training (Madry et al., 2017; Rice et al., 2020; Carmon et al., 2019) and its variants (Mao et al., 2019; 2022; Zhang et al., 2019) are the standard way to defend adversarial examples. The matching algorithm to produce features invariant to adversarial perturbations has been shown to produce robust models (Mahajan et al., 2021; Zhang et al., 2019). However, training time defense can only be robust to the attack that it has been trained on. Multitask learning (Mao et al., 2020; Zamir et al., 2020) and regularization (Cisse et al., 2017b) can improve adversarial robustness. However, they did not consider the spatial equivariance in their task. Recently, inference time defense using contrastive invariance (Mao et al., 2021) and rotation (Shi et al., 2020) has been shown to improve adversarial robustness without retraining the model on unforeseen attacks. However, they only apply a single constraint, which may not provide enough information.

Test Test Adaptation. Berthelot et al. (2019); Pastore et al. (2021) perform test-time training on the entire test set for many iterations, our method only assumes seeing one example at a time and performs test-time adaptation on a single image. Tsai et al. (2023) adapts the model with convolutional prompt, but only works for a large batchsize. Testtime adaption is also useful in language domain (Mc Dermott et al.). By leveraging equivariance, we can efﬁciently incorporate dense constraints into our framework, which can be orders of magnitude more effective than adding constraints individually (Sun et al., 2020; Lawhon et al., 2022).

In this section, we ﬁrst introduce equivariance for visual representation, present algorithms to improve adversarial robustness using equivariance, and then provide theoretical insight into why the multiple constraints can lead to such improvement.

3.1. Equivariance in Vision Representation

Let x be an input image. A neural network produces a representation h = F (x) for the input image. Assume there is a transformation g for the input image. A neural network is equivariant only when:

F (x) = g 1 F g(x), (1)

Robust Perception through Equivariance

Adversarial

Clean Image

Our Defense

Final Prediction

Prediction g 1

Prediction g 1

Adversarial

Clean Image

Our Defense

Person Person

Person Person

car truck truck car

truck truck car car car

truck truck car

truck truck car car

truck truck

Image Final Prediction

Prediction g 1

Prediction g 1

Sidewalk Car

Horse Horse

Person Person

Person Person Person

Horse Horse Horse

Cityscapes (semantic segmentation) VOC (semantic segmentation)

COCO (semantic segmentation) COCO (instance segmentation)

Figure 2. Random examples showing equivariance on clean images and non-equivariance on attacked images in Cityscapes, PASCAL VOC, and COCO. The representation is equivariant when the predicted images (2nd column) and the reversed prediction of transformed images (3rd, 4th column) are the same. By restoring equivariance, our method corrects the prediction.

where g 1( ) denotes the inverse transformation for g( ), and denotes function composition. Equivariant representations will change symmetrically as the input transformation. This means applying the transformation to the input image and undoing the transformation in the representation space, should result in the same representation as fed in the original image. Equivariance provides a meta property that can be applied to dense feature maps, and generalized to most existing vision tasks (Gupta et al., 2021; Laptev et al., 2016; Marcos et al., 2016).

In contrast, invariance is deﬁned as F (x) = F g(x), which requires the model to produce the same representation after different transformations, such as texture augmentation (Geirhos et al., 2019) and color jittering (Mao et al., 2021; Chen et al., 2020). Without performing transformation in the same way as the input, invariance removes all the information related to the transformation, which can hurt the ﬁnal task if the transformation is crucial to the ﬁnal task (Lee et al., 2021). On the contrary, equivariant models maintain the covariance of the transformations (Gupta et al., 2021; Laptev et al., 2016; Marcos et al., 2016).

Transformation for Equivariance. We use spatial transformation, such as ﬂip, resizing, and rotation, in our experiments. Assume we apply k different transformations gi where i = 1, ..., k. We denote the cosine similarity as cos( ). Equivariance across all transformations means the

Algorithm 1 Equivariance Defense

1: Input: Potentially attacked image x, step size , num-

ber of iterations T, deep network F, reverse attack bound v, and equivariance loss function Lequi. 2: Output: Prediction by 3: Inference: x0 x 4: for t = 1, ..., T do 5: x0 x0 + sign(Normalize(rx0Lequi(x0)) + N(0, T 1 t

T )) 6: x0 (x, v)x0, which projects the image back into the bounded region. 7: end for 8: Predict the ﬁnal output by by = F(x0)

following term is large:

i=1 cos(g 1

i F gi(x), F (x)) (2)

3.2. Equivariance for Adversarial Robustness

Let y be the ground-truth category labels for x. Let the network that uses the feature h for ﬁnal task prediction to be C 0. For prediction, neural networks learn to predict the category by = C 0 F (x) by minimizing the loss L(by, y) between the predictions and the ground truth. For example,

Robust Perception through Equivariance

for semantic segmentation, L is cross-entropy for each pixel output. We deﬁne the loss for the ﬁnal task as follows:

Lt(x, y) = L (C 0 F (x), y) , (3)

Adversarial Attack. To fool the model s prediction, the adversarial attack ﬁnds additive perturbations δ to the image such that the loss of the task (Equation 15) is maximized.

xa = argmax

Lt(xa, y), s.t. ||xa x||q , (4)

where the perturbation vector δ = xa x has a q norm bound that is smaller than , keeping the perturbation invisible to humans.

Equivariance Recalibration Defense. Given an input image, we can calculate the equivariant loss Lequi. As shown in Figure 2, the representations are non-equivariant when the input xa is adversarially perturbed, i.e., the term Lequi is low. We will ﬁnd an intervention to recalibrate the input image xa such that we can improve the feature equivariance of the image. To do this, we optimize a vector r by maximizing the equivariance objective:

r Lequi(xa + r), s.t. ||r||q v, (5)

where v deﬁnes the bound of our reverse attack r. The additive intervention r will modify the adversarial image xa such that it restores the equivariance in feature space.

We optimize the above objective via projected gradient descent to repair the input. To avoid the optimization converging to a local optimal, we ﬁrst perform SGLD (Welling & Teh, 2011) to get a good Bayesian posterior distribution of the solution (Wang et al., 2019). To avoid sampling from the posterior distribution of SGLD and improve the inference speed, we then use maximum a posterior (MAP) estimation to ﬁnd a single solution. Empirically, we add Gaussian noise to the gradient when optimizing and linearly anneal the noise level to zero. We show the optimization procedure in Algorithm 1. We use the same optimizer for the invariance objective and compare.

In contrast to Mao et al. (2021); Shi et al. (2020), we do not need to pre-train another network for the self-supervision task ofﬂine. In addition, equivariance in the feature space provides a dense constraint because, by projecting the transformation back to the original space, we can match each element in the feature space. Image-level self-supervision tasks, such as contrastive loss and rotation prediction, do not have this dense supervision advantage.

Adaptive Attack I. We now analyze our methods robustness when the attacker knows our defense strategy and takes our defense into consideration. Following the defense-aware attack setup in (Mao et al., 2021), the adaptive attacker can

1 1.5 2 2.5 3 3.5 4 4.5 5 Number of constrains log(K)

Robust m Io U Instance Seg Score

Figure 3. Adversarial robustness under an increased number of

constraints through equivariance at inference time.

maximize the following equation:

Ll(xa, y, λs) = Lt(xa, y) + λe Lequi(xa). (6)

where the ﬁrst term fools the ﬁnal task, and the second term optimizes for equivariance. A larger λe allows the adversarial budget to focus more on respecting the feature equivariance, which reduces the defense capability of our defense. However, with a ﬁxed adversarial budget, increasing λe also reduces the attack efﬁciency for the ﬁnal task. Our defense creates a lose-lose situation for the attacker. If they consider our defense, they hurt the attack efﬁciency for the ﬁnal task. If they ignore our defense, our defense will ﬁx the attack.

Adaptive Attack II. The above adaptive attack avoids the unstable gradient from the iterative optimization with a Lagrangian regularization term. Another way to bypass such defense is through BPDA (Athalye et al., 2018). Speciﬁcally, the equivariance recalibration process formulated in Eq. 5 can be treated as a preprocessor h( ) that is employed at test time, where h(xa) = xa + r. Given a pre-trained classiﬁer f( ), this method can be formulated as f(h(x)). The proposed process h( ) may cause exploding or vanishing gradients. According to (Athalye et al., 2018; Croce et al., 2022), we can use BPDA to approximate h( ), where an identity function is used for the backward pass of the restored images. While this method may make the backward gradient inaccurate, it avoids differentiation through the inner optimization procedure, which often leads to vanished or exploded gradients.

3.3. Theoretical Results for Adversarial Robustness

with Multiple Constraints

One major advantage of equivariance is that it allows dense constraints through the inverse transformation. We show theoretical insights for why using a dense intrinsic constraint rather than a single intrinsic constraint. Existing methods restore the input image to respect a single self-supervision label y(s1). With a dense intrinsic constraint, the defense

Robust Perception through Equivariance

model can predict with a set of ﬁne-grained self-supervision signals. y(si), where i = 1, 2, ..., K. In our case, each y(si)

a is the predicted self-supervision value under adversarial attack, and each y(si) is the predicted self-supervision value in our feature map after equivariance transformation. Following (Mao et al., 2021), we propose the following lemma:

Lemma 3.1. The standard classiﬁer under adversarial attack is equivalent to predicting with P(Y|Xa, y(s1)

a , ..., y(sk)

a ), and our approach is equivalent to predicting with P(Y|Xa, y(s1), y(s2), ..., y(sk)).

By adjusting the input image such that it satisﬁes a set of denser constraints, the predicted task Y uses both the information from the image and the intrinsic equivariance structure. We now show that by restoring the dense constraints in our visual representation, from an information perspective, the upper bound can be strictly improved than just restoring the structure from a single self-supervision task (Mao et al., 2021; Shi et al., 2020).

Theorem 3.2. Assume the classiﬁer operates better than chance and instances in the dataset are uniformly distributed over n categories. Let the prediction accuracy bounds be P(Y|y(s1)

a , ..., y(sk)

a , Xa) 2 [b0, c0], P(Y|y(s1), Xa) 2 [b1, c1], P(Y|y(s1), y(s2), Xa) 2 [b2, c2], ..., and P(Y|y(s1), y(s2), ..., y(sk), Xa) 2 [bk, ck]. If the conditional mutual information I(Y; Y(si)|Xa) > 0 and I(Y; Y(si)|Xa, Y(sj)) > 0 where i 6= j, we have b0 b1 ... bk and c0 < c1 < c2 < ... < ck, which means our approach strictly improves the upper bound for classiﬁcation accuracy.

In words, the adversarial perturbation Xa corrupts the shared information between the label Y (our target task) and the equivariance structure Ysi (self-supervised task). Theorem 3.2 shows that by recovering information from more Ysi, the task performance can be improved.

Directly increasing the number of invariance objectives is a straightforward baseline to increase the number of constraints. However, this can be limited because 1) each invariance objective only adds one constraint, which is less efﬁcient, and 2) invariance cannot be directly applied to many transformations, such as resizing and rotation, due to the mismatch in ﬁne-grained representation, where equivariance can. In contrast to invariance, dense equivariance allows us add one constraint on each element in the feature map1, which can increase constraints orders of magnitude faster with a more diverse set of transformations, providing an efﬁcient way to apply multiple constraints. By subsampling different number of constraints from equivariance, Figure 3 validates the trend of improving robustness as the number of constraints increases.

1A 100 by 100 feature map would provide 10000 constraints

The adaptive attack needs to respect the information in Y si, which itself limits the ability of the attacker, as the attacker performs a multitask optimization which is harder (Mao et al., 2020). The adaptive attacker predicts the task conditioned on the right set of self-supervision label Ysi, which fulﬁlls our Theorem 3.2 and improves robustness.

4. Experiments

Our experiments evaluate the adversarial robustness on four datasets: Image Net (Deng et al., 2009), Cityscapes (Cordts et al., 2016), PASCAL-VOC (Everingham et al., 2010), and MS-COCO (Lin et al., 2014). We use up to 6 different strong attacks, including Houdini, adaptive attack, and BPDA to evaluate the robustness. We ﬁrst show that our equivariancebased defense improves the robustness of the state-of-the-art adversarially trained robust models. We then show that even on the standard models without defense training, adding test-time equivariance can improve their robustness.

4.1. Dataset and Tasks

Image Net (Deng et al., 2009) contains 1000 categories. Due to its large size, we randomly sample 2% of data for evaluation. Cityscapes (Cordts et al., 2016) is a urban driving scene dataset. We study the semantic segmentation task. Following (Mao et al., 2020), we resize the image to 680 340 for fast inference. We use pretrained dilated residual network (DRN) for segmentation. PASCAL-VOC (Everingham et al., 2010) is a dataset for semantic segmentation task. We resize images to 480 480. We use the pre-trained Deep Lab V3+ model. MS-COCO (Lin et al., 2014) is a large-scale image dataset of common objects that supports semantic segmentation and instance segmentation task. For semantic segmentation, we resize the images to 400 400. We use pretrained Deeplab V3 and Mask RCNN for semantic segmentation and instance segmentation, respectively.

4.2. Attack Methods

IFGSM (seg) (Arnab et al., 2018) was used to evaluate the robustness of segmentation models with multiple steps of the fast gradient sign method. PGD (Madry et al., 2017) is a standard iterative-based adversarial attack, which performs gradient ascent and projects the attack vector inside the deﬁned p norm ball. MIM (Dong et al., 2018) adds a momentum term to the gradient ascent of PGD attack, which is a stronger attack that can get out of local optima. Houdini (Cisse et al., 2017a) is the state-of-the-art adversarial attack for decreasing the m Io U score of semantic segmentation. It proposes a surrogate objective function that can be optimized on the m Io U score directly. Adaptive Attack (AA) (Mao et al., 2021) is the standard defense-aware attack

for inference time defense method, where the adaptive attack knows our defense algorithm, and optimizes the attack

Robust Perception through Equivariance

Table 1. Classiﬁcation accuracy on Image Net and segmentation m Io U on Cityscapes dataset on adversarially trained models with

= 4/255. Using equivariance improves robustness more than other methods.

Image Net; Adversarially Pretrained Model (Wong et al., 2020); Classiﬁcation Accuracy Evaluation Method Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Clean 51.5 49.4 49.5 49.2 48.8 49.3 PGD 26.5 28.0 28.2 29.3 28.6 32.2 CW 26.6 28.3 28.6 29.8 32.2 32.2 AA 26.5 28.0 28.2 29.3 28.6 32.2 BPDA 26.5 28.0 27.9 28.8 28.9 30.4

Cityscape; Adversarially Trained DRN-22-d; Segmentation MIo U Evaluation Method Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Clean 53.23 52.96 51.72 53.00 49.04 48.74 IFGSM (seg) 33.06 33.21 33.47 33.59 32.36 34.04 PGD 26.61 27.04 27.68 28.14 27.74 29.65 MIM 26.59 27.06 27.72 28.24 27.76 29.56 Houdini 23.47 24.07 25.56 26.61 26.97 29.80 AA 26.61 27.04 27.68 28.14 27.74 29.63 BPDA 26.61 27.00 26.37 28.50 23.31 29.83

vector to respect equivariance while fooling the ﬁnal task. Since the attack already respects and adapts to equivariance, our defense has less space to improve by further optimizing for equivariance. BPDA (Athalye et al., 2018; Croce et al., 2022) is an adaptive attack for input puriﬁcation. In our case, we forward the adapted images in the forward pass and straight-through the gradient from our adapted image to the input image.

4.3. Baselines

We compare our method with the vanilla feed-forward inference and four existing inference-time defense methods. Random defense (Kumar et al., 2020) defends adversarial attack by adding random noise to the input, which is used as a baseline in (Mao et al., 2021). Rotation defense (Shi et al., 2020) puriﬁes the adversarial examples by restoring the performance of the rotation task at inference time, which can recover the image information that relates to rotation. However, the information related to rotation may be misleading due to the illusion issue (ill), which limits its power for complex tasks. Contrastive defense (Mao et al., 2021) restores the intrinsic structure of the image using Sim CLR (Chen et al., 2020) objective at inference time, which achieves state-of-the-art adversarial robustness on image recognition tasks. Contrastive learning requires images to be object-centric, which may not be true on the segmentation and detection dataset where multiple objects appear in the same image. Invariance defense follows the same setup as our equivariance experiment but replaces the equivariance loss with the invariance loss. To obtain several constrants from invariance, we use the same diversiﬁed set of transformations as the equivariance setup. We propose

this baseline to study the importance of using equivariance to apply multiple constraints.

4.4. Implementation details

We choose the number of transformations to be K = 8, which empirically can be ﬁt into a 2080Ti GPU with batch size 1. To increase the constraints obtained from equivariance, we empirically use a diversiﬁed set of transformations, which includes four resizing transformations ranging from 0.3 to 2 times of size change; one color jittering transformation; one horizontal ﬂip transformation; and two rotation transformations between -15 to 15 degrees. For transformations that cause part of the original image not in the view, we only consider the overlapped region when calculating the loss. Ablation study for the effect of each transformation is shown in Section 4.7. We use steps T = 20 for all our defense tasks. Since after the spatial transformations, the invariance objective cannot be performed in the dense feature space due to the position mismatch, we apply an average pooling for all the features and then compute the invariance loss.

4.5. Results on Adversarial Trained Models

Adversarial training is the standard way to defend against adversarial examples. We ﬁrst validate whether our proposed approach can further improve the robustness of adversarially trained models. For Image Net, we use the adversarial pretrained model with = 4/255 from (Wong et al., 2020). We set the defense vector bound to be v = 2 . With the stateof-the-art contrastive learning method (Mao et al., 2021), we improve robustness accuracy by 3 points to the Vanilla

Robust Perception through Equivariance

Table 2. Semantic segmentation m Io U on Cityscapes, PASCAL VOC, and MSCOCO dataset. All models are not adversarially trained.

Under different types of attack bounded by L1 = 4/255, our method consistently outperforms other defense methods.

Cityscape; Pretrained DRN-22-d Model; Segmentation MIo U Evaluation Method Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Clean 58.29 52.38 34.30 37.22 33.84 37.95 PGD 1.31 1.47 13.20 8.44 14.49 30.76 MIM 1.40 1.49 13.80 8.13 14.57 30.10 Houdini 0.00 0.21 16.31 10.12 14.16 30.52 AA 1.31 1.47 13.20 8.44 14.49 30.28 BPDA 1.31 1.47 6.20 4.78 6.82 11.64

PASCAL VOC dataset; Pretrained Deep Lab V3; Segmentation MIo U Evaluation Method Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Clean 69.52 68.96 28.63 66.92 63.64 56.58 PGD 6.46 6.52 6.91 18.72 39.07 43.51 MIM 5.63 5.74 6.35 18.25 37.43 41.56 Houdini 0.02 0.08 6.14 19.11 31.30 52.26 BPDA 6.46 6.46 8.23 5.45 15.15 25.68

MSCOCO dataset; Pretrained Deeplab V3-resnet50; Segmentation MIo U Evaluation Method Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Clean 63.02 62.97 60.92 57.28 43.07 44.71 PGD 2.62 2.65 5.79 14.75 23.92 24.51 MIM 2.71 2.52 5.66 13.61 20.53 21.30 Houdini 0.05 0.10 4.78 22.69 36.94 37.33 BPDA 2.62 2.63 1.15 2.35 17.13 18.69

Mask RCNN; Instance Segmentation mask AP Evaluation Method Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Clean 34.5 33.6 31.2 29.7 14.3 23.4 PGD 0.0 1.6 2.6 8.9 12.9 21.3 MIM 0.0 1.6 2.7 9.1 13.2 21.2 BPDA 0.0 0.0 0.3 1.7 8.7 9.9

defended model. Adaptive attack (AA) poses a lose-lose situation and does not further decrease the robustness accuracy, which is consistent with the observation of (Mao et al., 2021). With the strongest adaptive attack BPDA (Croce et al., 2022), it drops 0.5 points. 2 Using the equivariance objective, under both standard attack and the adaptive attack BPDA, it improves robustness more than the other methods. Even though BPDA decreases equivariance defense by 1.8 points, equivariance still improves robustness by 3.9 points than not using it.

On Cityscapes, we downsample the image from 2048 1024 to 680 340 to reduce computation, which follows the

setup of (Mao et al., 2020). We adversarially train a segmen-

2Recent work (Croce et al., 2022) uses a batch size of 50 for contrastive loss, which is a weaker defense due to the small batch size. Here, we use the original batch size of 400 setup as (Mao et al., 2021), which provides a stronger defense due to the large batch size, where we see robust accuracy improved than Vanilla.

tation model and evaluate it in Table 1, which is measured with mean Intersection over Union (m Io U) for semantic segmentation. We set the defense vector bound to be v = 2.5 . For the standard attack, Houdini reduces the robustness accuracy the most, where using equivariance constraints at test time can recover 6 points of performance. Using the adaptive attack (Mao et al., 2021), the robust accuracy of equivariance only drops by 0.2 points. Using the BPDA adaptive attack, the robustness of the invariance-based method drops 4 points, which suggests that invariance relies mostly on obfuscated gradients and it is not an effective constraint to maintain at inference time for segmentation. In contrast, BPDA cannot undermine the equivariance-based model s robustness. On the adversarially trained model, equivariance consistently outperforms all other test-time defenses, which demonstrates that equivariance is a better intrinsic structure to respect during inference time.

Robust Perception through Equivariance

Table 3. Segmentation m Io U with targeted attack on Cityscapes. We use the DRN-22-d backbone. Restoring the equivariance moves the

predicted segmentation map to the groundtruth.

m Io U to Attack Target # m Io U to Groundtruth " Evaluation Method Vanilla Invariance Equivariance (Ours) Vanilla Invariance Equivariance (Ours)

PGD 68.03 12.92 14.96 10.08 25.10 30.01 MIM 71.49 13.78 12.63 9.78 23.11 28.51 Houdini 54.49 12.83 16.80 17.17 24.56 30.26 BPDA 68.03 25.64 25.82 10.08 17.73 20.14

Targeted Adversarial

Our Defense

Final Prediction F(x) Prediction g 1

1 F g1(x) Prediction g 1

2 F g2(x) Figure 4. Our method improves robustness under

targeted adversarial attacks (Random Sample). By adding targeted adversarial attacks, the model fails to predict the bicycle on the road and instead predicts a sidewalk. In the middle row, the attacked model s representation produces different segmentation maps under different transformations, suggesting that the model is no longer equivariant. By restoring the equivariance, we correct the model prediction.

4.6. Results on Non-Adversarial Trained Models

We have shown that equivariance improves the robustness of adversarially trained models. However, most pretrained models are not adversarially defended. We thus study whether our method can also improve standard models robustness.

City Scapes Semantic Segmentation. In Table 2, we ﬁrst conduct ﬁve types of attacks for the DRN-22-d segmentation model. We use 20 steps of defense, i.e., K = 20, and use a step size of = 2 v, and set the defense vector bound to be v = 1.5 . While the strongest Houdini attack can reduce the m Io U score to 0, our defense can restore the m Io U score by over 30 points. For the adaptive attack, we search the optimal λe that reduces the robust performance the most and ﬁnd λe = 1000 produces the most effective attack, which still cannot bypass our defense. For baselines, we ﬁnd λe = 0 produces the most effective attack. We ﬁnd for standard backbones that are not adversarially trained, BPDA is the most effective attack, we thus only evaluate on BPDA on the following datasets. We run 10 steps of BPDA with 20 steps of reversal in the inner loop, which is a totally of 200 backward steps. Under the BPDA attack, equivariance-based defense is still more effective than other methods, including the invariance-based method.

PASCAL VOC Semantic Segmentation. We show results in Table 2. We use the pretrained Deep Lab V3 (Chen et al., 2018; 2017) model. We use K = 20 and step size = 2 v, and v = 1.5 . Our approach can signiﬁcantly improve the robustness compared with other methods.

MSCOCO Semantic Segmentation. We show results in Table 2. We use the pretrained Deep Lab V3 (Chen et al.,

2018; 2017) model. On COCO, we use K = 2 and step size = 2 v, v = 1.25 . Using equivariance outperforms other test-time defense methods.

Instance Segmentation. Our defense can also secure the more challenging instance segmentation model. In Table 2, our method improves instance segmentation mask AP by up to 21 points, which demonstrates that our method can be applied to a large number of vision applications.

Targeted Attack. The above attacks are untargeted. We also analyze whether our conclusion holds under targeted attacks, where the attacker needs to fool the model to predict a speciﬁc target. In Table 3, the targeted attack successfully misleads the model to predict the target, and our equivariance defense corrects the prediction to be the ground truth. Equivariance improves up to 10 points on the m Io U metric. We show visualizations in Figure 4.

4.7. Analysis

Equivariance Measurement. We calculate the equivariance value measured by Equation 2 for clean images, adversarial attacked images, and our defended images. We show the numerical results in Table 5. While adversarial attacks corrupt the equivariance of the image, as shown by the lowered value in the table, our method is able to restore it. Visualizations in Figure 2 also show our method clearly restores the equivariance under attack.

Ablation Study for Equivariance Transformations. In Table 6, we study the impact of using different transformations in our equivariance defense. We ﬁnd transformations, which the model should be equivariant to but, in fact, does not due to attacks, are the most effective ones in improving

Robust Perception through Equivariance

Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Running Time (sec/sample) 0.016 0.016 0.322 0.152 1.632 1.653 Memory Usage (GB) 0.391 0.391 3.102 0.731 10.049 10.357

Table 4. Running time and GPU memory usage for MS COCO semantic segmentation task. We evaluate on a single A6000 GPU.

Dataset Image Net Cityscapes PASAL VOC COCO

Clean Images 0.539 0.694 0.900 0.901 Attacked Images 0.538 0.448 0.642 0.774 Restored Images 0.581 0.713 0.921 0.914

Table 5. Measurement for equivariance on clean images, attacked

images, and our restored images. A high score indicates better equivariance. Adversarial attack corrupts the equivariance. Our method restores the equivariance back to the same level as the clean images.

Transformations of Equivariance Loss Flip Resize Rotation 15 Rotation 90

Invariance 9.56 9.90 9.75 9.60 Equivariance 20.50 26.00 17.03 8.61

Table 6. The impact of using different transformations on the per-

formance of our method. We show results from a standard segmentation model on Cityscape.

robustness. For example, ﬂipping and resizing are most effective for our studied semantic segmentation. Rotation below 15 degrees helps robustness more than rotation larger than 90 degrees. Large rotation performs worse because segmentation models are not equivariant to large rotation, even on clean data, which reduces the effectiveness of our approach. In Section 4.4, we empirically choose the combination of transformations that produces good empirical results for our approach.

The Trade-off between Robustness and Clean Accuracy. In Table 7, we show that increasing the bound v for the defense vector creates a trade-off between clean accuracy and robust accuracy. Speciﬁcally, bound v = 1/255 is a sweet spot, where one can increase robustness by 0.4 without any loss of clean accuracy. Our method allows dynamically conducting trade-off between robustness and clean accuracy by controlling the additive vector s bound.

Runtime Analysis and GPU Memory Usage. In Table 4, we show the running time and GPU memory usage on our studied methods. While our method leads to longer running time and larger GPU memory usage, we believe this is a necessary trade-off to achieve the best robustness. In many important applications, sacriﬁcing accuracy or robustness for the sake of reducing running time/memory usage would be counterproductive. To mitigate this, we also propose to ﬁrst detect adversarial examples, then only perform our test-time adaptation for the detected adversarial ones.

Detecting Adversarial Samples. A straightforward way to speed up our inference and improve the accuracy on clean samples, is to ﬁrst detect adversarial samples, and only run

Equivariance defense vector bound v = i/255 Accuracy i=0 i=1 i=2 i=4 i=6 i=8 i=10

Clean 53.23 53.24 53.09 52.38 50.62 48.84 48.74 Robustness 26.61 27.03 27.57 28.53 29.22 29.57 29.83

Table 7. Trading-off Robustness vs. Clean Accuracy on Cityscape

using our equivariance method under BPDA attack. If clean performance is important, we can simply decrease the defense vector bound to increase the clean accuracy.

Method Rotation Contrastive Invariance Equivariance (Ours)

Inference (sec) 0.016 0.016 0.016 0.016 Detection (sec) 0.048 0.049 0.147 0.169 Defense (sec) 0.306 0.136 1.616 1.637

Table 8. Running time for different methods on vanilla feedforward

inference (Inference), detecting adversarial samples (Detection), and our test-time defense (defense).

our algorithm on the adversarial samples. Table 8 reports the time running on COCO images with a single A6000 GPU, which shows that detection is less expensive compared to our defense and can be used to reduce our computational cost. Since test-time optimization on clean examples decrease clean performance, we can also increase the clean accuracy by ﬁrst detecting the adversarial examples. In Appendix A.3.1, we show that we can increase clean performance by only performing test-time optimization on the detected adversaries.

5. Conclusion

Robust perception under adversarial attacks has been an open challenge. We ﬁnd that equivariance can be a desired structure to maintain at inference time because it can provide dense structural constraints on a ﬁne-grained level. By dynamically restoring equivariance at inference, we show signiﬁcant improvement in adversarial robustness across three datasets. Our work hints toward a new direction that uses the right structural information at inference time to improve robustness.

Acknowledgement

This research is based on work partially supported by the DARPA SAIL-ON program, the NSF NRI award #1925157, a GE/DARPA grant, a CAIT grant, and gifts from JP Morgan, Di Di, and Accenture. We thank the anonymous reviewers for their valuable feedback in improving the paper.

Robust Perception through Equivariance

Rabbit and duck illusion. https://en.wikipedia.org/wiki/Rabbit-duck illusion.

Arnab, A., Miksik, O., and Torr, P. H. On the robustness of

semantic segmentation models to adversarial attacks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 888 897, 2018.

Athalye, A., Carlini, N., and Wagner, D. Obfuscated gra-

dients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274 283. PMLR, 2018.

Barnard, E. and Casasent, D. Invariance and neural nets.

IEEE Transactions on neural networks, 2(5):498 508, 1991.

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N.,

Oliver, A., and Raffel, C. A. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019.

Carlini, N. and Wagner, D. A. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, pp. 39 57, 2017.

Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J. C.,

and Liang, P. S. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems, volume 32, 2019.

Chaman, A. and Dokmanic, I. Truly shift-invariant convolu-

tional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3773 3783, 2021.

Chaman, A. and Dokmani c, I. Truly shift-equivariant con-

volutional neural networks with adaptive polyphase upsampling. In 2021 55th Asilomar Conference on Signals, Systems, and Computers, pp. 1113 1120. IEEE, 2021.

Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H.

Rethinking atrous convolution for semantic image segmentation. ar Xiv preprint ar Xiv:1706.05587, 2017.

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam,

H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801 818, 2018.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A

simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020.

Cisse, M., Adi, Y., Neverova, N., and Keshet, J. Houdini:

Fooling deep structured prediction models. ar Xiv preprint ar Xiv:1707.05373, 2017a.

Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and

Usunier, N. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning, pp. 854 863. PMLR, 2017b.

Cohen, T. and Welling, M. Group equivariant convolutional

networks. In International conference on machine learning, pp. 2990 2999. PMLR, 2016a.

Cohen, T. S. and Welling, M. Steerable cnns. ar Xiv preprint

ar Xiv:1612.08498, 2016b.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213 3223, 2016.

Croce, F. and Hein, M. Reliable evaluation of adversarial

robustness with an ensemble of diverse parameter-free attacks. In ICML, 2020.

Croce, F., Gowal, S., Brunner, T., Shelhamer, E., Hein,

M., and Cemgil, T. Evaluating the adversarial robustness of adaptive test-time defenses. ar Xiv preprint ar Xiv:2202.13711, 2022.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

Dieleman, S., De Fauw, J., and Kavukcuoglu, K. Exploiting

cyclic symmetry in convolutional neural networks. In International conference on machine learning, pp. 1889 1898. PMLR, 2016.

Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., and

Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9185 9193, 2018.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88 (2):303 338, 2010.

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wich-

mann, F. A., and Brendel, W. Image Net-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, 2019.

Gidaris, S., Singh, P., and Komodakis, N. Unsupervised rep-

resentation learning by predicting image rotations. ar Xiv preprint ar Xiv:1803.07728, 2018.

Robust Perception through Equivariance

Gupta, D. K., Arya, D., and Gavves, E. Rotation equivariant

siamese networks for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12362 12371, 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

He, K., Gkioxari, G., Doll ar, P., and Girshick, R. Mask r-

cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961 2969, 2017.

Hendrycks, D. and Dietterich, T. Benchmarking neural

network robustness to common corruptions and perturbations. In ICLR, 2019.

Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D.

Using self-supervised learning can improve model robustness and uncertainty. Advances in Neural Information Processing Systems, 32, 2019.

Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F.,

Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., and Gilmer, J. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.

Kamath, S., Deshpande, A., Kambhampati Venkata, S., and

N Balasubramanian, V. Can we have it all? on the tradeoff between spatial and adversarial robustness of neural networks. Advances in Neural Information Processing Systems, 34, 2021.

Kumar, A., Levine, A., Feizi, S., and Goldstein, T. Certify-

ing conﬁdence via randomized smoothing. Advances in Neural Information Processing Systems, 33:5165 5177, 2020.

Laptev, D., Savinov, N., Buhmann, J. M., and Pollefeys, M.

Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 289 297, 2016.

Lawhon, M., Mao, C., and Yang, J. Using multiple self-

supervised tasks improves model robustness. ar Xiv preprint ar Xiv:2204.03714, 2022.

Lee, H., Lee, K., Lee, K., Lee, H., and Shin, J. Improving

transferability of representations via augmentation-aware self-supervision. Advances in Neural Information Processing Systems, 34, 2021.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Doll ar, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740 755. Springer, 2014.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and

Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Mahajan, D., Tople, S., and Sharma, A. Domain generaliza-

tion using causal matching. In International Conference on Machine Learning, pp. 7313 7324. PMLR, 2021.

Mao, C., Zhong, Z., Yang, J., Vondrick, C., and Ray, B.

Metric learning for adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019.

Mao, C., Gupta, A., Nitin, V., Ray, B., Song, S., Yang, J.,

and Vondrick, C. Multitask learning strengthens adversarial robustness. In European Conference on Computer Vision, pp. 158 174. Springer, 2020.

Mao, C., Chiquier, M., Wang, H., Yang, J., and Vondrick,

C. Adversarial attacks are reversible with natural supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 661 671, 2021.

Mao, C., Geng, S., Yang, J., Wang, X., and Vondrick, C.

Understanding zero-shot adversarial robustness for largescale models. ar Xiv preprint ar Xiv:2212.07016, 2022.

Marcos, D., Volpi, M., and Tuia, D. Learning rotation invari-

ant convolutional ﬁlters for texture classiﬁcation. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2012 2017. IEEE, 2016.

Mc Dermott, N. T., Yang, J., and Mao, C. Robustifying

language models with test-time adaptation. In ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML.

Mi, L., Wang, H., Tian, Y., and Shavit, N. Training-free

uncertainty estimation for neural networks. In AAAI, 2022.

Pastore, G., Cermelli, F., Xian, Y., Mancini, M., Akata,

Z., and Caputo, B. A closer look at self-training for zero-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2693 2702, 2021.

Rice, L., Wong, E., and Kolter, Z. Overﬁtting in adversari-

ally robust deep learning. In International Conference on Machine Learning, pp. 8093 8104. PMLR, 2020.

Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Sid-

diqui, S. A., Binder, A., M uller, E., and Kloft, M. Deep one-class classiﬁcation. In International conference on machine learning, pp. 4393 4402. PMLR, 2018.

Shi, C., Holtz, C., and Mishne, G. Online adversarial puriﬁ-

cation based on self-supervised learning. In International Conference on Learning Representations, 2020.

Robust Perception through Equivariance

Sosnovik, I., Szmaja, M., and Smeulders, A. Scaleequivariant steerable networks. ar Xiv preprint ar Xiv:1910.11093, 2019.

Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt,

M. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pp. 9229 9248. PMLR, 2020.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,

D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Tack, J., Mo, S., Jeong, J., and Shin, J. Csi: Novelty de-

tection via contrastive learning on distributionally shifted instances. In Advances in Neural Information Processing Systems, 2020.

Tsai, Y.-Y., Mao, C., Lin, Y.-K., and Yang, J. Selfsupervised convolutional visual prompts. ar Xiv preprint ar Xiv:2303.00198, 2023.

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell,

T. Tent: Fully test-time adaptation by entropy minimization. ICLR, 2021.

Wang, H., Mao, C., He, H., Zhao, M., Jaakkola, T. S., and

Katabi, D. Bidirectional inference networks: A class of deep bayesian networks for health proﬁling. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 766 773, 2019.

Weiler, M. and Cesa, G. General e (2)-equivariant steer-

able cnns. Advances in Neural Information Processing Systems, 32, 2019.

Weiler, M., Hamprecht, F. A., and Storath, M. Learning

steerable ﬁlters for rotation equivariant cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849 858, 2018.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic

gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688. Citeseer, 2011.

Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free:

Revisiting adversarial training, 2020.

Yu, F., Koltun, V., and Funkhouser, T. Dilated residual

networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472 480, 2017.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo,

Y. Cutmix: Regularization strategy to train strong clas-

siﬁers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023 6032, 2019.

Zamir, A. R., Sax, A., Cheerla, N., Suri, R., Cao, Z., Malik,

J., and Guibas, L. J. Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11197 11206, 2020.

Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and

Jordan, M. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472 7482. PMLR, 2019.

Zhang, R. Making convolutional networks shift-invariant

again. In International conference on machine learning, pp. 7324 7334. PMLR, 2019.

Robust Perception through Equivariance

A. Appendix.

A.1. Theoretical Results for Adversarial Robustness

We now show detailed proof for Lemma 1 and theorem 2.

Lemma A.1. The standard classiﬁer under adversarial attack is equivalent to predicting with P(Y|Xa, y(s1)

a , ..., y(sk)

a ), and our approach is equivalent to predicting with P(Y|Xa, y(s1), y(s2), ..., y(sk)).

Proof. For the standard classiﬁer under attack, we know that P(y(s1)

a , ..., y(sk)

a |X = xa) = 1. Thus we know the standard classiﬁer under adversarial attack is equivalent to

P(Y|X = xa) =

a ,...,y(sk)

a , ..., y(sk)

a |X = xa)P(Y|y(s1)

a , ..., y(sk)

a , X = xa)

= P(Y|y(s1)

a , ..., y(sk)

a , X = xa).

Our algorithm ﬁnds a new input image x(n)

x(n) P(X(n) = x(n)|X = xa)P(y(s1), y(s2), ..., y(sk)|X(n) = x(n))

x(n) P(X(x) = x(n)|X = xa, y(s1), y(s2), ..., y(sk)).

Our algorithm ﬁrst estimate x(n)

max with adversarial image xa and self-supervised label y(s). We then predict the label Y using our new image x(n)

max. Thus, our approach in fact estimates P(Y|X(n) = x(n)

max)P(X(n) = x(n)

max|X = xa, y(s1), y(s2), ..., y(sk)). Note the following holds:

P(Y|X = xa, y(s1), y(s2), ..., y(sk))

P(Y|x(n))P(x(n)|X = xa, y(s1), y(s2), ..., y(sk))

P(Y|X(n) = x(n)

max)P(X(n) = x(n)

max|X = xa, y(s1), y(s2), ..., y(sk)).

Thus our approach is equivalent to estimating P(Y|X = xa, y(s1), y(s2), ..., y(sk)).

We use the maximum a posteriori (MAP) estimation x(n)

max to approximate the sum over X(n) because: (1) sampling a large number of X(n) is computationally expensive; (2) our results show that random sampling is ineffective; (3) our MAP estimate naturally produces a denoised image that can be useful for other downstream tasks.

Theorem A.2. Assume the classiﬁer operates better than chance and instances in the dataset are uniformly distributed over n categories. Let the prediction accuracy bounds be P(Y|y(s1)

a , ..., y(sk)

a , Xa) 2 [b0, c0], P(Y|y(s1), Xa) 2 [b1, c1], P(Y|y(s1), y(s2), Xa) 2 [b2, c2], ..., and P(Y|y(s1), y(s2), ..., y(sk), Xa) 2 [bk, ck]. If the conditional mutual information I(Y; y(si)|Xa) > 0 and I(Y; y(si)|Xa, y(sj)) > 0 where i 6= j, we have b0 b1 ... bk and c0 < c1 < c2 < ... < ck, which means our approach strictly improves the bound for classiﬁcation accuracy.

Proof. If I(Y; y(si)|X = xa) > 0, and I(Y; y(si)|Xa, y(sj)) > 0 where i 6= j, then it is straight-forward that:

I(Y; y(s1), y(s2), ..., y(sk), Xa) > I(Y; y(si), Xa) > I(Y; y(si)

a , Xa) = I(Y; Xa).

I(Y; y(s1), y(s2), ..., y(sk), Xa) > I(Y; y(s1)

a , ..., y(sk)

a , Xa) = I(Y; Xa).

We let the predicted label be b Y, we assume there are n categories, and let the lower bound for prediction accuracy to be Pr( b Y = Y) 1 p. We deﬁne H( p) = plog p (1 p)log(1 p). Using the Fano s Inequality, we have

H(Y|Xa) H( p) + p log(n 1) (7)

Robust Perception through Equivariance

p log(n 1) H( p) H(Y|Xa) (8)

We add H(Y) to both side

H(Y) p log(n 1) H( p) + I(Y; Xa) (9)

because I(Y; Xa) = H(Y) H(Y|Xa).

Then we get

H( p) + p log(n 1) I(Y; Xa) + H(Y) (10)

Now we deﬁne a new function G( p) = H( p)+ plog(n 1). Given that in the classiﬁcation task, the number of categories n 2. We know log(n 1) 0. Given that the entropy function H( p) ﬁrst increase and then decrease, the function G( p) should also ﬁrst increase, peak at some point, and then decrease.

We calculate the p for the peak value via calculating the ﬁrst order derivative G0( p) = 0. By solving this, we have:

which shows that the function G( p) is monotonically increasing when p 2 [0, 1 1

Given that we know, the base classiﬁer already achieves accuracy better than random guessing, thus the given classiﬁer satisﬁes p 2 [0, 1 1

Now, the function G( p) = H( p) + plog(n 1) is a monotonically increasing function in our studied region, which has the inverse function G 1.

By rewritting the equation 10 We then have

G( p) I(Y; Xa) + H(Y) (12)

We apply the inverse function G 1 to both side:

p G 1( I(Y; Xa) + H(Y)) (13)

1 p 1 G 1( I(Y; Xa) + H(Y)) (14)

Note that (1 p) is our deﬁned accuracy. Similarly, we have:

1 p c1 = 1 Q 1( I(Y; Xa) + H(Y)),

1 p c2 = 1 Q 1( I(Y; y(s), Xa) + H(Y)),

1 p ck = 1 Q 1( I(Y; y(s1), y(s2), ..., y(sk), Xa) + H(Y)),

where the upper bound is a function of the mutual information. Since H(Y) is a constant, a larger mutual information will strictly increase the bound. Thus, c0 < c1 < c2 < ... < ck.

In addition, the lower bound will not get worse given the additional information. Thus b0 b1 ... bk and.

Robust Perception through Equivariance

A.2. Detection

Anomaly detection, also referred to as novelty detection or outlier detection, predicts the data when the model is uncertain about a deviated model. (Ruff et al., 2018) conducts anomaly detection by training a binary classiﬁer on in-distribution and collected out-of-distribution data, however, it is hard to foresee the out-of-distribution data. (Hendrycks et al., 2019; Gidaris et al., 2018; Tack et al., 2020) need to train the model with self-supervision ﬁrst and then perform OOD detection using the performance of the self-supervision task. In this paper, we will focus on the training-free method that uses sensitivity to estimate the uncertainty of the model (Mi et al., 2022).

A.2.1. EQUIVARIANCE FOR ANOMALY DETECTION

Let y be the ground-truth category labels for x. Let the network that uses the feature h for ﬁnal task prediction to be C 0. For prediction, neural networks learn to predict the category by = C 0 F (x) by minimizing the loss L(by, y) between the predictions and the ground truth. For example, for semantic segmentation, L is cross-entropy for each pixel output; for depth prediction, L is an L1 loss for each depth pixel prediction. We deﬁne the loss for the ﬁnal task as follows:

Lt(x, y) = L (C 0 F (x), y) , (15)

As shown in Figure 6, ??, and 7, when the model is uncertain about the input and makes the wrong prediction, it is often less equivariant. Work (Tack et al., 2020; Hendrycks et al., 2019) has shown that self-supervision tasks perform worse when the model is uncertain. Thus, we propose to use the equivariance of the output by = C 0 F (x) for anomaly detection. We calculate the variance of the output after transformations gi:

i C 0 F gi(x) C 0 F (x)||2, (16)

where we use C 0 F (x) as the surrogate mean prediction. Here larger variance indicates less equivariance and therefore higher probability that x is an out-of-sample data point (see details in sec:theoryano).

A.2.2. THEORETICAL RESULTS FOR ANOMALY DETECTION WITH MULTIPLE CONSTRAINTS

Below we provide theoretical analysis on why the equivariant loss can be used for anomaly detection. For each pixel in an image, we denote as X and Y the input pixel and target label. We use Z0 = e F (X) and Z = eg 1 e F eg(X) to denote the model predictions of the original and transformed input, respectively, where eg and e F are the associated pixel operations for g and F . Correspondingly e = |Z0 Y | is the error of the model s prediction for the pixel. Note that eg( ) and eg 1( ) are equivariant transformations. There can be multiple equivariant transformations egi( ) for the same input pixel X, leading to different model predictions. For the same input pixel X, we then denote as µ(X) = Eeg[Z|X] and σ(X)2 = Veg[Z|X] the mean and variance of the model predictions over different equivariant transformations. Here, σ(X) measures the sensitivity of the model for the input pixel X; below we use a shorthand σ for σ(X) when the context is clear. Following (Mi et al., 2022), we now introduce our model-agnostic assumptions below.

Assumption A.3 (Heterogeneous Perturbation). 1 = Z0 µ(X)

σ(X) N(0, 1). That is, the model prediction given the original input X behaves like a random Gaussian draw from the model predictions produced by different equivariant transformations. Assumption A.4 (Random Bias). 2 = Y µ(X) N(0, B2). That is, the bias of the model prediction behaves like Gaussian noise with bounded variance B2.

Assuming each image contains n input pixels, {Xi}n

i=1, we have the corresponding target labels {Yi}n

i=1, errors {ei}n

i=1, and sensitivity {σi}n

i=1. We denote as e = 1

i ei the average error of an image. As is usually the case, we further assume the errors are bounded, i.e., a ei b. Our goal is to bound the average pixel error e for an image using the image s computed uncertainty (sensitivity) score {σi}n

i=1. With the assumptions above, we have: Theorem A.5 (Estimator for e). With probability at least δ, one can estimate the average error e for an image using q

2 E[σB] with the following guarantee:

2 E[σB]| < b a pn

σ2 + B2 is the smoothed version of the uncertainty (sensitivity) σ and B is a constant from Assumption A.4.

Robust Perception through Equivariance

In Distribution Out of Distribution Detected

Our Detection

Figure 5. Using equivariance can detect both dataset shifting and corruption shifting.

Table 9. AUROC (multiplied by 100) of anomaly detection on corrupted images. Our equivariance method achieves better detection

efﬁciency over 15 types of corruptions.

Cityscape Model Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Cont Elast Pixel JPEG

Rot (Hendrycks et al., 2019) 57 54 49 43 55 35 44 39 64 51 46 52 42 44 59 CSI (Tack et al., 2020) 67 67 62 65 62 57 64 45 55 63 63 65 55 53 54 Inv 99 99 99 100 94 87 80 86 95 93 86 98 52 98 100 Ours 100 100 100 100 99 98 99 100 100 98 94 100 77 99 100

PASCAL VOC Model Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Cont Elast Pixel JPEG

Rot (Hendrycks et al., 2019) 37 39 39 55 53 54 54 43 49 55 51 55 50 52 50 CSI (Tack et al., 2020) 49 51 50 55 55 58 54 61 58 58 52 72 53 54 54 Inv 99 99 99 66 21 36 37 70 74 68 54 66 45 35 40 Ours 98 98 98 96 95 91 93 85 86 81 60 92 70 85 75

MSCOCO Model Gauss Shot Impul Defoc Glass Motion Zoom Snow Frost Fog Bright Cont Elast Pixel JPEG

Rot (Hendrycks et al., 2019) 95 95 95 93 93 93 93 93 94 93 93 93 93 93 92 CSI (Tack et al., 2020) 88 89 86 76 79 77 75 42 47 44 77 22 84 82 85 Inv 98 98 98 96 97 96 96 98 98 97 98 98 97 97 97 Ours 98 98 98 98 98 98 98 98 98 98 98 98 98 98 98

Estimated Average Error for Anomaly Detection. We can see that

2 E[σB] can be a good estimate for the average error for an image, and that this estimate gets more accurate as n gets larger. Therefore, it can be used directly for anomaly

detection; larger

2 E[σB] indicates a potentially larger error, meaning that the image is more likely to be an anomaly. Note that the expectation E[σB] is over the space of pixels in all images governed by the assumptions above, not over the pixels in a speciﬁc image. In practice, we estimate E[σB] by averaging over the pixel-level sensitivity in an image, i.e.,

i + B2, leading to Eqn. 16.

Extension to the Multivariate Case. Theorem A.6 assumes one scalar output for each pixel in an image; this is directly applicable for dense regression tasks, e.g., depth estimation. For dense classiﬁcation tasks, e.g., segmentation, the label for each pixel is represented by a one-hot vector. Fortunately, Theorem A.6 can be naturally extended to the multivariate case, and therefore works for both regression and classiﬁcation tasks. Note that in classiﬁcation tasks, µ(X) and σ(X) are both real-valued vectors where all entries in a vector sums up to 1, while Y and Z0 are both one-hot vectors (vectors with one entry equal to 1 and others equal to 0); therefore the two mild assumptions above are still reasonable.

Rotation Contrast Invariance Equivariance Model CI VO CO CI VO CO CI VO CO CI VO CO

CI - 56 51 - 72 52 - 93 91 - 98 91 VO 67 - 61 54 - 85 95 - 91 71 - 88 CO 72 93 - 69 83 - 51 96 - 55 99 -

Table 10. AUROC for out of dstribution detection. The

rows are the source data that the models have been trained on. The columns are the data where the OOD are sampled from.

Robust Perception through Equivariance

A.2.3. ROBUSTNESS ON ANOMALY DETECTION

Dataset and Tasks. We conduct anomaly detection experiments with 15 common corruptions (Hendrycks & Dietterich, 2019) on Cityscapes, PASCAL VOC, and MSCOCO. We also study whether using equivariance can detect examples from a different dataset.

Baselines. CSI (Tack et al., 2020) uses contrastive loss as indicator for novelty detection. Hendrycks et al. (Hendrycks et al., 2019) (Rot) uses rotation task s performance for detection. Invariance (Inv) uses the consistency between different views of the same image for anomaly detection, which uses the same setup as our equivariance method except for the reversed transformation g 1 in feature space.

Results. We show visualization of the task in Figure 5. Table 9 shows the detection performance on corrupted images. Our approach in general improves the AUROC score over the baselines, achieving up to 15 AUROC points improvement, which demonstrates the corruption detection of our approach. We show results on detecting dataset shifting in Table 10. We denote Cityscapes, PASCAL VOC, and COCO as CI, VO, and CO, respectively. Each row indicates the source model that we trained on, and each column is the OOD sampled to detect. Our method in general achieves better out-of-distribution detection efﬁciency over the existing approaches.

A.2.4. THEORETICAL RESULTS FOR ANOMALY DETECTION

In the main paper, we provide theoretical analysis on why the equivariant loss can be used for anomaly detection. We now show detailed proof for our Theorem 1.

Theorem A.6 (Estimator for e). With probability at least δ, one can estimate the average error e for an image using q

2 E[σB] with the following guarantee:

2 E[σB]| < b a pn

Proof. By the law of total expectation, we have

E[e] = EσE[e|σ] = EσE [|σ 1 2||σ]

|N(0, σ2 + B2)||σ

σ2 + B2. Deﬁning the total error for an image of n pixels Sn = Pn

i=1 ei and by Hoeffding s inequality, we then have

P(|Sn E[Sn]| t) exp( t2 n(b a)2 ), (17)

i=1 E[ei] = n

2 E[σB]. (18)

Combining Eqn. 18 and Eqn. 17, we have

n) exp( t2 n(b a)2 ),

where e is the average error of an image. Setting δ = exp( t2 n(b a)2 ), we then have that with probability at least δ,

2 E[σB]| < b a pn

Robust Perception through Equivariance

Method Vanilla Random Rotation Contrastive Invariance Equivariance (Ours)

Cityscape 58.29 55.33 49.82 50.85 36.94 51.13 Pascal 69.52 69.24 49.07 68.22 66.58 63.05 COCO 63.02 63.03 61.69 56.76 56.06 58.71 COCO Instance 34.5 33.6 33.3 29.8 26.7 34.5

Table 11. Clean accuracy by ﬁrst detecting adversarial example. By avoiding running test-time optimization on clean examples, we can

largely improve clean accuracy.

A.3. Additional Analysis

A.3.1. PRESERVING CLEAN ACCURACY BY DETECTING FIRST

In deploying our defense, one can further preserve the accuracy on clean images by a detect-then-defend algorithm described below. Based on our ﬁndings that clean images and attacked images have a large difference in the average equivariance score (Table 5), we can set a threshold value to determine whether or not deploy our defense for a potentially attacked image based on its equivariance score. Experimental results are reported below in Table 11. As shown in the table, clean accuracy can be preserved to a large degree without signiﬁcant reduction in defense performance.

A.3.2. ABLATION STUDIES ON OPTIMIZER

In the paper, we use the SGLD optimizer which add noise during optimization. We compare the performance of using SGLD and SGD optimizer without noise for our defense in the Table 12 below. While our method is both effective with both optimization algorithm, SGLD achieves higher robustness.

MS COCO Equivariance SGLD Equivariance SGD

PGD 24.51 16.68

Table 12. Effect of optimizer

A.3.3. RUNTIME ANALYSIS

We report inference speed of our defense in Table 13. It is worth noting that if we deploy detection ﬁrst, as mentioned in section A.3.1, the defense will skip the majority of clean images and will not have a large effect on runtime. For attacked images, our algorithm is 40 times slower due to the test-time optimization. Given the rare cases of adversarial examples, this delay is reasonable. We can spend more time on the hard adversarial examples, as there is no point of making the wrong predictions only to make it fast.

MS COCO Vanilla Equivariance

runtime (s) 0.046 1.699

Table 13. Runtime analysis. Results are measured on a single A6000 GPU, averaged across 100 examples.

B. Visualization

We show additional visualization on the equivarance of representation when the input suffers from natural corruptions. In Figure 6 and 7,we show random visualizations on Cityscapes and PASCAL VOC.

Robust Perception through Equivariance

Brightness Corruption

Clean Image

Final Prediction F(x) Prediction g 1

1 F g1(x) Prediction g 1

Glass Blur Corruption

Frost Corruption

Figure 6. Examples showing equivariance on clean image and corrupted images on Cityscapes dataset. Clean image is equivariant. Images

under corruption are not equivariant, allowing us to detect corruption using equivariance measurement.

Motion Blur

Clean Image

Final Prediction F(x) Prediction g 1

1 F g1(x) Prediction g 1

Figure 7. Examples showing equivariance on clean image and a random corrupted image on COCO dataset. Clean image is equivariant.

Images under corruption are not equivariant, allowing us to detect corruption using equivariance measurement.