# pretrained_adversarial_perturbations__74dc9ea7.pdf

Pre-trained Adversarial Perturbations

Yuanhao Ban1,2 , Yinpeng Dong1,3

1 Department of Computer Science & Technology, Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 2 Department of Electronic Engineering, Tsinghua University 3 Real AI banyh19@mails.tsinghua.edu.cn, dongyinpeng@mail.tsinghua.edu.cn

Self-supervised pre-training has drawn increasing attention in recent years due to its superior performance on numerous downstream tasks after ﬁne-tuning. However, it is well-known that deep learning models lack the robustness to adversarial examples, which can also invoke security issues to pre-trained models, despite being less explored. In this paper, we delve into the robustness of pre-trained models by introducing Pre-trained Adversarial Perturbations (PAPs), which are universal perturbations crafted for the pre-trained models to maintain the effectiveness when attacking ﬁne-tuned ones without any knowledge of the downstream tasks. To this end, we propose a Low-Level Layer Lifting Attack (L4A) method to generate effective PAPs by lifting the neuron activations of low-level layers of the pre-trained models. Equipped with an enhanced noise augmentation strategy, L4A is effective at generating more transferable PAPs against ﬁne-tuned models. Extensive experiments on typical pre-trained vision models and ten downstream tasks demonstrate that our method improves the attack success rate by a large margin compared with state-of-the-art methods.

1 Introduction

Large-scale pre-trained models [40, 13] have recently achieved unprecedented success in a variety of ﬁelds, e.g., natural language processing [21, 27, 1], computer vision [2, 16, 17]. A large amount of work proposes sophisticated self-supervised learning algorithms, enabling the pre-trained models to extract useful knowledge from large-scale unlabeled datasets. The pre-trained models consequently facilitate downstream tasks through transfer learning or ﬁne-tuning [37, 50, 12]. Nowadays, more practitioners without sufﬁcient computational resources or training data tend to ﬁne-tune the publicly available pre-trained models on their own datasets. Therefore, it has become an emerging trend to adopt the paradigm of pre-training to ﬁne-tuning rather than training from scratch [13].

Despite the excellent performance of deep learning models, they are incredibly vulnerable to adversarial examples [44, 11], which are generated by adding small, human-imperceptible perturbations to natural examples, but can make the target model output erroneous predictions. Adversarial examples also exhibit an intriguing property called transferability [44, 26, 32], which means that the adversarial perturbations generated for one model or a set of images can remain adversarial for others. For example, a universal adversarial perturbation (UAP) [32] can be generated for the entire distribution of data samples, demonstrating excellent cross-data transferability. Other work [26, 7, 47, 8, 34] has revealed that adversarial examples have high cross-model and cross-domain transferability, making black-box attacks practical without any knowledge of the target model or even the training data. However, much less effort has been devoted to exploring the adversarial robustness of pre-trained models. As these models have been broadly studied and deployed in various real-world applications,

This work was done when Yuanhao Ban was intern at Real AI, Inc; Corresponding author.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Lifting the neuron activations of lowlevel layers

Pre-trained model

Fine-tuned model

+ Wrong answer

Fine-tuned model

Wrong answer

Uniform Gaussian Sampling

Figure 1: A demonstration of pre-trained adversarial perturbations (PAPs): An attacker ﬁrst downloads pre-trained weights on the Internet and generates a PAP by lifting the neuron activations of low-level layers of the pre-trained models. We adopt a data augmentation technique called uniform Gaussian sampling to improve the transferability of PAP. When users ﬁne-tune the pre-trained models to complete downstream tasks, the attacker can add the PAP to the input of the ﬁne-tuned models to cheat them without knowing the speciﬁc downstream tasks.

it is of signiﬁcant importance to identify their weaknesses and evaluate their robustness, especially concerning the pre-training to the ﬁne-tuning procedure.

In this paper, we introduce Pre-trained Adversarial Perturbations (PAPs), a new kind of universal adversarial perturbations designed for pre-trained models. Speciﬁcally, a PAP is generated for a pretrained model to effectively fool any downstream model obtained by ﬁne-tuning the pre-trained one, as illustrated in Fig. 1. It works under a quasi-black-box setting where the downstream task, dataset, and ﬁne-tuned model parameters are all unavailable. This attack setting is more suitable for the pre-training to the ﬁne-tuning procedure since many pre-trained models are publicly available, and the adversary may generate PAPs before the pre-trained model has been ﬁne-tuned. Although there are many methods [7, 47] proposed for improving the transferability, they do not consider the speciﬁc characteristics of the pre-training to the ﬁne-tuning procedure, limiting their cross-ﬁnetuning transferability in our setting.

To generate more effective PAPs, we propose a Low-Level Layer Lifting Attack (L4A) method, which aims to lift the feature activations of low-level layers. Motivated by the ﬁnding that the lower the level of the model s layer is, the less its parameters change during ﬁne-tuning, we generate PAPs to destroy the low-level feature representations of pre-trained models, making the attacking effects better reserved after ﬁne-tuning. To further alleviate the overﬁtting of PAPs to the source domain, we improve L4A with a noise augmentation technique. We conduct extensive experiments on typical pre-trained vision models [2, 17] and ten downstream tasks. The evaluation results demonstrate that our method achieves a higher attack success rate on average compared with the alternative baselines.

2 Related work

Self-supervised learning. Self-supervised learning (SSL) enables learning from unlabeled data. To achieve this, early approaches utilize hand-crafted pretext tasks, including colorization [53], rotation prediction [10], position prediction [36], and Selﬁe [45]. Another approach for SSL is contrastive learning [25, 39, 2, 22], which aims to map the input image to the feature space and minimize the distance between similar ones while keeping dissimilar ones far away from each other. In particular, a similar sample is retrieved by applying appropriate data augmentation techniques to the original one, and the versions of different samples are viewed as dissimilar pairs.

Adversarial examples. With the knowledge of the structure and parameters of a model, many algorithms [24, 31, 30, 38] successfully fool the target model in a white-box manner. An intriguing property of adversarial examples is their good transferability [26, 32]. The universal adversarial perturbations [32] demonstrate good cross-data transferability by optimizing under a distribution of

data samples. The cross-model transferability has also been extensively studied [7, 47, 8], enabling the attack on black-box models without any knowledge of their internal working mechanisms.

Robustness of the pre-training to ﬁne-tuned procedure. Due to the popularity of pre-trained models, a lot of works [43, 49, 4] study the robustness of this setting. Among them, Dong et al. [6] propose a novel adversarial ﬁne-tuning method in an information-theoretical way to retain robust features learned from the pre-trained model. Jiang et al. [20] integrate adversarial samples into the pre-training procedure to defend against attacks. Fan et at. [9] adopt Clusterﬁt [48] to generate pseudo-label data and later use them for training the model in a supervised way, which improves the robustness of the pre-trained model. The main difference between our work and theirs is that we consider the problem from an attacker s perspective.

3 Methodology

In this section, we ﬁrst introduce the notations and the problem formulation of the Pre-trained Adversarial Perturbations (PAPs). Then, we detail the Low-Level Layer Lifting Attack (L4A) method.

3.1 Notations and problem formulation

Let fθ denote a pre-trained model for feature extraction with parameters θ. It takes an image x Dp as input and outputs a feature vector v X, where Dp and X refer to the pre-training dataset and feature space, respectively. We denote f k θ (x) as the k-th layer s feature map of fθ for an input image x. In the pre-training to ﬁne-tuning paradigm, a user ﬁne-tunes the pre-trained model fθ using a new dataset Dt of the downstream task and ﬁnally gets a ﬁne-tuned model fθ with updated parameters θ . Then, let fθ (x) be the predicted probability distribution of an image x over the classes of Dt, and Fθ (x) = arg max fθ (x) be the ﬁnal classiﬁcation result.

In this paper, we introduce Pre-trained Adversarial Perturbations (PAPs), which are generated for the pre-trained model fθ, but can effectively fool ﬁne-tuned models fθ on downstream tasks. Formally, a PAP is a universal perturbation δ within a small budget ϵ crafted by fθ and Dp, such that Fθ (x + δ) = Fθ (x) for most of the instances belonging to the ﬁne-tuning dataset Dt. This can be formulated as the following optimization problem:

max δ Ex Dt[Fθ (x) = Fθ (x + δ)], s.t. δ p ϵ and x + δ [0, 1] , (1)

where p denotes the ℓp norm and we take the ℓ norm in this work. There exist some works related to the universal perturbations, such as the universal adversarial perturbation (UAP) [32] and the fast feature fool (FFF) [33], as detailed below.

UAP: Given a classiﬁer f and its dataset D, the UAP tries to generate a perturbation δ that can fool the model on most of the instances from D, which is usually solved by an iterative method. Every time sampling an image x from the dataset D, the attacker computes the minimal perturbation ζ that sends x + δ to the decision boundary by Eq. (2) and then adds it into δ.

ζ arg min r r 2, s.t. F(x + δ + r) = F(x). (2)

FFF: It aims to produce maximal spurious activations at each layer. To achieve this, FFF starts with a random δ and solves the following problem:

, s.t. δ p ϵ. (3)

where li(δ) is the mean of the output tensor at layer i.

3.2 Our design

However, these attacks show limited cross-ﬁnetuning transferability in our problem setting due to ignorance of the ﬁne-tuning procedure. Two challenges are degenerating the performance.

Fine-tuning Deviation. The parameters of the model could change a lot during ﬁne-tuning. As a result, the generated adversarial samples may perform well in the feature space of the pre-trained model but fail in the ﬁne-tuned ones.

(a) Resnet50

(b) Resnet101

Figure 2: The ordinate represents the Frobenius norm of the difference between the parameters of the ﬁne-tuned model and its corresponding pre-trained model, which is scaled into a range from 0 to 1 for easy comparison. The abscissa represents the level of the layer. Note that Resnet50 and Resnet101 [14] are pre-trained by Sim CLRv2 [2], and Vi T16 [46] is pre-trained by MAE [17].

Datasets Deviation. The statistics (i.e., mean and standard deviation) of different datasets can vary a lot. Only using the pre-training dataset with the ﬁxed statistics to generate adversarial samples may suffer a performance drop.

To alleviate the negative effect of the above issues, we propose a Low-Level Layer Lifting Attack (L4A) method equipped with a uniform Gaussian sampling strategy.

Low-Level Layer Lifting Attack (L4A). Our method is motivated by the ﬁndings in Fig. 2 that the higher the level of the layers, the more their parameters change during ﬁne-tuning. This is also consistent with the knowledge that the low-level convolutional layer acts as an edge detector that extracts low-level features like edges and textures and has little high-level semantic information [37, 50]. Since images from different datasets share the same low-level features, the parameters of these layers can be preserved during ﬁne-tuning. In contrast, the attack algorithms based on the high-level layers or the scores predicted by the model may not transfer well in such a cross-ﬁnetuning setting, as the feature spaces of high-level layers are easily distorted during ﬁne-tuning. The basic method of L4A can be formulated as the following problem:

min δ Lbase(fθ, x, δ) = Ex Dp

[ f k θ (x + δ) 2 F

where F denotes the Frobenius norm of the input tensor. In our experiments, we ﬁnd the lower the layer, the better it performs, so we choose the ﬁrst layer as default, such that k = 1. As Eq. (4) is usually a sophisticated non-convex optimization problem, we solve it using stochastic gradient descent method.

We also ﬁnd that fusing the adversarial loss of the consecutive low-level layers can boost the performance, which gives L4Afuse method as solving:

min δ Lfuse(fθ, x, δ) = Ex Dp

[ f k1 θ (x + δ) 2 F + λ f k2 θ (x + δ) 2 F

where f k1 θ (x+δ) and f k2 θ (x+δ) refers to the k1-th and k2-th layers feature maps of fθ respectively, λ is a balancing hyperparameter. We set k1 = 1 and k2 = 2 as default.

Figure 3: Datasets statistics.

Uniform Gaussian Sampling. Nowadays, most state-of-the-art networks apply batch normalization [19] to input images for better performance. Thus, the datasets statistics become an essential factor for training. As shown in Fig. 3, the distribution of the downstream datasets can vary signiﬁcantly compared to that of the pre-training dataset. However, traditional data augmentation techniques [51, 18] are limited to the pre-training domain and cannot alleviate the problem. Thus, we propose sampling Gaussian noises with various means and deviations to avoid overﬁtting. Combining the base loss using the pre-training dataset and the new loss using uniform Gaussian noises gives the L4Augs

Table 1: The attack success rate (%) of various attack methods against Resnet101 pre-trained by Sim CLRv2. Note that C10 stands for CIFAR10 and C100 stands for CIFAR100.

ASR Cars Pets Food DTD FGVC CUB SVHN C10 C100 STL10 AVG FFFno 43.81 38.62 49.95 63.24 85.57 48.38 12.55 8.53 77.74 57.11 48.55 FFFFmean 33.93 31.37 41.77 52.66 78.94 45.00 14.85 14.42 72.59 56.66 44.22 FFFone 31.87 29.74 39.25 46.92 74.17 43.87 9.24 11.77 65.61 50.21 40.26 DR 36.28 35.54 47.43 47.45 75.00 44.15 12.05 21.35 65.39 41.65 42.63 SSP 32.89 30.50 43.12 45.85 82.57 45.55 8.69 11.66 65.80 40.91 40.75 ASV 60.75 19.84 36.33 56.22 84.16 55.82 7.11 7.29 58.10 80.89 46.64 UAP 48.70 36.55 60.80 63.40 76.06 52.64 8.46 8.53 52.35 31.15 43.86 UAPEPGD 94.12 66.66 61.30 72.55 70.34 82.72 13.88 61.65 20.04 50.13 59.34 L4Abase 94.07 61.57 71.23 69.20 96.28 81.07 11.70 12.68 80.57 90.49 66.89 L4Afuse 90.98 88.53 80.65 74.31 93.79 91.23 11.40 17.40 80.98 89.69 67.10 L4Augs 94.24 94.99 78.28 77.23 92.92 91.77 11.40 14.60 76.50 90.05 72.20

method as follows:

min δ Lugs(fθ, x, δ) = Eµ,σ,n0 N(µ,σ)

[ f k θ (x + δ) 2 F + λ f k θ (n0 + δ) 2 F

where µ and σ are drawn from the uniform distribution U (µl, µh) and U (σl, σh), respectively, and µl, µh, σl, σr are four hyperparameters.

4 Experiments

We provide some experimental results in this section. More results can be found in Appendix. Our code is publicly available at https://github.com/banyuanhao/PAP.

4.1 Settings

Pre-training methods. Sim CLR [2, 3] uses the Resnet [14] backbone and pre-trains the model by contrastive learning. We download pre-trained parameters of Resnet50 and Resnet1011 to evaluate the generalization ability of our algorithm on different architectures. We also adopt MOCO [15] with the backbone of Resnet502. Besides convolutional neural networks, transformers [46] attract much attention nowadays for their competitive performance. Based on transformers and masked image modeling, MAE [17] becomes a good alternative for pre-training. We adopt the pre-trained Vi T-base-16 model3. Moreover, vision-language pre-trained models are gaining popularity these days. Thus we also choose CLIP [41]4 for our study. We report the results of Sim CLR and MAE in Section 4.2. More results on CLIP and MOCO can be found in Appendix A.1.

Datasets and Pre-processing. We adopt the ILSVRC 2012 dataset [42] to generate PAPs, which are also used to pre-train the models. We mainly evaluate PAPs on image classiﬁcation tasks, which are the same as the settings of Sim CLRv2. Ten ﬁne-grained and coarse-grained datasets are used to test the cross-ﬁnetuning transferability of the generated PAPs. We load these datasets from torchvision (Details in Appendix D). Before feeding the images to the model, we resize them to 256 256 and then center crop them into 224 224.

Compared methods. We choose UAP [32] to test whether image-agnostic attacks also bear good cross-ﬁnetuning transferability. Since UAP needs ﬁnal classiﬁcation predictions of the inputs, we ﬁt a linear head on the pre-trained feature extractor. Furthermore, by integrating the moment term into the iterative method, UAPEPGD [5] is believed to enhance cross-model transferability. Thus, we adopt UAPEPGD to study the connection between cross-model and cross-ﬁnetuning transferability. As our algorithm is based on the feature level, other feature attacks (including FFF [33], ASV [23], DR [28], SSP [35]) are chosen for comparison.

1https://github.com/google-research/simclr 2https://dl.fbaipublicfiles.com/moco/ 3https://github.com/facebookresearch/mae 4https://github.com/openai/CLIP

Table 2: The attack success rate (%) of various attack methods against Resnet50 pre-trained by Sim CLRv2. Note that C10 stands for CIFAR10, and C100 stands for CIFAR100.

ASR Cars Pets Food DTD FGVC CUB SVHN C10 C100 STL10 AVG FFFno 26.91 30.83 43.28 48.99 41.30 38.23 79.00 68.50 44.67 16.95 43.86 FFFmean 36.75 33.88 45.26 50.15 53.13 77.22 52.02 82.41 68.10 22.11 52.10 FFFone 37.88 35.30 52.79 59.52 59.62 57.04 80.33 75.40 53.58 18.31 52.98 DR 38.64 34.42 50.04 45.53 39.80 75.67 47.88 76.05 60.57 13.98 48.26 SSP 41.70 43.94 50.83 48.78 47.67 82.39 48.38 86.95 66.23 19.56 53.64 ASV 74.47 36.93 45.85 73.51 64.89 92.29 73.16 45.14 53.60 22.02 58.19 UAP 44.86 46.47 64.67 65.53 49.63 82.32 52.00 79.63 46.46 19.99 55.16 UAPEPGD 66.29 66.58 81.11 69.52 87.91 59.07 69.16 87.84 68.26 37.12 69.28 LLLLbase 94.86 56.30 61.31 75.37 67.61 94.87 81.45 68.25 77.04 34.56 66.89 L4Afuse 96.00 59.80 65.00 77.93 69.39 95.02 85.05 64.41 76.29 37.54 72.64 L4Augs 96.13 79.15 74.87 82.18 78.73 94.45 95.29 55.03 77.10 45.09 77.80

Table 3: The attack success rate (%) of various attack methods against Vi T16 pre-trained by MAE. Note that C10 stands for CIFAR10 and C100 stands for CIFAR100.

ASR Cars Pets Food DTD FGVC CUB SVHN C10 C100 STL10 AVG FFFno 64.31 88.21 95.04 88.18 81.91 92.94 76.10 49.48 79.83 60.91 77.69 FFFmean 40.39 67.54 54.10 61.38 71.47 73.39 92.96 86.88 94.55 67.90 71.06 FFFone 48.36 77.89 60.06 64.04 75.67 74.09 92.33 86.13 94.48 70.40 74.35 DR 37.02 23.84 59.54 44.73 28.01 10.41 14.30 16.66 14.12 21.54 27.02 SSP 44.15 73.31 85.42 72.82 52.57 63.10 52.45 27.94 25.32 36.90 53.40 ASV 38.46 10.17 37.49 48.31 29.14 4.97 8.41 17.04 11.34 21.14 22.64 UAP 62.71 58.90 89.90 74.92 44.69 39.56 47.65 47.77 33.80 51.70 55.16 UAPEPGD 63.67 73.09 96.22 76.69 57.78 73.37 79.84 45.89 47.21 55.79 66.95 L4Abase 87.66 89.98 98.96 99.10 99.33 84.06 86.99 98.62 97.08 98.25 94.00 L4Afuse 83.24 89.57 98.87 98.77 98.36 93.60 89.85 98.64 95.72 97.53 94.42 L4Augs 96.49 90.00 98.97 98.89 99.48 84.01 89.56 99.43 97.27 98.96 95.30

Default settings and Metric. Unless otherwise speciﬁed, we choose a batch size of 16 and a step size of 0.0002. All the perturbations should be within the bound of 0.05 under the ℓ norm. We evaluate the perturbations at the iterations of 1, 000, 5, 000, 30, 000, and 60, 000, and report the best performance. We show the results in the measure of attack success rates (ASR), representing the classiﬁcation error rate on the whole testing dataset after adding the perturbations to the legitimate images.

4.2 Main results

We craft pre-trained adversarial perturbations (PAPs) for three pre-trained models (i.e., Resnet50 by Sim CLRv2, Resnet101 by Sim CLRv2, Vi T16 by MAE) and evaluate the attack success rates on ten downstream tasks. The results are shown in Table 1, Table 2, and Table 3, respectively. Note that the ﬁrst seven datasets are ﬁne-grained, while the last three are coarse-grained ones. We mark the best results for each dataset in bold, and the best baseline in blue. We highlight the results of L4Augs in red to emphasize that the L4A attack equipped with Uniform Gaussian Sampling shows great cross-ﬁnetuning transferability and performs best.

A quick glimpse shows that our proposed methods outperform all the baselines by a large margin. For example, as can be seen from Table 1, if the target model is Resnet101 pre-trained by Sim CLRv2, the best competitor FFFmean achieves an average attack success rate of 59.34%, while the villain L4Abase can lift it up to 66.89% and the UGS technique further boosts the performance up to 72.20%. Moreover, the STL10 dataset is the hardest for PAPs to transfer among these tasks. However, L4Augs can signiﬁcantly improve the cross-ﬁnetuning transferability, achieving an attack success rate of 90.05% and 98.96% for Resnet101 and Vi T16 in STL10, respectively. Another intriguing ﬁnding is that Vi T16 with a transformer backbone shows severe vulnerabilities to PAPs. Although performing best on legitimate samples, they bear an attack success rate of 95.30% under the L4Augs attack, closing to random outputs. These results reveal the serious security problem of the pre-training to ﬁne-tuning paradigm and demonstrate the effectiveness of our method in such a problem setting.

Figure 4: The attack success rates (%) of L4Abase when using different layers. We show the results on the pre-training domain (Left) and ﬁne-tuning domains (Right).

(a) Resnet101

(b) Resnet50

Figure 5: The attack success rate (%) of different hyperparameter λ in L4Augs for different models.

4.3 Ablation studies

4.3.1 Effect of the attacking layer

We analyze the inﬂuence of attacking different intermediate layers of the networks on the performance of our proposed L4Abase in the pre-training domain (Image Net) and the ﬁne-tuning domains (ten downstream tasks). To this end, we divide Resnet50, Resnet101, and Vi T into ﬁve blocks (Details in Appendix F.1) and conduct our algorithm on them. Note that for the ﬁne-tuning domains, the average attack success rates on the ten datasets are reported.

As shown in Fig. 4, the lower the level we choose to attack, the better our algorithm performs in the ﬁne-tuning domains. Moreover, for the pre-training domain, attacking the middle layers of the networks results in a higher attack success rate compared to the top and bottom layers, which is also reported in existing works [29, 35, 52]. These results reveal the intrinsic property of the pre-training to ﬁne-tuning paradigm. As the lower-level layers change less during the ﬁne-tuning procedure, attacking the low-level layer becomes more effective when generating adversarial perturbations in the pre-training domain rather than the middle-level layers.

4.3.2 Effect of uniform Gaussian sampling

We set µl, µh, σl, σh as 0.4, 0.6, 0.05, 0.1 for all the three models as they perform best. To discuss the effect of the hyperparameter λ in Eq. (6) fusing the base loss and the UGS loss, we select the values with a grid of 8 logarithmically spaced learning rates between 10 2 and 102. The results are shown in Fig. 5. As shown in Fig. 5(a), the best attack success rate is achieved when λ = 10 0.5 on Resnet101, boosting the performance by 1.52% compared to only using the Gaussian noises.

Table 4: Fixed datasets statistics

Model R101 R50 MAE None 66.95 71.16 94.00 Image Net 68.39 69.50 95.30 Uniform 72.20 77.80 95.30

Furthermore, we study whether adopting the ﬁxed statistics of the pre-training dataset (i.e., the mean and standard deviation of Image Net) can help. We report the attack success rates (%) in the Table 4, where None refers to using no data augmentation, Image Net adopts the mean and standard deviation of Image Net

and Uniform samples a pair of mean and standard deviation from the uniform distribution. As can be seen from the table, Uniform outperforms None by 4.39% on average, while Image Net does not help, which means that our proposed UGS helps to avoid overﬁtting the pre-training domain.

4.4 Visualization of feature maps

Figure 6: Visualization of feature maps.

We show the feature maps before and after L4Afuse attack in Fig. 6. The Left column shows the inputs of the model, while the Middle and the Right show the feature map of the pre-trained model and the ﬁne-tuned one, respectively. The Upper row represents the pipeline of a clean input in Cars, and the Lower shows that of adversarial ones. We can see from the upper row that ﬁne-tuning the model can make it sensitive to the deﬁning features related to the speciﬁc domain, such as tires and lamps. However, adding an adversarial perturbation to the image can signiﬁcantly lift all the activations and ﬁnally mask the useful features. Moreover, the effect of our attack could be well preserved during ﬁne-tuning and cheat the ﬁne-tuned model into misclassiﬁcation, stressing the safety problem of pre-trained models.

4.5 Trade-off between the clean accuracy and robustness

Figure 7: Model accuracy (%) on the Pets and STL10 datasets under clean inputs, PAP, and FGSM attack.

We study the effect of ﬁne-tuning epochs on the performance of our attack. To this end, we ﬁne-tune the model until it reports the best result on the testing dataset, and then we plot the clean accuracy and the accuracy against FGSM and PAPs on Pets and STL10 in Fig. 7. The ﬁgure shows the clean accuracy and robustness of the ﬁne-tuned model against PAPs are at odds. In Fig. 7(b), the model shows the best robustness at epoch 5 in STL10, achieving an accuracy rate of 95.38% and 55.02% on clean and adversarial samples, respectively. However, the model does not converge until epoch 19. Though the process boosts the clean accuracy by 1.63%, it suffers a signiﬁcant drop in robustness, as the accuracy on adversarial samples is lowered to 28.96%. Such ﬁndings reveal the safety problem of the pre-training to ﬁne-tuning paradigm.

5 Discussion

In this section, we ﬁrst introduce the gradient alignment and then use it to explain the effectiveness of our method. In particular, we show why our algorithms fall behind UAPs in the pre-training domain but have better cross-ﬁnetuning transferability when evaluated on the downstream tasks.

5.1 Preliminaries

Gradient sequences. Given a network fθ0 and a sample sequence {x1 x2, x3, ..., x N}D drawn from the dataset D, let δθ0,D = { δθ0,x1, δθ0,x2, δθ0,x3, ..., δθ0,x N }D be the sequence of gradients obtained when generating adversarial samples by the following equation:

δθ0,xi = δL(fθ0, xi, δi), with δi = P ,ϵ(δi 1 + δθ0,xi 1), (7)

where L denotes the loss function of iterative attack methods like UAP, FFF, and L4A.

Deﬁnition 1 (Gradient alignment). Given a dataset D and a model fθ0, the gradient alignment GA of an attack algorithm is deﬁned as the expectation over the cosine similarity of δθ0,x1 and δθ0,x2,

which can be formulated as Eq. (8)

GA = Ex1 D,x2 D

[ δθ0,x1 δθ0,x2 δθ0,x1 2 δθ0,x2 2

where δθ0,x1 and δθ0,x2 are two consecutive elements obtained by Eq. (7).

Then the L4A algorithm bears a higher gradient alignment (A strict deﬁnition and proof in a weaker form can be found in Appendix B.1). In addition, we provide the results of a simulation experiment to justify it in Table 5. We can see a negative correlation between the gradient alignment and the attack success rate on Image Net. In contrast, a positive correlation exists between the gradient alignment and attack success rate in the ﬁne-tuning domain.

Effectiveness of the algorithm. Given a pre-trained model fθ and the pre-training dataset Dp, the generation of PAPs can be reformulated from Eq. (1) to the following equation:

max δ Span{ δθ,Dp} Ex Dt[Fθ (x) = Fθ (x + δ)], s.t. δ p ϵ and x + δ [0, 1] , (9)

where Span{ δθ,Dp} denotes the subspace spanned by the elements of δθ,Dp. The optimal value δ can be viewed as a linear combination of the elements in δθ,Dp, so it is the feasible region of the equation. In particular, the feasible region of Eq. (9) is smaller than that of Eq. (1), which reﬂects that the stochastic gradient descent methods may not converge to the global maximum when it is not included in Span{ δθ,Dp} and the local maximum of Eq. (9) in Span{ δθ,Dp} represents the effectiveness of the algorithm in the ﬁne-tuning domain.

Pre-training Domain Fine-tuning Domain

Decision boundary of Decision boundary of

Gradient alignment of L4A = cos( ) Gradient alignment of UAP = cos( ) cos( ) >> cos( ) 0

Figure 8: Illustration

Resnet101 GA Image Net AVG FFFno 0.1221 38.64% 48.55% FFFmean 0.0194 48.37% 44.22% FFFone 0.1222 34.15% 40.26% DR 0.0165 41.06% 42.62% UAP 0.0018 88.14% 43.86% UAPEPGD 0.0008 94.62% 59.39% SSP 0.0274 41.81% 40.75% L4Abase 0.6125 37.44% 66.95%

Table 5: Left: Gradient alignments; Middle: Attack success rates on the pre-training dataset; Right: Average attack success rates on downstream tasks. See more details in Appendix B.2

5.2 Explanation

We aim to explain why the effectiveness of our algorithm is better in the ﬁne-tuning domain and worse in the pre-training domain, as seen from Table 5. Let δθ,Dp and δ θ,Dp be the feasible zones of L4A and UAP obtained by feeding instances from Dp into the pre-trained model fθ, respectively. Similarly, we can deﬁne δθ,Dt and δ θ,Dt. Meanwhile, denote δ δθ,Dp and δ δθ,Dp as the maxima in the pre-training domain obtained by L4A and UAP respectively. An illustration is shown in Fig. 8, supposing there is only one step in the iterative method.

Pre-training domain: Because the gradients of UAP obtained by Eq. (2) represent the directions to the closest points on the decision boundary in the pre-training domain. Thus, limiting the feasible zone to δ θ,Dp does little harm to the performance when evaluated in the pre-training domain. Meanwhile, according to the optimal objective, L4A ﬁnds the next best directions which are worse than those of UAPs. Thus, δ δθ,Dp performs better than δ δθ,Dp in the pre-training domain.

Fine-tuning domain: According to the fact that UAP bears a low gradient alignment near to 0, the subspace spanned by the tensors in δ θ,Dp is almost orthogonal to that spanned by the tensors in δ θ,Dt which represent the best directions that send the sample to the decision boundary in the ﬁne-tuning domain. Thus, limiting the feasible zone of Eq. (1) to δ θ,Dp suffers a great drop in ASR when evaluated on the downstream tasks in Eq. (9). However, as shown in Table 5, our

algorithm can achieve a gradient alignment of up to 0.6125, which means that there is considerable overlap in the next best feasible region of Dp in the pre-training domain obtained by L4A and that of Df in the ﬁne-tuning domain. Thus the performance of the best solution in δθ,Dp is close to that of δθ,Df , which represents the next best solution in the ﬁne-tuning domain. Finally we have δ δθ,Dp performs better than δ δθ,Dp in the ﬁne-tuning domain.

In conclusion, the high gradient alignment guarantees high cross-ﬁnetuning transferability.

6 Societal impact

A potential negative societal impact of L4A is that malicious adversaries could use it to cause security/safety issues in real-world applications. As more people focus on the pre-trained models because of their excellent performance, ﬁne-tuning pre-trained models provided by the cloud server becomes a panacea for deep learning practitioners. In such settings, PAPs become a signiﬁcant security ﬂaw as one can easily access the prototype pre-trained models and perform attacking algorithms on them. Our work appeals to big companies to delve further into the safety problem related to the vulnerability of pre-trained models.

7 Conclusion

In this paper, we address the safety problem of pre-trained models. In particular, an attacker can use them to generate so-called pre-trained adversarial perturbations, achieving a high success rate on the ﬁne-tuned models without knowing the victim model and the speciﬁc downstream tasks. Considering the inner qualities of the pre-training to ﬁne-tuning paradigm, we propose a novel algorithm, L4A, which performs well in such problem settings. A limitation of L4A is that it performs worse than UAPs in the pre-training domain; we hope some upcoming work can ﬁll the gap. Furthermore, L4A only utilizes the information in the pre-training domain. When the attacker obtains some information about the downstream tasks, like several unlabeled instances in the ﬁne-tuning domain, he may be able to enhance PAPs using the knowledge and further exacerbate the situation, which we leave to future work. Thus, we hope our work can draw attention to the safety problem of pre-trained models to guarantee security.

Acknowledgement

This work was supported by the National Key Research and Development Program of China (2020AAA0106000, 2020AAA0104304, 2020AAA0106302), NSFC Projects (Nos. 62061136001, 62076145, 62076147, U19B2034, U1811461, U19A2081, 61972224), Beijing NSF Project (No. JQ19016), BNRist (BNR2022RC01006), Tsinghua Institute for Guo Qiang, and the High Performance Computing Center, Tsinghua University. Y. Dong was also supported by the China National Postdoctoral Program for Innovative Talents and Shuimu Tsinghua Scholar Program.

[1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (Neur IPS), pages 1877 1901, 2020.

[2] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pages 1597 1607, 2020.

[3] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. In Advances in Neural Information Processing Systems (Neur IPS), pages 22243 22255, 2020.

[4] Tong Chen and Zhan Ma. Towards robust neural image compression: Adversarial attack and model ﬁnetuning. ar Xiv preprint ar Xiv:2112.08691, 2021.

[5] Yingpeng Deng and Lina J Karam. Universal adversarial attack via enhanced projected gradient descent. In IEEE International Conference on Image Processing (ICIP), pages 1241 1245, 2020.

[6] Xinshuai Dong, Anh Tuan Luu, Min Lin, Shuicheng Yan, and Hanwang Zhang. How should pre-trained language models be ﬁne-tuned towards adversarial robustness? In Advances in Neural Information Processing Systems (Neur IPS), pages 4356 4369, 2021.

[7] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9185 9193, 2018.

[8] Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4312 4321, 2019.

[9] Lijie Fan, Sijia Liu, Pin-Yu Chen, Gaoyuan Zhang, and Chuang Gan. When does contrastive learning preserve adversarial robustness from pretraining to ﬁnetuning? In Advances in Neural Information Processing Systems (Neur IPS), pages 21480 21492, 2021.

[10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), 2018.

[11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015.

[12] Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, and Rogerio Feris. Spottune: transfer learning through adaptive ﬁne-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4805 4814, 2019.

[13] Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. Pre-trained models: Past, present and future. AI Open, pages 225 250, 2021.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016.

[15] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729 9738, 2020.

[16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729 9738, 2020.

[17] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000 16009, 2022.

[18] Hiroshi Inoue. Data augmentation by pairing samples for images classiﬁcation. ar Xiv preprint ar Xiv:1801.02929, 2018.

[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448 456, 2015.

[20] Ziyu Jiang, Tianlong Chen, Ting Chen, and Zhangyang Wang. Robust pre-training by adversarial contrastive learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 16199 16210, 2020.

[21] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 4171 4186, 2019.

[22] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 18661 18673, 2020.

[23] Valentin Khrulkov and Ivan Oseledets. Art of singular vectors and universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8562 8570, 2018.

[24] Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In Artiﬁcial Intelligence Safety and Security, pages 99 112. 2018.

[25] Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. Contrastive representation learning: A framework and review. IEEE Access, pages 193907 193934, 2020.

[26] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations (ICLR), 2017.

[27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

[28] Yantao Lu, Yunhan Jia, Jianyu Wang, Bai Li, Weiheng Chai, Lawrence Carin, and Senem Velipasalar. Enhancing cross-task black-box transferability of adversarial examples with dispersion reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 940 949, 2020.

[29] Yantao Lu, Yunhan Jia, Jianyu Wang, Bai Li, Weiheng Chai, Lawrence Carin, and Senem Velipasalar. Enhancing cross-task black-box transferability of adversarial examples with dispersion reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 940 949, 2020.

[30] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018.

[31] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 2574 2582, 2016.

[32] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1765 1773, 2017.

[33] Konda Reddy Mopuri, Utsav Garg, and R Venkatesh Babu. Fast feature fool: A data independent approach to universal adversarial perturbations. In British Machine Vision Conference (BMVC), 2017.

[34] Muhammad Muzammal Naseer, Salman H Khan, Muhammad Haris Khan, Fahad Shahbaz Khan, and Fatih Porikli. Cross-domain transferability of adversarial perturbations. In Advances in Neural Information Processing Systems (Neur IPS), pages 12905 12915, 2019.

[35] Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Fatih Porikli. A self-supervised approach for adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 262 271, 2020.

[36] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision (ECCV), pages 69 84, 2016.

[37] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring midlevel image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1717 1724, 2014.

[38] Nicolas Papernot, Patrick Mc Daniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (Euro S&P), pages 372 387, 2016.

[39] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision (ECCV), pages 319 345, 2020.

[40] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pretrained models for natural language processing: A survey. Science China Technological Sciences, pages 1872 1897, 2020.

[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748 8763, 2021.

[42] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision (IJCV), pages 211 252, 2015.

[43] Ananya B Sai, Akash Kumar Mohankumar, Siddhartha Arora, and Mitesh M Khapra. Improving dialog evaluation with a multi-reference adversarial dataset and large scale pretraining. Transactions of the Association for Computational Linguistics (TACL), pages 810 827, 2020.

[44] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.

[45] Trieu H Trinh, Minh-Thang Luong, and Quoc V Le. Selﬁe: Self-supervised pretraining for image embedding. ar Xiv preprint ar Xiv:1906.02940, 2019.

[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems (Neur IPS), pages 6000 6010, 2017.

[47] Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2730 2739, 2019.

[48] Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, and Dhruv Mahajan. Clusterﬁt: Improving generalization of visual representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6509 6518, 2020.

[49] Zonghan Yang and Yang Liu. On robust preﬁx-tuning for text classiﬁcation. In International Conference on Learning Representations (ICLR), 2022.

[50] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems (Neur IPS), pages 3320 3328, 2014.

[51] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6023 6032, 2019.

[52] Qilong Zhang, Xiaodan Li, Yuefeng Chen, Jingkuan Song, Lianli Gao, Yuan He, and Hui Xue. Beyond imagenet attack: Towards crafting adversarial examples for black-box domains. In International Conference on Learning Representations (ICLR), 2021.

[53] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision (ECCV), pages 649 666, 2016.