# desparsify_adversarial_attack_against_token_sparsification_mechanisms__0847c7dd.pdf De Sparsify: Adversarial Attack Against Token Sparsification Mechanisms Oryan Yehezkel , Alon Zolfi , Amit Baras, Yuval Elovici, Asaf Shabtai Department of Software and Information Systems Engineering Ben-Gurion University of the Negev, Israel {oryanyeh,zolfi,barasa}@post.bgu.ac.il, {elovici,shabtaia}@bgu.ac.il Vision transformers have contributed greatly to advancements in the computer vision domain, demonstrating state-of-the-art performance in diverse tasks (e.g., image classification, object detection). However, their high computational requirements grow quadratically with the number of tokens used. Token sparsification mechanisms have been proposed to address this issue. These mechanisms employ an input-dependent strategy, in which uninformative tokens are discarded from the computation pipeline, improving the model s efficiency. However, their dynamism and average-case assumption makes them vulnerable to a new threat vector carefully crafted adversarial examples capable of fooling the sparsification mechanism, resulting in worst-case performance. In this paper, we present De Sparsify, an attack targeting the availability of vision transformers that use token sparsification mechanisms. The attack aims to exhaust the operating system s resources, while maintaining its stealthiness. Our evaluation demonstrates the attack s effectiveness on three token sparsification mechanisms and examines the attack s transferability between them and its effect on the GPU resources. To mitigate the impact of the attack, we propose various countermeasures. The source code is available online1. 1 Introduction In the last few years, vision transformers have demonstrated state-of-the-art performance in computer vision tasks, outperforming traditional convolutional neural networks (CNNs) in various tasks such as image classification, object detection, and segmentation [13]. While vision transformers have excellent representational capabilities, the computational demands of their transformer blocks make them unsuitable for deployment on edge devices. These demands mainly arise from the quadratic number of interactions (inter-/intra-calculations) between tokens [13]. Therefore, to reduce the computational requirements, various techniques have been proposed to improve their resource efficiency. Token sparsification (TS), in which tokens are dynamically sampled based on their significance, is a prominent technique used for this purpose. The TS approaches proposed include: ATS [5], Ada Vi T [20], and A-Vi T [33], each of which adaptively allocates resources based on the complexity of the input image (i.e., input-dependent inference) by deactivating uninformative tokens, resulting in improved throughput with a slight drop in accuracy. Despite the fact that TS has been proven to be effective in improving the resource efficiency of vision transformers, their test-time dynamism and average-case performance assumption creates a new attack surface for adversaries aiming to compromise model availability. Practical implications include various scenarios, such as: attacks on cloud-based Io T applications (e.g., surveillance cameras) and attacks on real-time DNN inference for resourceand time-constrained scenarios (e.g., autonomous vehicles) [10]. *Equal contribution 1https://github.com/oryany12/De Sparsify-Adversarial-Attack 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Given the potential impact of availability-oriented attacks, the machine learning research (ML) community has increasingly focused its attention on adversarial attacks aimed at compromising model availability. Shumailov et al. [27] were at the forefront of research in this emerging domain, introducing sponge examples, which leverage data sparsity to escalate GPU operations, resulting in increased inference times and energy consumption. Adversarial (c) Ada Vi T Figure 1: Token depth distribution in terms of transformer blocks for a clean (top) and adversarial (bottom) image for three TS mechanisms (b)-(d). The colors indicate the maximum depth each token reaches before being discarded. The adversarial image is crafted using the single-image attack variant (Section 4.1), which results in worst-case performance. Exploiting this technique, Cinà et al. [3] deliberately poisoned models with sponge examples in the training phase to induce delays during the inference phase. The postprocessing phase of deep neural networks (DNNs), particularly in the context of object detection [26] and Li DAR detection [15], has also been shown to be susceptible to availability-oriented attacks. In addition, dynamic neural networks (Dy NNs) [7], which adapt their structures or parameters based on input during the inference phase, have been found to be vulnerable to adversarial attacks. For example, previous studies demonstrated that layer-skipping and earlyexit mechanisms are also vulnerable to malicious inputs that aim to induce worst-case performance [8, 11, 24]. In this paper, we introduce the De Sparsify attack, a novel adversarial attack that targets TS mechanisms, exploiting their test-time dynamism to compromise model availability. To perform our attack, we craft adversarial examples using a custom loss function aimed at thwarting the sparsification mechanism by generating adversarial examples that trigger worst-case performance, as shown in Figure 1. To increase the stealthiness of our adversarial examples in scenarios where anomaly detection mechanisms are employed (e.g., monitoring shifts in the predicted distribution), the attack is designed to preserve the model s original classification. The experiments performed in our comprehensive evaluation examine the attack s effect for: (i) different sparsification mechanisms (i.e., ATS, Ada Vi T, and A-Vi T); (ii) different transformer models (i.e., Dei T [30], T2T-Vi T [34]); (iii) compare the performance of different attack variations (i.e., single-image, class-universal, and universal); and (iv) investigate the adversarial examples transferability between different TS mechanisms and the effect of ensembles. For example, the results of our attack against ATS show that it can increase the number of floating-point operations by 74%, the memory usage by 37%, and energy consumption by 72%. Our contributions can be summarized as follows: To the best of our knowledge, we are the first to both identify TS mechanisms dynamism as a threat vector and propose an adversarial attack that exploits the availability of vision transformers while preserving the model s original classification. We conduct a comprehensive evaluation on various configurations, examining different TS mechanisms and transformer models, reusable perturbations, transferability, and the use of ensembles. We discuss countermeasures that can be employed to mitigate the threat posed by our attack. 2 Background Vision transformers [33, 30, 34] consist of a backbone network, which is usually a transformer encoder comprised of L blocks, each of which consists of a multi-head self-attention layer (MSA) and feedforward network (FFN). We consider a vision transformer f : X RM that receives an input sample x X and outputs M real-valued numbers that represent the model s confidence for each class m M. The input image x RC W H is first sliced into a sequence of N 2D patches, which are then mapped into patch embeddings using a linear projection. Next, a learnable class token is appended, resulting in a sequence of size N + 1. Finally, positional embeddings are added to the patch embeddings to provide positional information. A single-head attention is computed as follows: Attn(Q, K, V ) = Softmax QKT where Q, K, V R(N+1) d represent the query, key, and value matrices, respectively. These matrices are derived from the output of the previous block of the transformer, denoted as Zl, where l indicates the l-th block. For the first block, i.e., Z0, the value corresponds to the flattened patches of the input image mentioned above. 3 Related Work 3.1 Token sparsification (TS) In this section, we provide an overview of the three TS mechanisms we focus on in this paper. Adaptive Token Sampling (ATS) [5]. ATS is a differentiable parameter-free module that adaptively downsamples input tokens. It automatically selects an optimal number of tokens in each stage (transformer block) based on the image content. The input tokens in each block are assigned significance scores by employing the attention weights of the classification token in the self-attention layer. A key advantage of ATS is its ability to be seamlessly integrated into pretrained vision transformers without the need for additional parameter tuning. Ada Vi T [20]. Ada Vi T is an end-to-end framework for vision transformers that adaptively determines the use of tokens, self-attention heads, and transformer blocks based on input images. A lightweight subnetwork (i.e., a decision network) is inserted into each transformer block of the backbone, which learns to make binary decisions on the use of tokens, self-attention heads, and the block s components (MSA and FFN). The decision networks are jointly optimized with the transformer backbone to reduce the computational cost while preserving classification accuracy. A-Vi T [33]. A-Vi T is an input-dependent spatially-adaptive inference mechanism that halts the computation of different tokens at different depths, reserving computation for discriminative tokens in a dynamic manner. This halting score-based module is incorporated into an existing vision transformer block by allocating a single neuron in the MLP layer to perform this task, i.e., it does not require any additional parameters or computation for the halting mechanism. 3.2 Availability-oriented attacks Confidentiality, integrity, and availability, collectively known as the CIA triad, serve as a foundational model for the design of security systems [25]. In the DNN realm, a significant amount of research performed has been devoted to adversarial attacks, particularly those focused on compromising integrity [29, 6, 21, 22, 28, 32, 36, 37] and confidentiality [1, 12]. However, adversarial attacks targeting the availability of these models have only recently begun to receive attention by the ML research community. Pioneers in the field of availability-oriented adversarial attacks, Shumailov et al. [27], introduced sponge examples, an attack designed to compromise the efficiency of vision and NLP models. The authors presented two attacks exploiting: (i) data sparsity - the assumption of data sparsity, which enables GPU acceleration by employing zero-skipping multiplications, and (ii) computation dimensions - the internal representation size of inputs and outputs (e.g., in transformers, mapping words to tokens). Both attacks aim to maximize GPU operations and memory accesses, resulting in increased inference time and energy consumption. Taking advantage of the data sparsity attack vector, Cinà et al. [3] proposed sponge poisoning, which aims to compromise a model s performance by targeting it with a sponge attack during the training phase. Boucher et al. [2] introduced a notable extension of the sponge attack (computation dimension vulnerability), presenting an adversarial strategy for NLP models. This method employs invisible characters and homoglyphs to significantly manipulate the model s output, while remaining imperceptible to human detection. Designed to enhance computational efficiency by adapting to input data during runtime, Dy NNs [7] have been shown to be susceptible to adversarial attacks. For example, Haque et al. [8] targeted DNNs employing layer-skipping mechanisms, forcing malicious inputs to go through all of the layers. Hong et al. [11] fooled DNNs that employ an early-exit strategy (dynamic depth), causing malicious inputs to consistently bypass early exits, thereby inducing worst-case performance. In a unified approach, Pan et al. [24] proposed a method for generating adversarial samples that are capable of attacking both dynamic depth and width networks. Another line of research focused on the post-processing phase of DNNs. Shapira et al. [26] demonstrated that overloading object detection models by massively increasing the total number of candidates input into the non-maximum suppression (NMS) component can result in increased execution times. Building on this, Liu et al. [15] extended the approach to Li DAR detection models. In this paper, we present a novel attack vector that has not been studied before, an adversarial attack that targets the availability of efficient transformer-based models that employ TS mechanisms. 4.1 Threat model Adversary s Goals. We consider an adversary whose primary goal is to generate an adversarial perturbation δ that causes TS mechanisms to use all available tokens, i.e., no tokens are sparsified. Furthermore, as a secondary goal, to increase the stealthiness of the attack, the adversary aims to maintain the model s original classification. Adversary s Knowledge. To assess the security vulnerability of TS mechanisms, we consider three scenarios: (i) a white-box scenario in which the attacker has full knowledge about the victim model, (ii) a grey-box scenario in which the attacker has partial knowledge about the set of potential models; (iii) a black-box scenario in which the attacker crafts a perturbation on a surrogate model and applies it on a different victim model. Attack Variants. Given a dataset D that contains multiple pairs (xi, yi), where xi is a sample and yi is the corresponding label, we consider three attack variants: (i) single-image - a different perturbation δ is crafted for each x D, (ii) class-universal - a single perturbation δ is crafted for a target class m M, and (iii) universal - a single perturbation δ is crafted for all x D. 4.2 De Sparsify attack To achieve the goals presented above, we utilize the PGD attack [18] with a modified loss function (a commonly used approach [26, 11]). The update of the perturbation δ in iteration t is formulated as: ||δ||p<ϵ (δt + α sgn( δ X (x,y) D L(x, y))) (2) where α is the step size, Q is the projection operator that enforces ||δ||p < ϵ for some norm p, and L is the loss function. The selection of D depends on the attack variant: (i) for the singleimage variant, D = {(x, y)}; (ii) for the class-universal variant with a target class m M, D = {(x, y) D|y = m}; and (iii) for the universal variant, D = D. Next, we describe the proposed custom loss function, which is formulated as follows: L = Latk + λ Lcls (3) where Latk is the attacking component, Lcls is the classification preservation component, and λ is a scaling term which is empirically determined using the grid search approach. The Lcls component, set to achieve the secondary goal, is defined as follows: m=1 LCE(fm(x + δ), fm(x)) (4) where fm denotes the score for class m and LCE denotes the cross-entropy loss. The Latk component, set to achieve the main goal, will be described separately for each TS mechanism in the subsections below. 4.2.1 Attack on ATS Preliminaries. To prune the attention matrix A, i.e., remove redundant tokens, ATS [5] uses the weights A{1,2}, ..., A{1,N+1} as significance scores, where A1 represents the attention weights of the classification token, and A1,j represents the importance of the input token j for the output classification token. The significance score for a token j is thus given by: Sj = A1,j ||Vj|| PN+1 i=2 A1,i ||Vi|| where {i, j 2, . . . , N + 1}. For multi-head attention, the score is calculated for each head, and those scores are totaled over all the heads. Since the scores are normalized, they can be interpreted as probabilities, and the cumulative distribution function (CDF) of S can be calculated as CDFi = Pj=i j=2 Sj. Given the cumulative distribution function, the token sampling function is obtained by calculating the inverse of the CDF: ψ(r) = CDF 1(r) = n, where r [0, 1] and n [2, N + 1] (which corresponds to a token s index). To obtain R samples, a fixed sampling scheme is used by choosing: r = 1 2R, 3 2R, . . . , 2R 1 2R . If a token is sampled more than once, only one instance kept. Next, given the indices of the sampled tokens, the attention matrix is refined by only selecting the rows that correspond to the sampled tokens. For example, in the case in which a token j in S is assigned a high significance score, it will be sampled multiple times, which will result in less unique tokens in the final set. Attack Methodology. Since our goal is to prevent the method from sparsifying any tokens, we want the sampling procedure to sample as many unique tokens as possible, i.e., R = |{ψ(r )|r r}| R. The number of unique sampled tokens R depends on the distribution they are drawn from (the CDF of S). In an extreme case, when the scores are not balanced (Sj 1), only the dominant token j will be sampled, i.e., R = 1. In another extreme case, in which the scores are perfectly balanced, each token will only be sampled once, and none of the tokens will be sparsified, resulting in R = R. Therefore, we want to push the vector S towards a uniform distribution, which will result in a balanced scores vector. Formally, let ˆS be a vector representing a uniform distribution. The loss component we propose is formulated as: l ℓKL(Sl, ˆS) (5) where ℓKL denotes the Kullback-Leibler (KL) divergence loss and Sl denotes the scores vector of the l-th block. The use of KL divergence loss enforces the optimization to consider all the elements in the scores vector Sl as a distribution, as opposed to a simple distance metric (e.g., MSE loss) that only considers them as independent values. 4.2.2 Attack on Ada Vi T Preliminaries. In Ada Vi T [20], a decision network is inserted into each transformer block to predict binary decisions regarding the use of patch embeddings, self-attention heads, and transformer blocks. The decision network in the l-th block consists of three linear layers with parameters Wl = {W p l , W h l , W b l } to produce computation usage policies for patch selection, attention head selection, and transformer block selection, respectively. Formally, given the input to the l-th block Zl, the usage policy matrices for the block are computed as (mp l , mh l , mb l) = {W p l , W h l , W b l }Zl, where mp l RN, mh l RH, mb l R2. Since the decisions are binary, the action of keeping/discarding is resolved by applying Gumbel-Softmax [17] to make the process differentiable Ml = GS(mb l), GS(mh l ), GS(mp l ) , where Ml {0, 1}(2+H+N) and GS is the Gumble-Softmax function. For example, the j-th patch embedding in the l-th block is kept when M p l,j = 1 and dropped when M p l,j = 0. It should also be noted that the activation of the attention heads depends on the activation of the MSA. Attack Methodology. The output of the decision network in each block Ml provides a binary decision about which parts will be activated. Therefore, our goal is to push all the decision values towards the activate" decision, which will result in no sparsification. Practically, we want the Gumbel-Softmax values to be equal to one, i.e., {Ml,i = 1| i}. Formally, we define the loss component as follows: LAda Vi T = 1 b ℓMSE(M b l , ˆ M b l ) + 1M 0 l =1 H h ℓMSE(M h l , ˆ M h l ) + 1 p ℓMSE(M p l , ˆ M p l ) (6) where ℓMSE denotes the MSE loss, ˆ Ml denotes the target value (set at one), and M 0 l denotes the decision regarding the activation of the MSA in block l. We condition the attention heads term with M 0 l , to avoid penalizing the activation of the attention heads when the MSA is deactivated (M 0 l = 0). When M 0 l = 0, the attention heads in that block are also deactivated. 4.2.3 Attack on A-Vi T Preliminaries. In A-Vi T [33], a global halting mechanism that monitors all blocks in a joint manner is proposed; the tokens are adaptively deactivated using an input-dependent halting score. For a token j in block l, the score hl j is computed as follows: hl j = H(Zl j) (7) where H( ) is a halting module, and hl j is enforced to be in the range [0, 1]. As the inference progresses into deeper blocks, the score is simply accumulated over the previous blocks scores. A token is deactivated when the cumulative score exceeds 1 τ: Ij = arg min n L l=1 hl j 1 τ (8) where τ is a small positive constant that allows halting after one block and Ij denotes the layer index at which the token is discarded. Once a token is halted in block l, it is also deactivated for all remaining depth l > Ij. The halting module H( ) is incorporated into a single neuron in the token s embedding specifically, the first neuron. The neuron is spared" from the original embedding dimension, and thus no additional parameters are introduced, enabling halting score calculation as: H(Zl j) = σ(γ Zl j,0 + β) (9) where Zl j,0 indicates the first dimension of token Zl j, σ( ) denotes the logistic sigmoid function, and γ and β are respectively learnable shifting and scaling parameters that are shared across all tokens. Attack Methodology. As noted above, the decision whether to deactivate a token j in block n (and for all the remaining blocks) relies on the cumulative score Pn l=1 hl j. If the cumulative score exceeds the threshold 1 τ, then the token is halted. Therefore, we want the push the cumulative score for all blocks beneath the threshold, and practically, we want to push it towards zero (the minimum value of the sigmoid function). This will result in the use of token j for all blocks, i.e., Ij L. Formally, we define the loss component as: LA-Vi T = 1 n=1 1Ij