# spikedriven_transformer__eac7442a.pdf Spike-driven Transformer Man Yao1,2 , Jiakui Hu3,1 , Zhaokun Zhou3,2 , Li Yuan3,2, Yonghong Tian3,2, Bo Xu1, Guoqi Li1 1Institute of Automation, Chinese Academy of Sciences, Beijing, China 2Peng Cheng Laboratory, Shenzhen, Guangzhou, China 3Peking University, Beijing, China Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm. In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties: i) Event-driven, no calculation is triggered when the input of Transformer is zero; ii) Binary spike communication, all matrix multiplications associated with the spike matrix can be transformed into sparse additions; iii) Self-attention with linear complexity at both token and channel dimensions; iv) The operations between spike-form Query, Key, and Value are mask and addition. Together, there are only sparse addition operations in the Spike-driven Transformer. To this end, we design a novel Spike Driven Self-Attention (SDSA), which exploits only mask and addition operations without any multiplication, and thus having up to 87.2 lower computation energy than vanilla self-attention. Especially in SDSA, the matrix multiplication between Query, Key, and Value is designed as the mask operation. In addition, we rearrange all residual connections in the vanilla Transformer before the activation functions to ensure that all neurons transmit binary spike signals. It is shown that the Spikedriven Transformer can achieve 77.1% top-1 accuracy on Image Net-1K, which is the state-of-the-art result in the SNN field. All source code and models are available at https://github.com/BICLab/Spike-Driven-Transformer. 1 Introduction One of the most crucial computational characteristics of bio-inspired Spiking Neural Networks (SNNs) [1] is spike-based event-driven (spike-driven): i) When a computation is event-driven, it is triggered sparsely as events (spike with address information) occur; ii) If only binary spikes (0 or 1) are employed for communication between spiking neurons, the network s operations are synaptic ACcumulate (AC). When implementing SNNs on neuromorphic chips, such as True North [2], Loihi [3], and Tianjic [4], only a small fraction of spiking neurons at any moment being active and the rest being idle. Thus, spike-driven neuromorphic computing that only performs sparse addition operations is regarded as a promising low-power alternative to traditional AI [5, 6, 7]. Although SNNs have apparent advantages in bio-plausibility and energy efficiency, their applications are limited by poor task accuracy. Transformers have shown high performance in various tasks for their self-attention [8, 9, 10]. Incorporating the effectiveness of Transformer with the high energy efficiency of SNNs is a natural and exciting idea. There has been some research in this direction, but all so far have relied on hybrid computing . Namely, Multiply-and-ACcumulate (MAC) operations dominated Equal contribution Corresponding author, guoqi.li@ia.ac.cn 37th Conference on Neural Information Processing Systems (Neur IPS 2023). (b) Spike-Driven Self-Attention (SDSA) Hadamard product (Element-wise mask) Spiking neuron Column summation Dot product Broadcasted element-wise multiplication Column mask Linear Linear Linear (a) Vanilla Self-Attention (VSA) Softmax 𝑛 𝑛 𝑂𝐷𝑁2 + 𝐷𝑁2 𝑛 𝑑 𝑋 Continuous input value Linear Linear Linear Binary spike input 𝑛 𝑑 𝑛 𝑑 𝑛 𝑑 𝑂𝑁𝐷+ 0 Linear SDSA Version 1 Linear Linear Linear 𝑆 Binary spike input SDSA Version 2 = Figure 1: Comparison Vanilla Self-Attention (VSA) and our Spike-Driven Self-Attention (SDSA). (a) is a typical Vanilla Self-Attention (VSA) [8]. (b) are two equivalent versions of SDSA. The input of SDSA are binary spikes. In SDSA, there are only mask and sparse additions. Version 1: Spike Q and K first perform element-wise mask, i.e., Hadamard product; then column summation and spike neuron layer are adopted to obtain the binary attention vector; finally, the binary attention vector is applied to the spike V to mask some channels (features). Version 2: An equivalent version of Version 1 (see Section 3.3) reveals that SDSA is a unique type of linear attention (spiking neuron layer is the kernel function) whose time complexity is linear with both token and channel dimensions. Typically, performing self-attention in VSA and SDSA requires 2N 2D multiply-and-accumulate and 0.02ND accumulate operations respectively, where N is the number of tokens, D is the channel dimensions, 0.02 is the non-zero ratio of the matrix after the mask of Q and K. Thus, the self-attention operator between the spike Q, K, and V has almost no energy consumption. by vanilla Transformer components and AC operations caused by spiking neurons both exist in the existing spiking Transformers. One popular approach is to replace some of the neurons in Transformer with spiking neurons to do a various of tasks [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], and keeping the MAC-required operations like dot-product, softmax, scale, etc. Though hybrid computing helps reduce the accuracy loss brought on by adding spiking neurons to the Transformer, it can be challenging to benefit from SNN s low energy cost, especially given that current spiking Transformers are hardly usable on neuromorphic chips. To address this issue, we propose a novel Spike-driven Transformer that achieves the spike-driven nature of SNNs throughout the network while having great task performance. Two core modules of Transformer, Vanilla Self-Attention (VSA) and Multi-Layer Perceptron (MLP), are re-designed to have a spike-driven paradigm. The three input matrices for VSA are Query (Q), Key (K), and Value (V ), (Fig. 1(a)). Q and K first perform similarity calculations to obtain the attention map, which includes three steps of matrix multiplication, scale and softmax. The attention map is then used to weight the V (another matrix multiplication). The typical spiking self-attentions in the current spiking Transformers [20, 19] would convert Q, K, V into spike-form before performing two matrix multiplications similar to those in VSA. The distinction is that spike matrix multiplications can be converted into addition, and softmax is not necessary [20]. But these methods not only yield large integers in the output (thus requiring an additional scale multiplication for normalization to avoid gradient vanishing), but also fails to exploit the full energy-efficiency potential of the spike-driven paradigm combined with self-attention. We propose Spike-Driven Self-Attention (SDSA) to address these issues, including two aspects (see SDSA Version 1 in Fig. 1(b)): i) Hadamard product replaces matrix multiplication; ii) matrix column-wise summation and spiking neuron layer take the role of softmax and scale. The former can be considered as not consuming energy because the Hadamard product between spikes is equivalent to the element-wise mask. The latter also consumes almost no energy since the matrix to be summed column-by-column is very sparse (typically, the ratio of non-zero elements is less than 0.02). We also observe that SDSA is a special kind of linear attention [23, 24], i.e., Version 2 of Fig. 1(b). In this view, the spiking neuron layer that converts Q, K, and V into spike form is a kernel function. Additionally, existing spiking Transformers [20, 12] usually follow the SEW-SNN residual design [25], whose shortcut is spike addition and thus outputs multi-bit (integer) spikes. This shortcut can satisfy event-driven, but introduces integer multiplication. We modified the residual connections throughout the Transformer architecture as shortcuts between membrane potentials [26, 27] to address this issue (Section 3.2). The proposed Spike-driven Transformer improves task accuracy on both static and neuromorphic event-based datasets. The main contributions of this paper are as follows: We propose a novel Spike-driven Transformer that only exploits sparse addition. This is the first time that the spike-driven paradigm has been incorporated into Transformer, and the proposed model is hardware-friendly to neuromorphic chips. We design a Spike-Driven Self-Attention (SDSA). The self-attention operator between spike Query, Key, Value is replaced by mask and sparse addition with essentially no energy consumption. SDSA is computationally linear in both tokens and channels. Overall, the energy cost of SDSA (including Query, Key, Value generation parts) is 87.2 lower than its vanilla self-attention counterpart. We rearrange the residual connections so that all spiking neurons in Spike-driven Transformer communicate via binary spikes. Extensive experiments show that the proposed architecture can outperform or comparable to State Of-The-Art (SOTA) SNNs on both static and neuromorphic datasets. We achieved 77.1% accuracy on Image Net-1K, which is the SOTA result in the SNN field. 2 Related Works Bio-inspired Spiking Neural Networks can profit from advanced deep learning and neuroscience knowledge concurrently [28, 5, 7]. Many biological mechanisms are leveraged to inspire SNN s neuron modeling [1, 29], learning rules [30, 31], etc. Existing studies have shown that SNNs are more suited for incorporating with brain mechanisms, e.g., long short-term memory [32, 33], attention [34, 27, 35], etc. Moreover, while keeping its own spike-driven benefits, SNNS have greatly improved its task accuracy by integrating deep learning technologies like network architecture [26, 25, 36], gradient backpropagation [37, 38], normalization [39, 40], etc. Our goal is to combine SNN and Transformer architectures. One way is to discretize Transformer into spike form through neuron equivalence [41, 42], i.e., ANN2SNN, but this requires a long simulation timestep and boosts the energy consumption. We employ the direct training method, using the first SNN layer as the spike encoding layer and applying surrogate gradient training [43]. Neuromorphic Chips. As opposed to the compute and memory separated processors used in ANNs, neuromorphic chips use non-von Neumann architectures, which are inspired by the structure and function of the brain [5, 7, 28]. Because of the choice that uses spiking neurons and synapses as basic units, neuromorphic chips [44, 45, 46, 2, 47, 3, 4] have unique features, such as highly parallel operation, collocated processing and memory, inherent scalability, and spike-driven computing, etc. Typical neuromorphic chips consume tens to hundreds of m Ws [48]. Conv and MLP in neuromorphic chips are equivalent to a cheap addressing algorithm [49] since the sparse spike-driven computing, i.e., to find out which synapses and neurons need to be involved in the addition operation. Our Transformer design strictly follows the spike-driven paradigm, thus it is friendly to deploy on neuromorphic chips. Efficient Transformers. Transformer and its variants have been widely used in numerous tasks, such as natural language processing [8, 50, 51] and computer vision [52, 53, 54]. However, deploying these models on mobile devices with limited resources remains challenging because of their inherent complexity [24, 55]. Typical optimization methods include convolution and self-attention mixing [55, 56], Transformer s own mechanism (token mechanism [57, 58], self-attention [59, 60], multihead [61, 62] and so on) optimization, etc. An important direction for efficient Transformers is linear attention on tokens since the computation scale of self-attention is quadratic with token number. Removing the softmax in self-attention and re-arranging the computation order of Query, Key, and Value is the main way to achieve linear attention [60, 63, 23, 64, 65, 62, 66]. For the spiking Transformer, softmax cannot exist, thus spiking Transformer can be a kind of linear attention. SPS Spike-driven Encoder Block 𝑳 MLP 3 Conv 2D Spike-Driven Self-Attention (SDSA) MP Max Pooling Spike Neuron Layer + Element-wise Add Membrane Potential Classification Figure 2: The overview of Spike-driven Transformer. We follow the network structure in [20], but make two new key designs. First, we propose a Spike-driven Self-Attention (SDSA) module, which consists of only mask and sparse addition operations (Fig. 1(b)). Second, we redesign the shortcuts in the whole network, involving position embedding, self-attention, and MLP parts. As indicated by the red line, the shortcut is constructed before the spike neuron layer. That is, we establish residual connections between membrane potentials to make sure that the values in the spike matrix are all binary, which allows the multiplication of the spike and weight matrices to be converted into addition operations. By contrast, previous works [20, 19] build shortcut between spike tensors in different layers, resulting in the output of spike neurons as multi-bit (integer) spikes. 3 Spike-driven Transformer We propose a Spike-driven Transformer, which incorporates Transformer into the spike-driven paradigm with only sparse addition. We first briefly introduce the spike neuron layer, then introduce the overview and components of the Spike-driven Transformer one by one. The spiking neuron model is simplified from the biological neuron model [67, 29]. Leaky Integrateand-Fire (LIF) spiking neuron [1], which have biological neuronal dynamics and are easy to simulate on a computer, is uniformly adopted in this work. The dynamics of the LIF layer [37] is governed by U[t] = H[t 1] + X[t], (1) S[t] = Hea (U[t] uth) , (2) H[t] = Vreset S[t] + (βU[t]) (1 S[t]) , (3) where t denote the timestep, U[t] means the membrane potential which is produced by coupling the spatial input information X[t] and the temporal input H[t 1], where X[t] can be obtained through operators such as Conv, MLP, and self-attention. When membrane potential exceeds the threshold uth, the neuron will fire a spike, otherwise it will not. Thus, the spatial output tensor S[t] contains only 1 or 0. Hea( ) is a Heaviside step function that satisfies Hea (x) = 1 when x 0, otherwise Hea (x) = 0. H[t] indicates the temporal output, where Vreset denotes the reset potential which is set after activating the output spiking. β < 1 is the decay factor, if the spiking neuron does not fire, the membrane potential U[t] will decay to H[t]. 3.1 Overall Architecture Fig. 2 shows the overview of Spike-driven Transformer that includes four parts: Spiking Patch Splitting (SPS), SDSA, MLP, and a linear classification head. For the SPS part, we follow the design in [20]. Given a 2D image sequence I RT C H W , the Patch Splitting Module (PSM), i.e., the first four Conv layers, linearly projects and splits it into a sequence of N flattened spike patches s with D dimensional channel, where T (images are repeated T times in the static dataset as input), C, H, and W denote timestep, channel, height and width of the 2D image sequence. Another Conv layer is then used to generate Relative Position Embedding (RPE). Together, the SPS part is written as: u = PSM (I) , I RT C H W , u RT N D (4) s = SN(u), s RT N D (5) RPE = BN(Conv2d(s)), RPE RT N D (6) U0 = u + RPE, U0 RT N D (7) where u and U0 are the output membrane potential tensor of PSM and SPS respectively, SN( ) denote the spike neuron layer. Note, in Eq 6, before executing Conv2d( ), s RT N D will be transposed into s RT C H W . We pass the U0 to the L-block Spike-driven Transformer encoder, which consists of a SDSA and a MLP block. Residual connections are applied to membrane potentials in both SDSA and MLP block. SDSA provides an efficient approach to model the local-global information of images utilizing spike Q, K, and V without scale and softmax (see Sec. 3.3). A Global Average-Pooling (GAP) is utilized on the processed feature from spike-driven encoder and outputs the D-dimension channel which will be sent to the fully-connected-layer Classification Head (CH) to output the prediction Y . The three parts SDSA, MLP and CH can be written as follows: S0 = SN(U0), S0 RT N D (8) U l = SDSA(Sl 1) + Ul 1, U l RT N D, l = 1...L (9) S l = SN(U l ), S l RT N D, l = 1...L (10) Sl = SN(MLP(S l) + U l ), Sl RT N D, l = 1...L (11) Y = CH(GAP(SL)), (12) where U l and S l represent membrane potential and spike tensor output by SDSA at l-th layer. 3.2 Membrane Shortcut in Spike-driven Transformer Residual connection [68, 69] is a crucial basic operation in Transformer architecture. There are three shortcut techniques in existing Conv-based SNNs [27]. Vanilla Res-SNN [39], similar to vanilla Res-CNN [68], performs a shortcut between membrane potential and spike. Spike-Element-Wise (SEW) Res-SNN [25] employs a shortcut to connect the output spikes in different layers. Membrane Shortcut (MS) Res-SNN [26], creating a shortcut between membrane potential of spiking neurons in various layers. There is no uniformly standard shortcut in the current SNN community, and SEW shortcut is adopted by existing spiking Transformers [20, 12]. As shown in Eq. 7, Eq. 9 and Eq. 11, we leverage the membrane shortcut in the proposed Spike-driven Transformer for four reasons: Spike-driven refers to the ability to transform matrix multiplication between weight and spike tensors into sparse additions. Only binary spikes can support the spike-driven function. However, the values in the spike tensors are multi-bit (integer) spikes, as the SEW shortcut builds the addition between binary spikes. By contrast, as shown in Eq. 8, Eq. 10, Eq. 11, SN is followed by the MS shortcut, which ensures that there are always only binary spike signals in the spike tensor. High performance. The task accuracy of MS-Res-SNN is higher than that of SEW-Res-SNN [26, 27, 25], also in Transformer-based SNN (see Table 5 in this work). Bio-plausibility. MS shortcut can be understood as an approach to optimize the membrane potential distribution. This is consistent with other neuroscience-inspired methods to optimize the internal dynamics of SNNs, such as complex spiking neuron design [70], attention mechanism [27], long short-term memory [33], recurrent connection [71], information maximization [72], etc. Dynamical isometry. MS-Res-SNN has been proven [26] to satisfy dynamical isometry theory [73], which is a theoretical explanation of well-behaved deep neural networks [74, 75]. 3.3 Spike-driven Self-Attention Vanilla Self-Attention (VSA). Given a float-point input feature sequence X RN D, float-point Query (Q), Key (K), and Value (V ) in RN D are calculated by three learnable linear matrices, respectively. Standard scaled dot-product self-attention (Fig. 2(a)) is computed as [52]: VSA(Q, K, V ) = softmax QKT where d = D/H is the feature dimension of one head and H is the head number, d is the scale factor. The time complexity of VSA is O(N 2D + N 2D). Spike-Driven Self-Attention (SDSA) Version 1. As shown in Fig. 2(b) left part, given a spike input feature sequence S RT N D, float-point Q, K, and V in RT N D are calculated by three learnable linear matrices, respectively. Note, the linear operation here is only addition, because the input S is a spike tensor. A spike neuron layer SN( ) follows, converting Q, K, V into spike tensor QS, KS, and VS. SDSA Version 1 (SDSA-V1) is presented as: SDSA(Q, K, V ) = g(QS, KS) VS = SN (SUMc (QS KS)) VS, (14) where is the Hadamard product, g( ) is used to compute the attention map, SUMc( ) represents the sum of each column. The outputs of both g( ) and SUMc( ) are D-dimensional row vectors. The Hadamard product between spike tensors is equivalent to the mask operation. Discussion on SDSA. Since the Hadamard product among QS, KS, and VS in RN D (we here assume T = 1 for mathematical understanding) can be exchanged, Eq. 14 can also be written as: SDSA(Q, K, V ) = QS g(KS, VS) = QS SN (SUMc (KS VS)) . (15) Note, Eq.14 and Eq.15 are equivalent functionally. In this view, Eq.15 is a linear attention [23, 63] whose computational complexity is linear in token number N because KS and VS can participate in calculate first. This is thanks to the softmax operation in the VSA is dropped here. The function of softmax needs to be replaced by the kernel function. Specific to our SDSA, SN( ) is the kernel function. Further, we can assume a special case [62], H = D, i.e., the number of channels per head is one. After the self-attention operation is performed on the H heads respectively, the outputs are concatenated together. Specifically, SDSA(Qi, Ki, V i) = SN(Qi)g(Ki, V i) = SN(Qi)SN SN(Qi)T SN(V i) , (16) where Qi, Ki, V i in RN 1 are the i-th vectors in Q, K, V respectively, is the dot product operation. The output of g(Ki, V i) is a scalar, 0 or 1. Since the operation between SN(Qi) and g(Ki, V i) is a mask, the whole SDSA only needs to be calculated H = D times for g(Ki, V i). The computational complexity of SDSA is O(0 + ND), which is linear with both N and D (see Fig. 1(b) right part). Vectors Ki and V i are very sparse, typically less than 0.01 (Table 2). Together, the whole SDSA only needs about 0.02ND times of addition, and its energy consumption is negligible. Interestingly, Eq. 15 actually converts the soft vanilla self-attention to hard self-attention, where the attention scores in soft and hard attention are continuousand binary-valued, respectively [76]. Thus, the practice of the spike-driven paradigm in this work leverages binary self-attention scores to directly mask unimportant channels in the sparse spike Value tensor. Although this introduces a slight loss of accuracy (Table 5), SDSA( ) consumes almost no energy. 4 Theoretical Energy Consumption Analysis Three key computational modules in deep learning are Conv, MLP, and self-attention. In this Section, We discuss how the spike-driven paradigm achieves high energy efficiency on these operators. Spike-driven in Conv and MLP. Spike-driven combines two properties, event-driven and binary spike-based communication. The former means that no computation is triggered when the input is zero. The binary restriction in the latter indicates that there are only additions. In summary, in spike-driven Conv and MLP, matrix multiplication is transformed into sparse addition, which is implemented as addressable addition in neuromorphic chips [49]. Spike-driven in Self-attention. QS, KS, VS in spiking self-attention involve two matrix multiplications. One approach is to perform multiplication directly between QS, KS, VS, which is then converted to sparse addition, like spike-driven Conv and MLP. The previous work [20] did just that. We provide a new scheme that performs element-wise multiplication between QS, KS, VS. Since all elements in spike tensors are either 0 or 1, element multiplication is equivalent to a mask operation with no energy consumption. Mask operations can be implemented in neuromorphic chips through addressing algorithms [49] or AND logic operations [4]. Energy Consumption Comparison. The times of floating-point operations (FLOPs) is often used to estimate the computational burden in ANNs, where almost all FLOPs are MAC. Under the same Table 1: Energy evaluation. FLConv and FLMLP represent the FLOPs of the Conv and MLP models in the ANNs, respectively. RC, RM, R, b R denote the spike firing rates (the proportion of non-zero elements in the spike matrix) in various spike matrices. We give the strict definitions and calculation methods of these indicators in the Supplementary due to space constraints. Vanilla Spike-driven Transformer Transformer [52] (This work) SPS First Conv EMAC FLConv EMAC T RC FLConv Other Conv EMAC FLConv EAC T RC FLConv Self-attention Q, K, V EMAC 3ND2 EAC T R 3ND2 f(Q, K, V ) EMAC 2N 2D EAC T b R ND Scale EM N 2 - Softmax EMAC 2N 2 - Linear EMAC FLMLP 0 EAC T RM0 FLMLP 0 MLP Layer 1 EMAC FLMLP 1 EAC T RM1 FLMLP 1 Layer 2 EMAC FLMLP 2 EAC T RM2 FLMLP 2 architecture, the energy cost of SNN can be estimated by combining the spike firing rate R and simulation timestep T if the FLOPs of ANN is known. Table 1 shows the energy consumption of Conv, self-attention, and MLP modules of the same scale in vanilla and our Spike-driven Transformer. 5 Experiments We evaluate our method on both static datasets Image Net [77], CIFAR-10/100 [78], and neuromorphic datasets CIFAR10-DVS [79], DVS128 Gesture [80]. Experimental Setup on Image Net. For the convenience of comparison, we generally continued the experimental setup in [20]. The input size is set to 224 224. The batch size is set to 128 or 256 during 310 training epochs with a cosine-decay learning rate whose initial value is 0.0005. The optimizer is Lamb. The image is divided into N = 196 patches using the SPS module. Standard data augmentation techniques, like random augmentation, mixup, are also employed in training. Details of the training and experimental setup on Image Net are given in the supplementary material. Accuracy analysis on Image Net. Our experimental results on Image Net are given in Table 3. We first compare our model performance with the baseline spiking Transformer (i.e., Spik Former [20]). The five network architectures consistent with those in Spik Former are adopted by this work. We can see that under the same parameters, our accuracies are significantly better than the corresponding baseline models. For instance, the Spike-driven Transformer-8-384 is 2.0% higher than Spik Former-8-384. It is worth noting that the Spike-driven Transformer-8-768 obtains 76.32% (input 224 224) with 66.34M, which is 1.5% higher than the corresponding Spik Former. We further expand the inference resolution to 288 288, obtaining 77.1%, which is the SOTA result of the SNN field on Image Net. Table 2: Spike Firing Rate (SFR) of Spike-driven Self-attention in 8-512. Average SFR is the mean of SFR over T = 4, and 8 SDSA blocks. SDSA Average SFR QS 0.0091 KS 0.0090 g(QS, KS) 0.0713 VS 0.1649 Output of SDSA( ), ˆVS 0.0209 We then compare our results with existing Res SNNs. Whether it is vanilla Res-SNN [39], MSRes-SNN [26] or SEW-Res-SNN [25], the accuracy of Spike-driven Transformer-8-768 (77.1%) is the highest. Att-MS-Res-SNN [27] also achieves 77.1% accuracy by plugging an additional attention auxiliary module [81, 82] in MS-Res-SNN, but it destroys the spike-driven nature and requires more parameters (78.37M vs. 66.34M) and training time (1000epoch vs. 310epoch). Furthermore, the proposed Spikedriven Transformer outperforms by more than 72% at various network scales, while Res-SNNs have lower performance with a similar amount of parameters. For example, Spike-driven Transformer6-512 (This work) vs. SEW-Res-SNN-34 vs. MS-Res-SNN-34: Param, 23.27M vs. 21.79M vs. 21.80M; Acc, 74.11% vs. 67.04% vs. 69.15%. Table 3: Evaluation on Image Net. Power is the average theoretical energy consumption when predicting an image from the test set. The power data in this work is evaluated according to Table 1, and data for other works were obtained from related papers. Spiking Transformer-L-D represents a model with L encoder blocks and D channels. *The input crops are enlarged to 288 288 in inference. The default inference input resolution for other models is 224 224. Methods Architecture Spike -driven Param (M) Power (m J) Time Step Acc Hybrid training [83] Res Net-34 21.79 - 250 61.48 TET [84] SEW-Res Net-34 21.79 - 4 68.00 Spiking Res Net [85] Res Net-50 25.56 70.93 350 72.75 td BN [39] Spiking-Res Net-34 21.79 6.39 6 63.72 SEW Res Net [25] SEW-Res Net-34 21.79 4.04 4 67.04 SEW-Res Net-50 25.56 4.89 4 67.78 SEW-Res Net-101 44.55 8.91 4 68.76 SEW-Res Net-152 60.19 12.89 4 69.26 MS Res Net [26] MS-Res Net-18 11.69 4.29 4 63.10 MS-Res Net-34 21.80 5.11 4 69.42 MS-Res Net-104* 77.28 10.19 4 76.02 Att MS Res Net [27] Att-MS-Res Net-18 11.87 0.48 1 63.97 Att-MS-Res Net-34 22.12 0.57 1 69.15 Att-MS-Res Net-104* 78.37 7.30 4 77.08 Res Net Res-CNN-104 77.28 54.21 1 76.87 Transformer Transformer-8-512 29.68 41.77 1 80.80 Spikformer [20] Spiking Transformer-8-384 16.81 7.73 4 70.24 Spiking Transformer-6-512 23.37 9.41 4 72.46 Spiking Transformer-8-512 29.68 11.57 4 73.38 Spiking Transformer-10-512 36.01 13.89 4 73.68 Spiking Transformer-8-768 66.34 21.47 4 74.81 Spiking Transformer-8-384 16.81 3.90 4 72.28 Spiking Transformer-6-512 23.37 3.56 4 74.11 Spike-driven Spiking Transformer-8-512 29.68 1.13 1 71.68 Transformer (Ours) Spiking Transformer-8-512 29.68 4.50 4 74.57 Spiking Transformer-10-512 36.01 5.53 4 74.66 Spiking Transformer-8-768* 66.34 6.09 4 77.07 Power analysis on Image Net. Compared with prior works, the Spike-driven Transformer shines in energy cost (Table 3). We first make an intuitive comparison of energy consumption in the SNN field. Spike-driven Transformer-8-512 (This work) vs. SEW-Res-SNN-50 vs. MS-Res-SNN-34: Power, 4.50m J vs. 4.89m J vs. 5.11m J; Acc, 74.57% vs. 67.78% vs. 69.42%. That is, our model achieves +6.79% and +5.15% accuracy higher than previous SEW and MS Res-SNN backbones with lower energy consumption. What is more attractive is that the energy efficiency of the Spikedriven Transformer will be further expanded as the model scale grows because its computational complexity is linear in both token and channel dimensions. For instance, in an 8-layer network, as the channel dimension increases from 384 to 512 and 768, Spik Former [20] has 1.98 (7.73m J/3.90m J), 2.57 (11.57m J/4.50m J), and 3.52 (21.47m J/6.09m J) higher energy consumption than our Spikedriven Transformer. At the same time, our task performance on these three network structures has improved by +2.0%, +1.2%, and +1.5%, respectively. Table 4: Energy Consumption of Self-attention. E1 and E2 (including energy consumption to generate Q, K, V ) represent the power of selfattention mechanism in ANN and spike-driven. Models E1 (p J) E2 (p J) E1/E2 8-384 6.7e8 1.6e7 42.6 8-512 1.2e9 2.1e7 57.2 8-768 2.7e9 3.1e7 87.2 Then we compare the energy cost between Spikedriven and ANN Transformer. Under the same structure, such as 8-512, the power required by the ANN-Vi T (41.77m J) is 9.3 that of the spikedriven counterpart (4.50m J). Further, the energy advantage will extend to 36.7 if we set T = 1 in the Spike-driven version (1.13m J). Although the accuracy of T = 1 (here we are direct training) will be lower than T = 4, it can be compensated Table 5: Experimental Results on CIFAR10/100, DVS128 Gesture and CIFAR10-DVS. Methods Spike-driven CIFAR10-DVS DVS128 Gesture CIFAR-10 CIFAR-100 T Acc T Acc T Acc T Acc td BN [39] 10 67.8 40 96.9 6 93.2 - - PLIF [87] 20 74.8 20 97.6 8 93.5 - - Dspike [88] 10 75.4 - - 6 94.3 6 74.2 DSR [89] 10 77.3 - - 20 95.4 20 78.5 Spikformer [20] 16 80.9 16 98.3 4 95.5 4 78.2 DIET-SNN[90] - - - - 5 92.7 5 69.7 ANN(Res Net19) - - - - 1 94.97 1 75.4 ANN(Transformer4-384) - - - - 1 96.7 1 81.0 This Work 16 80.0 16 99.3 4 95.6 4 78.4 by special training methods [86] in future work. Sparse spike firing is the key for Spike-driven Transformer to achieve high energy efficiency. As shown in Table 2, the Spike Firing Rate (SFR) of the self-attention part is very low, where the SFR of QS and QK are both less than 0.01. Since the mask (Hadamard product) operation does not consume energy, the number of additions required by the SUMc (QS KS) is less than 0.02ND times. The operation between the vector output by g(QS, KS) and VS is still a column mask that does not consume energy. Consequently, in the whole self-attention part, the energy consumption of spike-driven self-attention can be lower than 87.2 of ANN self-attention (see Table 4). Table 6: Studies on Spiking Transformer-2-512. Model CIFAR-10 CIFAR-100 Baseline [20] 93.12 73.17 + SDSA 93.09 (-0.03) 72.83 (-0.34) + MS 93.93 (+0.81) 74.63 (+1.46) This work 93.82 (+0.73) 74.41 (+1.24) Experimental results on CIFAR-10/100, CIFAR10-DVS, and DVS128 Gesture are conducted in Table 5. These four datasets are relatively small compared to Image Net. CIFAR10/100 are static image classification datasets. Gesture and CIFAR10-DVS are neuromorphic action classification datasets, which need to convert the event stream into frame sequences before processing. DVS128 Gesture is a gesture recognition dataset. CIFAR10-DVS is a neuromorphic dataset converted from CIFAR-10 by shifting image samples to be captured by the DVS camera. We basically keep the experimental setup in [20], including the network structure, training settings, etc., and details are given in the supplementary material. As shown in Table 5, we achieve SOTA results on Gesture (99.3%) and CIFAR-10 (95.6%), and comparable results to SOTA on other datasets. Input Image Block 1 Block 2 Block 3 Block 4 Figure 3: Attention Map Based on Spike Firing Rate (SFR). VS is the Value tensor. ˆVS is the output of SDSA( ). The spike-driven selfattention mechanism masks unimportant channels in VS to obtain ˆVS. Each pixel on VS and ˆVS represents the SFR at a patch. The spatial resolution of each attention map is 14 14 (196 patches). The redder the higher the SFR, the bluer the smaller the SFR. Ablation study. To implement the spike-driven paradigm in Transformer, we design a new SDSA module and reposition the residual connections in the entire network based on the Spik Former [20]. We organize ablation studies on CIFAR10/100 to analyze their impact. Results are given in Table 5. We adopt spiking Transformer-2-512 as the baseline structure. It can be observed that SDSA incurs a slight performance loss. As discussed in Section 3.3, SDSA actually masks some unimportant channels directly. In Fig. 3, we plot the attention maps (the detailed drawing method is given in the supplementary material), and we can observe: i) SDSA can optimize intermediate features, such as masking background information; ii) SDSA greatly reduces the spike firing rate of ˆVS, thereby reducing energy cost. On the other hand, the membrane shortcut leads to significant accuracy improvements, consistent with the experience of Conv-based MS-SNN [26, 25]. Comprehensively, the proposed Spike-driven Transformer simultaneously achieves better accuracy and lower energy consumption (Table 3). 6 Conclusion We propose a Spike-driven Transformer that combines the low power of SNN and the excellent accuracy of the Transformer. There is only sparse addition in the proposed Spike-driven Transformer. To this end, we design a novel Spike-Driven Self-Attention (SDSA) module and rearrange the location of residual connections throughout the network. The complex and energy-intensive matrix multiplication, softmax, and scale in the vanilla self-attention are dropped. Instead, we employ mask, addition, and spike neuron layer to realize the function of the self-attention mechanism. Moreover, SDSA has linear complexity with both token and channel dimensions. Extensive experiments are conducted on static image and neuromorphic datasets, verifying the effectiveness and efficiency of the proposed method. We hope our investigations pave the way for further research on Transformer-based SNNs and inspire the design of next-generation neuromorphic chips. Acknowledgement This work was supported by Beijing Natural Science Foundation for Distinguished Young Scholars (JQ21015), National Science Foundation for Distinguished Young Scholars (62325603), and National Natural Science Foundation of China (62236009, U22A20103). [1] Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural Networks, 10(9):1659 1671, 1997. [2] Paul A Merolla, John V Arthur, Rodrigo Alvarez-Icaza, Andrew S Cassidy, Jun Sawada, Filipp Akopyan, Bryan L Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668 673, 2014. [3] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1):82 99, 2018. [4] Jing Pei, Lei Deng, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106 111, 2019. [5] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607 617, 2019. [6] Bojian Yin, Federico Corradi, and Sander M Bohté. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nature Machine Intelligence, 3(10):905 913, 2021. [7] Catherine D Schuman, Shruti R Kulkarni, Maryam Parsa, J Parker Mitchell, Bill Kay, et al. Opportunities for neuromorphic computing algorithms and applications. Nature Computational Science, 2(1):10 19, 2022. [8] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998 6008, 2017. [9] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on Pattern Analysis and Machine Intelligence, 45(1):87 110, 2022. [10] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1 41, 2022. [11] Yudong Li, Yunlin Lei, and Xu Yang. Spikeformer: A novel architecture for training highperformance low-latency spiking neural network. ar Xiv preprint ar Xiv:2211.10686, 2022. [12] Nathan Leroux, Jan Finkbeiner, and Emre Neftci. Online transformers with spiking neurons for fast prosthetic hand control. ar Xiv preprint ar Xiv:2303.11860, 2023. [13] Jiqing Zhang, Bo Dong, Haiwei Zhang, Jianchuan Ding, Felix Heide, Baocai Yin, and Xin Yang. Spiking transformers for event-based single object tracking. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8801 8810, 2022. [14] Jiyuan Zhang, Lulu Tang, Zhaofei Yu, Jiwen Lu, and Tiejun Huang. Spike transformer: Monocular depth estimation for spiking camera. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part VII, pages 34 52. Springer, 2022. [15] Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, and Zhengqi Wen. Spike Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition. In Proc. Interspeech 2020, pages 5026 5030, 2020. [16] Rui-Jie Zhu, Qihang Zhao, and Jason K Eshraghian. Spikegpt: Generative pre-trained language model with spiking neural networks. ar Xiv preprint ar Xiv:2302.13939, 2023. [17] Etienne Mueller, Viktor Studenyak, Daniel Auge, and Alois Knoll. Spiking transformer networks: A rate coded approach for processing sequential data. In 2021 7th International Conference on Systems and Informatics (ICSAI), pages 1 5. IEEE, 2021. [18] Minglun Han, Qingyu Wang, Tielin Zhang, Yi Wang, Duzhen Zhang, and Bo Xu. Complex dynamic neurons improved spiking transformer network for efficient automatic speech recognition. Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023), 2023. [19] Shihao Zou, Yuxuan Mu, Xinxin Zuo, Sen Wang, and Li Cheng. Event-based human pose tracking by spiking spatiotemporal transformer. ar Xiv preprint ar Xiv:2303.09681, 2023. [20] Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng Yan, Yonghong Tian, and Li Yuan. Spikformer: When spiking neural network meets transformer. In The Eleventh International Conference on Learning Representations, 2023. [21] Guangyao Chen, Peixi Peng, Guoqi Li, and Yonghong Tian. Training full spike neural networks via auxiliary accumulation pathway. ar Xiv preprint ar Xiv:2301.11929, 2023. [22] Chenlin Zhou, Liutao Yu, Zhaokun Zhou, Han Zhang, Zhengyu Ma, Huihui Zhou, and Yonghong Tian. Spikingformer: Spike-driven residual learning for transformer-based spiking neural network. ar Xiv preprint ar Xiv:2304.11954, 2023. [23] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020. [24] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1 28, 2022. [25] Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems, 34:21056 21069, 2021. [26] Yifan Hu, Yujie Wu, Lei Deng, Man Yao, and Guoqi Li. Advancing residual learning towards powerful deep spiking neural networks. ar Xiv preprint ar Xiv:2112.08954, 2021. [27] Man Yao, Guangshe Zhao, Hengyu Zhang, Hu Yifan, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1 18, 01 2023. [28] Guoqi Li, Lei Deng, Huajin Tang, Gang Pan, Yonghong Tian, Kaushik Roy, and Wolfgang Maass. Brain inspired computing: A systematic survey and future trends. 2023. [29] Eugene M Izhikevich. Simple model of spiking neurons. IEEE Transactions on Neural Networks, 14(6):1569 1572, 2003. [30] Tielin Zhang, Xiang Cheng, Shuncheng Jia, Mu-ming Poo, Yi Zeng, and Bo Xu. Selfbackpropagation of synaptic modifications elevates the efficiency of spiking and artificial neural networks. Science Advances, 7(43):eabh0146, 2021. [31] Yujie Wu, Rong Zhao, Jun Zhu, Feng Chen, Mingkun Xu, Guoqi Li, Sen Song, Lei Deng, Guanrui Wang, Hao Zheng, et al. Brain-inspired global-local learning incorporated with neuromorphic computing. Nature Communications, 13(1):65, 2022. [32] Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. Advances in Neural Information Processing Systems, 31, 2018. [33] Arjun Rao, Philipp Plank, Andreas Wild, and Wolfgang Maass. A long short-term memory for ai applications in spike-based neuromorphic hardware. Nature Machine Intelligence, 4(5):467 479, 2022. [34] Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10221 10230, 2021. [35] Man Yao, Jiakui Hu, Guangshe Zhao, Yaoyuan Wang, Ziyang Zhang, Bo Xu, and Guoqi Li. Inherent redundancy in spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16924 16934, 2023. [36] Man Yao, Hengyu Zhang, Guangshe Zhao, Xiyu Zhang, Dingheng Wang, Gang Cao, and Guoqi Li. Sparser spiking activity can be better: Feature refine-and-mask spiking neural network for event-based visual recognition. Neural Networks, 166:410 423, 2023. [37] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in Neuroscience, 12:331, 2018. [38] Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51 63, 2019. [39] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11062 11070, 2021. [40] Lei Deng, Yujie Wu, Yifan Hu, Ling Liang, Guoqi Li, Xing Hu, Yufei Ding, Peng Li, and Yuan Xie. Comprehensive snn compression using admm optimization and activity regularization. IEEE Transactions on Neural Networks and Learning Systems, 2021. [41] Shikuang Deng and Shi Gu. Optimal conversion of conventional artificial neural networks to spiking neural networks. In International Conference on Learning Representations, 2021. [42] Jibin Wu, Chenglin Xu, Xiao Han, Daquan Zhou, Malu Zhang, Haizhou Li, and Kay Chen Tan. Progressive tandem learning for pattern recognition with deep spiking neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7824 7840, 2021. [43] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan Xie, and Luping Shi. Direct training for spiking neural networks: Faster, larger, better. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 1311 1318, 2019. [44] Johannes Schemmel, Daniel Brüderle, Andreas Grübl, Matthias Hock, Karlheinz Meier, and Sebastian Millner. A wafer-scale neuromorphic hardware system for large-scale neural modeling. In 2010 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1947 1950. IEEE, 2010. [45] Eustace Painkras, Luis A Plana, Jim Garside, Steve Temple, Francesco Galluppi, Cameron Patterson, David R Lester, Andrew D Brown, and Steve B Furber. Spinnaker: A 1-w 18-core system-on-chip for massively-parallel neural network simulation. IEEE Journal of Solid-state Circuits, 48(8):1943 1953, 2013. [46] Ben Varkey Benjamin, Peiran Gao, Emmett Mc Quinn, Swadesh Choudhary, Anand R Chandrasekaran, Jean-Marie Bussat, Rodrigo Alvarez-Icaza, John V Arthur, Paul A Merolla, and Kwabena Boahen. Neurogrid: A mixed-analog-digital multichip system for large-scale neural simulations. Proceedings of the IEEE, 102(5):699 716, 2014. [47] Juncheng Shen, De Ma, Zonghua Gu, Ming Zhang, Xiaolei Zhu, Xiaoqiang Xu, Qi Xu, Yangjing Shen, and Gang Pan. Darwin: A neuromorphic hardware co-processor based on spiking neural networks. Science China Information Sciences, 59(2):1 5, 2016. [48] Arindam Basu, Lei Deng, Charlotte Frenkel, and Xueyong Zhang. Spiking neural network integrated circuits: A review of trends and future directions. In 2022 IEEE Custom Integrated Circuits Conference (CICC), pages 1 8, 2022. [49] Ole Richer, Ning Qiao, Qian Liu, and Sadique Sheik. Event-driven spiking convolutional neural network. WIPO Patent, page WO2020207982A1, 2020. [50] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018. [51] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. [52] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2020. [53] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012 10022, 2021. [54] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347 10357. PMLR, 2021. [55] Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35:12934 12949, 2022. [56] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets and transformer. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part XXVI, pages 613 630. Springer, 2022. [57] Yongming Rao, Zuyan Liu, Wenliang Zhao, Jie Zhou, and Jiwen Lu. Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. [58] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 558 567, 2021. [59] Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. Levit: a vision transformer in convnet s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12259 12269, 2021. [60] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021. [61] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances in Neural Information Processing Systems, 32, 2019. [62] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, and Judy Hoffman. Hydra attention: Efficient attention with many heads. In Computer Vision ECCV 2022 Workshops: Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part VII, pages 35 49. Springer, 2023. [63] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In International Conference on Learning Representations, 2022. [64] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3531 3539, 2021. [65] Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2021. [66] Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. ar Xiv preprint ar Xiv:2303.15446, 2023. [67] A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4):500 544, 1952. [68] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [69] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision ECCV 2016, pages 630 645, Cham, 2016. Springer International Publishing. [70] Xingting Yao, Fanrong Li, Zitao Mo, and Jian Cheng. GLIF: A unified gated leaky integrateand-fire neuron for spiking neural networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. [71] Bojian Yin, Federico Corradi, and Sander M Bohté. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nature Machine Intelligence, 3(10):905 913, 2021. [72] Yufei Guo, Yuanpei Chen, Liwen Zhang, Xiaode Liu, Yinglei Wang, Xuhui Huang, and Zhe Ma. Im-loss: information maximization loss for spiking neural networks. Advances in Neural Information Processing Systems, 35:156 166, 2022. [73] Zhaodong Chen, Lei Deng, Bangyan Wang, Guoqi Li, and Yuan Xie. A comprehensive and modularized statistical framework for gradient norm equality in deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):13 31, 2022. [74] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393 5402. PMLR, 2018. [75] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers. ar Xiv preprint ar Xiv:2203.00555, 2022. [76] Meng-Hao Guo, Tian-Xing Xu, Jiang-Jiang Liu, Zheng-Ning Liu, Peng-Tao Jiang, Tai-Jiang Mu, Song-Hai Zhang, Ralph R Martin, Ming-Ming Cheng, and Shi-Min Hu. Attention mechanisms in computer vision: A survey. Computational Visual Media, 8(3):331 368, 2022. [77] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248 255, 2009. [78] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [79] Hongmin Li, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. Cifar10-dvs: an eventstream dataset for object classification. Frontiers in Neuroscience, 11:309, 2017. [80] Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey Mc Kinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, Jeff Kusnitz, Michael Debole, Steve Esser, Tobi Delbruck, Myron Flickner, and Dharmendra Modha. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7243 7252, 2017. [81] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7132 7141, 2018. [82] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3 19, 2018. [83] Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In International Conference on Learning Representations, 2020. [84] Shikuang Deng, Yuhang Li, Shanghang Zhang, and Shi Gu. Temporal efficient training of spiking neural network via gradient re-weighting. In International Conference on Learning Representations, 2022. [85] Yangfan Hu, Huajin Tang, and Gang Pan. Spiking deep residual networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1 6, 2021. [86] Sayeed Shafayet Chowdhury, Nitin Rathi, and Kaushik Roy. One timestep is all you need: training spiking neural networks with ultra low latency. ar Xiv preprint ar Xiv:2110.05929, 2021. [87] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2661 2671, 2021. [88] Yuhang Li, Yufei Guo, Shanghang Zhang, Shikuang Deng, Yongqing Hai, and Shi Gu. Differentiable Spike: Rethinking Gradient-Descent for Training Spiking Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems (Neur IPS), volume 34, pages 23426 23439, 2021. [89] Qingyan Meng, Mingqing Xiao, Shen Yan, Yisen Wang, Zhouchen Lin, and Zhi-Quan Luo. Training high-performance low-latency spiking neural networks by differentiation on spike representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12444 12453, 2022. [90] Nitin Rathi and Kaushik Roy. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Transactions on Neural Networks and Learning Systems, pages 1 9, 2021. [91] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017. [92] Souvik Kundu, Massoud Pedram, and Peter A Beerel. Hire-snn: Harnessing the inherent robustness of energy-efficient deep spiking neural networks by training with crafted input noise. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5209 5218, 2021. [93] Priyadarshini Panda, Sai Aparna Aketi, and Kaushik Roy. Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization. Frontiers in Neuroscience, 14:653, 2020. [94] Mark Horowitz. 1.1 computing s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10 14. IEEE, 2014. [95] Guillermo Gallego, Tobi Delbrück, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J. Davison, Jörg Conradt, Kostas Daniilidis, and Davide Scaramuzza. Event-based vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154 180, 2022. [96] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618 626, 2017.