# online_stabilization_of_spiking_neural_networks__94c5b414.pdf

Published as a conference paper at ICLR 2024

ONLINE STABILIZATION OF SPIKING NEURAL NETWORKS

Yaoyu Zhu1, Jianhao Ding1, Tiejun Huang1,2, Xiaodong Xie1 & Zhaofei Yu1,2

1 School of Computer Science, Peking University 2 Institute for Artificial Intelligence, Peking University

Spiking neural networks (SNNs), attributed to the binary, event-driven nature of spikes, possess heightened biological plausibility and enhanced energy efficiency on neuromorphic hardware compared to analog neural networks (ANNs). Mainstream SNN training schemes apply backpropagation-through-time (BPTT) with surrogate gradients to replace the nondifferentiable spike emitting process during backpropagation. While achieving competitive performance, the requirement for storing intermediate information at all time-steps incurs higher memory consumption and fails to fulfill the online property crucial to biological brains. Our work focuses on online training techniques, aiming for memory efficiency while preserving biological plausibility. The limitation of not having access to future information in early time steps in online training has constrained previous efforts to incorporate advantageous modules such as batch normalization. To address this problem, we propose Online Spiking Renormalization (OSR) to ensure consistent parameters between testing and training, and Online Threshold Stabilizer (OTS) to stabilize neuron firing rates across time steps. Furthermore, we design a novel online approach to compute the sample mean and variance over time for OSR. Experiments conducted on various datasets demonstrate the proposed method s superior performance among SNN online training algorithms. Our code is available at https://github.com/zhuyaoyu/SNN-onlinenormalization.

1 INTRODUCTION

Regarded as the third generation of neural networks, spiking neural networks (SNNs) possess a greater level of biological plausibility (Zenke et al., 2021) than their second generation counterparts analog neural networks (ANNs) due to the binary and event-driven nature of spikes. The binary nature of spikes in SNNs eliminates the need for multiplication during inference, leading to improved energy efficiency when deployed on neuromorphic hardware (Furber et al., 2014; Merolla et al., 2014; Shen et al., 2016; Davies et al., 2018; Pei et al., 2019). However, the discontinuity of binary spikes also poses challenges in the training of SNNs.

To address the non-differentiable issue associated with the spike emitting process in SNN training, various approaches have been proposed. The mainstream direct training techniques use surrogate gradients to address this problem, which replaces the non-differentiable Heaviside function during the spike firing process with a differentiable surrogate function (Neftci et al., 2019). In addition to this, they just regard SNNs as binary recurrent neural networks (RNNs) and use backpropagation-through-time (BPTT) for SNN training (Bellec et al., 2018; Zenke & Ganguli, 2018; Wu et al., 2018). Although competitive performances are achieved on the CIFAR-10/100 and Image Net datasets (Deng et al., 2021; Fang et al., 2021) with a relatively short simulation time, these methods require storing intermediate information of all time-steps for gradient backpropagation. An alternative approach to train SNNs is to use the assistance of ANNs. Several works first train ANNs and

Corresponding author: yuzf12@pku.edu.cn

Published as a conference paper at ICLR 2024

then convert them to SNNs (Cao et al., 2015; Rueckauer et al., 2017; Han et al., 2020; Bu et al., 2022a; Deng & Gu, 2021; Bu et al., 2022b). However, these methods often require a longer simulation time and result in more fired spikes. The long simulation time will lead to high latency, while more fired spikes will consume more energy. Overall, these approaches bring about extra expenses either in the training phases or in the testing phases while not satisfying the online property of the learning process in biological brains.

Recently, online training techniques have been developed to save memory costs while maintaining the biologically plausible online property during the training process. However, the limitation of not having access to future information in the early time steps has constrained previous efforts to incorporate advantageous modules such as batch normalization (BN). In this work, we design a mechanism that bypasses the need for future information while maintaining consistency across time-steps, thereby reducing the overfitting problem associated with treating different time-steps with different BN. Our main contributions can be summarized as follows:

1. We propose Online Spiking Renormalization (OSR), ensuring consistent scale and shift parameters between testing and training. This helps eliminate the normalization parameter difference when applying BN separately for each time-step. In addition, we introduce an online approach for computing a variable s all-time mean and variance that dynamically changes over time for OSR. 2. We devise Online Threshold Stabilizer (OTS), aiming at stabilizing neuron firing rates across varying time steps, which also effectively regulates the overall firing rate. 3. We conduct experiments on CIFAR10, CIFAR100, CIFAR10-DVS, DVS-Gesture, and Imagenet datasets and demonstrate that our proposed method achieves state-of-the-art performance among SNN online training algorithms.

2 RELATED WORK

2.1 ONLINE TRAINING APPROACHES

Online training allows real-time parameter updates as new data arrives, especially useful for RNNs and SNNs spanning multiple time-steps. This mechanism serves to curtail memory usage, a particularly advantageous feature when dealing with many time-steps.

Existing literature on RNNs has delved into various approaches to online learning. Real-time recurrent learning (RTRL), introduced by Williams & Zipser (1989), propagates partial derivatives of hidden states across parameters throughout time, enabling the computation of gradients in a forward-in-time manner. Many recent research endeavors, exemplified by UORO (Tallec & Ollivier, 2017), KF-RTRL (Mujika et al., 2018), and Sn Ap (Menick et al., 2020), have explored enhancing the memory and time efficiency of RTRL through tailored approximations for more pragmatic utilization. Another work put forward a proposition to update parameters in an online fashion, utilizing decoupled gradients coupled with regularization at each time-step (Kag & Saligrama, 2021).

In the domain of SNNs, numerous studies have drawn inspiration from online training techniques developed for RNNs. Some of these works adopt the fundamental principles of RTRL and tailor them to streamline the training process for SNNs (Zenke & Ganguli, 2018; Bellec et al., 2020; Bohnstingl et al., 2022). Yin et al. (2022) directly applied the approach proposed by Kag & Saligrama (2021) to train SNNs. Zenke & Ganguli (2018) connected the online learning rule for leaky integrate-and-fire (LIF) neurons with the nonlinear Hebbian three-factor rule, and Kaiser et al. (2020) extended the neuron model to a double-exponential spikeresponse model. Xiao et al. (2022) successfully extended online training methodologies to accommodate large-scale tasks such as the Image Net classification. However, all these works did not consider incorporating network modules like batch normalization to enhance the network performance. As a result, they suffer from a performance disadvantage compared to their BPTT counterparts.

Published as a conference paper at ICLR 2024

2.2 NORMALIZATION MECHANISMS

Normalization mechanisms are commonly used in neural networks to stabilize network training, which speeds up convergence and enhances network performance. Typical normalization techniques include batch normalization (BN) (Ioffe & Szegedy, 2015), instance normalization (IN) (Ulyanov et al., 2016), group normalization (GN) (Wu & He, 2018), and layer normalization (LN) (Ba et al., 2016). A subsequent work, batch renormalization (Ioffe, 2017), improves BN by eliminating the difference between the batch mean and variance between the training and testing phases.

In spiking neural networks, researchers have also tried to incorporate normalization techniques to enhance SNN performance. For instance, Kim & Panda (2021) proposed BNTT to regulate firing rates by utilizing separate BN parameters at different time steps. Zheng et al. (2021) proposed threshold-dependent batch normalization (td BN), which extends the scope of BN to the additional temporal dimension and takes into account the impact of threshold on firing rates. TEBN (Duan et al., 2022) combined elements from both of these approaches by applying BN across the spatial-temporal dimension, while utilizing separate scale and shift parameters at different time steps. Although most works apply normalization to the input current, some studies explore normalization for other variables. For example, PSP-BN (Ikegawa et al., 2022) used unique statistics, the second raw moment of post-synaptic potential, as the denominator for normalization, which can be inserted right after the spiking functions. This approach leads to a higher complexity of BN parameters and the potential risk of breaking the temporal coherence of information. Among the aforementioned works, the most successful ones (Duan et al., 2022; Zheng et al., 2021) used information from all time-steps for BN. However, these methods cannot be directly applied to online learning.

3 PRELIMINARIES

3.1 LEAKY INTEGRATE AND FIRE NEURON

Spiking neurons are the basic building blocks of SNNs, with the LIF neuron model being the most commonly used (Gerstner et al., 2014). The dynamic of the LIF neuron before firing can be described by:

dt = (u(t) urest) + I(t), (1)

where u(t) is the membrane potential of the neuron at time t, I(t) is the input current received by the neuron, τ is the membrane time constant, and urest is the resting potential. When membrane potential u(t) reaches a certain threshold θ, the neuron will emit a spike, and u(t) will be suddenly reset to a value ureset. In practice, we often use a discrete form of Eq. 1, which can be represented as:

ul[t] =(1 1

τ l )ul[t 0.5] + W lsl 1[t], (2)

sl[t] =Θ(ul[t] θ), (3)

ul[t + 0.5] =ul[t] (1 sl[t]). (4)

Here, we use a tensor form that ul, W l, and sl denote the membrane potential tensor, weight matrix between layers l 1 and l, output spike tensor of layer l, respectively. Among them, ul[t] is the membrane potential after decay and adding input but before the reset, and ul[t + 1

2] is the membrane potential after reset. Θ is a Heaviside step function. The element of sl[t] equals 1 if the neuron fires and 0 otherwise.

Our holistic method is illustrated in Figure 1. In the following parts, we first briefly introduce the forward and backward propagation processes of our algorithm and then elaborate on the modules we add to the network.

Published as a conference paper at ICLR 2024

Online Spiking Renormalization (OSR)

for training only

Online Threshold Stabilizer (OTS)

LIF LIF LIF LIF

OTS OTS OTS

Online learning This work Online learning OTTT Surrogate learning BPTT

Forward Backward

Layer index

Layer index

Layer index

Figure 1: Illustration of online stabilization techniques for SNN. Our method uses online spiking renormalization to improve the generalization and an online threshold stabilizer to regulate the firing rate within the framework of online SNN training, which requires less memory usage than BPTT training. OTTT (Xiao et al., 2022) adopts normalization-free networks and thus has no BN modules.

The major modules are online spiking renormalization (OSR) and online threshold stabilizer (OTS). Besides, an online calculation method of all-time mean and variance is introduced in OSR.

4.1 FORWARD AND BACKWARD PROPAGATION

In the forward stage, our method uses the LIF formulas (Eqs. 2-4). The OSR replaces sl 1[t]W l with renorm(sl 1[t]W l) in Eq. 2, and the OTS changes threshold θ in Eq. 3 over time-steps.

In the backward stage, we select the TET loss (Deng et al., 2021) as our loss function since the loss function needs to provide feedback at each time-step:

i=1 yi log oi[t] + ϵ

i=1 (oi[t] ϕ(yi))2

where T is the total simulation time, yi denotes whether the label is equal to i, and oi[t] is the spikes of output neuron i at time t in the output layer. An additional MSE loss is introduced as a regularization term (as proposed by Deng et al. (2021)) with weight ϵ, and ϕ(yi) is the target value of yi set in MSE loss.

For the online gradient propagation, we remove the propagation path of neuron membrane potential decay and reset (from the next time step to the last time step) for membrane potential (as shown in Figure 1). Then the gradients received by membrane potential and weights become:

L uly[t] = Lt sly[t] sl y[t] uly[t], (6)

Lt sly[t] sl y[t] uly[t] ul y[t] wlxy . (7)

Published as a conference paper at ICLR 2024

Note that Eq. 7 just sums up the derivative of Lt to wl xy at all time-steps, so there is no backward or forward temporal dependency both in Eqs. 6 and 7)

4.2 INCORPORATING BATCH NORMALIZATION INTO ONLINE ALGORITHMS

In Zheng et al. (2021); Kim & Panda (2021); Duan et al. (2022), it is shown that applying BN once across all time-steps, rather than separately on each time-step, yields superior performance. However, in online training, it is impractical as we need normalization before having information on all time-steps. As per Duan et al. (2022), using mean and variance across all time-steps is crucial for reducing temporal covariate shift and enhancing performance. A key feature of normalizing by the global mean and variance is that, the transformation of all time-steps are the same during normalization. Therefore, a question naturally arises: can we normalize inputs at all time-steps with the same mean and variance when we do not have the all-time data?

Online Spiking Renormalization (OSR). Although we do not have the whole data of the current batch, we have data from previous batches and can apply BN transformation at all time-steps based on these data. The running mean ˆµ and running variance ˆσ2 is a good choice. Using them as the normalization parameter brings an additional benefit: The BN transformation will be the same between the training stage and the inference stage. Specifically, we apply the transformation

I[t] = γ I[t] ˆµ p ˆσ2 + ϵ + β (8)

during the forward stage in training, where I[t] is the neurons input currents to be normalized. The next question is: How to compute gradients in the backward stage if we use this forward transformation? The ˆµ and ˆσ2 come from previous data instead of the current batch data. Therefore, if no additional mechanisms are involved, this standardization just plays the role of linear transformation instead of real normalization. Our solution is online spiking renormalization (OSR), which first applies a real normalization and then unifies transformation among time-steps by another linear transform. To be specific, we first normalize I[t] to ˆI[t] = I[t] µ[t]

σ2[t]+ϵ and then linearly transform it twice to I[t] = γ I[t] ˆµ

ˆ σ2+ϵ + β:

I[t] =γ I[t] ˆµ p ˆσ2 + ϵ + β = γ

σ2[t] + ϵ p ˆσ2 + ϵ + µ[t] ˆµ p ˆσ2 + ϵ

Eq. 9 denotes the normalization followed by a linear transformation. The gradients for I[t], γ, β are:

σ2[t] + ϵ p ˆσ2 + ϵ , (10)

σ2[t] + ϵ p ˆσ2 + ϵ + µ[t] ˆµ p ˆσ2 + ϵ

L Ix[t] . (12)

Online Calculation of All-time Mean and Variance. In OSR, the running mean ˆµ and running variance ˆσ2 are the running average of all-time mean µ and variance σ2 of a batch. To keep the memory cost low, we need to calculate these all-time statistics in an online fashion, utilizing the mean and variance of each time-step: µ[1], , µ[T] and σ2[1], , σ2[T]. Their relationship can be described by the following equations:

x=1 Ix[t], σ2[t] = 1

x=1 (Ix[t] µ[t])2, (13)

Published as a conference paper at ICLR 2024

x=1 Ix[t] = 1

t=1 µ[t], (14)

x=1 (Ix[t] µ)2 = 1

t=1 σ2[t] + 1

t=1 µ[t]2 µ2. (15)

Hence, we can initialize µ and σ2 as 0 for each batch, add 1

T µ[t] to µ and add 1

T (σ2[t] + µ[t]2) to σ2 at each time step, and subtract µ2 from σ2 at the last time step.

Online Threshold Stabilizer (OTS). To enhance the stability of mean and variance in the OSR process during training, we introduce the OTS mechanism. The variable subject to normalization is the input current of neurons, and our objective is to ensure the mean and variance of it remain stable across all time-steps. This raises a question: When should we intervene to stabilize the mean and variance of input currents?

The mean and variance of the input current in a layer are significantly influenced by the output spikes from the preceding layer, making it essential to stabilize the firing rate of each layer. The firing rate is determined by the proportion of membrane potential surpassing the firing threshold within discrete time-steps. Consequently, we can adjust either the membrane potential or the firing threshold to regulate the firing rate. Between these options, regulating the firing threshold stands out as a judicious choice: it leaves the neuronal dynamics unchanged and only impacts backward propagation by altering the values of the surrogate function.

Specifically, we assume the membrane potential of neurons in one layer at time t follows a normal distribution N(µmem[t], σ2 mem[t]) (where we denote θ[t], µmem[t], and σmem[t] are the threshold, mean of membrane potential, and variance of membrane potential at time t), then the firing rate of this layer at time t is

1 Φ 1 θ[t] µmem[t]

where Φ(x) = 1

2π R x e y2

2 dy is the cumulative distribution function of normal distribution. To keep this

ratio constant among time-steps, we need to keep the quantile θ[t] µmem[t]

σmem[t] constant. Under this control, the

adjusted threshold at time t, θ[t] = µmem[t] + σmem[t] θ[1] µmem[1]

The overall algorithm description is provided in Appendix B.

4.3 THEORETICAL ANALYSIS

In this section, we discuss how our online threshold stabilizer (OTS) helps stabilize online spiking renormalization (OSR). The process involves three stages: adjusting the firing threshold of layer l 1, corresponding adjustment of the firing rate of layer l 1, and regulating the mean and variance before normalization in layer l to ensure stability. This involves two crucial aspects: adjusting the threshold for a stable firing rate, which subsequently stabilizes the mean and variance. For the step from threshold to firing rate, existing research has shown that when a LIF neuron receives constant input with Gaussian noise, the membrane potential will have a Gaussian distribution (Hohn & Burkitt, 2001). This implies the reasonableness of our Gaussian distribution assumption of membrane potential in OTS, further supporting its process. For the step from firing rate to mean and variance, studying the property of the all-time sample variance σ2 is a good choice: In Eq. 15, σ2 can be split into two parts: The first part is 1

T PT t=1 σ2[t], which stands for the average variance inside each time-step. The second part is 1

T PT t=1 µ[t]2 µ2, which is the variance of the mean at each time-step (variance of µ[1], , µ[T]). To stabilize the whole training process, we want the mean among different time-steps to vary as little as possible. In other words, we want the variance of the mean among time-steps to be low. On the other hand, for variance inside each time-step, we do not need it to be low.

Published as a conference paper at ICLR 2024

Denote p[t] to be the firing probability of each neuron at time-step t and gross firing probability p = 1 T PT i=1 p[t]. To proceed with the theoretical derivation, we must establish the following assumptions:

Assumption 4.1. Assume all entries of sl 1[t] (of size B Cin) and W l (of size Cin Cout) are independent for 1 t T, all sl 1 i [t] obey i.i.d Bernoulli(p[t]) distribution, and all wl ji obey any i.i.d distribution.

Under the above assumptions, we have the following conclusions (note we only discuss the expectation of the target variables since both the sample mean µ and the sample variance σ2 are estimated statistics):

Theorem 4.2. When Assumption 4.1 holds and the gross firing rate p holds constant, then the expectation of sample variance of µ[t] among time-steps E h 1 T PT t=1 µ[t]2 µ2i increases when the variance of firing rate

among time-steps 1

T PT t=1 p[t]2 p2 increases.

Theorem 4.3. When Assumption 4.1 holds and the gross firing rate p keeps constant, then the expectation of variance within time-steps E h 1 T PT t=1 σ2[t] i keeps constant.

The detailed proof is provided in Appendix A. These results indicate that given the gross firing rate (p) constant, reducing the variance of firing probability (p[t]) among time-steps will reduce the variance of the mean (µ[t]) among time-steps (Theorem. 4.2) but will not affect the variance inside time-steps (P σ2[t]) (Theorem. 4.3). Thus, a steady firing rate helps stabilize the sample mean, which further indicates that our OTS mechanism helps our OSR mechanism. Related experimental results are shown in the ablation study.

5 EXPERIMENTS

To show the effectiveness of our proposed method, we conduct experiments on CIFAR10, CIFAR100 (Krizhevsky et al., 2009), DVS-Gesture (Amir et al., 2017), CIFAR10-DVS (Li et al., 2017), and Imagenet (Deng et al., 2009) datasets to evaluate the performance of our method. The model we choose is consistent with OTTT (Xiao et al., 2022) to conduct a fair comparison. All experiments are run on Nvidia RTX 4090 GPUs with Pytorch 2.0. The implementation details are provided in Appendix C.

5.1 COMPARISON WITH OTHER WORKS

Here we compare our approach with previous SNN training methods. We select the BPTT-based algorithms td BN (Zheng et al., 2021), SEW (Fang et al., 2021), TET (Deng et al., 2021), TEBN (Duan et al., 2022), and an online algorithm OTTT (Xiao et al., 2022). The results have shown that our algorithm performs well on all datasets. For the CIFAR10 dataset, we have outperformed td BN, OTTT, and TET. For the CIFAR100 dataset, we have outperformed TET and OTTT. For the DVS-Gesture dataset, we have outperformed all listed methods, including td BN and OTTT. For the CIFAR10-DVS dataset, we have outperformed td BN and OTTT. For the Imagenet dataset, we have outperformed td BN and OTTT. Note that the network that OTTT uses (NF-Resnet-34) adds membrane potential in the shortcut connection, which enhances its overall performance over SEW-Resnet-34 and Resnet-34. We test our method for the same architecture (the last line) and achieve better performance with fewer time-steps. Among the online algorithms, we have outperformed OTTT on all datasets with fewer time-steps (T = 4 vs T = 6). In addition, although the overall performance of state-of-the-art BPTT-based algorithms outperforms the online ones in Table 1, they require more memory, especially when the number of total time steps is large (the detailed information is provided in Section 5.2).

Comparison with Vanilla BN. To show the necessity of our approach, we compare it with a vanilla BN, which applies BN each time-step solely based on data from that time-step. The result is shown in the BN (vanilla) line for the Imagenet dataset, and we can see our approach outperforms this vanilla BN by around 4%. More detailed ablation studies for OSR and OTS on various datasets are provided in Appendix D.

Published as a conference paper at ICLR 2024

Table 1: Performance comparison on CIFAR-10/100, DVS-Gesture, CIFAR10-DVS, and Imagenet

Dataset Model Online or not Architecture Time steps Accuracy

td BN (Zheng et al., 2021) % Resnet-19 4 92.92% TET (Deng et al., 2021) % Resnet-19 4 94.44% TEBN (Duan et al., 2022) % Resnet-19 4 95.58% OTTT (Xiao et al., 2022) " VGGSNN 6 93.58%

Ours " VGGSNN 4 94.35% " Resnet-19 4 95.20%

TET (Deng et al., 2021) % Resnet-19 4 74.47% TEBN (Duan et al., 2022) % Resnet-19 4 78.71% OTTT (Xiao et al., 2022) " VGGSNN 6 71.11%

Ours " VGGSNN 4 76.48% " Resnet-19 4 77.86%

DVS-Gesture

td BN (Zheng et al., 2021) % Resnet-17 40 96.88% OTTT (Xiao et al., 2022) " VGGSNN 20 96.88% Ours " VGGSNN 20 97.57%

CIFAR10-DVS

td BN (Zheng et al., 2021) % Resnet-19 10 67.80% TET (Deng et al., 2021) % VGG-11 10 83.17% TEBN (Duan et al., 2022) % VGGSNN 10 84.90% OTTT (Xiao et al., 2022) " VGGSNN 10 76.30% Ours " VGGSNN 10 82.40%

td BN (Zheng et al., 2021) % Resnet-34 6 63.72% SEW (Fang et al., 2021) % SEW-Resnet-34 4 67.04% TET (Deng et al., 2021) % SEW-Resnet-34 4 68.00% TEBN (Duan et al., 2022) % SEW-Resnet-34 4 68.28% OTTT (Xiao et al., 2022) " NF-Resnet-34 6 65.15% BN(Vanilla) " SEW-Resnet-34 4 60.48% BN(OSR+OTS)(Ours) " SEW-Resnet-34 4 64.14% BN(OSR+OTS)(Ours) " NF-Resnet-34 4 67.54%

To keep consistent with OTTT, we use the name NF-Resnet here to represent adding membrane potential in the shortcut connection in Resnet. Note that NF in OTTT stands for normalizer-free, but the corresponding part (weight standardization and scaling factors α, β along with the corresponding operations to keep variance stable) of this network is eliminated in our work.

5.2 QUALITATIVE RESULTS

A. Memory Usage: We compare the training memory usage between online algorithms and BPTT algorithms here. We test the case where T = 2, 4, 6, 8, 10, 15, 20, 25, 30 on the CIFAR10 dataset with VGGSNN architecture and a batch size of 128. The memory usage statistics are plotted in Figure 2 (a). We can see that our method maintains a constant memory requirement irrespective of time-steps, whereas BPTT approaches scale memory usage linearly with the number of time-steps. In addition, even when the number of time-steps is as low as 2, the memory cost of our algorithm is still lower than that of its BPTT counterpart.

B. Firing Rate Statistics: We compare the firing rate statistics among different configurations of our proposed modules. We test these statistics on Imagenet, using the SEW-Resnet-34 architecture with total time-steps T = 4. The gross firing rate statistics are listed in Table 2 and the per-time-step firing rates are plotted in Figure 2 (b). Results have shown that OTS successfully decreases the gross firing rate, which meets our expectations since it raises the thresholds in the latter time-steps. The effect of OSR on firing rates is more

Published as a conference paper at ICLR 2024

5 10 15 20 25 30 Number of time-steps

Memory usage

Online training BPTT

1 2 3 4 Time-step

Average firing rate

Baseline OTS

OSR OTS+OSR

Figure 2: (a) Comparison of memory usage between our method and BPTT. BPTT incurs memory costs linearly proportional to time-steps, whereas our approach maintains constant memory usage regardless of time-steps. (b) Firing rate statistics of different configurations. From the figures, we know that the online threshold stabilizer indeed stabilizes the firing rate among time-steps.

Table 2: Gross firing rate

Configuration OTTT (Xiao et al., 2022) Vanilla BN OSR OTS OSR+OTS Gross firing rate 24% 22.39% 27.34% 19.16% 16.06%

interesting: When OTS is not added, it increases the gross firing rate. However, it decreases the total firing rate when OTS is added. For per-time-step firing rates, when the OTS mechanism is not added, the neurons fire far fewer spikes in the first time-step compared with later time-steps, while the firing rate is relatively stable from the second time-step to the last time-step. Besides, our OTS mechanism has greatly alleviated but not perfectly eliminated the firing rate variation among time-steps. It slightly over-lifts the firing rate of the first time-step, which might be caused by the firing rate distribution difference among time-steps.

6 CONCLUSION AND FUTURE WORK

In this paper, we investigate online training for spiking neural networks, aiming to reduce training memory costs. We integrate essential batch normalization into the online training process by introducing online spiking renormalization and online threshold stabilizers to enhance training stability. Experiments on diverse datasets demonstrate the effectiveness of our proposed modules, showcasing the superior performance of our holistic approach among SNN online training algorithms. However, our approach currently falls short of BPTT in performance, primarily due to the absence of inner-layer and inter-layer reverse-in-time dependencies during backpropagation. Addressing the inner-layer dependency might involve incorporating eligibility traces, but effectively managing the significant inter-layer dependency in online learning remains a challenge. Moreover, achieving biologically plausible learning necessitates local (Journé et al., 2022) and event-driven (Zhu et al., 2022) properties in addition to online behavior areas we haven t extensively explored in this work. These shortcomings present promising avenues for future research and deeper investigation.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China(62176003, 62088102) and by Beijing Nova Program (20230484362).

Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey Mc Kinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7243 7252, 2017.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long shortterm memory and learning-to-learn in networks of spiking neurons. Advances in Neural Information Processing Systems, 31:795 805, 2018.

Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature Communications, 11(1):3625, December 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-17236-y.

Thomas Bohnstingl, Stanislaw Wozniak, Angeliki Pantazi, and Evangelos Eleftheriou. Online Spatio Temporal Learning in Deep Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 15, 2022. ISSN 2162-237X, 2162-2388. doi: 10.1109/TNNLS.2022.3153985.

Tong Bu, Jianhao Ding, Zhaofei Yu, and Tiejun Huang. Optimized potential initialization for low-latency spiking neural networks. In In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11 20, 2022a.

Tong Bu, Wei Fang, Jianhao Ding, Peng Lin Dai, Zhaofei Yu, and Tiejun Huang. Optimal ANN-SNN conversion for high-accuracy and ultra-low-latency spiking neural networks. In International Conference on Learning Representations, 2022b.

Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energyefficient object recognition. International Journal of Computer Vision, 113(1):54 66, 2015.

Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, Yuyun Liao, Chit-Kwan Lin, Andrew Lines, Ruokun Liu, Deepak Mathaikutty, Steven Mc Coy, Arnab Paul, Jonathan Tse, Guruguhanathan Venkataramanan, Yi-Hsin Weng, Andreas Wild, Yoonseok Yang, and Hong Wang. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1):82 99, 2018. ISSN 0272-1732. doi: 10.1109/MM.2018.112130359.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Shikuang Deng and Shi Gu. Optimal conversion of conventional artificial neural networks to spiking neural networks. In International Conference on Learning Representations, 2021.

Shikuang Deng, Yuhang Li, Shanghang Zhang, and Shi Gu. Temporal efficient training of spiking neural network via gradient re-weighting. In International Conference on Learning Representations, 2021.

Published as a conference paper at ICLR 2024

Chaoteng Duan, Jianhao Ding, Shiyan Chen, Zhaofei Yu, and Tiejun Huang. Temporal effective batch normalization in spiking neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 34377 34390. Curran Associates, Inc., 2022.

Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems, 34:21056 21069, 2021.

Steve B Furber, Francesco Galluppi, Steve Temple, and Luis A Plana. The spinnaker project. Proceedings of the IEEE, 102(5):652 665, 2014.

Wulfram Gerstner, Werner M Kistler, Richard Naud, and Liam Paninski. Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014.

Bing Han, Gopalakrishnan Srinivasan, and Kaushik Roy. RMP-SNN: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13558 13567, 2020.

Nicolas Hohn and Anthony N Burkitt. Shot noise in the leaky integrate-and-fire neuron. Physical Review E, 63(3):031902, 2001.

Shin-ichi Ikegawa, Ryuji Saiin, Yoshihide Sawada, and Naotake Natori. Rethinking the role of normalization and residual blocks for spiking neural networks, March 2022.

Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. Advances in neural information processing systems, 30, 2017.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 456. PMLR, 2015.

Adrien Journé, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learning without feedback. In The Eleventh International Conference on Learning Representations, 2022.

Anil Kag and Venkatesh Saligrama. Training recurrent neural networks via forward propagation through time. In International Conference on Machine Learning, pp. 5189 5200. PMLR, 2021.

Jacques Kaiser, Hesham Mostafa, and Emre Neftci. Synaptic plasticity dynamics for deep continuous local learning (decolle). Frontiers in Neuroscience, 14:424, 2020.

Youngeun Kim and Priyadarshini Panda. Revisiting Batch Normalization for Training Low-Latency Deep Spiking Neural Networks From Scratch. Frontiers in Neuroscience, 15:773954, December 2021. ISSN 1662-453X. doi: 10.3389/fnins.2021.773954.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Hongmin Li, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience, 11:309, 2017.

Yuhang Li, Youngeun Kim, Hyoungseob Park, Tamar Geller, and Priyadarshini Panda. Neuromorphic data augmentation for training spiking neural networks. In European Conference on Computer Vision, pp. 631 649. Springer, 2022.

Jacob Menick, Erich Elsen, Utku Evci, Simon Osindero, Karen Simonyan, and Alex Graves. Practical real time recurrent learning with a sparse approximation. In International conference on learning representations, 2020.

Published as a conference paper at ICLR 2024

Paul A. Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Andrew S. Cassidy, Jun Sawada, Filipp Akopyan, Bryan L. Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, Bernard Brezzo, Ivan Vo, Steven K. Esser, Rathinakumar Appuswamy, Brian Taba, Arnon Amir, Myron D. Flickner, William P. Risk, Rajit Manohar, and Dharmendra S. Modha. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668 673, 2014.

Asier Mujika, Florian Meier, and Angelika Steger. Approximating real-time recurrent learning with random kronecker factors. Advances in Neural Information Processing Systems, 31, 2018.

Emre O. Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51 63, 2019. ISSN 1053-5888, 1558-0792. doi: 10.1109/MSP.2019.2931595.

Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572 (7767):106 111, 2019.

Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in neuroscience, 11:682, 2017.

Juncheng Shen, De Ma, Zonghua Gu, Ming Zhang, Xiaolei Zhu, Xiaoqiang Xu, Qi Xu, Yangjing Shen, and Gang Pan. Darwin: a neuromorphic hardware co-processor based on spiking neural networks. Science China Information Sciences, 59(2):1 5, 2016.

Corentin Tallec and Yann Ollivier. Unbiased Online Recurrent Optimization, May 2017.

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. ar Xiv preprint ar Xiv:1607.08022, 2016.

Ronald J. Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation, 1(2):270 280, June 1989. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.1989.1.2.270.

Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.

Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3 19, 2018.

Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Di He, and Zhouchen Lin. Online Training Through Time for Spiking Neural Networks, October 2022.

Bojian Yin, Federico Corradi, and Sander M. Bohte. Accurate online training of dynamical spiking neural networks through Forward Propagation Through Time, November 2022.

Friedemann Zenke and Surya Ganguli. Super Spike: Supervised Learning in Multilayer Spiking Neural Networks. Neural Computation, 30(6):1514 1541, June 2018. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco\_a\_01086.

Friedemann Zenke, Sander M Bohté, Claudia Clopath, Iulia M Com sa, Julian Göltz, Wolfgang Maass, Timothée Masquelier, Richard Naud, Emre O Neftci, Mihai A Petrovici, et al. Visualizing a joint future of neuroscience and neuromorphic engineering. Neuron, 109(4):571 575, 2021.

Published as a conference paper at ICLR 2024

Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11062 11070, 2021.

Yaoyu Zhu, Zhaofei Yu, Wei Fang, Xiaodong Xie, Tiejun Huang, and Timothée Masquelier. Training spiking neural networks with event-driven backpropagation. In Advances in Neural Information Processing Systems, 2022.

A THEORETICAL DERIVATION

To make it easy to follow, we write Assumption. 4.1 again in the following: Assumption A.1. Assume all entries of sl 1[t] (of size B Cin) and W l (of size Cin Cout) are independent for 1 t T, all sl 1 i [t] obey i.i.d Bernoulli(p[t]) distribution, and all wl ji obey any i.i.d distribution.

For simplicity, we omit the superscript l and l 1 in the following derivation. Before proving the theorems, we derive the mean and variance of variable Ibi[t] in Lemma. A.2 and Lemma. A.3: Lemma A.2. When Assumption 4.1 holds, all Ibi[t] will share the identical distribution, and E[Ibi[t]] = Cinp[t]E[wji], VAR[Ibi[t]] = Cin(p[t]VAR[wji] + (p[t] p[t]2)E2[wji]).

Proof. Since Ibi[t] = PCin j=1 sbj[t]wji are all sum of products of independent variables with identical distributions (sbj[t] and wji), they share the identical distribution. We calculate the mean and variance of Ibi[t] as follows:

E[Ibi[t]] =

j=1 E(sbj[t])E(wji) = Cinp[t]E[wji] (17)

E[I2 bi[t]] = E

j=1 sbj[t]wji

sbj[t]wji 2

+ Cin(Cin 1)p[t]2E2[wji]

j=1 E[sbj[t]2]E[w2 ji] + Cin(Cin 1)p[t]2E2[wji]

= Cinp[t]E[w2 ji] + Cin(Cin 1)p[t]2E2[wji] (18)

VAR(Ibi[t]) = E[I2 bi[t]] E2[Ibi[t]] = Cin(p[t]E[w2 ji] p[t]2E2[wji])

= Cin(p[t]VAR[wji] + (p[t] p[t]2)E2[wji]) (19)

Lemma A.3. When Assumption 4.1 holds, for all 1 b1, b2 B, 1 i1, i2 Cout, and 1 t1, t2 T, Ib1i1[t1] and Ib2i2[t2] are uncorrelated when (b1, t1) = (b2, t2) and i1 = i2. When (b1, t1) = (b2, t2) and i1 = i2, COV(Ibi1[t], Ibi2[t]) = Cin E2[wji](p[t] p[t]2); When (b1, t1) = (b2, t2) and i1 = i2, COV(Ib1i[t1], Ib2i[t2]) = Cinp[t1]p[t2]VAR[wji].

Proof. Since Ibi[t] = PCin j=1 sbj[t]wji,

COV(Ib1i1[t1], Ib2i2[t2]) = COV

j=1 sb1j[t1]wji1,

j=1 sb2j[t2]wji2

Published as a conference paper at ICLR 2024

When (b1, t1) = (b2, t2) and i1 = i2, the lemma is trivial since the entries in the summation are all uncorrelated. For the case when (b1, t1) = (b2, t2), we have:

COV(Ibi1[t], Ibi2[t]) = E

j=1 sbj[t]wji1

j=1 sbj[t]wji2)

E[Ibi1[t]]E[Ibi2[t]]

E(sbj[t]2wji1wji2) E(sbj[t]wji1)E(sbj[t]wji2)

p[t]E2[wji1] p[t]2E2[wji1]

= Cin E2[wji](p[t] p[t]2) (21)

For the case when i1 = i2, we have:

COV(Ib1i[t1], Ib2i[t2]) = E

j=1 sb1j[t1]wji

j=1 sb2j[t2]wji)

E[Ib1i[t1]]E[Ib2i[t2]]

E(sb1j[t1]sb2j[t2]w2 ji) E(sb1j[t1]wji)E(sb2j[t2]wji)

p[t]2E[w2 ji] p[t1]p[t2]E2[wji]

= Cinp[t1]p[t2]VAR[wji] (22)

After getting the mean, variance, and covariance of Ibi[t], we can prove the following theorems by calculating the coefficient before the variance of p[t]:

Theorem A.4. When Assumption 4.1 holds and the gross firing rate p holds constant, then the expectation of sample variance of µ[t] among time-steps E h 1 T PT t=1 µ[t]2 µ2i increases when the variance of firing rate

among time-steps 1

T PT t=1 p[t]2 p2 increases.

Proof. Here we omit the subscript b (the batch dimension) for Ibi[t]. First we calculate the expectation of µ[t] and µ:

E[µ[t]] = E[Ibi[t]] = Cinp[t]E[wji] (23)

t=1 E[µ[t]] = Cinp E[wji] (24)

Published as a conference paper at ICLR 2024

Then we calculate the second moment, including E[µ[t]2] and E[µ[t1]µ[t2]]:

E[µ[t]2] = E2[µ[t]] + VAR[µ[t]] = C2 inp[t]2E2[wji] + 1 C2 out VAR

=C2 inp[t]2E2[wji] + 1 C2 out

i=1 VAR(Ii[t]) + 2 X

1 i1<i2 Cout COV(Ii1[t], Ii2[t])

=C2 inp[t]2E2[wji] + 1 C2 out

Cout Cin(p[t]E[w2 ji] p[t]2E2[wji]) + Cout(Cout 1)Cin E2[wji](p[t] p[t]2)

=C2 inp[t]2E2[wji] + Cin

p[t]VAR[wji] + Cout(p[t] p[t]2)E2[wji] . (25)

E[µ[t1]µ[t2]] = E[µ[t1]]E[µ[t2]] + COV[µ[t1], µ[t2]]

=C2 inp[t1]p[t2]E2[wji] + 1 C2 out COV

i=1 Ii[t1],

=C2 inp[t1]p[t2]E2[wji] + 1 C2 out

i=1 COV Ii[t1], Ii[t2]

=C2 inp[t1]p[t2]E2[wji] + Cin

Cout p[t1]p[t2]VAR[wji]. (26)

Finally, we can calculate the target function:

t=1 µ[t]2 µ2

t=1 E[µ[t]2] 2

1 t1<t2 T E[µ[t1]µ[t2]]

C2 inp[t]2E2[wji] + Cin

p[t]VAR[wji] + Cout(p[t] p[t]2)E2[wji]

C2 inp[t1]p[t2]E2[wji] + Cin

Cout p[t1]p[t2]VAR[wji]

=C2 in E2[wji]

t=1 p[t]2 p2

+ Cin E2[wji]

Cout VAR[wji]

t=1 p[t]2 p2

When p is constant, the variance among time-steps only depends on PT t=1 p[t]2. In the last equation of Eq. 27, the only thing that can vary is PT t=1 p[t]2, and the coefficient in front of it is always positive (C2 in E2[wji] 1

T Cin E2[wji] 1

T Cin E2[wji] T 1

T 2 ). Therefore, the conclusion holds.

Published as a conference paper at ICLR 2024

Theorem A.5. When Assumption 4.1 holds and the gross firing rate p keeps constant, then the expectation of variance within time-steps E[ 1

T PT t=1 σ2[t]] keeps constant.

Proof. We first calculate E[σ2[t]] and then sum them up. The E[σ2[t]] can be split into calculating E[Ii[t]2] and E[µ[t]2], which have been calculated before.

E[σ2[t]] = E

i=1 Ii[t]2 µ[t]2

i=1 E[Ii[t]2] E[µ[t]2]

=Cinp[t]E[w2 ji] + Cin(Cin 1)p[t]2E2[wji] C2 inp[t]2E2[wji]

p[t]VAR[wji] + Cout(p[t] p[t]2)E2[wji]

=Cinp[t]E[w2 ji] Cinp[t]2E2[wji] Cin

p[t]VAR[wji] + Cout(p[t] p[t]2)E2[wji]

=Cin(p[t]VAR[wji] + (p[t] p[t]2)E2[wji]) Cin

p[t]VAR[wji] + Cout(p[t] p[t]2)E2[wji]

=Cin(Cout 1)

Cout p[t]VAR[wji] (28)

Cin(Cout 1)

Cout p[t]VAR[wji] = Cin(Cout 1)

Cout p VAR[wji] (29)

As a result, the variance of p[t] will not affect E h 1 T PT t=1 σ2[t] i , which means it keeps constant.

B ALGORITHM DESCRIPTION FOR OUR METHOD

Our algorithm works under the online learning framework, which means the network goes through forward and backward propagations step by step from time-step 1 to T (instead of first forward from time step 1 to T and then backward from time step T to 1). Since the network is processed step by step, it does not require saving the intermediate state from time-step 1 to T as in regular BPTT. In each time-step, the information goes from the input layer to the output layer of the network in the forward pass, and then the gradients go from the output layer to the input layer in the backward pass.

The workflow of each layer is shown in Algorithm 1, while the calculation of µ[t], σ2[t], ˆµ, ˆσ2 is shown separately in Algorithm 2:

C IMPLEMENTATION DETAILS

We conduct experiments on CIFAR10, CIFAR100, DVS-Gesture, CIFAR10DVS, and Imagenet datasets. The network structure of VGGSNN we use for the CIFAR10, CIFAR100, DVS-Gesture, CIFAR10-DVS datasets is consistent with OTTT (64C3-128C3-AP2-256C3-256C3-AP2-512C3-512C3-AP2-512C3-512C3-GAP-FC), where 64C3 denotes convolution layer with 3 3 convolution kernel and 64 output channels, AP2 means 2 2 average pooling, GAP means global average pooling, and FC means fully connected layer. For Imagenet classification, we just use standard Resnet-34 architecture.

In all experiments, we use an SGD optimizer with a momentum of 0.9 with a cosine annealing learning rate scheduler. The data augmentation we use for each dataset is listed as follows: For CIFAR10 and

Published as a conference paper at ICLR 2024

Algorithm 1 The workflow of each layer

Input: Output of the last layer sl 1[t] (input spike train/image at time t for the input layer) and the weight between last layer and current layer W l (sl 1[t] and W l are both tensors instead of scalars). // 1. Calculate input current I[t] I[t] = layer(sl 1[t], W l) layer means a conv layer, a linear layer or other types of layer // 2. Apply OSR on I[t] to get the normalized I[t] if training then

Calculate µ[t], σ2[t], ˆµ, ˆσ2 according to Algorithm. 2

ˆI[t] = I[t] µ[t]

ˆI[t] no_grad

+ no_grad µ[t] ˆµ

I[t] = γ I[t] ˆµ

ˆ σ2+ϵ + β Note that same linear transformations are applied in training and inference

end if // 3. Update membrane potential of neurons in layer l according to the LIF neuron model and input I[t] ul[t] = (1 1

τ l )ul[t 0.5] + I[t] // 4. Apply OTS to update the threshold θ[t] θ[t] = µmem[t] + σmem[t] θ[1] µmem[1]

σmem[1] // 5. Fire spikes sl[t] and then reset membrane potential sl[t] = Θ(ul[t] θ[t]) ul[t + 0.5] = ul[t](1 sl[t])

Algorithm 2 The calculation of µ[t], σ2[t], ˆµ, ˆσ2

Input: (Additional Input) Current time-step t (1 t T) To determine whether we should initialize variables or calculate running mean/variance at the current time-step Output: W (n)(n = 1, ..., N)

// 1. Calculate the batch mean µ[t] and variance σ2[t] according to I[t] according to Eq. 13 (Here m is the number of elements in a channel, which forms a group for normalization). µ[t] = 1

m Pm x=1 Ix[t] σ2[t] = 1

m Pm x=1(Ix[t] µ[t])2

// 2. According to Eq. 14 15, we need variables µ and σ2 to accumulate total mean and variance. Before the first time-step, we initialize µ and σ2 to 0: if t = 1 then

µ 0, σ2 0 end if // Then we accumulate total mean and variance according to Eq. 14 15: µ µ + 1

T µ[t], σ2 σ2 + 1

T (σ2[t] + µ[t]2) // 3. Calculate running mean ˆµ and running variance ˆσ2 in time-step T (the last time step) if t = T then

ˆµ ˆµ + (1 momentum)(µ ˆµ) We take momentum = 0.9 as in BN. ˆσ2 ˆσ2 + (1 momentum)(σ2 ˆσ2) end if

Published as a conference paper at ICLR 2024

Table 3: Experimental configurations

Dataset CIFAR10 CIFAR100 DVS-Gesture CIFAR10-DVS Imagenet

Epochs 300 300 300 300 100 Batch size 128 128 128 128 256 Learning rate 0.1 0.1 0.01 0.1 0.1 Weight decay 5e-4 5e-4 5e-4 5e-4 2e-5 MSE weight ϵ 0.05 0.05 0.001 0.001 0.05 Dropout rate 0 0 0.05 0.1 0

CIFAR100, we use Random Crop(4) + Cutout() + Random Horizontal Flip() + Normalize(); For DVS-Gesture, we use Random Resized Crop(128, scale=(0.7, 1.0)) + Resize(48) + Random Rotation(20) + Random Temporal Delete(14) (recall the total time-step is 20, and the random temporal delete drops 6 time-steps (30%)). For CIFAR10-DVS, we use the neuromorphic data augmentation (NDA) which comes from (Li et al., 2022). For Imagenet, we use Random Resized Crop(224) + Random Horizontal Flip() + Normalize() during training. During testing, the image is first resized to 256 256 and center-cropped to 224 224 and then normalized. Other hyperparameters we use are provided in Table 3, including total training epochs, batch size, learning rate, weight decay, ϵ (weight of MSE loss in Eq. 5), and dropout rate.

For the configuration of Vanilla BN used as the baseline, it calculates the statistics solely based on data from each time step. In this approach, at every time step during training, the normalized input I[t] is computed by

I[t] = γ I[t] µ[t] p

σ2[t] + ϵ + β,

where the batch mean µ[t] and batch variance σ2[t] at time-step t accords with Algorithm 2 in the above.

The slight difference between it and the standard BN is the calculation of running mean and variance: If we use the same momentum as in our OSR to update running mean and variance at each time step, they will be unstable since they are updated T times more compared with our OSR. A simple approach is to change the momentum parameter to 1 (1 momentum)/T, but we choose to implement it in a strict corresponding way: We accumulate the mean and variance during training (for the variance, we accumulate by σ2 σ2 + 1

T σ2[t] instead of the complex way shown above), and only update the running mean and variance at time step T. In this way, there is no need to change the momentum parameter.

D ABLATION STUDY

Here we show the ablation results of our proposed modules: online spiking renormalization (OSR) and online threshold stabilizer (OTS). Since the OTS is proposed to help the training of OSR, we provide the results of Vanilla BN / OSR / OSR+OTS here on CIFAR10/100 and Imagenet dataset. The results are shown in Table. 4. It is shown that adding OSR will improve the performance over vanilla BN, and adding OTS will further improve the performance over solely adding OSR. It is worth noting that only adding OSR is sensitive to the weight decay parameter, it often requires lower weight decay to get a better result. For example, both Resnet-19+(Vanilla BN) and Resnet-19+OSR+OTS can be trained on CIFAR10 with a weight decay of 2e-4 while Resnet-19+OSR cannot. Another example is that the performance of SEW-Resnet-34+OSR will degrade to 54.97% on Imagenet when using a weight decay of 2e-5. On the other hand, OSR+OTS is much more stable with large weight decay parameters.

Published as a conference paper at ICLR 2024

Table 4: Ablation results

CIFAR10 Acc (wd) CIFAR100 Acc (wd) Imagenet Acc (wd)

VGG+Vanilla BN 92.6 (5e-4) 75.17 (5e-4) - VGG+OSR 94.05 (2e-4) 75.65 (2e-4) - VGG+OSR+OTS 94.35 (5e-4) 76.48 (5e-4) - Resnet-19+Vanilla BN 92.96 (2e-5) 73.68 (2e-4) - Resnet-19+OSR 95.14 (2e-5) 74.03 (2e-5) - Resnet-19+OSR+OTS 95.20 (2e-5) 77.86 (2e-4) - SEW-Resnet-34+Vanilla BN - - 60.48 (2e-5) SEW-Resnet-34+OSR - - 61.92 (0) SEW-Resnet-34+OSR+OTS - - 64.14 (2e-5)

We report both accuracy and weight decay statistics here.

E ADDITIONAL EXPERIMENTS

Necessity of "double transformation" in OSR. The OSR mechanism is shown to be useful among many mechanisms that we have tried. One simpler mechanism that does not work well is directly using a "linear transformation" instead of the "double transformation" in OSR. it directly applies

I[t] = γ I[t] ˆµ p ˆσ2 + ϵ + β

in both training and inference. We have tested this approach on the CIFAR100 dataset using VGGSNN and find it very hard to train. The final result we get is 53.25%, which is significantly worse than OSR.

Fixed θ[t] during inference in OTS. In OTS, the threshold θ[t] is dynamically adjusted for each sample batch during both the training and inference phases. It will be better when θ[t] is fixed during the inference stage if there is no significant performance decrease since inference batch size will not affect performance and it is more friendly to neuromorphic chips under this case. Hence we have conducted two extra experiments:

1. We have tested the performance of using fixed running θ[t] on Imagenet (using our saved model), its performance is 64.06% (original accuracy is 64.14%) (using the saved model of OSR+OTS). This result shows that fixed θ works well.

2. We have tested the performance for batchsize = 1 on Imagenet (also using our saved model), and the performance is 62.64%. Although there is a performance drop, it is still better than the baseline.

F MEMBRANE POTENTIAL VISUALIZATION

To see whether the Gaussian assumption in OTS is reasonable, we collect the membrane potential of a VGG network trained with OSR and OTS on the CIFAR-10 dataset and visualize the distribution of membrane potentials for each layer and each time step. The results are shown in Fig. 3. These distributions display the shape of bell curves, which indicate the similarity between these distributions and Gaussian distributions. Most of the distributions take the mean value around zero. This result shows that the Gaussian assumption in OTS is reasonable. Therefore, our proposed algorithm exploits the adaptation of batch normalization and can cope with the varied distributions of network features during online learning of spiking neural networks.

Published as a conference paper at ICLR 2024

10 5 0 5 10 0.0

T=0 T=1 T=2 T=3

(a) Layer 1

10 5 0 5 10 0.0

T=0 T=1 T=2 T=3

(b) Layer 2

6 4 2 0 2 4 6 0.0

T=0 T=1 T=2 T=3

(c) Layer 3

4 2 0 2 4 0.0

T=0 T=1 T=2 T=3

(d) Layer 4

4 2 0 2 4 0.0

T=0 T=1 T=2 T=3

(e) Layer 5

4 2 0 2 4 0.0

T=0 T=1 T=2 T=3

(f) Layer 6

4 2 0 2 4 0.00

T=0 T=1 T=2 T=3

(g) Layer 7

4 2 0 2 4 0.00

T=0 T=1 T=2 T=3

(h) Layer 8

Figure 3: Visualization of the distributions of membrane potentials for each layer and each time step.