# sparse_spiking_gradient_descent__5c791efd.pdf Sparse Spiking Gradient Descent Nicolas Perez-Nieves Electrical and Electronic Engineering Imperial College London London, United Kingdom nicolas.perez14@imperial.ac.uk Dan F.M. Goodman Electrical and Electronic Engineering Imperial College London London, United Kingdom d.goodman@imperial.ac.uk There is an increasing interest in emulating Spiking Neural Networks (SNNs) on neuromorphic computing devices due to their low energy consumption. Recent advances have allowed training SNNs to a point where they start to compete with traditional Artificial Neural Networks (ANNs) in terms of accuracy, while at the same time being energy efficient when run on neuromorphic hardware. However, the process of training SNNs is still based on dense tensor operations originally developed for ANNs which do not leverage the spatiotemporally sparse nature of SNNs. We present here the first sparse SNN backpropagation algorithm which achieves the same or better accuracy as current state of the art methods while being significantly faster and more memory efficient. We show the effectiveness of our method on real datasets of varying complexity (Fashion-MNIST, Neuromophic MNIST and Spiking Heidelberg Digits) achieving a speedup in the backward pass of up to 150x, and 85% more memory efficient, without losing accuracy. 1 Introduction In recent years, deep artificial neural networks (ANNs) have matched and occasionally surpassed human-level performance on increasingly difficult auditory and visual recognition problems [1, 2, 3], natural language processing tasks [4, 5] and games [6, 7, 8]. As these tasks become more challenging the neural networks required to solve them grow larger and consequently their power efficiency becomes more important [9, 10]. At the same time, the increasing interest in deploying these models into embedded applications calls for faster, more efficient and less memory intensive networks [11, 12]. The human brain manages to perform similar and even more complicated tasks while only consuming about 20W [13] which contrasts with the hundreds of watts required for running ANNs [9]. Unlike ANNs, biological neurons in the brain communicate through discrete events called spikes. A biological neuron integrates incoming spikes from other neurons in its membrane potential and after reaching a threshold emits a spike and resets its potential [14]. The spiking neuron, combined with the sparse connectivity of the brain, results in a highly spatio-temporally sparse activity [15] which is fundamental to achieve this level of energy efficiency. Inspired by the extraordinary performance of the brain, neuromorphic computing aims to obtain the same level of energy efficiency preferably while maintaining an accuracy standard on par with ANNs by emulating spiking neural networks (SNNs) on specialised hardware [16, 17, 18, 19]. These efforts go in two directions: emulating the SNNs and training the SNNs. While there is a growing number of successful neuromorphic implementations for the former [20, 21, 22, 23, 24, 17, 25, 26], the latter has proven to be more challenging. Some neuromorphic chips implement local learning rules [26, 27, 28] and recent advances have achieved to approximate backpropagation on-chip [29]. However, a full end-to-end supervised learning via error backpropagation requires off-chip training [22, 23, 18]. This is usually achieved by simulating and training the entire SNN on a GPU and more 35th Conference on Neural Information Processing Systems (Neur IPS 2021). ๐‘†(1) ๐‘†(2) ๐‘‰(3) ๐‘†(0) ๐‘‰(3) ๐‘‰(3) ๐‘Š(0) ๐‘Š(1) ๐‘Š(2) Figure 1: Spiking Neural Network diagram highlighting the forward and backward passes recently, emulating the forward pass of the SNN on neuromorphic hardware and then performing the backward pass on a GPU [30, 31]. The great flexibility provided by GPUs allows the development of complex training pipelines without being constrained by neuromorphic hardware specifications. However, as GPUs do not leverage the event-driven nature of spiking computations this results in effective but slower and less energy efficient training. An alternative to directly training an SNN consists on training an ANN and converting it to an SNN. However, this approach has been recently shown to not be more energy efficient than using the original ANN in the first place [32]. Training SNNs has been a challenge itself even without considering a power budget due to the all-or-none nature of spikes which hinders traditional gradient descent methods [33, 34]. Since the functional form of a spike is a unit impulse, it has a zero derivative for all membrane potentials except at the threshold where it is infinity. Recently developed methods based on surrogate gradients have been shown to be capable of solving complex tasks to near state-of-the-art accuracies [35, 18, 36, 37, 38, 39, 31]. The problem of non-differentiability is solved by adopting a well-behaved surrogate gradient when backpropagating the error [33]. This method, while effective for training very accurate models, is still constrained to working on dense tensors, thus, not profiting from the sparse nature of spiking, and limits the training speed and power efficiency when training SNNs. In this work we introduce a sparse backpropagation method for SNNs. Recent work on ANNs has shown that adaptive dropout [40] and adaptive sparsity [41] can be used to achieve state-of-the-art performance at a fraction of the computational cost [42, 43]. We show that by redefining the surrogate gradient functional form, a sparse learning rule for SNNs arises naturally as a three-factor Hebbianlike learning rule. We empirically show an improved backpropagation time up to 70x faster than current implementations and up to 40% more memory efficient in real datasets (Fashion-MNIST (FMNIST) [44], Neuromorphic-MNIST (N-MNIST) [45] and Spiking Heidelberg Dataset (SHD) [46]) thus reducing the computational gap between the forward and backward passes. This improvement will not only impact training for neuromorphic computing applications but also for computational neuroscience research involving training on SNNs [39]. 2 Sparse Spike Backpropagation 2.1 Forward Model We begin by introducing the spiking networks we are going to work with (Figs. 1 and 2). For simplicity we have omitted the batch dimension in the derivations but it is later used on the complexity analysis and experiments. We have a total of L fully connected spiking layers. Each layer l consisting of N (l) spiking neurons which are fully connected to the next layer l+1 through synaptic weights W (l) RN (l) N(l+1). Each neuron has an internal state variable, the membrane potential, V (l) j [t] R that updates every discrete time step t {0, . . . , T 1} for some finite simulation time T N. Neurons emit spikes according to a spiking function f : R {0, 1} such that S(l) i [t]=f(V (l) i [t]) (1) The membrane potential varies according to a simplified Leaky Integrate and Fire (LIF) neuron model in which input spikes are directly integrated into the membrane [14]. After a neuron spikes, its potential is reset to Vr. We will work with a discretised version of this model (see Appendix E for a continuous time version). The membrane potential of neuron j in layer l+1 evolves according to the following difference equation V (l+1) j [t+1] = ฮฑ(V (l+1) j [t] Vrest) + Vrest + X i S(l) i [t]W (l) ij (Vth Vr)S(l+1) j [t] (2) Forward Backward ๐‘‰๐‘—[1] ๐‘‰๐‘—[0] ๐‘†๐‘—[0] ๐‘†๐‘—[3] ๐‘†๐‘—[4] ๐‘‰1[1] ๐‘‰1[0] ๐‘†1[0] ๐‘†1[3] ๐‘†0[1] ๐‘†0[2] ๐‘‰0[1] ๐‘‰0[2] ๐‘‰๐‘—[1] ๐‘‰๐‘—[0] ๐‘‰1[1] ๐‘‰1[0] ๐‘†0[1] ๐‘†0[2] ๐‘‰0[1] ๐‘‰0[2] ๐‘‰๐‘—[4] ๐‘‰๐‘—[3] ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘“( ) ๐‘‘๐‘† ๐‘‘๐‘‰ ๐‘‘๐‘† ๐‘‘๐‘‰ Forward connec on between layers ๐‘‰0[0] Inac ve neuron ๐‘‰0[1] Ac ve neuron Forward connec on within neuron Backward connec on between layers or within neuron Figure 2: Illustration of the gradient backpropagating only through active neurons. The first two terms account for the leaky part of the model where we define ฮฑ = exp( t/ฯ„) with t, ฯ„ R being the time resolution of the simulation and the membrane time constant of the neurons respectively. The second term deals with the integration of spikes. The last term models the resetting mechanism by subtracting the distance from the threshold to the reset potential. For simplicity and without loss of generality, we consider Vrest =0, Vth =1, Vr =0 and V (l) i [0]=0 l, i from now on. We can unroll (2) to obtain V (l+1) j [t+1] = X k=0 ฮฑt k S(l) i [k] | {z } Input trace | {z } Weighted input k=0 ฮฑt k S(l+1) j [k] | {z } Resetting term Thus, a spiking neural network can be viewed as a special type of recurrent neural network where the activation function of each neuron is the spiking function f( ). The spiking function, is commonly defined as the unit step function centered at a particular threshold Vth f(v) = 1, v > Vth 0, otherwise (4) Note that while this function is easy to compute in the forward pass, its derivative, the unit impulse function, is problematic for backpropagation which has resulted in adopting surrogate derivatives [33] to allow the gradient to flow. Interestingly, it has been shown that surrogate gradient descent performs robustly for a wide range of surrogate functional forms [38]. After the final layer, a loss function loss( ) computes how far the network activity is with respect to some target value for the given input. The loss function can be defined as a function of the network spikes, membrane potentials or both. It may also be evaluated at every single time step or every several steps. We deliberately leave the particular definition of this function open as it does not directly affect the sparse gradient descent method we introduce. 2.2 Backward Model The SNN is updated by computing the gradient of the loss function with respect to the weights in each layer. This can be achieved using backpropagation through time (BPTT) on the unrolled network. Following (1) and (3) we can derive weight gradients to be W (l) ij = X ฮต(l+1) j [t] | {z } Gradient from next layer Spike derivative z }| { d S(l+1) j [t] d V (l+1) j [t] k