# fully_spiking_variational_autoencoder__637eac8b.pdf Fully Spiking Variational Autoencoder Hiromichi Kamata 1, Yusuke Mukuta 1,2, Tatsuya Harada 1,2 1 The University of Tokyo 2 RIKEN {kamata, mukuta, harada}@mi.t.u-tokyo.ac.jp Spiking neural networks (SNNs) can be run on neuromorphic devices with ultra-high speed and ultra-low energy consumption because of their binary and event-driven nature. Therefore, SNNs are expected to have various applications, including as generative models being running on edge devices to create high-quality images. In this study, we build a variational autoencoder (VAE) with SNN to enable image generation. VAE is known for its stability among generative models; recently, its quality advanced. In vanilla VAE, the latent space is represented as a normal distribution, and floating-point calculations are required in sampling. However, this is not possible in SNNs because all features must be binary time series data. Therefore, we constructed the latent space with an autoregressive SNN model, and randomly selected samples from its output to sample the latent variables. This allows the latent variables to follow the Bernoulli process and allows variational learning. Thus, we build the Fully Spiking Variational Autoencoder where all modules are constructed with SNN. To the best of our knowledge, we are the first to build a VAE only with SNN layers. We experimented with several datasets, and confirmed that it can generate images with the same or better quality compared to conventional ANNs. The code is available at https://github.com/kamata1729/Fully Spiking VAE. Introduction Recently, artificial neural networks (ANNs) have been evolving rapidly, and have achieved considerable success in computer vision and NLP. However, ANNs often require significant computational resources, which is a challenge in situations where computational resources are limited, such as on edge devices. Spiking neural networks (SNNs) are neural networks that more accurately mimic the structure of a biological brain than ANNs; notably, SNNs are referred to as the third generation of artificial intelligence (Maass 1997). In a SNN, all information is represented as binary time series data, and is driven by event-based processing. Therefore, SNNs can run with ultra-high speed and ultra-low energy consumption on neuromorphic devices, such as Loihi (Davies et al. 2018), Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Illustration of our FSVAE. Entire model is constructed with SNNs. All features are represented as spike trains, and the latent spike trains follow Bernoulli processes. True North (Akopyan et al. 2015), and Neurogrid (Benjamin et al. 2014). For example, on True North, the computational time is approximately 1/100 lower and the energy consumption is approximately 1/100,000 times lower than on conventional ANNs (Cassidy et al. 2014). With the recent breakthroughs on ANNs, research on SNNs has been progressing rapidly. Additionally, SNNs are outperforming ANNs in accuracy in MNIST, CIFAR10, and Image Net classification tasks (Zheng et al. 2021; Zhang and Li 2020). Moreover, SNNs are used for object detection (Kim et al. 2020), sound classification (Wu et al. 2018), optical flow estimation (Lee et al. 2020); however, their applications are still limited. In particular, image generation models based on SNNs have not been studied sufficiently. Spiking GAN (Kotariya and Ganguly 2021) built a generator and discriminator with shallow SNNs, and generated images of handwritten digits by adversarial learning. However, its generation quality was low, and some undesired images were generated that could not be interpreted as numbers. In (Skatchkovsky, Simeone, and Jang 2021), SNN was used as the encoder and ANN as the decoder to build a VAE (Kingma and Welling 2014); however, the main focus of their research was efficient spike encoding, and not the image generation task. In ANNs, image generation models have been extensively studied and can generate high-quality images (Razavi, van den Oord, and Vinyals 2019; Karras et al. 2020). However, in general, image generation models are computationally expensive, and some problems must be solved for edge devices, or for real-time generation. If SNNs can generate images comparably to ANNs, their high speed and low energy consumption can solve these problems. Therefore, we propose Fully Spiking Variational Autoen- The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) coder (FSVAE), which can generate images with the same or better quality than ANN. VAEs are known for their stability among generative models and are related to the learning mechanism of the biological brain (Han et al. 2018). Hence, it is compatible to build VAE with SNN. In our FSVAE, we built the entire model in SNN, so that it can be implemented in a neuromorphic device in the future. We conducted experiments using MNIST (Deng 2012), Fashion MNIST (Xiao, Rasul, and Vollgraf 2017), CIFAR10 (Krizhevsky and Hinton 2009), and Celeb A (Liu et al. 2015), and confirmed that FSVAE can generate images of equal or better quality than ANN VAE of the same structure. FSVAE can be implemented on neuromorphic devices in the future, and is expected to improve in terms of speed and energy consumption. The most difficult aspect of creating VAEs in SNNs is how to create the latent space. In ANN VAEs, the latent space is often represented as a normal distribution. However, within the framework of SNNs, sampling from a normal distribution is not possible because all features must be binary time series data. Therefore, we propose the autoregressive Bernoulli spike sampling. First, we incorporated the idea of VRNN (Chung et al. 2015) into SNN, and built prior and posterior models with autoregressive SNNs. The latent variables are randomly selected from the output of the autoregressive SNNs, which enables sampling from the Bernoulli processes. This can be realized on neuromorphic devices because it does not require floating-point calculations during sampling as in ANNs, and sampling using a random number generator is possible on actual neuromorphic devices (Wen et al. 2016; Davies et al. 2018). In addition, the latent variables can be sampled sequentially; thus, they can be input to the decoder incrementally, which saves time. The main contributions of this study are summarized as follows. We propose the autoregressive Bernoulli spike sampling, which uses autoregressive SNNs and constructs the latent space as Bernoulli processes. This sampling method is feasible within the framework of SNN. We propose Fully Spiking Variational Autoencoder (FSVAE), where all modules are constructed in the SNN. We experimented with multiple datasets; FSVAE could generate images of equal or better quality than ANN VAE of the same architecture. Related Work Development of SNNs SNNs are neural networks that accurately mimic the structure of the biological brain. In the biological brain, information is transmitted as spike trains (binary time series data with only on/off). This information is transmitted between neurons via synapses, and subsequently, the neuron s membrane potential changes. When it exceeds a threshold, it fires and becomes a spike train to the next neuron. SNNs mimic these characteristics of the biological brain, modeling biological neurons using differential equations and representing all features as spike trains. This allows Figure 2: LIF neuron. When spike trains of the previous layer neurons enter, the internal membrane potential ut changes by Eq. (2). If ut exceeds Vth, it fires a spike ot = 1; otherwise, ot = 0. SNNs to run faster and asynchronously, because they require fewer floating-point computations and only need computations when the input spike arrives. SNNs can be considered as recurrent neural networks (RNNs) with the membrane potential as its internal state. Learning algorithms for SNNs have been studied extensively recently. (Diehl and Cook 2015) used a two-layer SNN to recognize MNIST with STDP, an unsupervised learning rule, and achieved 95% accuracy. Later, (Wu et al. 2019) made it possible to train a deep SNN with backpropagation. Recently, (Zhang and Li 2020) exceeded the ANN s accuracy for MNIST and CIFAR10 (Krizhevsky and Hinton 2009) in only 5 timesteps (length of spike trains). (Zheng et al. 2021) has even higher accuracy in 2 timesteps for CIFAR10 and Image Net (Deng et al. 2009). Spike Neuron Model Although there are several learning algorithms for SNNs, in this study, we follow (Zheng et al. 2021), which currently has the highest recognition accuracy. First, as a neuron model, we use the iterative leaky integrate-and-fire (LIF) model (Wu et al. 2019), which is a LIF model (Stein and Hodgkin 1967) solved using the Euler method. ut = τdecayut 1 + xt (1) where ut is a membrane potential, xt is a presynaptic input, and τdecay is a fixed decay factor. When ut exceeds a certain threshold Vth, the neuron fires and outputs ot = 1. Then, ut is reset to urest = 0. This can be written as follows: ut,n = τdecayut 1,n(1 ot 1,n) + xt,n 1 (2) ot,n = H(ut,n Vth) (3) Here, ut,n is the membrane potential of the nth layer, and ot,n is its binary output. H is the heaviside step function. Input xt,n is described as a weighted sum of spikes from neurons in the previous layer, xt,n 1 = P j wjoj t,n 1. By changing the connection way of wj , we can implement convolution layers, FC layers, etc. The next step is to enable learning with backpropagation. As Eq. (3) is non-differentiable, we approximate it as follows: ot,n ut,n = 1 asign |ut,n Vth| < a Variational Autoencoder Variational Autoencoder (VAE) (Kingma and Welling 2014) is a generative model that explicitly assumes a distribution of the latent variable z over input x. Typically, the distribution p(x|z) is represented by deep neural networks, so its inverse transformation is approximated by a simple approximate posterior q(z|x). This allows us to calculate the evidence lower bound (ELBO) of the log likelihood. log p(x) Eq(z|x)[log p(x|z)] KL[q(z|x)||p(z)] (5) where KL[Q||P] is the Kullback Leibler (KL) divergence for distributions Q and P. In q(z|x), reparameterization trick is used to sample from N(µ(x), diag(σ(x)2)). VAEs have stable learning among generative models, and can be applied to various tasks, such as anomaly detection (An and Cho 2015). As VAE could generate high-quality images (Razavi, van den Oord, and Vinyals 2019; Vahdat and Kautz 2020), we aimed to build a VAE using SNN. Variational Reccurent Neural Network Variational Reccurent Neural Network (VRNN) (Chung et al. 2015) is a VAE for time series data. Its posterior and prior distributions are set as follows: q(z1:T |x1:T ) = t=1 q(zt|x t, z