# fearnet_braininspired_model_for_incremental_learning__d60e9db3.pdf Published as a conference paper at ICLR 2018 FEARNET: BRAIN-INSPIRED MODEL FOR INCREMENTAL LEARNING Ronald Kemker and Christopher Kanan Carlson Center for Imaging Science Rochester Institute of Technology Rochester, NY 14623, USA {rmk6217,kanan}@rit.edu Incremental class learning involves sequentially learning classes in bursts of examples from the same class. This violates the assumptions that underlie methods for training standard deep neural networks, and will cause them to suffer from catastrophic forgetting. Arguably, the best method for incremental class learning is i Ca RL, but it requires storing training examples for each class, making it challenging to scale. Here, we propose Fear Net for incremental class learning. Fear Net is a generative model that does not store previous examples, making it memory efficient. Fear Net uses a brain-inspired dual-memory system in which new memories are consolidated from a network for recent memories inspired by the mammalian hippocampal complex to a network for long-term storage inspired by medial prefrontal cortex. Memory consolidation is inspired by mechanisms that occur during sleep. Fear Net also uses a module inspired by the basolateral amygdala for determining which memory system to use for recall. Fear Net achieves state-of-the-art performance at incremental class learning on image (CIFAR-100, CUB-200) and audio classification (Audio Set) benchmarks. 1 INTRODUCTION In incremental classification, an agent must sequentially learn to classify training examples, without necessarily having the ability to re-study previously seen examples. While deep neural networks (DNNs) have revolutionized machine perception (Krizhevsky et al., 2012), off-the-shelf DNNs cannot incrementally learn classes due to catastrophic forgetting. Catastrophic forgetting is a phenomenon in which a DNN completely fails to learn new data without forgetting much of its previously learned knowledge (Mc Closkey & Cohen, 1989). While methods have been developed to try and mitigate catastrophic forgetting, as shown in Kemker et al. (2018), these methods are not sufficient and perform poorly on larger datasets. In this paper, we propose Fear Net, a brain-inspired system for incrementally learning categories that significantly outperforms previous methods. The standard way for dealing with catastrophic forgetting in DNNs is to avoid it altogether by mixing new training examples with old ones and completely re-training the model offline. For large datasets, this may require weeks of time, and it is not a scalable solution. An ideal incremental learning system would be able to assimilate new information without the need to store the entire training dataset. A major application for incremental learning includes real-time operation on-board embedded platforms that have limited computing power, storage, and memory, e.g., smart toys, smartphone applications, and robots. For example, a toy robot may need to learn to recognize objects within its local environment and of interest to its owner. Using cloud computing to overcome these resource limitations may pose privacy risks and may not be scalable to a large number of embedded devices. A better solution is on-device incremental learning, which requires the model to use less storage and computational power. In this paper, we propose an incremental learning framework called Fear Net (see Fig. 1). Fear Net has three brain-inspired sub-systems: 1) a recent memory system for quick recall, 2) a memory Corresponding author. Published as a conference paper at ICLR 2018 system for long-term storage, and 3) a sub-system that determines which memory system to use for a particular example. Fear Net mitigates catastrophic forgetting by consolidating recent memories into long-term storage using pseudorehearsal (Robins, 1995). Pseudorehearsal allows the network to revisit previous memories during incremental training without the need to store previous training examples, which is more memory efficient. Figure 1: Fear Net consists of three braininspired modules based on 1) m PFC (longterm storage), 2) HC (recent storage), and 3) BLA for determining whether to use m PFC or HC for recall. Problem Formulation: Here, incremental class learning consists of T study-sessions. At time t, the learner receives a batch of data Bt, which contains Nt labeled training samples, i.e., Bt = {(xj, yj)}Nt j=1, where xj Rd is the input feature vector to be classified and yj is its corresponding label. The number of training samples Nt may vary between sessions, and the data inside a study-session is not assumed to be independent and identically distributed (iid). During a study session, the learner only has access to its current batch, but it may use its own memory to store information from prior study sessions. We refer to the first session as the model s base-knowledge, which contains exemplars from M 1 classes. The batches learned in all subsequent sessions contain only one class, i.e., all yj will be identical within those sessions. Novel Contributions: Our contributions include: 1. Fear Net s architecture includes three neural networks: one inspired by the hippocampal complex (HC) for recent memories, one inspired by the medial prefrontal cortex (m PFC) for long-term storage, and one inspired by the basolateral amygdala (BLA) that determines whether to use HC or m PFC for recall. 2. Motivated by memory replay during sleep, Fear Net employs a generative autoencoder for pseudorehearsal, which mitigates catastrophic forgetting by generating previously learned examples that are replayed alongside novel information during consolidation. This process does not involve storing previous training data. 3. Fear Net achieves state-of-the-art results on large image and audio datasets with a relatively small memory footprint, demonstrating how dual-memory models can be scaled. 2 RELATED WORK Catastrophic forgetting in DNNs occurs due to the plasticity-stability dilemma (Abraham & Robins, 2005). If the network is too plastic, older memories will quickly be overwritten; however, if the network is too stable, it is unable to learn new data. This problem was recognized almost 30 years ago (Mc Closkey & Cohen, 1989). In French (1999), methods developed in the 1980s and 1990s are extensively discussed, and French argued that mitigating catastrophic forgetting would require having two separate memory centers: one for the long-term storage of older memories and another to quickly process new information as it comes in. He also theorized that this type of dual-memory system would be capable of consolidating memories from the fast learning memory center to longterm storage. Catastrophic forgetting often occurs when a system is trained on non-iid data. One strategy for reducing this phenomenon is to mix old examples with new examples, which simulates iid conditions. For example, if the system learns ten classes in a study session and then needs to learn 10 new classes in a later study session, one solution could be to mix examples from the first study session into the later study session. This method is known as rehearsal, and it is one of the earliest methods for reducing catastrophic forgetting (Hetherington & Seidenberg, 1989). Rehearsal essentially uses an external memory to strengthen the model s representations for examples learned previously, so Published as a conference paper at ICLR 2018 that they are not overwritten when learning data from new classes. Rehearsal reduces forgetting, but performance is still worse than offline models. Moreover, rehearsal requires storing all of the training data. Robins (1995) argued that storing of training examples was inefficient and of little interest, so he introduced pseudorehearsal. Rather than replaying past training data, in pseudorehearsal, the algorithm generates new examples for a given class. In Robins (1995), this was done by creating random input vectors, having the network assign them a label, and then mixing them into the new training data. This idea was revived in Draelos et al. (2017), where a generative autoencoder was used to create pseudo-examples for unsupervised incremental learning. This method inspired Fear Net s approach to memory consolidation. Pseudorehearsal is related to memory replay that occurs in mammalian brains, which involves reactivation of recently encoded memories in HC so that they can be integrated into long-term storage in m PFC (Rasch & Born, 2013). Recently there has been renewed interest in solving catastrophic forgetting in supervised learning. Many new methods are designed to mitigate catastrophic forgetting when each study session contains a permuted version of the entire training dataset (see Goodfellow et al. (2013)). Unlike incremental class learning, all labels are contained in each study session. Path Net uses an evolutionary algorithm to find the optimal path through a large DNN, and then freezes the weights along that path (Fernando et al., 2017). It assumes all classes are seen in each study session, and it is not capable of incremental class learning. Elastic Weight Consolidation (EWC) employs a regularization scheme that redirects plasticity to the weights that are least important to previously learned study sessions (Kirkpatrick et al., 2017). After EWC learns a study session, it uses the training data to build a Fisher matrix that determines the importance of each feature to the classification task it just learned. EWC was shown to work poorly at incremental class learning in Kemker et al. (2018). The Fixed Expansion Layer (FEL) model mitigates catastrophic forgetting by using sparse updates (Coop et al., 2013). FEL uses two hidden layers, where the second hidden layer (i.e., the FEL layer) has connectivity constraints. The FEL layer is much larger than the first hidden layer, is sparsely populated with excitatory and inhibitory weights, and is not updated during training. This limits learning of dense shared representations, which reduces the risk of learning interfering with old memories. FEL requires a large number of units to work well (Kemker et al., 2018). 100 120 140 160 180 200 Number of Classes Mean-Class Accuracy on First 100 Classes [%] EPC=1 EPC=20 Figure 2: i Ca RL s performance depends heavily on the number of exemplars per class (EPC) that it stores. Reducing EPC from 20 (blue) to 1 (red) severely impairs its ability to recall older information. Gepperth & Karaoguz (2016) introduced a new approach for incremental learning, which we call Gepp Net. Gepp Net uses a self-organizing map (SOM) to reorganize the input onto a two-dimensional lattice. This serves as a long-term memory, which is fed into a simple linear layer for classification. After the SOM is initialized, it can only be updated if the input is sufficiently novel. This prevents the model from forgetting older data too quickly. Gepp Net also uses rehearsal using all previous training data. A variant of Gepp Net, Gepp Net+STM, uses a fixed-size memory buffer to store novel examples. When this buffer is full, it replaces the oldest example. During pre-defined intervals, the buffer is used to train the model. Gepp Net+STM is better at retaining base-knowledge since it only trains during its consolidation phase, but the STM-free version learns new data better because it updates the model on every novel labeled input. i Ca RL (Rebuffiet al., 2017) is an incremental class learning framework. Rather than directly using a DNN for classification, i Ca RL uses it for supervised representation learning. During a study session, i Ca RL updates a DNN using the study session s data and a set of J stored examples from earlier sessions (J = 2, 000 for CIFAR-100 in their paper), which is a kind of rehearsal. After a study session, the J examples retained are carefully chosen using herding. After learning the entire dataset, i Ca RL has retained J/T exemplars per class (e.g., J/T = 20 for CIFAR-100). The DNN in i Ca RL is then used to compute an embedding for each stored example, and then the mean embedding for each class seen is computed. To classify a new instance, the DNN is used to compute an embedding for it, and then the class with the nearest mean embedding is assigned. i Ca RL s performance is heavily influenced by the number of examples it stores, as shown in Fig. 2. Published as a conference paper at ICLR 2018 3 MAMMALIAN MEMORY: NEUROSCIENCE AND MODELS Fear Net is heavily inspired by the dual-memory model of mammalian memory (Mc Clelland et al., 1995), which has considerable experimental support from neuroscience (Frankland et al., 2004; Takashima et al., 2006; Kitamura et al., 2017; Bontempi et al., 1999; Taupin & Gage, 2002; Gais et al., 2007). This theory proposes that HC and m PFC operate as complementary memory systems, where HC is responsible for recalling recent memories and m PFC is responsible for recalling remote (mature) memories. Gepp Net is the most recent DNN to be based on this theory, but it was also independently explored in the 1990s in French (1997) and Ans & Rousset (1997). In this section, we review some of the evidence for the dual-memory model. One of the major reasons why HC is thought to be responsible for recent memories is that if HC is bilaterally destroyed, then anterograde amnesia occurs with old memories for semantic information preserved. One mechanism HC may use to facilitate creating new memories is adult neurogenesis. This occurs in HC s dentate gyrus (Altman, 1963; Eriksson et al., 1998). The new neurons have higher initial plasticity, but it reduces as time progresses (Deng et al., 2010). In contrast, m PFC is responsible for the recall of remote (long-term) memories (Bontempi et al., 1999). Taupin & Gage (2002) and Gais et al. (2007) showed that m PFC plays a strong role in memory consolidation during REM sleep. Mc Clelland et al. (1995) and Euston et al. (2012) theorized that, during sleep, HC reactivates recent memories to prevent forgetting which causes these recent memories to replay in m PFC as well, with dreams possibly being caused by this process. After memories are transferred from HC to m PFC, evidence suggests that corresponding memory in HC is erased (Poe, 2017). Recently, Kitamura et al. (2017) performed contextual fear conditioning (CFC) experiments in mice to trace the formation and consolidation of recent memories to long-term storage. CFC experiments involve shocking mice while subjecting them to various visual stimuli (i.e., colored lights). They found that BLA, which is responsible for regulating the brain s fear response, would shift where it retrieved the corresponding memory from (HC or m PFC) as that memory was consolidated over time. Fear Net follows the memory consolidation theory proposed by Kitamura et al. (2017). 4 THE FEARNET MODEL Fear Net has two complementary memory centers, 1) a short-term memory system that immediately learns new information for recent recall (HC) and 2) a DNN for the storage of remote memories (m PFC). Fear Net also has a separate BLA network that determines which memory center contains the associated memory required for prediction. During sleep phases, Fear Net uses a generative model to consolidate data from HC to m PFC through pseudorehearsal. Pseudocode for Fear Net is provided in the supplemental material. Because the focus of our work is not representation learning, we use pre-trained Res Net embeddings to obtain features that are fed to Fear Net. 4.1 DUAL-MEMORY STORAGE Fear Net s HC model is a variant of a probabilistic neural network (Specht, 1990). HC computes class conditional probabilities using stored training examples. Formally, HC estimates the probability that an input feature vector x belongs to class k as PHC (C = k|x) = βk P βk = ϵ + minj x uk,j 2 1 if HC contains instances of class k 0 otherwise (2) where ϵ > 0 is a regularization parameter and uk,j is the j th stored exemplar in HC for class k. All exemplars are removed from HC after they are consolidated into m PFC. Fear Net s m PFC is implemented using a DNN trained both to reconstruct its input using a symmetric encoder-decoder (autoencoder) and to compute Pm P F C (C = k|x). The autoencoder enables us to Published as a conference paper at ICLR 2018 Figure 3: The m PFC and BLA sub-systems in Fear Net. m PFC is responsible for the long-term storage of remote memories. BLA is used during prediction time to determine if the memory should be recalled from shortor long-term memory. use pseudorehearsal, which is described in more detail in Sec. 4.2. The loss function for m PFC is Lm P F C = Lclass + Lrecon, (3) where Lclass is the supervised classification loss and Lrecon is the unsupervised reconstruction loss, as illustrated in Fig. 3(a). For Lclass, we use standard softmax loss. Lrecon is the weighted sum of mean squared error (MSE) reconstruction losses from each layer, which is given by hencoder,(i,j) hdecoder,(i,j) 2 2 (4) where M is the number of m PFC layers, Hj is the number of hidden units in layer j, hencoder,(i,j) and hdecoder,(i,j) are the outputs of the encoder/decoder at layer j respectively, and λj is the reconstruction weight for that layer. m PFC is similar to a Ladder Network (Rasmus et al., 2015), which combines classification and reconstruction to improve regularization, especially during lowshot learning. The λj hyperparameters were found empirically, with λ0 being largest and decreasing for deeper layers (see supplementary material). This prioritizes the reconstruction task, which makes the generated pseudo-examples more realistic. When training is completed during a study session, all of the data in HC is pushed through the encoder to extract a dense feature representation of the original data, and then we compute a mean feature vector µc and covariance matrix Σc for each class c. These are stored and used to generate pseudo-examples during consolidation (see Sec. 4.2). We study Fear Net s performance as a function of how much data is stored in HC in Sec. 6.2. 4.2 PSEUDOREHEARSAL FOR MEMORY CONSOLIDATION During Fear Net s sleep phase, the original inputs stored in HC are transferred to m PFC using pseudo-examples created by an autoencoder. This process is known as intrinsic replay, and it was used by Draelos et al. (2017) for unsupervised learning. Using the class statistics from the encoder, pseudo-examples for class c are generated by sampling a Gaussian with mean µc and covariance matrix Σc to obtain ˆxrand. Then, ˆxrand is passed through the decoder to generate a pseudo-example. To create a balanced training set, for each class that m PFC has learned, we generate m pseudo-examples, where m is the average number of examples per class stored in HC. The pseudo-examples are mixed with the data in HC, and the mixture is used to fine-tune m PFC using backpropagation. After consolidation, all units in HC are deleted. 4.3 NETWORK SELECTION USING BLA During prediction, Fear Net uses the BLA network (Fig. 3(b)) to determine whether to classify an input x using HC or m PFC. This can be challenging because if HC has only been trained on one class, it will put all of its probability mass on that class, whereas m PFC will likely be less confident. The output of BLA is given by A (x) and will be a value between 0 and 1, with a 1 indicating m PFC should be used. BLA is trained after each study session using only the data in HC and with pseudoexamples generated with m PFC, using the same procedure described in Sec. 4.2. Instead of using Published as a conference paper at ICLR 2018 solely BLA to determine which network to use, we found that combining its output with those of m PFC and HC improved results. The predicted class ˆy is computed as ˆy = arg maxk PHC (C = k |x) if ψ > maxk Pm P F C (C = k|x) arg maxk Pm P F C (C = k |x) otherwise (5) where ψ = (1 A (x)) 1 max k PHC (C = k|x) A (x) ψ is the probability of the class according to HC weighted by the confidence that the associated memory is actually stored in HC. BLA has the same number of layers/units as the m PFC encoder, and uses a logistic output unit. We discuss alternative BLA models in supplemental material. 5 EXPERIMENTAL SETUP Evaluating Incremental Learning Performance. To evaluate how well the incrementally trained models perform compared to an offline model, we use the three metrics proposed in Kemker et al. (2018). After each study session t in which a model learned a new class k, we compute the model s test accuracy on the new class (αnew,t), the accuracy on the base-knowledge (αbase,t), and the accuracy of all of the test data seen to this point (αall,t). After all T study sessions are complete, a model s ability to retain the base-knowledge is given by Ωbase = 1 T 1 PT t=2 αbase,t αoffline , where αoffline is the accuracy of a multi-layer perceptron (MLP) trained offline (i.e., it is given all of the training data at once). The model s ability to immediately recall new information is measured by Ωnew = 1 T 1 PT t=2 αnew,t. Finally, we measure how well the model does on all available test data with Ωall = 1 T 1 PT t=2 αall,t αoffline . The Ωall metric shows how well new memories are integrated into the model over time. For all of the metrics, higher values indicate superior performance. Both Ωbase and Ωall are relative to an offline MLP model, so a value of 1 indicates that a model has similar performance to the offline baseline. This allows results across datasets to be better compared. Note that Ωbase > 1 and Ωall > 1 only if the incremental learning algorithm is more accurate than the offline model, which can occur due to better regularization strategies employed by different models. CIFAR-100 CUB-200 Audio Set Classification Task RGB Image RGB Image Audio Classes 100 200 100 Feature Shape 2,048 2,048 1,280 Train Samples 50,000 5,994 28,779 Test Samples 10,000 5,794 5,523 Train Samples/Class 500 29-30 250-300 Test Samples/Class 100 11-30 43-62 Table 1: Dataset Specifications Datasets. We evaluate all of the models on three benchmark datasets (Table 1): CIFAR-100, CUB-200, and Audio Set. CIFAR-100 is a popular image classification dataset containing 100 mutually-exclusive object categories, and it was used in Rebuffiet al. (2017) to evaluate i Ca RL. All images are 32 32 pixels. CUB-200 is a fine-grained image classification dataset containing high resolution images of 200 different bird species (Welinder et al., 2010). We use the 2011 version of the dataset. Audio Set is an audio classification dataset (Gemmeke et al., 2017). We use the variant of Audio Set used by Kemker et al. (2018), which contains a 100 class subset such that none of the classes were superor sub-classes of one another. Also, since the Audio Set data samples can have more than one class, the chosen samples had only one of the 100 classes chosen in this subset. For CIFAR-100 and CUB-200, we extract Res Net-50 image embeddings as the input to each of the models, where Res Net-50 was pre-trained on Image Net (He et al., 2016). We use the output after the mean pooling layer and normalize the features to unit length. For Audio Set, we use the audio CNN embeddings produced by pre-training the model on the You Tube-8M dataset (Abu-El-Haija et al., 2016). We use the pre-extracted Audio Set feature embeddings, which represent ten second sound clips (i.e., ten 128-dimensional vectors concatenated in order). Comparison Models. We compare Fear Net to FEL, Gepp Net, Gepp Net+STM, i Ca RL, and an onenearest neighbor (1-NN). FEL, Gepp Net, and Gepp Net+STM were chosen due to their previously reported efficacy at incremental class learning in Kemker et al. (2018). i CARL is explicitly designed for incremental class learning, and represents the state-of-the-art on this problem. We compare against 1-NN due to its similarity to our HC model. 1-NN does not forget any previously observed examples, but it tends to have worse generalization error than parametric methods and requires storing all of the training data. Published as a conference paper at ICLR 2018 50 60 70 80 90 100 Number of Classes Mean-Class Accuracy [%] on Classes Seen so Far Offline MLP 1-NN Gepp Net+STM Gepp Net FEL i Ca RL Fear Net (a) CIFAR-100 100 110 120 130 140 150 160 170 180 190 200 Number of Classes Mean-Class Accuracy [%] on Classes Seen so Far Offline MLP 1-NN Gepp Net+STM Gepp Net FEL i Ca RL Fear Net (b) CUB-200 50 60 70 80 90 100 Number of Classes Mean-Class Accuracy [%] on Classes Seen so Far Offline MLP 1-NN Gepp Net+STM Gepp Net FEL i Ca RL Fear Net (c) Audio Set Figure 4: Mean-class test accuracy of all classes seen so far. In each of our experiments, all models take the same feature embedding as input for a given dataset. This required modifying i Ca RL by turning its CNN into a fully connected network. We performed a hyperparameter search for each model/dataset combination to tune the number of units and layers (see Supplemental Materials). Training Parameters. Fear Net was implemented in Tensorflow. For m PFC and BLA, each fully connected layer uses an exponential linear unit activation function (Clevert et al., 2016). The output of the encoder also connects to a softmax output layer. Xavier initialization is used to initialize all weight layers (Glorot & Bengio, 2010), and all of the biases are initialized to one. BLA s architecture is identical to m PFC s encoder, except it has a logistic output unit, instead of a softmax layer. m PFC and BLA were trained using NAdam. We train m PFC on the base-knowledge set for 1,000 epochs, consolidate HC over to m PFC for 60 epochs, and train BLA for 20 epochs. Because m PFC s decoder is vital to preserving memories, its learning rate is 1/100 times lower than the encoder. We performed a hyperparameter search for each dataset and model, varying the model shape (64-1,024 units), depth (2-4 layers), and how often to sleep (see Sec. 6.2). Across datasets, m PFC and BLA performed best with two hidden layers, but the number of units per layer varied across datasets. The specific values used for each dataset are given in supplemental material. In preliminary experiments, we found no benefit to adding weight decay to m PFC, likely because the reconstruction task helps regularize the model. 6 EXPERIMENTAL RESULTS Unless otherwise noted, each class is only seen in one unique study-session and the first baseknowledge study session contains half the classes in the dataset. We perform additional experiments to study how changing the number of base-knowledge classes affects performance in Sec. 6.2. Unless otherwise noted, Fear Net sleeps every 10 study sessions across datasets. 6.1 STATE-OF-THE-ART COMPARISON Table 2 shows incremental class learning summary results for all six methods. Fear Net achieves the best Ωbase and Ωall on all three datasets. Fig. 4 shows that Fear Net more closely resembles the offline MLP baseline than other methods. Ωnew measures test accuracy on the most recently trained class. 1 For Fear Net, this measures the performance of HC and BLA. Ωnew does not account for how well the class was consolidated into m PFC which happens later during a sleep phase; however, Ωall does account for this. FEL achieves a high Ωnew score because it is able to achieve nearly perfect test accuracy on every new class it learns, but this results in forgetting more quickly than Fear Net. 1-NN is similar to our HC model; but on its own, it fails to generalize as well as Fear Net, is memory inefficient, and is slow to make predictions. The final mean-class test accuracy for the offline MLP used to normalize the metrics is 69.9% for CIFAR-100, 59.8% for CUB-200, and 45.8% for Audio Set. 1Ωnew is not scaled with αoffline, so it does not have the same scale as Ωbase and Ωall. Published as a conference paper at ICLR 2018 Model CIFAR-100 CUB-200 Audio Set Mean Ωbase Ωnew Ωall Ωbase Ωnew Ωall Ωbase Ωnew Ωall Ωbase Ωall 1-Nearest Neighbor 0.878 0.648 0.879 0.746 0.434 0.694 0.655 0.269 0.613 0.760 0.729 Gepp Net+STM 0.866 0.408 0.800 0.764 0.204 0.645 0.941 0.372 0.861 0.857 0.769 Gepp Net 0.833 0.529 0.754 0.727 0.558 0.645 0.932 0.499 0.879 0.831 0.759 FEL 0.707 0.999 0.619 0.702 0.976 0.641 0.491 1.000 0.456 0.633 0.572 i Ca RL 0.746 0.807 0.749 0.942 0.547 0.864 0.740 0.487 0.733 0.801 0.782 Fear Net 0.927 0.824 0.947 0.924 0.598 0.891 0.962 0.455 0.932 0.938 0.923 Table 2: State-of-the-art comparison on CIFAR-100, CUB-200, and Audio Set. The best Ωall for each dataset are in bold. Ωbase and Ωall are normalized by the offline MLP baseline. CIFAR-100 CUB-200 Audio Set Oracle With BLA Oracle With BLA Oracle With BLA Ωbase 0.965 0.927 0.968 0.924 0.970 0.962 Ωnew 0.912 0.824 0.729 0.598 0.701 0.455 Ωall 1.002 0.947 0.936 0.891 0.972 0.932 Table 3: Fear Net performance when the location of the associated memory is known using an oracle versus using BLA. 6.2 ADDITIONAL EXPERIMENTS Novelty Detection with BLA. We evaluated the performance of BLA by comparing it to an oracle version of Fear Net, i.e., a version that knew if the relevant memory was stored in either m PFC or HC. Table 3 shows that Fear Net s BLA does a good job at predicting which network to use; however, the decrease in Ωnew suggests BLA is sometimes using m PFC when it should be using HC. 0 2 4 6 8 10 12 14 Number of Classes Stored Before Sleep Performance Ωbase Ωnew Ωall Figure 5: Fear Net performance as the sleep frequency decreases. When should the model sleep? To study how the frequency of memory consolidation affects Fear Net s performance, we trained Fear Net on CUB-200 and varied the sleep frequency from 1-15 study sessions. When Fear Net increases the number of classes it learns before sleeping (Fig. 5), it is better able to retain its base-knowledge, but this reduces its ability to recall new information. In humans, sleep deprivation is known to impair new learning (Yoo et al., 2007), and that forgetting occurs during sleep (Poe, 2017). Each time Fear Net sleeps, the m PFC weights are perturbed which can cause it to gradually forget older memories. Sleeping less causes HC s recall performance to deteriorate. Base-Knowledge CIFAR-100 Audio Set 50/50 Mix Ωbase 0.995 0.845 0.837 Ωnew 0.693 0.903 0.822 Ωall 0.854 0.634 0.820 Table 4: Multi-modal incremental learning experiment. Fear Net was trained with various base-knowledge sets (column-header) and then incrementally trained on all remaining data. Multi-Modal Incremental Learning. As shown in Sec. 6.1, Fear Net can incrementally learn and retain information from a single dataset, but how does it perform if new inputs differ greatly from previously learned ones? This scenario is one of the first shown to cause catastrophic forgetting in MLPs. To study this, we trained Fear Net to incrementally learn CIFAR-100 and Audio Set, which after training is a 200-way classification problem. To do this, Audio Set s features are zero-padded to make them the same length as CIFAR-100s. Table 4 shows the performance of Fear Net for three separate training paradigms: 1) Fear Net learns CIFAR-100 as the baseknowledge and then incrementally learns Audio Set; 2) Fear Net learns Audio Set as the baseknowledge and then incrementally learns CIFAR-100; and 3) the base-knowledge contains a 50/50 split from both datasets with Fear Net incrementally learning the remaining classes. Our results suggest Fear Net is capable of incrementally learning multi-modal information, if the model has a good starting point (high base-knowledge); however, if the model starts with lower base-knowledge Published as a conference paper at ICLR 2018 performance (e.g., Audio Set), the model struggles to learn new information incrementally (see Supplemental Material for detailed plots). Base-Knowledge Effect on Performance. In this section, we examine how the size of the baseknowledge (i.e., number of classes) affects Fear Net s performance on CUB-200. To do this, we varied the size of the base-knowledge from 10-150 classes, with the remaining classes learned incrementally. Detailed plots are provided in the Supplemental Material. As the base-knowledge size increases, there is a noticeable increase in overall model performance because 1) m PFC has a better learned representation from a larger quantity of data and 2) there are not as many incremental learning steps remaining for the dataset, so the base-knowledge performance is less perturbed. 7 DISCUSSION Fear Net s m PFC is trained to both discriminate examples and also generate new examples. While the main use of m PFC s generative abilities is to enable psuedorehearsal, this ability may also help make the model more robust to catastrophic forgetting. Gillies (1991) observed that unsupervised networks are more robust (but not immune) to catastrophic forgetting because there are no target outputs to be forgotten. Since the pseudoexample generator is learned as a unsupervised reconstruction task, this could explain why Fear Net is slow to forget old information. Model 100 Classes 1,000 Classes 1-NN 4.1 GB 40.9 GB Gepp Net+STM 4.1 GB 41.0 GB Gepp Net 4.1 GB 41.0 GB FEL 272.5 MB 395.0 MB i Ca RL 17.6 MB 166.0 MB Fear Net 10.7 MB 74.4 MB Table 5: Memory requirements to train CIFAR-100 and the amount of memory that would be required if these models were trained up to 1,000 classes. Table 5 shows the memory requirements for each model in Sec. 6.1 for learning CIFAR-100 and a hypothetical extrapolation for learning 1,000 classes. This chart accounts for a fixed model capacity and storage of any data or class statistics. Fear Net s memory footprint is comparatively small because it only stores class statistics rather than some or all of the raw training data, which makes it better suited for deployment. An open question is how to deal with storage and updating of class statistics if classes are seen in more than one study sessions. One possibility is to use a running update for the class means and covariances, but it may be better to favor the data from the most recent study session due to learning in the autoencoder. Fear Net assumed that the output of the m PFC encoder was normally distributed for each class, which may not be the case. It would be interesting to consider modeling the classes with a more complex model, e.g., a Gaussian Mixture Model. Robins (1995) showed that pseudorehearsal worked reasonably well with randomly generated vectors because they were associated with the weights of a given class. Replaying these vectors strengthened their corresponding weights, which could be what is happening with the pseudo-examples generated by Fear Net s decoder. Full Diagonal Covariance Covariance Ωbase 0.942 0.781 Ωnew 0.805 0.877 Ωall 0.959 0.800 Model Size 10.7 MB 3.8 MB Table 6: Using a diagonal covariance matrix for Fear Net s class statistics instead of a full covariance matrix on CIFAR-100. The largest impact on model size is the stored covariance matrix Σc for each class. We tested a variant of Fear Net that used a diagonal Σc instead of a full covariance matrix. Table 6 shows that performance degrades, but Fear Net still works. Fear Net can be adapted to other paradigms, such as unsupervised learning and regression. For unsupervised learning, Fear Net s m PFC already does a form of it implicitly. For regression, this would require changing m PFC s loss function and may require grouping input feature vectors into similar collections. Fear Net could also be adapted to perform the supervised data permutation experiment performed by Goodfellow et al. (2013) and Kirkpatrick et al. (2017). This would likely require storing statistics from previous permutations and classes. Fear Net would sleep between learning different permutations; however, if the number of classes was high, recent recall may suffer. Published as a conference paper at ICLR 2018 8 CONCLUSION In this paper, we proposed a brain-inspired framework capable of incrementally learning data with different modalities and object classes. Fear Net outperforms existing methods for incremental class learning on large image and audio classification benchmarks, demonstrating that Fear Net is capable of recalling and consolidating recently learned information while also retaining old information. In addition, we showed that Fear Net is more memory efficient, making it ideal for platforms where size, weight, and power requirements are limited. Future work will include 1) integrating BLA directly into the model (versus training it independently); 2) replacing HC with a semi-parametric model; 3) learning the feature embedding from raw inputs; and 4) replacing the pseduorehearsal mechanism with a generative model that does not require the storage of class statistics, which would be more memory efficient. Wickliffe C Abraham and Anthony Robins. Memory retention the synaptic stability versus plasticity dilemma. Trends in Neurosciences, 28(2):73 78, 2005. Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, et al. Youtube-8m: A large-scale video classification benchmark. ar Xiv:1609.08675, 2016. Joseph Altman. Autoradiographic investigation of cell proliferation in the brains of rats and cats. The Anatomical Record, 145(4):573 591, 1963. Bernard Ans and Stphane Rousset. Avoiding catastrophic forgetting by coupling two reverberating neural networks. Comptes Rendus de l Acadmie des Sciences - Series III - Sciences de la Vie, 320 (12):989 997, 1997. ISSN 0764-4469. Bruno Bontempi, Catherine Laurent-Demir, Claude Destrade, and Robert Jaffard. Time-dependent reorganization of brain circuitry underlying long-term memory storage. Nature, 400(6745):671 675, 1999. Djork-Arn e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016. Robert Coop, Aaron Mishtal, and Itamar Arel. Ensemble learning in fixed expansion layer networks for mitigating catastrophic forgetting. IEEE Trans. on Neural Networks and Learning Systems, 24(10):1623 1634, 2013. Wei Deng, James B Aimone, and Fred H Gage. New neurons and new memories: how does adult hippocampal neurogenesis affect learning and memory? Nature Reviews Neuroscience, 11(5): 339 350, 2010. Timothy J Draelos, Nadine E Miner, Christopher C Lamb, Jonathan A Cox, Craig M Vineyard, Kristofor D Carlson, William M Severa, Conrad D James, and James B Aimone. Neurogenesis deep learning: Extending deep networks to accommodate new classes. In International Joint Conference on Neural Networks, pp. 526 533. IEEE, 2017. Peter S Eriksson, Ekaterina Perfilieva, Thomas Bj ork-Eriksson, Ann-Marie Alborn, Claes Nordborg, Daniel A Peterson, and Fred H Gage. Neurogenesis in the adult human hippocampus. Nature medicine, 4(11):1313 1317, 1998. David R Euston, Aaron J Gruber, and Bruce L Mc Naughton. The role of medial prefrontal cortex in memory and decision making. Neuron, 76(6):1057 1070, 2012. Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. ar Xiv:1701.08734, 2017. Paul W Frankland, Bruno Bontempi, Lynn E Talton, Leszek Kaczmarek, and Alcino J Silva. The involvement of the anterior cingulate cortex in remote contextual fear memory. Science, 304 (5672):881 883, 2004. Published as a conference paper at ICLR 2018 Robert M French. Pseudo-recurrent connectionist networks: An approach to the sensitivitystability dilemma. Connection Science, 9(4):353 380, 1997. Robert M French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128 135, 1999. Steffen Gais, Genevi eve Albouy, M elanie Boly, Thien Thanh Dang-Vu, Annabelle Darsaud, Martin Desseilles, G eraldine Rauchs, Manuel Schabus, Virginie Sterpenich, Gilles Vandewalle, et al. Sleep transforms the cerebral trace of declarative memories. Proceedings of the National Academy of Sciences, 104(47):18778 18783, 2007. Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, New Orleans, LA, 2017. Alexander Gepperth and Cem Karaoguz. A bio-inspired incremental learning architecture for applied perceptual problems. Cognitive Computation, 8(5):924 934, 2016. AJ Gillies. The Stability/Plasticity Dilemma in Self-organising Neural Networks. Ph D thesis, MSc Thesis, Computer Science Department, University of Otago, New Zealand, 1991. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249 256, 2010. Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. ar Xiv:1312.6211, 2013. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CPVR, pp. 770 778, 2016. P Hetherington and Mark S Seidenberg. Is there catastrophic interference in connectionist networks. In Proceedings of the 11th annual conference of the cognitive science society, volume 26, pp. 33. Erlbaum Hillsdale, NJ, 1989. Ronald Kemker, Marc Mc Clure, Angelina Abitino, Tyler Hayes, and Christopher Kanan. Measuring catastrophic forgetting in neural networks. In AAAI, 2018. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, et al. Overcoming catastrophic forgetting in neural networks. Proc. of the National Academy of Sciences, pp. 201611835, 2017. Takashi Kitamura, Sachie K Ogawa, Dheeraj S Roy, Teruhiro Okuyama, Mark D Morrissey, Lillian M Smith, Roger L Redondo, and Susumu Tonegawa. Engrams and circuits crucial for systems consolidation of a memory. Science, 356(6333):73 78, 2017. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097 1105, 2012. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Data Mining, 2008. ICDM 08. Eighth IEEE International Conference on, pp. 413 422. IEEE, 2008. James L Mc Clelland, Bruce L Mc Naughton, and Randall C O reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995. Michael Mc Closkey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24:109 165, 1989. Gina R Poe. Sleep is for forgetting. Journal of Neuroscience, 37(3):464 473, 2017. Bj orn Rasch and Jan Born. About sleeps role in memory. Physiological reviews, 93(2):681 766, 2013. Published as a conference paper at ICLR 2018 Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In NIPS, pp. 3546 3554, 2015. Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, and Christoph H Lampert. i Ca RL: Incremental classifier and representation learning. In CVPR, 2017. Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2): 123 146, 1995. Peter J Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212 223, 1999. Bernhard Sch olkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. Estimating the support of a high-dimensional distribution. Neural Comput., 13(7):1443 1471, July 2001. Donald F Specht. Probabilistic neural networks. Neural networks, 3(1):109 118, 1990. A Takashima, Karl Magnus Petersson, F Rutters, I Tendolkar, O Jensen, MJ Zwarts, BL Mc Naughton, and G Fernandez. Declarative memory consolidation in humans: a prospective functional magnetic resonance imaging study. Proceedings of the National Academy of Sciences of the United States of America, 103(3):756 761, 2006. Philippe Taupin and Fred H Gage. Adult neurogenesis and neural stem cells of the central nervous system in mammals. Journal of neuroscience research, 69(6):745 749, 2002. P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. Seung-Schik Yoo, Peter T Hu, Ninad Gujar, Ferenc A Jolesz, and Matthew P Walker. A deficit in the ability to form new human memories without sleep. Nature Neuroscience, 10(3):385 392, 2007. A SUPPLEMENTAL MATERIAL A.1 MODEL HYPERPARAMETERS Table S1 shows the training parameters for the Fear Net model for each dataset. We also experimented with various dropout rates, weight decay, and various activation functions; however, weight decay did not work well with Fear Net s m PFC. Hyperparameter Values Learning Rate 2 10 3 Mini-Batch Size 450 (Audio Set & CIFAR-100) 200 (CUB-200) m PFC Base-Knowledge Epochs 1,000 Memory Consolidation Epochs 60 BLA Training Epochs 20 Hidden Layer Size CIFAR-100: [140, 130] CUB-200: [350, 300] Audio Set: [300, 100] Sleep Frequency 10 (see Sec. 6.2) Dropout Rate 0.25 Unsupervised Loss Weights (λ) 104, 1.0, 0.1 Hidden Layer Activation Exponential Linear Units Weight Decay 0.0 Table S1: Fear Net Training Parameters Table S2 shows the training parameters for the i Ca RL framework used in this paper. We adapted the code from the author s Git Hub page for our own experiments. The Res Net-18 convolutional neural network was replaced with a fully-connected neural network. We experimented with various regularization strategies to increase the initial base-knowledge accuracy with weight decay working Published as a conference paper at ICLR 2018 the best. The values that are given as a range of values are the hyperparameter search spaces. Hyperparameter Values Learning Rate 2 10 3 Mini-Batch Size 450 Exemplars per Class (EPC) 20 Hidden Layer Size 64-1024 Number of Hidden Layers 2-4 Dropout Rate [0.5, 0.75, 1.00] Hidden Layer Activation Re LU Weight Decay 0.0, 10 5, 10 4, 5 10 4 Table S2: i Ca RL Training Parameters Table S3 shows the training parameters for Gepp Net and Gepp Net+STM. Parameters not listed here are the default parameters defined by Gepperth & Karaoguz (2016). The values that are given as a range of values are the hyperparameter search spaces. Hyperparameter Values SOM Lattice Shape (N) 20-36 Non-Linearity Suppression Threshold (θ) 0.1-0.75 Incremental Class Learning Iterations (Tinc2 Tinc1) [2, 000, 20, 000] Table S3: Gepp Net Training Parameters Table S4 shows the training parameters for the Fixed Expansion Layer (FEL). The number of units in the FEL layer is given by FEL Units = H2 + HK where H is the number of units in the first hidden-layer and K is the maximum number of classes in the dataset. The values that are given as a range of values are the hyperparameter search spaces. Hyperparameter Values Hidden Layer Size (H) 64-1800 FEL Layer Size See Equation 6 Number of Hidden Layers 2 Mini-Batch Size 8 Initial Learning Rate 10 2 Table S4: FEL Training Parameters A.2 ICARL PERFORMANCE WITH MORE EXEMPLARS Table S5 provides additional experimental results for when there are more exemplars per class (EPC) for the i Ca RL framework. Rebuffiet al. (2017) used 20 EPC in their original paper; however, we increased the number to 100 EPC to see if storing more training data helped i Ca RL. Although a higher EPC does increase i Ca RL performance, it still does not outperform Fear Net. Note that CUB-200 only has about 30 training samples per class, so i Ca RL is storing the entire training set for 100 EPC. Our main results use the default value of 20. A.3 BLA VARIANTS Our BLA model is a classifier that determines whether a prediction should be made using HC (recent memory) or m PFC (remote memory). An alternative approach would be to use an outlier detection algorithm that determines whether the data being processed by a sub-network is an outlier for that sub-network and should therefore be processed by the other sub-network. To explore this alternative BLA formulation, we experimented with three outlier detection algorithms: 1) one-class support vector machine (SVM) (Sch olkopf et al., 2001), 2) determining if the data fits into a Gaussian distribution using a minimum covariance determinant estimation (i.e., elliptical envelope) (Rousseeuw Published as a conference paper at ICLR 2018 Model CIFAR-100 CUB-200 Audio Set Mean Ωbase Ωnew Ωall Ωbase Ωnew Ωall Ωbase Ωnew Ωall Ωbase Ωall i Ca RL (20 EPC) 0.746 0.807 0.749 0.942 0.547 0.864 0.740 0.487 0.733 0.801 0.782 i Ca RL (100 EPC) 0.842 0.719 0.822 0.951 0.554 0.882 0.820 0.419 0.771 0.871 0.825 Fear Net 0.927 0.824 0.947 0.924 0.598 0.891 0.962 0.455 0.932 0.938 0.923 Table S5: i Ca RL s performance when the stored EPC is increased from 20 to 100. & Driessen, 1999), and 3) the isolation forest (Liu et al., 2008). All three of these methods set a rejection criterion for if the test sample exists in HC; whereas the binary MLP reports a probability on how likely the test sample resides in HC. Table S6 compares these individual methods. Isolation Forest and Elliptic Envelope seem to prefer the data in HC, one-class SVM prefers the data in m PFC, and our binary MLP worked best at choosing the correct sub-network to use. BLA Method Ωbase Ωnew Ωall Isolation Forest 0.328 0.823 0.368 Elliptic Envelope 0.518 0.823 0.541 One-Class SVM 0.718 0.433 0.702 Binary MLP 0.927 0.924 0.947 Table S6: Performance of different BLA variants. A.4 FEARNET ALGORITHM Pseudocode for Fear Net s training and prediction algorithms are given in Algorithms 1 and 2 respectively. The variables match the ones defined in the paper. Algorithm 1: Fear Net Training Data: X,y Classes/Study-Sessions: T; K: Sleep Frequency; Initialize m PFC with base-knowledge; Store µt, Σt for each class in the base-knowledge; for c T/2 to T do Store X, y for class c in HC; if c % K == 0 then Fine-tune m PFC with X, y in HC and pseudoexamples generated by m PFC decoder; Update µt, Σt for all classes seen so far; Clear HC; else Update BLA; Algorithm 2: Fear Net Prediction Data: X A (X) PBLA (C = 1|X); ψ maxk PHC(C=k|X)A(X) 1 A(X) ; if ψ > maxk Pm P F C (C = k|X) then return arg maxk PHC (C = k |X); else return arg maxk Pm P F C (C = k |X); A.5 MULTI-MODAL LEARNING EXPERIMENT Fig. S1 shows the plots for the multi-modal experiments in Sec. 6.2. The three base-knowledge experiments were 1) CIFAR-100 is the base-knowledge and Audio Set is trained incrementally, 2) Audio Set is the base-knowledge and then Audio Set is trained incrementally, and 3) the base-knowledge is a 50/50 mix of the two datasets and then the remaining classes are trained incrementally. For all three base-knowledge experiments, we show the mean-class accuracy on the base-knowledge and the entire test set. Fear Net works well when it adequately learns the base-knowledge (Experiment #1 and #3); however, when Fear Net learns it poorly, incremental learning deteriorates. A.6 BASE-KNOWLEDGE EFFECT ON PERFORMANCE Fig. S2 shows the effect of the base-knowledge s size on Fear Net s performance. As expected, Ωbase increases because there are not as many sleep phases to overwrite existing base-knowledge. Ωnew Published as a conference paper at ICLR 2018 100 120 140 160 180 200 Number of Classes Mean-Class Accuracy [%] on Base Knowledge Fear Net Offline MLP (a) Base-knowledge: CIFAR-100 100 120 140 160 180 200 Number of Classes Mean-Class Accuracy [%] on Classes Seen So Far Fear Net Offline MLP (b) Base-knowledge: CIFAR-100 100 120 140 160 180 200 Number of Classes Mean-Class Accuracy [%] on Base Knowledge Fear Net Offline MLP (c) Base-knowledge: Audio Set 100 120 140 160 180 200 Number of Classes Mean-Class Accuracy [%] on Classes Seen So Far Fear Net Offline MLP (d) Base-knowledge: Audio Set 100 120 140 160 180 200 Number of Classes Mean-Class Accuracy [%] on Base Knowledge Fear Net Offline MLP (e) Base-knowledge: 50/50 Mix 100 120 140 160 180 200 Number of Classes Mean-Class Accuracy [%] on Classes Seen So Far Fear Net Offline MLP (f) Base-knowledge: 50/50 Mix Figure S1: Detailed plots for the multi-modal experiment. The top row is when the base-knowledge was CIFAR-100, the middle row is when the base-knowledge was Audio Set, and the bottom row is when the base-knowledge was a 50/50 mix from the two datasets. The left column represents the mean-class accuracy on the base-knowledge test set and the right column computes mean-class accuracy on the entire test set. remains relatively even because the size of the base-knowledge has no effect on the HC model s ability to immediately recall new information; however, there is a very slight decrease that corresponds to the BLA model erroneously favoring m PFC in a few cases. Most importantly, Ωall sees an increase in performance because; like Ωbase, there are not as many sleep phases to perturb older memories in m PFC. Published as a conference paper at ICLR 2018 0 20 40 60 80 100 120 140 160 Number of Classes in Base Knowledge Performance Ωbase Ωnew Ωall Figure S2: Fear Net performance as a function of base-knowledge size.