# random_path_selection_for_continual_learning__a8d80259.pdf Random Path Selection for Incremental Learning Jathushan Rajasegaran Munawar Hayat Salman Khan Fahad Shahbaz Khan Ling Shao Inception Institute of Artificial Intelligence first.last@inceptioniai.org Incremental life-long learning is a main challenge towards the long-standing goal of Artificial General Intelligence. In real-life settings, learning tasks arrive in a sequence and machine learning models must continually learn to increment already acquired knowledge. Existing incremental learning approaches, fall well below the state-of-the-art cumulative models that use all training classes at once. In this paper, we propose a random path selection algorithm, called RPS-Net, that progressively chooses optimal paths for the new tasks while encouraging parameter sharing. Since the reuse of previous paths enables forward knowledge transfer, our approach requires a considerably lower computational overhead. As an added novelty, the proposed model integrates knowledge distillation and retrospection along with the path selection strategy to overcome catastrophic forgetting. In order to maintain an equilibrium between previous and newly acquired knowledge, we propose a simple controller to dynamically balance the model plasticity. Through extensive experiments, we demonstrate that the proposed method surpasses the state-of-theart performance on incremental learning and by utilizing parallel computation this method can run in constant time with nearly the same efficiency as a conventional deep convolutional neural network. 1 Introduction The ability to incrementally learn novel tasks and acquire new knowledge is necessary for life-long machine learning. Deep neural networks suffer from catastrophic forgetting [18], a phenomenon that occurs when a network is sequentially trained on a series of tasks and the learning acquired on new tasks interferes with the previously learned concepts. As an example, in a typical transfer learning scenario, when a model pre-trained on a source task is adapted to another task by fine-tuning its weights, its performance significantly degrades on the source task whose weights are overridden by the newly learned parameters [13]. It is, therefore, necessary to develop continual learning models capable of incrementally adding newly available classes without the need to retrain models from scratch using all previous class-sets (a cumulative setting). . An ideal incremental learning model must meet the following criterion. (a) As a model is trained on new tasks, it is desirable to maintain its performance on the old ones, thus avoiding catastrophic forgetting. (b) The knowledge acquired on old tasks should help in accelerating the learning on new tasks (a.k.a forward transfer) and vice versa. (c) As the class-incremental learning progresses, the network must share and reuse the previously tuned parameters to realize a bounded computational complexity and memory footprint, (d) At all learning phases, the model must maintain a tight Codes available at https://github.com/brjathu/RPSnet 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. equilibrium between the existing knowledge base and newly presented information (stability-plasticity dilemma). Despite several attempts, existing incremental learning models partially address the above mentioned requirements. For example, [16] employs a distillation loss to preserve knowledge across multiple tasks but requires prior knowledge about the task corresponding to a test sample during inference. An incremental classifier and representation learning approach [21] jointly uses distillation and prototype rehearsal but retrains the complete network for new tasks, thus compromising model stability. The progressive network [22] lacks scalability as it grows paths linearly (and parameters quadratically) with the number of tasks. The elastic weight consolidation scheme [15] computes synaptic importance offline using Fisher information metric thus restricting its scalability and while it works well for permutation tasks, its performance suffers on class-incremental learning [12]. Here, we argue that the most important characteristic of a true incremental learner is to maintain the right trade-off between stability (leading to intransigence) and plasticity (resulting in forgetting). We achieve this requisite via a dynamic path selection approach, called RPS-Net, that proceeds with random candidate paths and discovers the optimal one for a given task. Once a task is learned, we fix the parameters associated with it, that can only be shared by future tasks. To complement the previously learned representations, we propose a stacked residual design that focuses on learning the supplementary features suitable for new tasks. Besides, our learning scheme leverages exemplarbased retrospection and introduces an explicit controller module to maintain the equilibrium between stability and plasticity for all tasks. During training, our approach always operates with a constant parameter budget that at max equals to a conventional linear model (e.g., resent [6]). Furthermore, it can be straightforwardly parallelized during both train and test stages. With these novelties, our approach obtains state-of-the-art class-incremental learning results, surpassing the previous best model [21] by 7.38% and 10.64% on CIFAR-100 and Image Net datasets, respectively. Our main contributions are: A random path selection approach that provides faster convergence through path sharing and reuse. The residual learning framework that incrementally learns residual paths which allows network reuse and accelerate the learning process resulting in faster training. Ours is a hybrid approach that combines the respective strengths of knowledge distillation (via regularization), retrospection (via exemplar replay) and dynamic architecture selection methodologies to deliver a strong incremental learning performance. A novel controller that guides the plasticity of the network to maintain an equilibrium between the previously learned knowledge and the newly presented tasks. 2 Related Work The catastrophic interference problem was first noted to hinder the learning of connectionist networks by [18]. This highlights the stability-plasticity dilemma in neural networks [1] i.e., a rigid and stable model will not be able to learn new concepts while an easily adaptable model is susceptible to forget old concepts due to major parameter changes. The existing continual learning schemes can be divided into a broad set of three categories: (a) regularization schemes, (b) memory based retrospection and replay, and (c) dynamic sub-network training and expansion. A major trend in continual learning research has been on proposing novel regularization schemes to avoid catastrophic forgetting by controlling the plasticity of network weights. [16] proposed a knowledge distillation loss [7] which forces the network to retain its predictions on the old tasks. Kirkpatrick et al. [15] proposed an elastic weight consolidation mechanism that quantifies the relevance of parameters to a particular task and correspondingly adjusts the learning rate. In a similar spirit, [28] designed intelligent synapses which measure their relevance to a particular task and consequently adjust plasticity during learning to minimize interference with old tasks. Rebuffiet al. [21] proposed a distillation scheme intertwined with exemplar-based retrospection to retain the previously learned concepts. [8] considered a similar approach for cross-dataset continual learning [16]. The combination of episodic (short-term) and semantic (long-term) memory was studied in [11, 5, 10] to perform memory consolidation and retrieval. Particularly, [10, 11] help avoid explicitly storing exemplars in the memory, rather using a generative process to recall memories. Figure 1: An overview of our RPS-Net: The network architecture utilizes a parallel residual design where the optimal path is selected among a set of randomly sampled candidate paths for new tasks. The residual design allows forward knowledge transfer and faster convergence for later tasks. The random path selection approach is trained with a hybrid objective function that ensures the right trade-off between network stability and plasticity, thus avoiding catastrophic forgetting. The third stream of works explores dynamically adapting network architectures to cope with the growing learning tasks. [22] proposed a network architecture that progressively adds new branches for novel tasks that are laterally connected to the fixed existing branches. Similarly, [26] proposed a network that not only grows incrementally but also expands hierarchically. Specific paths through the network were selected for each learning task using a genetic algorithm in Path Net [4]. Afterwards, task-relevant paths were fixed and reused for new tasks to speed-up the learning efficiency. The existing adaptive network architectures come with their respective limitations e.g., [22] s complexity grows linearly with the tasks, [26] has an expensive training procedure and a somewhat rigid architecture and [4] does not allow incrementally learning new classes due to a detached output layer and a relatively expensive genetic learning algorithm used in [4]. In comparison, we propose a random path selection methodology that provides a significant boost and enables faster convergence. Furthermore, our approach combines the respective strengths of the above two types of methods by introducing a distillation procedure alongside an exemplar-based memory replay to avoid catastrophic forgetting. We consider the recognition problem in an incremental setting where new tasks are sequentially added. Assuming a total of K tasks, each comprising of U classes. Our goal is to sequentially learn a deep neural network, that not only performs well on the new tasks but also retains its performance on the old tasks. To address this problem, we propose a random path selection approach (RPS-Net) for new tasks that progressively builds on the previously acquired knowledge to facilitate faster convergence and better performance. In the following, we explain our network architecture, the path selection strategy, a hybrid objective function and the training procedure for incremental learning. 3.1 RPS-Net Architecture Our network consists of L distinct layers (see Figure 1). Each layer ℓ [1, L] is constitutes a set of basic building blocks, called modules Mℓ. For simplicity, we consider each layer to contain an equal number of M modules, stacked in parallel, i.e., Mℓ= {Mℓ m}M m=1, along with a skip connection module Mℓ skip that carries the bypass signal. The skip connection module Mℓ skip is an identity function when the feature dimensions do not change and a learnable module when the dimensions vary between consecutive layers. A module Mℓ m is a learnable sub-network that maps the input features to the outputs. In our case, we consider a simple combination of (conv-bn-relu-conv-bn) layers for each module, similar to a single resnet block [6]. In contrast to a residual block which consists of a single identity connection and a residual branch, we have one skip connection and M residual blocks stacked in parallel. The intuition behind developing such a parallel architecture is to ensure multiple tasks can be continually learned without causing catastrophic interference with other paths, while simultaneously providing parallelism to ensure efficiency. Towards the end of each layer in RPS-Net, all the residual connections, as well as skip connections, are combined together using element-wise addition to aggregate complimentary task-specific features obtained from different paths. Remarkably, for the base-case when M = 1, the network is identical to a conventional resnet model. After the Global Average Pooling (GAP) layer that collapses the input feature maps to generate a final feature f RD, we use a fully connected layer classifier with weights Wfc RD C (C being the total number of classes) that is shared among all tasks. For a given RPS-Net with M modules and L layers, we can define a path Pk RL M for a task k: Pk(ℓ, m) = 1, if the module Mℓ m is added to the path, 0, otherwise. (1) The path Pk is basically arranged as a stack of one-hot encoded row vectors e(i) (i-th standard basis): Pk = n Pk(ℓ) {0, 1}M : Pk(ℓ) = e(i) m=1 Pk(ℓ, m) = 1 o , s.t., i U {Z [1, M]} , (2) where i is the selected module index, uniformly sampled using U( ) over the set of integers [1, M]. We define two set of paths Ptr k and Pts k that denote the train and inference paths, respectively. Both are formulated as binary matrices: Ptr,ts k {0, 1}L M. When training the network, any mth module in lth layer with Ptr k (l, m) = 1 is activated and all such modules together constitute a training path Ptr k for task k. As we will elaborate in Sec. 3.2, the inference path is evolved during training by sequentially adding newly discovered training paths and ends up in a common inference path for all inputs, therefore our RPS-Net does not require knowledge about the task an input belongs to. Some previous methods (e.g., [16]) need such information, which limits their applicability to real-world incremental class-learning settings where one does not know in advance the corresponding task for an input sample. Similarly, only the modules with Pts k (ℓ, m) = 1 are used in the inference stage. 3.2 Path Selection With a total of K tasks, we assume a constant number of U classes that are observed in each kth task, such that U = C/K. Without loss of generality, the proposed path selection strategy can also be applied to a variable number of classes occurring in each task. The path selection scheme enables incremental and bounded resource allocation, with progressive learning that ensures knowledge exchange between the old and new tasks resulting in positive forward and backward transfer. To promote resource reuse during training that in turn improves training speed and minimizes computational requirements, we propose to perform path selection after every J task, where 1 # random paths), hence BTS has more compute complexity. Sophisticated genetic algorithms may beat random selection with a small margin, but likely with a high compute cost, which is not suitable for an incremental classifier learning setting having multiple tasks. Forward Transfer: The convergence trends shown in Fig. 5d demonstrate the forward knowledge transfer for RPS-Net. We can see that for task-2, the model takes relatively longer to converge compared with task-10. Precisely, for the final task, the model achieves 95% of the total performance within only one epoch, while for the second task it starts with 65% and takes up-to 20 epochs to achieve 95% of the final accuracy. This trends shows the faster convergence of our model for newer tasks This effect is due to residual learning as well as overlapping module sharing in RPS-Net design, demonstrating its forward transfer capability. Backward Transfer: Fig. 7 shows evolution of our model with new tasks. We can see that the performance of the current task (k) is lower than the previous tasks ( 1 tasks. Hence, in practice our computational complexity is well below the worst-case logarithmic curve. For example with a setting of M=2, J=2 the computational requirements reduces by 63.7% while achieving the best performance. We also show that even when a single path is used for all the tasks (M=1), our model achieves almost the same performance as state-of-the-art with constant computational complexity. 5 Conclusion Learning tasks appear in a sequential order in real-world problems and a learning agent must continually increment its existing knowledge. Deep neural networks excel in the cumulative learning setting where all tasks are available at once, but their performance deteriorates rapidly for incremental learning. In this paper, we propose a scalable approach to class-incremental learning that aims to keep the right balance between previously acquired knowledge and the newly presented tasks. We achieve this using an optimal path selection approach that support parallelism and knowledge exchange between old and new tasks. Further, a controlling mechanism is introduced to maintain an equilibrium between the stability and plasticity of the learned model. Our approach delivers strong performance gains on MNIST, SVHN, CIFAR-100 and Image Net datasets for incremental learning problem. [1] W. C. Abraham and A. Robins. Memory retention the synaptic stability versus plasticity dilemma. Trends in neurosciences, 28(2):73 78, 2005. [2] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139 154, 2018. [3] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532 547, 2018. [4] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. ar Xiv preprint ar Xiv:1701.08734, 2017. [5] A. Gepperth and C. Karaoguz. A bio-inspired incremental learning architecture for applied perceptual problems. Cognitive Computation, 8(5):924 934, 2016. [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [7] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015. [8] S. Hou, X. Pan, C. Change Loy, Z. Wang, and D. Lin. Lifelong learning via progressive distillation and retrospection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 437 452, 2018. [9] Y.-C. Hsu, Y.-C. Liu, A. Ramasamy, and Z. Kira. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. 2018. [10] N. Kamra, U. Gupta, and Y. Liu. Deep generative dual memory network for continual learning. ar Xiv preprint ar Xiv:1710.10368, 2017. [11] R. Kemker and C. Kanan. Fearnet: Brain-inspired model for incremental learning. International Conference on Learning Representations, 2018. [12] R. Kemker, M. Mc Clure, A. Abitino, T. L. Hayes, and C. Kanan. Measuring catastrophic forgetting in neural networks. In Thirty-second AAAI conference on artificial intelligence, 2018. [13] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun. A guide to convolutional neural networks for computer vision. Synthesis Lectures on Computer Vision, 8(1):1 207, 2018. [14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [15] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. In Proceedings of the national academy of sciences, volume 114, pages 3521 3526. National Acad Sciences, 2017. [16] Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935 2947, 2018. [17] D. Lopez-Paz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467 6476, 2017. [18] M. Mc Closkey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109 165. Elsevier, 1989. [19] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review. Co RR, abs/1802.07569, 2018. [20] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. ar Xiv preprint ar Xiv:1802.03268, 2018. [21] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2001 2010, 2017. [22] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016. [23] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning. ar Xiv preprint ar Xiv:1805.06370, 2018. [24] H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990 2999, 2017. [25] G. M. van de Ven and A. S. Tolias. Generative replay with feedback connections as a general strategy for continual learning. ar Xiv preprint ar Xiv:1809.10635, 2018. [26] T. Xiao, J. Zhang, K. Yang, Y. Peng, and Z. Zhang. Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proceedings of the 22nd ACM international conference on Multimedia, pages 177 186. ACM, 2014. [27] S. Xie, A. Kirillov, R. Girshick, and K. He. Exploring randomly wired neural networks for image recognition. ar Xiv preprint ar Xiv:1904.01569, 2019. [28] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3987 3995. JMLR. org, 2017. [29] J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C.-C. J. Kuo. Class-incremental learning via deep model consolidation. ar Xiv preprint ar Xiv:1903.07864, 2019. [30] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697 8710, 2018.