# reinforced_continual_learning__d578f3ac.pdf

Reinforced Continual Learning

Ju Xu Center for Data Science, Peking University Beijing, China xuju@pku.edu.cn

Zhanxing Zhu Center for Data Science, Peking University & Beijing Institute of Big Data Research (BIBDR) Beijing, China zhanxing.zhu@pku.edu.cn

Most artiﬁcial intelligence models are limited in their ability to solve new tasks faster, without forgetting previously acquired knowledge. The recently emerging paradigm of continual learning aims to solve this issue, in which the model learns various tasks in a sequential fashion. In this work, a novel approach for continual learning is proposed, which searches for the best neural architecture for each coming task via sophisticatedly designed reinforcement learning strategies. We name it as Reinforced Continual Learning. Our method not only has good performance on preventing catastrophic forgetting but also ﬁts new tasks well. The experiments on sequential classiﬁcation tasks for variants of MNIST and CIFAR-100 datasets demonstrate that the proposed approach outperforms existing continual learning alternatives for deep networks.

1 Introduction

Continual learning, or lifelong learning [15], the ability to learn consecutive tasks without forgetting how to perform previously trained tasks, is an important topic for developing artiﬁcial intelligence. The primary goal of continual learning is to overcome the forgetting of learned tasks and to leverage the earlier knowledge for obtaining better performance or faster convergence/training speed on the newly coming tasks.

In the deep learning community, two groups of strategies have been developed to alleviate the problem of forgetting the previously trained tasks, distinguished by whether the network architecture changes during learning.

The ﬁrst category of approaches maintain a ﬁxed network architecture with large capacity. When training the network for consecutive tasks, some regularization term is enforced to prevent the model parameters from deviating too much from the previous learned parameters according to their significance to old tasks [4, 19]. In [6], the authors proposed to incrementally match the moment of the posterior distribution of the neural network which is trained on the ﬁrst and the second task, respectively. Alternatively, an episodic memory [7] is budgeted to store the subsets of previous datasets, and then trained together with the new task. Fear Net [3] mitigates catastrophic forgetting by consolidating recent memories into long-term storage using pseudorehearsal [10] which employs a generative autoencoder to generate previously learned examples that are replayed alongside novel information during consolidation. Fernando et al. [2] proposed Path Net, in which a neural network has ten or twenty modules in each layer, and three or four modules are picked for one task in each layer by an evolutionary approach. However, these methods typically require unnecessarily largecapacity networks, particularly when the number of tasks is large, since the network architecture is never dynamically adjusted during training.

Corresponding author.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

The other group of methods for overcoming catastrophic forgetting dynamically expand the network to accommodate the new coming task while keeping the parameters of previous architecture unchanged. Progressive networks [11] expand the architectures with a ﬁxed size of nodes or layers, leading to an extremely large network structure particularly faced with a large number of sequential tasks. The resultant complex architecture might be expensive to store and even unnecessary due to its high redundancy. Dynamically Expandable Network (DEN, [17] alleviated this issue slightly by introducing group sparsity regularization when adding new parameters to the original network; unfortunately, there involves many hyperparameters in DEN, including various regularization and thresholding ones, which need to be tuned carefully due to the high sensitivity to the model performance.

In this work, in order to better facilitate knowledge transfer and avoid catastrophic forgetting, we propose a novel framework to adaptively expand the network. Faced with a new task, deciding optimal number of nodes/ﬁlters to add for each layer is posed as a combinatorial optimization problem. We provide a sophisticatedly designed reinforcement learning method to solve this problem. Thus, we name it as Reinforced Continual Learning (RCL). In RCL, a controller implemented as a recurrent neural network is adopted to determine the best architectural hyper-parameters of neural networks for each task. We train the controller by an actor-critic strategy guided by a reward signal deriving from both validation accuracy and network complexity. This can maintain the prediction accuracy on older tasks as much as possible while reducing the overall model complexity. To the best of our knowledge, the proposal is the ﬁrst attempt that employs the reinforcement learning for solving the continual learning problems.

RCL not only differs from adding a ﬁxed number of units to the old network for solving a new task [11], which might be suboptimal and computationally expensive, but also distinguishes from [17] as well that performs group sparsity regularization on the added parameters. We validate the effectiveness of RCL on various sequential tasks. And the results show that RCL can obtain better performance than existing methods even with adding much less units.

The rest of this paper is organized as follows. In Section 2, we introduce the preliminary knowledge on reinforcement learning. We propose the new method RCL in Section 3, a model to learn a sequence of tasks dynamically based on reinforcement learning. In Section 4, we implement various experiments to demonstrate the superiority of RCL over other state-of-the-art methods. Finally, we conclude our paper in Section 5 and provide some directions for future research.

2 Preliminaries of Reinforcement learning

Reinforcement learning [13] deals with learning a policy for an agent interacting in an unknown environment. It has been applied successfully to various problems, such as games [8, 12], natural language processing [18], neural architecture/optimizer search [20, 1] and so on. At each step, an agent observes the current state st of the environment, decides of an action at according to a policy π(at|st), and observes a reward signal rt+1. The goal of the agent is to ﬁnd a policy that maximizes the expected sum of discounted rewards Rt, Rt = t =t+1 γt t 1rt , where γ (0, 1] is a discount factor that determines the importance of future rewards. The value function of a policy π is deﬁned as the expected return Vπ(s) = Eπ[ t=0 γtrt+1|s0 = s] and its action-value function as Qπ(s, a) = Eπ[ t=0 γtrt+1|s0 = s, a0 = a].

Policy gradient methods address the problem of ﬁnding a good policy by performing stochastic gradient descent to optimize a performance objective over a given family of parametrized stochastic policies πθ(a|s) parameterized by θ. The policy gradient theorem [14] provides expressions for the gradient of the average reward and discounted reward objectives with respect to θ. In the discounted setting, the objective is deﬁned with respect to a designated start state (or distribution) s0: ρ(θ, s0) = Eπθ[ t=0 γtrt+1|s0]. The policy gradient theorem shows that:

s µπθ(s|s0)

θ Qπθ(s, a). (1)

where µπθ(s|s0) = t=0 γt P(st = s|s0).

3 Our Proposal: Reinforced Continual Learning

In this section, we elaborate on the new framework for continual learning, Reinforced Continual Learning(RCL). RCL consists of three networks, controller, value network, and task network. The controller is implemented as a Long Short-Term Memory network (LSTM) for generating policies and determining how many ﬁlters or nodes will be added for each task. We design the value network as a fully-connected network, which approximates the value of the state. The task network can be any network of interest for solving a particular task, such as image classiﬁcation or object detection. In this paper, we use a convolutional network (CNN) as the task network to demonstrate how RCL adaptively expands this CNN to prevent forgetting, though our method can not only adapt to convolutional networks, but also to fully-connected networks.

3.1 The Controller

Figure 1(a) visually shows how RCL expands the network when a new task arrives. After the learning process of task t 1 ﬁnishes and task t arrives, we use a controller to decide how many ﬁlters or nodes should be added to each layer. In order to prevent semantic drift, we withhold modiﬁcation of network weights for previous tasks and only train the newly added ﬁlters. After we have trained the model for task t, we timestamp each newly added ﬁlter by the shape of every layer. During the inference time, each task only employs the parameters introduced in stage t, and does not consider the new ﬁlters added in the later tasks to prevent the caused semantic drift.

Suppose the task network has m layers, when faced with a newly coming task, for each layer i, we specify the the number of ﬁlters to add in the range between 0 and ni 1. A straightforward idea to obtain the optimal conﬁguration of added ﬁlters for m layers is to traverse all the combinatorial combinations of actions. However, for an m-layer network, the time complexity of collecting the best action combination is O( m 1 ni), which is NP-hard and unacceptable for very deep architectures such as VGG and Res Net.

To deal with this issue, we treat a series of actions as a ﬁxed-length string. It is possible to use a controller to generate such a string, representing how many ﬁlters should be added in each layer. Since there is a recurrent relationship between consecutive layers, the controller can be naturally designed as a LSTM network. At the ﬁrst step, the controller network receives an empty embedding as input (i.e. the state s) for the current task, which will be ﬁxed during the training. For each task t, we equip the network with softmax output, pt,i Rni representing the probabilities of sampling each action for layer i, i.e. the number of ﬁlters to be added. We design the LSTM in an autoregressive manner, as Figure 1(b) shows, the probability pt,i in the previous step is fed as input into the next step. This process is circulated until we obtain the actions and probabilities for all the m layers. And the policy probability of the sequence of actions a1:m follows the product rule,

π(a1:m|s; θc) =

i=1 pt,i,ai, (2)

where θc denotes the parameters of the controller network.

3.2 The Task Network

We deal with T tasks arriving in a sequential manner with training dataset Dt = {xi, yi}Nt i=1 , validation dataset Vt = {xi, yi}Mt i=1, test dataset Tt = {xi, yi}Kt i=1 at time t. For the ﬁrst task, we train a basic task network that performs well enough via solving a standard supervised learning problem, min W1 L1(W1; D1). (3)

We deﬁne the well-trained parameters as W a t for task t. When the t-th task arrives, we already know the best parameters W a t 1 for task t 1. Now we use the controller to decide how many ﬁlters should be added to each layer, and then we obtain an expanded child network, whose parameters to be learned are denoted as Wt (including W a t 1). The training procedure for the new task is as follows, keeping W a t 1 ﬁxed and only back-propagating the newly added parameters of Wt\W a t 1. Thus, the optimization formula for the new task is, min Wt\W a t 1 Lt(Wt; Dt). (4)

number of filters addedb er of

layer N-1 layer N layer N+1

number of filters addedb er of

number of filters addedb er of

(a) (b) Figure 1: (a) RCL adaptively expands each layer of the network when t-th task arrives. (b) The controller implemented as a RNN to determine how many ﬁlters to add for the new task.

We use stochastic gradient descent to learn the newly added ﬁlters with η as the learning rate,

Wt\W a t 1 Wt\W a t 1 η Wt\W a t 1Lt. (5)

The expanded child network will be trained until the required number of epochs or convergence are reached. And then we test the child network on the validation dataset Vt and the corresponding accuracy At will be returned. The parameters of the expanded network achieving the maximal reward (described in Section 3.3) will be the optimal ones for task t, and we store them for later tasks.

3.3 Reward Design

In order to facilitate our controller to generate better actions over time, we need design a reward function to reﬂect the performance of our actions. Considering both the validation accuracy and complexity of the expanded network, we design the reward for task t by the combination of the two terms,

Rt = At(St, a1:m) + αCt(St, a1:m), (6)

where At represents the validation accuracy on Vt, the network complexity as Ct = m

is the number of ﬁlters added in layer i, and α is a parameter to balance between the prediction performance and model complexity. Since Rt is non-differentiable, we use policy gradient to update the controller, described in the following section.

3.4 Training Procedures

The controller s prediction can be viewed as a list of actions a1:m, which means the number of ﬁlters added in m layers , to design an new architecture for a child network and then be trained in a new task. At convergence, this child network will achieve an accuracy At on a validation dataset and the model complexity Ct, ﬁnally we can obtain the reward Rt as deﬁned in Eq. (6). We can use this reward Rt and reinforcement learning to train the controller.

To ﬁnd the optimal incremental architecture the new task t, the controller aims to maximize its expected reward,

J(θc) = Vθc(st). (7)

where Vθc is the true value function. In order to accelerate policy gradient training over θc, we use actorcritic methods with a value network parameterized by θv to approximate the state value V (st; θv). The REINFORCE algorithm [16] can be used to learn θc,

θc J(θc) = E

a1:m π(a1:m|st, θc)(R(st, a1:m) V (st, θv)) θcπ(a1:m|st, θc)

π(a1:m|st, θc)

Algorithm 1 RCL for Continual Learning

1: Input: A sequence of dataset D = {D1, D2, . . . , DT } 2: Output: W a T 3: for t = 1, ..., T do 4: if t = 1 then 5: Train the base network using ( 3) on the ﬁrst datasest D1 and obtain W a 1 . 6: else 7: Expand the network by Algorithm 2, and obtain the trained W a t . 8: end if 9: end for

Algorithm 2 Routine for Network Expansion

1: Input: Current dataset Dt; previous parameter W a t 1; the size of action space for each layer ni, i = 1 . . . , m; number of epochs for training the controller and value network, Te. 2: Output: Network parameter W a t 3: for i = 1, . . . , Te do 4: Generate actions a1:m by controller s policy;

5: Generate W (i) t by expanding parameters W a t 1 according to a1:m;

6: Train the expanded network using Eq. (5) to obtain W (i) t . 7: Evaluate the gradients of the controller and value network by Eq. (9) and Eq.(10),

θc = θc + ηc θc J(θc), θv = θv ηv θv Lv(θv).

8: end for 9: Return the best network parameter conﬁguration, W a t = argmax W (i) t Rt(W (i) t ).

A Monte Carlo approximation for the above quantity is,

i=1 θc log π(a(i) 1:m|st; θc) ( R(st, a(i) 1:m) V (st, θv) ) . (9)

where N is the batch size. For the value network, we utilize gradient-based method to update θv, the gradient of which can be evaluated as follows,

i=1 (V (st; θv) R(st, a(i) 1:m))2,

( V (st; θv) R(st, a(i) 1:m) ) V (st; θv)

Finally we summarize our RCL approach for continual learning in Algorithm 1, in which the subroutine for network expansion is described in Algorithm 2.

3.5 Comparison with Other Approaches

As a new framework for network expansion to achieve continual learning, RCL distinguishes from progressive network [11] and DEN [17] from the following aspects.

Compared with DEN, instead of performing selective retraining and network split, RCL keeps the learned parameters for previous tasks ﬁxed and only updates the added parameters. Through this training strategy, RCL can totally prevent catastrophic forgetting due to the freezing parameters for corresponding tasks. Progressive neural networks expand the architecture with a ﬁxed number of units or ﬁlters. To obtain a satisfying model accuracy when number of sequential tasks is large, the ﬁnal complexity of progressive nets is required to be extremely high. This directly leads to high computational burden both in training and inference, even difﬁcult for the storage of the

entire model. To handle this issue, both RCL and DEN dynamically adjust the networks to reach a more economic architecture.

While DEN achieves the expandable network by sparse regularization, RCL adaptively expands the network by reinforcement learning. However, the performance of DEN is quite sensitive to the various hyperparameters, including regularization parameters and thresholding coefﬁcients. RCL largely reduces the number of hyperparameters and boils down to only balancing the average validation accuracy and model complexity when the designed reward function. Through different experiments in Section 4, we demonstrate that RCL could achieve more stable results, and better model performance could be achieved simultaneously with even much less neurons than DEN.

4 Experiments

We perform a variety of experiments to access the performance of RCL in continual learning. We will report the accuracy, the model complexity and the training time consumption between our RCL and the state-of-the-art baselines. We implemented all the experiments in Tensorfolw framework on GPU Tesla K80.

Datasets (1) MNIST Permutations [4]. Ten variants of the MNIST data, where each task is transformed by a ﬁxed permutation of pixels. In the dataset, the samples from different task are not independent and identically distributed; (2) MNIST Mix. Five MNIST permutations (P1, . . . , P5) and ﬁve variants of the MNIST dataset (R1, . . . , R5) where each contains digits rotated by a ﬁxed angle between 0 and 180 degrees. These tasks are arranged in the order P1, R1, P2, . . . , P5, R5. (3) Incremental CIFAR-100 [9]. Different from the original CIFAR-100, each task introduces a new set of classes. For the total number of tasks T, each new task contains digits from a subset of 100/T classes. In this dataset, the distribution of the input is similar for all tasks, but the distribution of the output is different.

For all of the above datasets, we set the number of tasks to be learned as T = 10. For the MNIST datasets, each task contains 60000 training examples and 10000 test examples from 10 different classes. For the CIFAR-100 datasets, each task contains 5000 train examples and 1000 examples from 10 different classes. The model observes the tasks one by one, and once the task had been observed, the task will not be observed later during the training.

Baselines (1) SN, a single network trained across all tasks; (2) EWC, deep network trained with elastic weight consolidation [4] for regularization; (3) GEM, gradient episodic memory [7]; (4) PGN, progressive neural network proposed in [11]; (5) DEN, dynamically expandable network [17].

Base network settings (1) Fully connected networks for MNIST Permutations and MNIST Mix datasets. We use a three-layer network with 784-312-128-10 neurons with RELU activations; (2) Le Net is used for Incremental CIFAR-100. Le Net has two convolutional layers and three fullyconnected layers, the detailed structure of Le Net can be found in [5].

4.1 Results

We evaluate each compared approach by considering average test accuracy on all the tasks, model complexity and training time. Model complexity is measured via the number of model parameters after training all the tasks. We ﬁrst report the test accuracy and model complexity of baselines and our proposed RCL for the three datasets in Figure 2.

Comparison between ﬁxed-size and expandable networks. From Figure 2, we can easily observe that the approaches with ﬁxed-size network architectures, such as IN, EWC and GEM, own low model complexity, but their prediction accuracy is much worse than those methods with expandable networks, including PGN, DEN and RCL. This shows that dynamically expanding networks can indeed contribute to the model performance by a large margin.

Comparison between PGN, DEN and RCL. Regarding to the expandable networks, RCL outperforms PGN and DEN on both test accuracy and model complexity. Particularly, RCL achieves

SN EWC GEM PGN DEN RCL 0.0

test accuracy

0.920 0.884

0.960 0.963

0.966 0.966

0.966 0.966

Average test accuracy across all tasks

MNIST permutations MNIST mix CIFAR-100

SN EWC GEM PGN DEN RCL 0

1e5 Number of parameters

MNIST permutations MNIST mix CIFAR-100

Figure 2: Top: Average test accuracy for all the datasets. Bottom: The number of parameters for different methods.

3 4 5 6 7 parameters 1e5

test accuracy

MNIST permutations

RCL DEN PGN

3 4 5 6 7 parameters 1e5

test accuracy

RCL DEN PGN

1.0 1.5 2.0 2.5 3.0 3.5 parameters 1e5

test accuracy

RCL DEN PGN

Figure 3: Average test accuracy v.s. model complexity for RCL, DEN and PGN.

signiﬁcant reduction on the number of parameters compared with PGN and DEN, e.g. for incremental Cifar100 data, 42% and 53% parameter reduction, respectively.

To further see the difference of the three methods, we vary the hyperparameters settings and train the networks accordingly, and obtain how test accuracy changes with respect to the number of parameters, as shown in Figure 3. We can clearly observe that RCL can achieve signiﬁcant model reduction with the same test accuracy as that of PGN and DEN, and accuracy improvement with same size of networks. This demonstrates the beneﬁts of employing reinforcement learning to adaptively control the complexity of the entire model architecture.

Comparison between RCL and Random Search. We compare our policy gradient controller and random search controller on different datasets. In every experiment setup, hyper-parameters are the same except the controller (random search controller v.s. policy gradient controller). We run each experiment for four times. We found that random search achieves more than 0.1% less accuracy and almost the same number of parameters on these three datasets compared with policy gradient. We

test accuracy

MNIST permutations

test accuracy

GEM EWC SN PGN DEN RCL

0 2 4 6 8 10

test accuracy

Figure 4: Test accuracy on the ﬁrst task as more tasks are learned.

note that random search performs surprisingly well, which we attribute to the representation power of our reward design. This demonstrates that our well-constructed reward strikes a balance between accuracy and model complexity very effectively.

Evaluating the forgetting behavior. Figure 4 shows the evolution of the test accuracy on the ﬁrst task as more tasks are learned. RCL and PGN exhibit no forgetting while the approaches without expanding the networks raise catastrophic forgetting. Moreover, DEN can not completely prevent forgetting since it retrains the previous parameters when learning new tasks.

Training time We report the wall clock training time for each compared method in Table 1). Since RCL is based on reinforcement learning, a large number of trials are typically required that leads to more training time than other methods. Improving the training efﬁciency of reinforcement learning is still an open problem, and we leave it as future work.

Table 1: Training time (in seconds) of experiments for all methods.

Methods IN EWC GEM DEN PGN RCL

MNIST permutations 173 1319 1628 21686 452 34583

MNIST mix 170 1342 1661 19690 451 23626

CIFAR100 149 508 7550 1428 167 3936

Balance between test accuracy and model complexity. We control the tradeoff between the model performance and complexity through the coefﬁcient α in the reward function (6). Figure 5 shows how varying α affects the test accuracy and number of model parameters. As expected, with increasing α the model complexity drops signiﬁcantly while the model performance also deteriorate gradually. Interestingly, when α is small, accuracy drops much slower compared with the decreasing of the number of parameters. This observation could help to choose a suitable α such that a mediumsized network can still achieve a relatively good model performance.

10 5 10 4 10 3 10 2 alpha

test accuracy

MNIST permutations

accuracy parameters

10 5 10 4 10 3 10 2 alpha

test accuracy

accuracy parameters

10 5 10 4 10 3 10 2 alpha

test accuracy

accuracy parameters

Figure 5: Experiments on the inﬂuence of the parameter α in the reward design

5 Conclusion

We propose a novel framework for continual learning, Reinforced Continual Learning. Our method searches for the best neural architecture for coming task by reinforcement learning, which increases its capacity when necessary and effectively prevents semantic drift. We implement both fully connected and convolutional neural networks as our task networks, and validate them on different datasets. The experiments demonstrate that our proposal outperforms the exiting baselines signiﬁcantly both on prediction accuracy and model complexity.

As for future works, two directions are worthy of consideration. Firstly, we will develop new strategies for RCL to facilitate backward transfer, i.e. improve previous tasks performance by learning new tasks. Moreover, how to reduce the training time of RCL is particularly important for large networks with more layers.

Acknowledgments

Supported by National Natural Science Foundation of China (Grant No: 61806009) and Beijing Natural Science Foundation (Grant No: 4184090).

[1] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V. Le. Neural optimizer search with reinforcement learning. In International Conference on Machine Learning(ICML), pages 459 468, 2017.

[2] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. ar Xiv preprint ar Xiv:1701.08734, 2017.

[3] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. ar Xiv preprint ar Xiv:1711.10563, 2017.

[4] James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521 3526, 2017.

[5] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[6] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting by incremental moment matching. In Advances in Neural Information Processing Systems, pages 4655 4665, 2017.

[7] David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. In NIPS, 2017.

[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

[9] Sylvestre-Alvise Rebufﬁ, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl: Incremental classiﬁer and representation learning. In CVPR, pages 5533 5542. IEEE Computer Society, 2017.

[10] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123 146, 1995.

[11] Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. ar Xiv preprint ar Xiv:1606.04671, 2016.

[12] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.

[13] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Cambridge: MIT press, 1998.

[14] Richard S. Sutton, David A. Mc Allester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057 1063, 1999.

[15] Sebastian Thrun. A lifelong learning perspective for mobile robot control. In International Conference on Intelligent Robots and Systems, 1995.

[16] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229 256, 1992.

[17] J. Yoon and E. Yang. Lifelong learning with dynamically expandable networks. ar Xiv preprint ar Xiv:1708.01547, 2017.

[18] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852 2858, 2017.

[19] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning (ICML), 2017.

[20] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.