# knowledge_flow_improve_upon_your_teachers__7c134e99.pdf

Published as a conference paper at ICLR 2019

KNOWLEDGE FLOW: IMPROVE UPON YOUR TEACHERS

Iou-Jen Liu, Jian Peng, Alexander G. Schwing University of Illinois at Urbana-Champaign {iliu3, jpeng, aschwing}@illinois.edu

A zoo of deep nets is available these days for almost any given task, and it is increasingly unclear which net to start with when addressing a new task, or which net to use as an initialization for ﬁne-tuning a new model. To address this issue, in this paper, we develop knowledge ﬂow which moves knowledge from multiple deep nets, referred to as teachers, to a new deep net model, called the student. The structure of the teachers and the student can differ arbitrarily and they can be trained on entirely different tasks with different output spaces too. Upon training with knowledge ﬂow the student is independent of the teachers. We demonstrate our approach on a variety of supervised and reinforcement learning tasks, outperforming ﬁne-tuning and other knowledge exchange methods.

1 INTRODUCTION

Research communities have amassed a sizable number of deep net architectures for different tasks, and new ones are added almost daily. Some of those architectures are trained from scratch while others are ﬁne-tuned, i.e., before training, their weights are initialized using a structurally similar deep net which was trained on different data.

Beyond ﬁne-tuning, particularly in reinforcement learning, teachers have also been considered in one way or another by Rusu et al. (2016b); Fernando et al. (2017); Wang et al. (2017); Li & Hoiem (2016); Bengio et al. (2009); Patel et al. (2015); Chen & Liu (2016); Teh et al. (2017); Parisotto et al. (2016). For instance, progressive neural net (Rusu et al., 2016b) keeps multiple teachers during both training and inference, and learns to extract useful features from the teachers for a new target task. Path Net (Fernando et al., 2017) uses genetic algorithms to choose pathways from a giant network for learning new tasks. Growing a Brain (Wang et al., 2017) ﬁne-tunes a neural network while growing the network s capacity (wider or deeper layers). Actor-mimic (Parisotto et al., 2016) pre-trains a big model on multiple source tasks, then the big model is used as a weight initialization for a new model which will be trained on a new target task. Knowledge distillation (Hinton et al., 2015) distills knowledge from a large ensemble of models to a smaller student model.

However, all the aforementioned techniques have limitations. For example, progressive neural net models (Rusu et al., 2016b) grow with the number of teachers. This large number of parameters limits the number of teachers a progressive neural net can handle, and largely increases the training and testing time. In Path Net (Fernando et al., 2017), searching over a big network for pathways is computationally intensive. For ﬁne-tuning based methods such as Growing a Brain (Wang et al., 2017) and actor-mimic (Parisotto et al., 2016), only one pretrained model can be used at a time. Hence, their performance heavily relies on the chosen pretrained model.

To address these shortcomings, we develop knowledge ﬂow which moves knowledge of multiple teachers when training a student. Irrespective of how many teachers we use, the student is guaranteed to become independent at the ﬁnal stage of training and the size of the resulting student net remains constant. In addition, our framework makes no restrictions on the deep net size of the teacher and student, which provides ﬂexibility in choosing teacher models. Importantly, our approach is applicable to a variety of tasks from reinforcement learning to fully-supervised training.

We evaluate knowledge ﬂow on a variety of tasks from reinforcement learning to fully-supervised learning. In particular, we follow Rusu et al. (2016b); Fernando et al. (2017) and compare on the same

Published as a conference paper at ICLR 2019

Atari games. In addition, we also observed signiﬁcant top-1 error rate improvements on supervised learning datasets, i.e., CIFAR-10, and CIFAR-100.

2 BACKGROUND

Knowledge ﬂow is applicable to a variety of settings from supervised learning to reinforcement learning, which we brieﬂy review to introduce notation.

Supervised Learning recovers the parameters θ of a mapping fθ : X Y from data space X to output space Y. To this end, a dataset D = {(xi, yi)}n i=1 containing n pairs (xi, yi) (assumed to be sampled i.i.d.) is used, where xi X and yi Y. Given this dataset, the parameters θ of the mapping fθ are learned by minimizing a loss function ℓ(x,y)(θ) composed of a regularization term R(θ) and an empirical risk ℓ(y, fθ(x)) which compares groundtruth label y and prediction fθ(x). The parameters θ are obtained by optimizing the following program:

min θ E(x,y) D[ℓ(x,y)(θ)] := E(x,y) D[ℓ(y, fθ(x))] + R(θ). (1)

Hereby, the mapping fθ is obtained by maximizing the logits or a corresponding probability distribution ˆfθ(y|x), i.e., fθ = arg maxy Y ˆfθ(y|x). Here and below let the hat ( ˆ ) indicate probability distributions over appropriate domains.

Reinforcement Learning considers an agent interacting with an environment according to a policy πθπ : X A which maps a state xt X to an action at A at time t. The policy depends on the parameters θπ. After performing action at, the agent observes the next state xt+1 and receives a scalar reward rt. The discounted return at time t is deﬁned as Rt = P k=0 γkrt+k, where γ is the discount factor. The expected future reward when observing state x and when following policy πθπ is deﬁned as V πθπ (xt) = Eτ πθπ [Rt|xt], where τ = {(xt, at, rt), (xt+1, at+1, rt+1), . . .} is a trajectory generated by following πθπ from state xt.

The goal of reinforcement learning is to ﬁnd a policy that maximizes the expected future reward from each state xt. Without loss of generality, in this paper, we follow the asynchronous advantage actor-critic (A3C) formulation (Mnih et al., 2016). In A3C, the policy mapping πθπ(x) = arg maxa A ˆπθπ(a|x) is obtained from a probability distribution over states, where ˆπθπ(a|x) is modeled by a deep net with parameters θπ. The value function is also approximated by a deep net Vθv(x), having parameters θv.

To optimize the policy parameters θπ given a state xt, a loss function based on a scaled negative log-likelihood and a negative entropy regularizer is common:

ℓτ π(θπ) = 1

t τ [ log ˆπθπ(at|xt)(Rt Vθv(xt)) βH(ˆπθπ( |xt))] .

Hereby, Rt = Pk 1 i=0 γirt+i + γk Vθv(xt+k) is the empirical k-step return obtained when starting in state xt, and |τ| is the length of the trajectory τ generated by following πθπ. The scalar β 0 is a user-speciﬁed constant, and H(ˆπθπ( |xt)) is the entropy function, which encourages exploration by favoring a uniform probability distribution ˆπθπ(a|x). To optimize the value function Vθv, it is common to use the squared loss ℓτ v(θv) = 1 2|τ| P

t τ(Rt Vθv(xt))2.

By minimizing the empirical expectation of ℓτ π(θπ) and ℓτ v(θv), i.e., by addressing

min θπ Eτ πθπ [ℓτ π(θπ)], and min θv Eτ πθπ [ℓτ v(θv)], (2)

alternatingly, we learn a policy and a value function that maximize expected return.

3 KNOWLEDGE FLOW

Instead of optimizing the programs given in Eq. (1) and Eq. (2) from scratch, the aforementioned warm-start techniques (see Sec. 5 for more) are applicable. To address their mentioned shortcomings, we propose knowledge ﬂow, a framework that moves knowledge from an arbitrary number of deep nets, henceforth referred to as teachers to a deep net under training, called the student.

Published as a conference paper at ICLR 2019

Teacher2(!(#)) Teacher1(!(%)) Student (!(&))

layer2 layer2

layer1 layer1

layer2 layer2

Teacher1(!(%)) Student (!(&))

(a) (b) (c)

%) -% = {(&

Figure 1: (a) Example of a two-teacher knowledge ﬂow. (b) Deep net transformation of knowledge ﬂow. (c) Average normalized weights for teachers and the student s layers. At the beginning of training, the student heavily relies on teacher one. As training progresses, teacher one s weight decreases, and the student s weight increases until the student is eventually independent.

3.1 OVERVIEW

Knowledge ﬂow is outlined on example deep nets in Fig. 1 (a,b). We train the parameters of the student net which are randomly initialized. To this end we take advantage of teachers, whose parameters are ﬁxed and obtained from pre-trained models on different source tasks by different algorithms. For example, for reinforcement learning, we may consider teachers trained by A3C (Mnih et al., 2016), A2C (Dhariwal et al., 2017) or DQN (Mnih et al., 2015).

Knowledge of multiple teachers is transferred to a student by adding transformed and scaled intermediate representations from the teacher deep nets to the student net. To achieve this, we modify the student net, i.e., fθ in the supervised setting and πθπ(a|x), Vθv(x) in the reinforcement learning case. We add teacher representations which are transformed by multiplication with a trainable matrix Q and scaled via a weight pw that is normalized to sum to one for each student layer and parameterized via trainable parameters w. The normalized weights encode which of the teachers or the student s representation to trust at every layer of the student net. Note that a teacher can help the student at different levels of abstraction with input from different levels of its net.

Importantly, after training, the student model should perform well on the target task without relying on teachers. To achieve this, as training progresses, we increasingly encourage a high normalized weight on the student representation, which forces the student to eventually capture all the knowledge. Due to the trainable scaling, at an early stage of training, we observe the student to rely heavily on the knowledge of the teacher to quickly obtain better performance. However, as training proceeds, the student is encouraged to become more and more independent. During ﬁnal stages of training, the student will no longer be able to rely on teachers, which ensures that the student has learned to master the desired task on its own. This is observed in Fig. 1 (c).

To formally encourage this successive transfer we introduce two additional loss functions. The ﬁrst, referred to as the dependency loss ℓdep(w), captures how much a student relies on teachers. It depends on the weight vector w which encodes the strength of the coupling. The second one ensures that a student s behavior doesn t change rapidly when the teachers inﬂuence decreases. We use loss ℓKL( , ) to capture the change.

By combining student net modiﬁcations and additional loss terms, for the supervised task we obtain

min θ,w,Q E(x,y)[ ℓ(x,y)(θ, w, Q) + λ1ℓdep(w) + λ2ℓKL( ˆfθ, ˆfθold)], (3)

and for reinforcement learning the transformed program reads as follows: minθπ,w,Q Eτ πθπ [ ℓτ π(θπ, w, Q) + λ1ℓdep(w) + λ2ℓτ KL( ˆπθπ, ˆπθπold )] minθv,w,Q Eτ πθπ [ ℓτ v(θv, w, Q)] . (4)

Loss ℓ (θ, w, Q) originates from the original loss ℓ (θ) (Eqs. (1)-(2)) by transforming the deep net to include cross-connections, hence its dependence on w, Q. The tilde ( ) denotes this dependence, also for probability distribution ˆf and policy distribution ˆπ. Parameters from the current and a previous iteration are referred to via θ and θold respectively.

Published as a conference paper at ICLR 2019

For both supervised and reinforcement learning, λ1 and λ2 control the strength which is used to decrease the inﬂuence of the teacher. A low λ1 allows the student to rely on teachers. Close to the end of training, the student should be independent. Therefore, we set λ1 to a small value at the beginning, and gradually increase its value as training progresses.

Note that we don t make any assumptions about teachers and student s objective. If a teacher s and student s objective differ, negative transfer may occur initially. However, the proposed method quickly decreases the weight for teacher layers to reduce this effect. Despite differences, students could potentially still beneﬁt from the low level representation of the teachers. We do observe this low level knowledge transfer in our experiments.

In the following we ﬁrst describe how to modify the deep nets, before we detail the loss functions ℓdep and ℓKL, which are used to successively decrease the inﬂuence of the teachers.

3.2 DEEP NET TRANSFORMATION AND LOSS TERMS

Deep Net Transformation: Knowledge ﬂow enhances the student by adding transformed and scaled intermediate representations from teacher models. To perform the transformation, intermediate representations from teachers are ﬁrst multiplied by transformation matrices Q. Then the transformed representations from teachers and representations from the student are linearly combined. The weights for this linear combination are determined by a weight pw which is normalized to sum to one for each student layer.

Let index m = 0 denote the student model and let θ(0) refer to its parameters. Further, let θ(m), m {1, . . . , M} denote teacher models. We use li m to refer to deep net layer i of teacher m, with i {1, . . . , Lm} and Lm the number of layers in teacher m. We deﬁne layer j of the student model to be lj 0, where j {1, . . . , L0} and L0 the number of deep net layers in the student model. The output of layer lk m right before and after an activation unit is denoted z(lk m) and h(lk m) respectively.

To align a teacher s layer li m with a student s layer lj 0, we introduce a learnable transformation matrix Qj(li m) Rdim(lj 0) dim(li m), where dim( ) gives the number of elements in the corresponding layer. The matrix multiplication Qj(li m)z(li m) aligns the representation from layer i of teacher m with the representation of layer j of the student.

For each layer j in the student model, we deﬁne a candidate set Lj, which contains lj 0 and all the teachers layers to be considered. For example, in Fig. 1 (a), layer one of the student model is combined with layer one of teacher one and layer two of teacher two. Therefore, the candidate set of layer one of the student model is given by L1 = {l1 0, l1 1, l2 2}.

To decide which teachers or the student s representation to trust at every layer of the student net, we introduce a normalized weight pj w(l) for all j {1, . . . , L0}, where l Lj, summing to one for each layer j in the student deep net, i.e., X

l Lj pj w(l) = 1, j {1, . . . , L0}.

To obtain the combined intermediate representation of layer j for the student model, we use

h(lj 0) = σ

pj w(l)Qj(l)z(l) + pj w(lj 0)z(lj 0)

where pj w(li m) determines how much the student layer j relies on transformed representations of layer i from the m-th teacher. Intuitively, if the transformed representation of the m-th teacher layer i is helpful, pj w(li m) will be close to one. We visualize the deep net transformation in Fig. 1 (b).

Note that the intermediate representations of teachers are not changed in our framework. To obtain the output of layer lk m we apply the original activation unit to the original representation z(li m), i.e., h(li m) = σ(z(li m)), m {1, . . . , M}, j {1, . . . Lm}.

The maximal number of introduced matrices Q in our framework is PM i=1 Li L0. In practice, we don t link a student s layer to every layer of a teacher network. Intuitively, a teachers bottom layer

Published as a conference paper at ICLR 2019

Table 1: Comparison with Path Net (Fernando et al., 2017) and progressive neural network (PNN) (Rusu et al., 2016b). Since Path Net and PNN don t report exact scores we obtain their numbers from their plots and indicate that with a symbol. The results of the state-of-the-art methods: A3C (Mnih et al., 2016), PPO (Schulman et al., 2017), and ACKTR (Wu et al., 2017) on Atari games are also listed for reference.

w/ Seaquest teacher w/ Riverraid teacher w/ Sea. and River. teachers No teachers

Ours Path Net Ours Path Net Ours PNN A3C PPO ACKTR

Alien 1254 1700 1259 1800 1911 2000 182 1850 3197 Asterix 3982 2000 3823 2000 6012 9000 6723 4533 31583 Boxing 96 70 96 80 99 99 34 95 1 Gopher 4152 3900 3820 2100 5233 4500 8443 2933 47730 Hero 21250 12500 29343 12500 30928 30000 28766 n/a n/a James. 857 600 832 600 1245 850 352 561 512 Krull 8193 7800 6890 7500 10000 9954 8067 7942 9689

features are very likely irrelevant to a student s top layer features. Indeed, we observed that linking a teachers bottom layer to a student s top layer generally doesn t yield improvements. Therefore, in practice, we recommend to link one teacher layer to one or two student layers, in which case we introduce on the order of ML0 matrices Q. Also note that while additional trainable parameters Q and w are introduced in our framework, Q and w are not part of the resulting student network since we ensure pj w(l) 0 l Lj\lj 0 at the end of training as discussed next. Hence, the additional parameters function as auxiliary knobs that help the student learn faster. In the ﬁnal stage of training, the student will be independent (see Fig. 1 (c)) and does no longer rely on Q, w, or any transformed representations from teachers.

Decreasing Teachers Inﬂuence: We successively decrease the inﬂuence of the teachers during training by gradually encouraging the normalized weight pj w(lj 0) to increase to a value of 1 j {1, 2, . . . , L0}. To capture how much the student relies on teachers, we introduce the dependence cost as the negative log probability:

ℓdep(w) = 1

j {1,2,...,L0} log pj w(lj 0). (5)

By minimizing ℓdep(w), we encourage weights for the layers of the student to increase. Hence we encourage the student to become more and more independent. During the ﬁnal stage of training, pj w(lj 0) approaches one for all j {1, . . . , L0}, making the student independent of the transformed representation obtained from teachers.

Empirically, we found that a fast decrease of the inﬂuence of the teacher can degrade the performance. This is intuitive as it requires some time to ﬁnd good transformations Q. Moreover, decreasing the inﬂuence of a teacher too fast may change the output distribution over labels or actions of the student model too much, and thus lead to performance loss. To prevent changing a student s output distribution too fast, we found a Kullback-Leibler (KL) regularizer to yield good results. More speciﬁcally, in the case of supervised learning we use

ℓKL( ˆfθ, ˆfθold) = DKL[ ˆfθ( |x)|| ˆfθold( |x)]. (6)

Hereby, θ is the set of current parameters, and θold are the previous ones. In the reinforcement learning case we use DKL[ ˆπθ( |xt)|| ˆπθold( |xt)].

4 EXPERIMENTAL RESULTS

In the following we evaluate knowledge ﬂow on reinforcement and supervised learning tasks. Results are reported by using only the student model to avoid even the smallest inﬂuence from any teacher nets.

4.1 REINFORCEMENT LEARNING We evaluate knowledge ﬂow on reinforcement learning using Atari games that were used by Rusu et al. (2016b); Fernando et al. (2017). Following existing work, the input to our agent are raw images from the environment. The agent learns to predict actions only based on the rewards and the input

Published as a conference paper at ICLR 2019

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

ours w/ seaq. and river. ours w/ seaq. ours w/ river. ours baseline Prog. Net w/ seaq. and river. Path Net w/ seaq. Path Net w/ river.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

ours w/ seaq. and river. ours w/ seaq. ours w/ river. ours baseline Prog. Net w/ seaq. and river. Path Net w/ seaq. Path Net w/ river.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

ours w/ seaq. and river. ours w/ seaq. ours w/ river. ours baseline Prog. Net w/ seaq. and river. Path Net w/ seaq. Path Net w/ river.

(a) Alien (b) Boxing (c) Gopher

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

ours w/ seaq. and river. ours w/ seaq. ours w/ river. ours baseline Prog. Net w/ seaq. and river. Path Net w/ seaq. Path Net w/ river.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

ours w/ seaq. and river. ours w/ seaq. ours w/ river. ours baseline Prog. Net w/ seaq. and river. Path Net w/ seaq. Path Net w/ river.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

ours w/ seaq. and river. ours w/ seaq. ours w/ river. ours baseline Prog. Net w/ seaq. and river. Path Net w/ seaq. Path Net w/ river.

(d) Hero (e) James Bond (f) Krull

Figure 2: Comparison with progressive neural network and Path Net.

images from the environment. The agent chooses an action every four frames, and the last action is repeated on the skipped four frames. For all teacher models and the student model, we use the fully forward architecture of A3C (Mnih et al., 2016). The model has three hidden layers. The ﬁrst layer is a convolutional layer with 16 ﬁlters of size 8x8 and stride 4. The second layer is a convolutional layer with 32 ﬁlters of size 4x4 and stride 2. The third layer is a fully connected layer with 256 hidden units. Following the third hidden layer are two sets of output. One is a softmax output that provides a probability distribution over all valid actions. The other one is a scalar output that provides the estimated value function. We use the same hyper-parameter settings as Mnih et al. (2016) except for the learning rate. Mnih et al. (2016) use RMSProp with shared statistics while we use Adam with shared statistics, which we found to give better results when training the baselines. The learning rate is set to 10 4 and gradually decreased to zero for all experiments. To select λ1 and λ2 in our framework, we follow progressive neural net (Rusu et al., 2016b): randomly sample λ1 {0.05, 0.1, 0.5} and λ2 {0.001, 0.01, 0.05}. Note that λ1 is set to zero at the beginning of training, and linearly increased to the sampled value at the end of training. Following Rusu et al. (2016b), we repeat each experiment 25 times with different random seeds and randomly sampled λ1 and λ2. The results of the top three out of 25 runs are reported. As A3C, we run 16 agents on 16 CPU cores in parallel.

Evaluation Metrics: We follow the evaluation procedure of Mnih et al. (2015). The trained student models are evaluated by playing each game for 30 episodes. We also follow the no-op procedure: at the beginning of each testing episode, the agents perform up to 30 no-op actions.

Results: We ﬁrst compare our framework with Path Net (Fernando et al., 2017) and progressive neural net (PNN) (Rusu et al., 2016b), which are state-of-the-art transfer reinforcement learning frameworks, using their experimental settings. The comparison is summarized in Table 1. The state-of-the-art results (Mnih et al., 2016; Schulman et al., 2017; Wu et al., 2017) on Atari games are also included in Table 1 for reference. Compared to Path Net, a student model trained using our transfer framework with one teacher achieves higher scores in 11 out of 14 experiments. Compared with PNN, for a two-teacher framework, our trained student model has only 0.7M parameters and PNN has 16M parameters. Nonetheless we observe higher scores in ﬁve out of the seven experiments. The results demonstrate that knowledge ﬂow effectively transfers knowledge from teachers to the student. Table 1 also indicates that, in our framework, when the number of teachers increases from one to two, the student s performance improves signiﬁcantly across all experiments. The training curves for the experiments are shown in Fig. 2. The curve is the average of the top three out of 25 runs. We observe our approach to generally perform very well.

Published as a conference paper at ICLR 2019

Table 2: Comparison with ﬁne-tuning and baseline A3C on different environment/teacher settings. The subscript following each number indicate the teachers being used. E.g., (alien, space I.) indicates that one teacher is an alien expert and the other is a space invaders expert.

Ours w/ expert Ours w/ non-expert Fine-tune A3C baseline A3C

Alien(teachers) 1705(alien, space I.) 1923(bank., space I.) 996(bank.) 1303(n/a) 182(n/a) Breakout(teachers) 400(breakout, space I.) 306(pong, space I.) 261(space I.) 99(n/a) 552(n/a) Chopper Command(teachers) 8120(chopper., space I.) 6013(sea., space I.) 3789(sea.) 4513(n/a) 4669(n/a) Kung Fu Master(teachers) 29458(kungfu., sea) 35103(sea. hero) 26752(hero) 29446(n/a) 3046(n/a) Ms Pacman(teachers) 2411(mspac., alien) 2450(alien, space I.) 1324(alien) 1628(n/a) 594(n/a) Seaquest(teachers) 1873(sea., chopper.) 32103(chopper., space I.) 1590(chopper.) 1670(n/a) 2300(n/a)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Ours w/ Seaquest and Chopper Command Ours w/ Chooper Command and Space Invaders Finutune from Chopper COmmand Baseline

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Kung Fu Master

Ours w/ Kung Fu Master and Seaquest Ours w/ Seaquest and Hero Finutune from Hero Baseline

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Ours w/ Alien and Space Invaders Ours w/ Bank Heist and Space Invaders Finutune from Bank Heist Baseline

(a) Seaquest (b) Kung Fu Master (c) Alien

Figure 3: Comparison with ﬁne-tuning and baseline A3C on different combinations of environment/teacher settings.

To further evaluate knowledge ﬂow, we experiment with different combinations of environment/teacher settings. These settings are not used by Path Net and progressive neural network. The results are summarized in Table 2, where ours w/ expert represents that one teacher is expert for the target game; ours w/ non-expert represents that both teachers are not experts for the target game; Fine-tune represents ﬁne-tuning from a non-expert on a new target game; A3C baseline represents our implementation of the A3C baseline; A3C represents the scores reported originally (Mnih et al., 2016). Note that our A3C implementation achieves better scores than those reported by Mnih et al. (2016) for most of the games. As shown in Table 2, knowledge ﬂow with expert teacher performs better than the baseline across all experiments, which we interpret as evidence that knowledge ﬂow successfully transfers knowledge from an expert teacher to the student. In addition, knowledge ﬂow with non-expert teachers also outperforms ﬁne-tuning on a non-expert teacher. The reasons are twofold: First, a student model in knowledge ﬂow can learn from multiple teachers while the ﬁne-tuning method can only start from one setting. Second, in knowledge ﬂow, the student can avoid the negative impact from insufﬁciently pretrained teachers, while ﬁne-tuning from an insufﬁciently pretrained model slows down the training process and may degrade the overall performance. The training curves for the experiments are shown in Fig. 3. More training curves are in the Appendix (Fig. 6). Note that in knowledge ﬂow, the student can beneﬁt from the intermediate representations of the teacher, even if input space, output space and objectives differ. For example, in Fig. 3 (a), the two teachers are Chopper Command and Space Invaders, which are quite different from the target game Seaquest. The student model still beneﬁts from learning from the teachers and achieves scores ten times larger than learning without teacher and ﬁne-tuning from a teacher.

4.2 SUPERVISED LEARNING

For supervised learning, we use a variety of image classiﬁcation benchmarks, including CIFAR10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), STL-10 (Coates et al., 2011), and EMNIST (Cohen et al., 2017). The parameters λ1 for the dependent cost and λ2 for the KL cost are determined using the validation set of each dataset.

Evaluation Metrics: To evaluate the trained student model we report top-1 error rate on the test set of each dataset. All plots and reported numbers are the average of three runs obtained using different random seeds.

Published as a conference paper at ICLR 2019

Table 3: Test Error (%) on CIFAR-10/100. The parentheses following Ours indicates the teachers we use. I.e., Ours (SVHN, C100) indicates that we use an SVHN expert and a C100 expert as teachers.

Baseline Densenet Fine-tune from C100 Fine-tune from SVHN Ours (C100, SVHN)

C10 4.44 4.27 4.58 3.88 (a)

Baseline Densenet Fine-tune from C10 Fine-tune from SVHN Ours (C10, SVHN)

C100 21.64 20.83 21.02 20.78 (b) CIFAR-10/CIFAR-100: CIFAR-10 and CIFAR-100 datasets consist of colored images of size 32 32. CIFAR-10 (C10) has 10 classes and CIFAR-100 (C100) has 100 classes. For both dataset, the training and test sets contain 50,000 and 10,000 images respectively. We perform all experiments on CIFAR-10 and CIFAR-100 with standard data augmentation (Huang et al., 2017).

We use Densenet (Huang et al., 2017) (depth 100, growth rate 24) as a baseline and follow their hyper-parameter settings to train our baseline, teacher and student models. For our approach, we ﬁrst train teachers on CIFAR-10, CIFAR-100, and SVHN (Netzer et al., 2011). We then train the student model using a different combination of teachers. We compare our results to ﬁne-tuning and the baseline model. As shown in Table 3 (a), for the CIFAR-10 target task, ﬁne-tuning from the CIFAR-100 expert improves 4% over the baseline. Fine-tuning from the SVHN expert performs worse than the baseline model. Intuitively, for the CIFAR-10 target task, the CIFAR-100 deep net is a good teacher while a deep net trained with SVHN isn t. Presented with both good and inadequate teachers, knowledge ﬂow improves by 13% over the baseline. This demonstrates that knowledge ﬂow can not only leverage a good teacher s knowledge, but it can also avoid misleading inﬂuence. As detailed in Table 3 (b), the results are similar on the CIFAR-100 dataset.

To further demonstrate the properties of knowledge ﬂow, additional results are in the appendix.

5 RELATED WORK As mentioned before, knowledge transfer has been considered using a variety of techniques. We brieﬂy discuss related work in contrast to our approach in the following and defer details to Sec. 8. Path Net (Fernando et al., 2017) enables multiple agents to train the same deep net while reusing parameters and avoiding catastrophic forgetting. In contrast to this formulation we consider availability of multiple pre-trained teacher nets. Progressive Net (Rusu et al., 2016b) leverages transfer and avoids catastrophic forgetting by introducing lateral connections to previously learned features. Our discussed method uses similar lateral connections. However, in contrast to Rusu et al. (2016b), our method ensures independence of the student upon training, addressing a limitation in (Rusu et al., 2016b) where only a fraction of the capacity of the student is eventually utilized. Distral a neologism combining distill & transfer learning (Teh et al., 2017) considers joint training of multiple tasks. Multiple tasks share a distilled policy which encodes common behavior between different tasks. While each worker addresses its own task, a shared policy encourages consistency between the policies. Different from Distral, which is a multi-task learning framework, knowledge ﬂow addresses a single task, while in multi-task learning, multiple tasks are addressed at the same time. Hence, common for multi-task learning and knowledge ﬂow is a transfer of information. However, in multi-task learning, information extracted from different tasks are shared to boost performance, while, in knowledge ﬂow, the information of multiple teachers is leveraged to help a student learn better a single, new, previously unseen task.

Other related work includes actor-mimic (Parisotto et al., 2016), learning without forgetting (Li & Hoiem, 2016), growing a brain (Wang et al., 2017), policy distillation (Rusu et al., 2016a), domain adaptation (Pan & Yang, 2010; Long et al., 2015; Tzeng et al., 2015), knowledge distillation (Hinton et al., 2015) or lifelong learning (Chen & Liu, 2016). A more detailed discussion on related work is provided in Sec. 8 of the supplementary material.

6 CONCLUSION

We developed a general knowledge ﬂow approach that permits to train a deep net from any number of teachers. We showed results for reinforcement learning and supervised learning, demonstrating improvements compared to training from scratch and to ﬁne-tuning. In the future we plan to learn when to use which teacher and how to actively swap teachers during training of a student.

Published as a conference paper at ICLR 2019

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proc. ICML, 2009.

Aaron Chen. pytorch-playground. https://github.com/aaron-xichen/ pytorch-playground, 2017.

Z. Chen and B. Liu. Lifelong Machine Learning. Morgan & Claypool Publishers, 2016.

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proc. AISTATS, 2011.

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EMNIST: an extension of MNIST to handwritten letters. ar Xiv preprint ar Xiv:1702.05373, 2017.

Prafulla Dhariwal, Christopher Hesse, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines, 2017.

Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. ar Xiv preprint ar Xiv:1701.08734, 2017.

Tommaso Furlanello, Jiaping Zhao, Andrew M. Saxe, Laurent Itti, and Bosco S. Tjan. Active long term memory networks. ar Xiv preprint ar Xiv:1606.02355, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proc. CVPR, 2017.

Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. Less-forgetting learning in deep neural networks. arxiv, 2016.

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Zhizhong Li and Derek Hoiem. Learning without forgetting. In Proc. ECCV, 2016.

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. Learning transferable features with deep adaptation networks. In Proc. ICML, 2015.

T. Mitchell, W. Cohen, E. Hruscha, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohammad, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning. In Proc. AAAI, 2015.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. In Nature, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. ICML, 2016.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.

Published as a conference paper at ICLR 2019

Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Trans. on Knowl. and Data Eng., 2010.

Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. In Proc. ICLR, 2016.

Vishal M. Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Visual domain adaptation: A survey of recent advances. IEEE Signal Process. Mag., 2015.

Andrei A. Rusu, Sergio Gomez Colmenarejo, Çaglar Gülçehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In Proc. ICLR, 2016a.

Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. In ar Xiv preprint ar Xiv:1606.04671, 2016b.

Paul Ruvolo and Eric Eaton. Ella: An efﬁcient lifelong learning algorithm. In Proc. ICML, 2013.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Yee Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Proc. NIPS, 2017.

Martin Thoma. Analysis and optimization of convolutional neural network architectures. ar Xiv preprint ar Xiv:1707.09725, 2017.

Sebastian Thrun. Lifelong learning algorithms. In Learning to Learn. Springer US, 1998.

Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In Proc. ICCV, 2015.

Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. In Proc. CVPR, 2017.

Yuhuai Wu, Elman Mansimov, Shun Liao, Roger B. Grosse, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Proc. NIPS, 2017.

Junbo Jake Zhao, Michaël Mathieu, Ross Goroshin, and Yann Le Cun. Stacked what-where autoencoders. ar Xiv preprint ar Xiv:1506.02351, 2015.

Published as a conference paper at ICLR 2019

Table 4: Test error (%) of distilled student net.

MNIST MNIST w/o digit 3 C100 Imagenet

Student alone 1.46 11.06 31.87 30.24

KD Hinton et al. (2015) 0.74 2.06 30.28 30.04 Ours 0.73 1.05 30.07 29.05 Table 5: Our approach on the EMNIST Letters dataset.

Model (Teacher) Test error(%)

Cohen et al. (2017) 14.85 Fine-tune from EMNIST digits 9.04 Baseline 9.20 Ours (EMNIST letters) 7.13 Ours (EMNIST half letters) 8.13 Ours (EMNIST digit) 8.11

7.1 SUPERVISED LEARNING

Comparison with Knowledge Distillation: We follow knowledge Distillation (KD) (Hinton et al., 2015) to distill knowledge from a larger model (teacher) to a smaller model (student). The student models have 50% - 5% parameters of the teacher models. Following their setup, we conduct experiments on MNIST, MNIST with digit 3 missing in the training set, CIFAR-100, and Image Net. For MNIST and MNIST with digit 3 missing, following KD, the teacher model is an MLP with two hidden layers of 1200 hidden units, and the student model is an MLP with two hidden layers of 800 hidden units. For CIFAR-100, we use the model from Chen (2017) as teacher model. The student model follows the structure of the teacher, but the number of output channels of each convolutional layer is halved. For Image Net, the teacher model is a 50-layer Res Net (He et al., 2016), and the student model is a 18-layer Res Net. The test error of the distilled student model are summarize in Table 4. Our framework has consistently better performance than KD, because the student model in our framework beneﬁts not only from the output layer behavior of the teacher but also from intermediate layer representations of the teacher.

The EMNIST Letters dataset consists of images of size 28 28 pixels showing handwritten letters. It has 26 balanced classes. Each class contains lower and upper case letters. The training and test sets contain 124,800 and 20,800 images respectively. The EMNIST Digits dataset consists of images of size 28 28 pixels showing handwritten digits. It has 10 balanced classes. The training and test sets contain 240,000 and 40,000 images respectively.

In this case we use the MNIST model from Chen (2017) as a baseline, teacher and student model. We trained teachers on EMNIST Digits, EMNIST Letters, and EMNIST Letters with only 13 classes. Our target task is EMNIST Letters. The student model is trained with different teachers and the results are compared to ﬁne-tuning, the baseline model, and the state-of-the-art results on EMNIST. The results are summarized in Table 5. Compared to the baseline and ﬁne-tuning, student learning in our framework with expert teacher (EMNIST Letters), semi-expert teacher (Half EMNIST Letters), and non-expert teacher (EMNIST Digits) all have better performance. In Fig. 4 we illustrate the accuracy over epochs for training of different models.

The STL-10 dataset consist of colored images of size 96 96 pixels. It has 10 balanced classes. The training set contains 5,000 labeled images and 100,000 unlabeled images. The test set contains 8,000 images. In our experiment, we only use the 5,000 labeled images for training.

We use the STL-10 model from Chen (2017) as our baseline, teacher and student model. We trained teachers on CIFAR-10 and CIFAR-100. We compare our results to ﬁne-tuning and the baseline in Table 6. Note that STL-10 is very similar to CIFAR-10 and CIFAR-100. Therefore, both CIFAR-10 and CIFAR-100 are very good teachers. As shown in Table 6, compared to the baseline, ﬁne-tuning a

Published as a conference paper at ICLR 2019

2 4 6 8 10 # of Epoch

EMNIST(Letter)

Ours w/ expert teacher Ours w/ semi-expert teacher Ours w/ non-expert teacher Finetune from non-expert Baseline

Figure 4: Comparison of top-1 accuracy of our approach, ﬁne-tuning and baseline on the EMNIST Letters test dataset.

Table 6: Our approach on the STL-10 dataset (fully supervised).

Test error (%)

Zhao et al. (2015) 25.20 Thoma (2017) 21.34

Baseline 25.50 Fine-tune from C10 14.32 Fine-tune from C100 14.38 Ours (C100) 12.35 Ours (C10, C100) 11.09

model using weights pretrained on CIFAR-10 and CIFAR-100 reduce test errors by more than 10%. Compared with ﬁne-tuning, student model training in our framework further reduces the test error by 3%. Note that we only train on the labeled data while other approaches use this data for testing of semi-supervised approaches. Hence our results are obtained using fewer data and may not be directly comparable. We still list their results in Table 6 for reference. In Fig. 5 we illustrate the accuracy over the epochs of training.

7.2 REINFORCEMENT LEARNING

We also compare to Distral (Teh et al., 2017), which is the state-of-the-art multi-task reinforcement learning framework. We used KL + ent 1 col , which has a central model (m0), and a task model (mi) for each task. We perform the experiments on Atari games. In the experiments, we have three tasks (task 1, task 2, task 3). The teachers of task 2 (m2) and task 3 (m3) are provided for our framework. Distral is trained for 120M steps (40M steps/task), and our model is trained for 40M steps. For fair comparison, we report results of Distral s task 1 model (m1), which is better than its center model (m0). The results are summarized in Table 7. Distral is suboptimal, because it aims to learn a multi-task agent. In addition, identical action and state space is assumed. When the target task is very different from the source tasks, Distral cannot decrease the teacher inﬂuence. In contrast, our framework can decrease a teacher s inﬂuence, and thus reduce negative transfer.

7.3 VISUALIZATION OF NORMALIZED WEIGHTS OF TEACHERS AND STUDENT

Following the reviewer s suggestion, we plot the averaged normalized weight (pw) for teachers and the student in the C10 experiment, where C100 and SVHN experts are teachers. Intuitively, the C100 teacher should have a higher pw value than the SVHN teacher, because C100 is more relevant to C10. The plot veriﬁes this intuition. As shown in Fig. 7, pw of the C100 teacher is higher than that of the SVHN teacher over the entire training. Note, both teachers normalized weights approach zero at the end of training.

Published as a conference paper at ICLR 2019

0 20 40 60 80 100 120 140 # of Epoch

Ours w/ Cifar10, Cifar100 Ours w/ Cifar100 Fine-tune from Cifar100 Finetune from Cifar10 Baseline

Figure 5: Comparison of top-1 accuracy of our approach, ﬁne-tuning and baseline on the STL-10 test dataset.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Ours w/ Alien and Space Invaders Ours w/ Bank Heist and Space Invaders Finutune from Bank Heist Baseline

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Ours w/ Breakout and Space Invaders Ours w/ Pong and Space Invaders Finutune from Space Invaders Baseline

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Chopper Command

Ours w/ Chopper Command and Space Invaders Ours w/ Seaquest and Space Invaders Finutune from Seaquest Baseline

(a) Alien (b) Breakout (c) Chopper Command

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Kung Fu Master

Ours w/ Kung Fu Master and Seaquest Ours w/ Seaquest and Hero Finutune from Hero Baseline

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Ours w/ Ms Pacman and Alien Ours w/ Alien and Space Invaders Finutune from Alien Baseline

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Steps 1e7

Ours w/ Seaquest and Chopper Command Ours w/ Chooper Command and Space Invaders Finutune from Chopper COmmand Baseline

(d) Kung Fu Master (e) Ms Pacman (f) Seaquest

Figure 6: Comparison with ﬁne-tuning and baseline A3C on different combinations of environment/teacher settings.

7.4 ABLATION STUDIES

7.4.1 UNTRAINED TEACHER MODELS

To verify that the student really beneﬁts from the knowledge of teachers, we conduct an ablation study suggested by a reviewer. We use teacher models that haven t been trained at all. Intuitively, learning with untrained teachers should have worse performance than learning with knowledgeable teachers. Our experiments verify this intuition. In Fig. 8 (a), where the target task is hero, learning with untrained teachers ( w/ untrained teachers ) achieves an average reward of 15934. Learning with knowledgeable teachers ( Ours with seaquest and riverraid teacher ) achieves an average reward of 30928. More results are presented in Figs. 8 (b, c). The results show that knowledge ﬂow achieves higher rewards than training with untrained teachers in different environments and teacher-student settings.

Published as a conference paper at ICLR 2019

Table 7: Comparison with Distral on Task 1 score.

Task1, Task2, Task3 Distral Teh et al. (2017) Ours

Kung Fu Master, Hero, Seaquest 27433 35103 Hero, Seaquest, Riverraid 15096 30928 James, Seaquest, Riverraid 550 1245

0 50 100 150 200 250 300 Epoch

Normalized Weight

Normalized weight for teachers and student (target_task:C10)

student C100 teacher SVHN teacher

Figure 7: Normalized weights for the teachers and the student in C10 experiments.

7.4.2 TRAINING WITHOUT KL TERM

The KL term prevents the student s output distribution over actions or labels from drastic changes when the teachers inﬂuence is decreasing. To investigate the importance of the KL term, we conduct an ablation study where the KL coefﬁcient (λ2) is set to zero. The result is summarized in Fig. 9. Considering Fig. 9 (a), where the target task is Ms Pacman and the teachers are Riverraid and Seaquest experts. Without the KL term, when a teacher s inﬂuence decreases, the rewards drop drastically. In contrast, with a KL term, we don t observe performance drops. At the end of training, learning with the KL term achieves an average reward of 2907 and learning without the KL term achieves an average reward of 1215. More results are presented in Fig. 9 (b, c), which shows that training with the KL term achieves higher reward than training without the KL term.

7.5 TEACHERS WITH DIFFERENT ARCHITECTURE THAN STUDENT

In additional experiments, following the suggestion of a reviewer, we use architectures for the teacher which differ from the student model. More speciﬁcally, we use the model of Mnih et al. (2015) as a teacher model. The teacher model consists of 3 convolutional layers, which have 32, 64, and 64 ﬁlters, followed by a hidden fully connected layer which has 512 Re LUs. We use the model of Mnih et al. (2016) as the student model. The student model consists of 2 convolutional layers, which have 16 and 32 ﬁlters respectively, followed by a hidden fully connected layer which has 256 Re LUs. Both models fully connected layers are followed by two output layers for actions and values. In the experiments, we link each teacher s ﬁrst convolutional layer to the student s ﬁrst convolutional layer. Moreover, we link each teacher s third convolutional layer to the student s second convolutional layer, and each teacher s fully connected layer to the student s fully connected layer. In the experiment, the target task is Kung Fu Master, and the teachers are experts for Seaquest and Riverraid. The results are summarized in Fig. 10. We observed that learning with teachers, whose architecture differs from the student, to have similar performance as learning with teachers which have the same architecture. Consider as an example Fig. 10 (a), where the target task is Kung Fu Master, and the teachers are experts for Seaquest and Riverraid. At the end of training, learning with teachers of different architectures achieves an average reward of 37520, and learning with teachers of the same architecture achieves an average reward of 35012. More results are shown in Fig. 10 (b, c). The results show that knowledge ﬂow can enable higher rewards, even if the teachers and the student architectures differ.

Published as a conference paper at ICLR 2019

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/ seaquest and riverraid teacher w/ untrained teacher

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/ seaquest and riverriad teacher w/ untrained teacher

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

Kung Fu Master

w/ hero teacher w/ untrained teacher

(a) Hero (b) James Bond (c) Kung Fu Master

Figure 8: Ablation study: using untrained teachers.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/ KL term w/o KL term

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

Kung Fu Master

w/ KL term w/o KL term

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/ KL term w/o KL term

(a) Ms Pacman (b) Kung Fu Master (c) Boxing

Figure 9: Ablation study regarding KL term. Seaquest and Riverraid experts are used as teachers for all experiments.

7.6 AVERAGE NETWORK AS θold

For the parameters θold an average network can be used. To investigate how usage of an average network to obtain the parameters θold affects the performance, we conduct an experiment where θold is computed using the exponential running average of the model weight. More speciﬁcally, θold is updated as follows: θold α θold + (1 α) θ, where α = 0.9. The results are summarized in Fig. 11. We observe that using an exponential average to compute θold results in very similar performance as using a single model. Consider Fig. 11 (a), where the target task is Boxing and the teacher is a Riverraid expert. At the end of training, using an average network to obtain θold achieves an average reward of 96.2 and using a single network to obtain θold achieves an average reward of 96.0. More results on using an average network are shown in Fig. 11 (b, c).

8 RELATED WORK

As mentioned before, variants of knowledge transfer have been considered using a variety of techniques, for instance, ﬁne-tuning, progressive neural nets (Rusu et al., 2016b), Path Net (Fernando et al., 2017), Growing a Brain (Wang et al., 2017), actor-mimic (Parisotto et al., 2016), learning without forgetting (Li & Hoiem, 2016). Also related are techniques on transfer learning and lifelong learning. We discuss those methods and contrast them to our approach in the following. Path Net (Fernando et al., 2017) enables multiple agents to train the same giant deep net while reusing parameters and avoiding catastrophic forgetting. To this end, agents embedded in the neural net discover which weights can be reused for new tasks and restrict application of gradients to those parameters. In contrast to this formulation we consider availability of multiple teacher nets, which are trained. Progressive Net (Rusu et al., 2016b) leverages transfer and avoids catastrophic forgetting by introducing lateral connections to previously learned features. Our discussed method uses similar lateral

Published as a conference paper at ICLR 2019

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

Kung Fu Master

w/ same-architecture teacher model w/ different-architecture teacher model

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/ same-architecture teacher model w/ different-architecture teacher model

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/ same-architecture teacher model w/ different-architecture teacher model

(a) Kung Fu Master (b) Boxing (c) Gopher

Figure 10: Teachers architecture differs from the student s architecture. Seaquest and Riverraid experts are used as teachers for all experiments.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/o exponential average w/ exponential average

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

Kung Fu Master

w/o exponential average w/ exponential average

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 # of Frames 1e7

w/o exponential average w/ exponential average

(a) Boxing (b) Kung Fu Master (c) Gopher

Figure 11: Average network to compute θold. Riverraid expert is used as teacher for all experiments.

connections. However, in contrast to Rusu et al. (2016b), we introduce scaling with normalized weights. This ensures independence of the student upon training, addressing a limitation in (Rusu et al., 2016b) where only a fraction of the capacity of the student is eventually utilized. Distral a neologism combining distill & transfer learning (Teh et al., 2017) considers joint training of multiple tasks. Multiple tasks share a distilled policy which encodes common behavior between different tasks. While each worker addresses its own task, a shared policy encourages consistency between the policies. Different from Distral, which is a multi-task learning framework, knowledge ﬂow addresses a single task, while in multi-task learning, multiple tasks are addressed at the same time. Hence, common for multi-task learning and knowledge ﬂow is a transfer of information. However, in multi-task learning, information extracted from different tasks are shared to boost performance, while, in knowledge ﬂow, the information of multiple teachers is leveraged to help a student learn better a single, new, previously unseen task. Knowledge distillation (Hinton et al., 2015) distills information form a larger deep net into a smaller one. It assumes both nets are trained on the same dataset. In contrast, our technique allows knowledge transfer between different source and target domains. Actor-mimic (Parisotto et al., 2016) enables an agent to learn how to address multiple tasks simultaneously and generalize the extracted knowledge to new domains. A single policy net learns how to act in a set of tasks following the guidance of several expert teachers. A combination of feature regression and cross entropy loss is used to encourage the student to produce similar actions and representations. Our proposed technique differs in that we take advantage of a teachers representation at the beginning of training, Learning without forgetting (Li & Hoiem, 2016) permits to add a new task to a deep net without forgetting the original capabilities. Importantly, only data from the new task is used and the old capabilities are retained by ﬁrst recording the old networks output on the new data. Similar techniques have been developed by Furlanello et al. (2016); Jung et al. (2016). In contrast, we transfer knowledge from teacher networks more explicitly.

Published as a conference paper at ICLR 2019

Growing a Brain (Wang et al., 2017) analyzes the parameters which change during ﬁne-tuning and points out that more natural model adaptation is obtained when increasing the model capacity, by either extending width or depth. Appropriate normalization is essential to signiﬁcantly outperform classical ﬁne-tuning. Since this technique is based on ﬁne-tuning, it differs from our student-teacher based approach.

Other related work includes policy distillation (Rusu et al., 2016a), domain adaptation (Pan & Yang, 2010; Long et al., 2015; Tzeng et al., 2015) or lifelong learning (Chen & Liu, 2016; Thrun, 1998; Mitchell et al., 2015; Ruvolo & Eaton, 2013).