# decoupled_greedy_learning_of_cnns__e53a29dc.pdf

Decoupled Greedy Learning of CNNs

Eugene Belilovsky 1 Michael Eickenberg 2 Edouard Oyallon 3

A commonly cited inefﬁciency of neural network training by back-propagation is the update locking problem: each layer must wait for the signal to propagate through the full network before updating. Several alternatives that can alleviate this issue have been proposed. In this context, we consider a simpler, but more effective, substitute that uses minimal feedback, which we call Decoupled Greedy Learning (DGL). It is based on a greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classiﬁcation. We consider an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization in layers. With the use of a replay buffer we show this approach can be extended to asynchronous settings, where modules can operate with possibly large communication delays. We show theoretically and empirically that this approach converges. Then, we empirically ﬁnd that it can lead to better generalization than sequential greedy optimization. We demonstrate the effectiveness of DGL against alternative approaches on the CIFAR-10 dataset and on the large-scale Image Net dataset.

1. Introduction

Jointly training all layers using back-propagation is the standard method for learning neural networks, including the computationally intensive Convolutional Neural Networks (CNNs). Due to the sequential nature of gradient processing, standard back-propagation has several well-known inefﬁ-

1MILA 2Center for Computational Mathematics, Flatiron Institute 3CNRS, LIP6. Correspondence to: Eugene Belilovsky <eugene.belilovsky@mila.quebec>, Michael Eickenberg <meickenberg@ﬂatironinstitute.org>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

ciencies that prohibit parallelization of the computations of the different constituent modules. Jaderberg et al. (2017) characterize these in order of severity as the forward, update, and backward locking problems. Backward unlocking would permit updates of all modules once forward signals have propagated to all subsequent modules, update unlocking would permit updates of a module before a signal has reached all subsequent modules, and forward unlocking would permit a module to operate asynchronously from its predecessor and dependent modules.

Methods addressing backward locking to a certain degree have been proposed in (Huo et al., 2018b;a; Choromanska et al., 2018; Nø kland, 2016). However, update locking is a far more severe inefﬁciency. Thus Jaderberg et al. (2017) and Czarnecki et al. (2017) propose and analyze Decoupled Neural Interfaces (DNI), a method that uses an auxiliary network to predict the gradient of the backward pass directly from the input. This method unfortunately does not scale well computationally or in terms of accuracy, especially in the case of CNNs (Huo et al., 2018a;b). Indeed, auxiliary networks must predict a weight gradient, usually high dimensional for larger models and input sizes.

A major obstacle to update unlocking is the heavy reliance on the upper modules for feedback. Several works have recently revisited the classic (Ivakhnenko & Lapa, 1965; Bengio et al., 2007) approach of supervised greedy layerwise training of neural networks (Huang et al., 2018a; Marquez et al., 2018). In Belilovsky et al. (2019) it is shown that such an approach, which relaxes the joint learning objective, and does not require global feedback, can lead to high-performance deep CNNs on large-scale datasets. We will show that the greedy learning objective used in these papers can be solved with an alternative optimization algorithm, which permits decoupling the computations and achieves update unlocking. It can be augmented with replay buffers (Lin, 1992) to permit forward unlocking which is a challenge not effectively addressed by any of the prior work. This simpler strategy can be shown to be a superior baseline for parallelizing the training across modules of a neural network.

The paper is structured as follows. In Sec. 2 we propose an optimization procedure for a decoupled greedy learning objective that achieves update unlocking and then extend it to

Decoupled Greedy Learning of CNNs

Forward connection forward unlocked

DNI - Synthetic Gradient Synchronous - DGL Asynchronous - DGL

no feedback

Module 2 Aux

no feedback Module 2

Auxiliary gradient

Error gradient

Forward connection update unlocked

Forward connection not update unlocked

Figure 1. Comparison of DNI, Synchronous and Asynchronous DGL. Note in DGL subsequent modules do not provide feedbacks thus removing dependencies of the auxiliary network. Asynchronous DGL allows achieving forward unlocking.

an asynchronous setting (async-DGL) using a replay buffer, addressing forward unlocking. In Sec. 3 we show that the proposed optimization procedure converges and recovers standard rates of non-convex optimization, motivating empirical observations in the subsequent experimental section. In Sec. 4 we show that DGL can outperform competing methods in terms of scalability to larger and deeper models and stability to optimization hyperparameters and overall parallelism, allowing it to be applied to large datasets such as Image Net. We extensively study async-DGL and ﬁnd that it is robust to signiﬁcant delays. We also empirically study the impact of parallelized training on convergence. Code for experiments is included in the submission.

2. Parallel Decoupled Greedy Learning

In this section we formally deﬁne the greedy objective and parallel optimization which we study in both the synchronous and asynchronous setting. We mainly consider the online setting and assume a stream of samples or minibatches denoted S {(xt 0, yt)}t T , run during T iterations.

2.1. Preliminaries

For comparison purposes, we brieﬂy review the update unlocking approach from DNI (Jaderberg et al., 2017). There, each network module has an associated auxiliary net which, given the output activation of the module, predicts the gradient signal from subsequent modules: the module can thus perform an update while modules above are still forward processing. The DNI auxiliary model is trained by using true gradients provided by upper modules when they become available, requiring activation caching. This also means that the auxiliary module can become out of sync with the changing output activation distribution, often requiring slow learning rates. Due to this and the high dimensionality of the predicted gradient which scales with module size, this estimate is challenging. One may ask how well a method that entirely avoids the use of feedback from upper modules would fare given similarly-sized auxiliary networks. We

will show that adapting the objective in (Belilovsky et al., 2019; Bengio et al., 2007) can also allow for update unlock and a degree of forward unlocking, with better properties.

Algorithm 1: Synchronous DGL

Input: Stream S {(xt 0, yt)}t T of samples or mini-batches.

1 Initialize Parameters {θj, γj}j J.

2 for (xt 0, yt) S do

3 for j 1, ..., J do

4 xt j fθj 1(xt j 1).

5 Compute (γj,θj) ˆL(yt, xt j; γj, θj).

6 (θj, γj) Update parameters (θj, γj).

Algorithm 2: Asynchronous DGL with Replay

Input: Stream S {(xt 0, yt)}t T ; Distribution of the delay p = {p(j)}j; Buffer size M.

1 Initialize: Buffers {Bj}j; params {θj, γj}j.

2 while training do

3 Sample j in {1, ..., J} following p.

4 if j = 1 then

5 (x0, y) S

7 (xj 1, y) Bj 1.

9 xj fθj 1(xj 1).

10 Compute (γj,θj) ˆL(y, xj; γj, θj).

11 (θj, γj) Update parameters (θj, γj).

12 if j < J then Bj (xj, y).

2.2. Optimization for Greedy Objective

Let X0 and Y be the data and labels, Xj be the output representation for module j. We will denote the per-module objective function ˆL(Xj, Y ; θj, γj), where the parameters θj correspond to the module parameter (i.e. Xj+1 = fθj(Xj)). Here γj represents parameters of a auxiliary networks used to predict the ﬁnal target and compute the local objective. ˆL in our case will be the empirical risk with a cross-entropy loss. The greedy training objective is thus given recursively

Decoupled Greedy Learning of CNNs

by deﬁning Pj:

min θj,γj ˆL(Xj, Y ; θj, γj), (Pj)

where Xj = fθ j 1(Xj 1) and θ j 1 is the minimizer of Problem (Pj 1). A natural way to solve the optimization problem for J modules, (PJ), is thus by sequentially solving the problems {Pj}j J starting with j = 1. This is the approach taken in e.g. Marquez et al. (2018); Huang et al. (2018a); Bengio et al. (2007); Belilovsky et al. (2019). Here we consider an alternative procedure for optimizing the same objective, which we refer to as Sync-DGL. It is outlined in Alg 1. In Sync-DGL individual updates of each set of parameters are performed in parallel across the different layers. Each layer processes a sample or mini-batch, then passes it to the next layer, while simultaneously performing an update based on its own local loss. Note that at line 5 the subsequent layer can already begin computing line 4. Therefore, this algorithm achieves update unlocking. Once xt j has been computed, subsequent layers can begin processing. Sync-DGL can also be seen as a generalization of the biologically plausible learning method proposed in concurrent work (Nøkland & Eidnes, 2019). Appendix D also gives an explicit version of an equivalent multi-worker pseudo-code. Fig. 1 illustrates the decoupling compared to how samples are processed in the DNI algorithm.

In this work we solve the sub-problems Pj by backpropagation, but we note that any iterative solver available for Pj will be applicable (e.g. (Choromanska et al., 2018)). Finally we emphasize that unlike the sequential solvers of (e.g. Bengio et al. (2007); Belilovsky et al. (2019)) the distribution of inputs to each sub-problem solver changes over time, resulting in a learning dynamic whose properties have never been studied nor contrasted with sequential solvers.

2.3. Asynchronous DGL with Replay

We can now extend this framework to address forward unlocking (Jaderberg et al., 2017). DGL modules already do not depend on their successors for updates. We can further reduce dependency on the previous modules such that they can operate asynchronously. This is achieved via a replay buffer that is shared between adjacent modules, enabling them to reuse older samples. Scenarios with communication delays or substantial variations in speed between layers/modules beneﬁt from this. We study one instance of such an algorithm that uses a replay buffer of size M, shown in Alg. 2 and illustrated in Fig. 1.

Our minimal distributed setting is as follows. Each worker j has a buffer that it writes to and that worker j + 1 can read from. The buffer uses a simple read/write protocol. A buffer Bj lets layer j write new samples. When it reaches capacity it overwrites the oldest sample. Layer j+1 requests samples from the buffer Bj. They are selected by a last-in-

ﬁrst-out (LIFO) rule, with precedence for the least reused samples. Alg. 2 simulates potential delays in such a setup by the use of a probability mass function (pmf) p(j) over workers, analogous to typical asynchronous settings such as (Leblond et al., 2017). At each iteration, a layer is chosen at random according to p(j) to perform a computation. In our experiments we limit ourselves to pmfs that are uniform over workers except for a single layer which is chosen to be selected less frequently on average. Even in the case of a uniform pmf, asynchronous behavior will naturally arise, requiring the reuse of samples. Alg. 2 permits a controlled simulation of processing speed discrepancies and will be used over settings of p and M to demonstrate that training and testing accuracy remain robust in practical regimes. Appendix D also provides pseudo-code for implementation in a parallel environment.

Unlike common data-parallel asynchronous algorithms (Zhang et al., 2015), the asynchronous DGL does not rely on a master node and requires only local communication similar to recent decentralized schemes (Lian et al., 2017). Contrary to decentralized SGD, DGL nodes only need to maintain and update the parameters of their local module, permitting much larger modules. Combining asynchronous DGL with distributed synchronous SGD for sub-problem optimization is a promising direction. For example it can alleviate a common issue of the popular distributed synchronous SGD in deep CNNs, which is the often limiting maximum batch size (Goyal et al., 2017).

2.4. Auxiliary and Primary Network Design

Like DNI our procedure relies on an auxiliary network to obtain update signal, both methods thus require auxiliary network design in addition to the main CNN architecture. Belilovsky et al. (2019) have shown that spatial averaging operations can be used to construct a scalable auxiliary network for the same objective as used in Sec 2.2. However, they did not directly consider the parallel training use case, where additional care must be taken in the design: The primary consideration is the relative speed of the auxiliary network with respect to its associated main network module. We will use primarily FLOP count in our analysis and aim to restrict our auxiliary networks to be 5% of the main network.

Although auxiliary network design might seem like an additional layer of complexity in CNN design and may require invoking slightly different architecture principles, this is not inherently prohibitive since architecture design is often related to training (e.g., the use of residuals is originally motivated by optimization issues inherent to end-to-end backprop (He et al., 2016)).

Finally, we note that although we focus on the distributed learning context, this algorithm and associated theory for greedy objectives is generic and has other potential appli-

Decoupled Greedy Learning of CNNs

cations. For example greedy objectives have recently been used in (Haarnoja et al., 2018; Huang et al., 2018a) and even with a single worker DGL reduces memory.

3. Theoretical Analysis

We now study the converge results of DGL. Since we do not rely on any approximated gradients, we can derive stronger properties than DNI (Czarnecki et al., 2017), such as a rate of convergence in our non-convex setting. To do so, we analyze Alg. 1 when the update steps are obtained from stochastic gradient methods. We show convergence guarantees (Bottou et al., 2018) under reasonable assumptions. In standard stochastic optimization schemes, the input distribution fed to a model is ﬁxed (Bottou et al., 2018). In this work, the input distribution to each module is time-varying and dependent on the convergence of the previous module. At time step t, for simplicity we will denote all parameters of a module (including auxiliary) as Θt j (θt j, γt j), and samples as Zt j (Xt j, Y t), which follow the density pt j(z). For each auxiliary problem, we aim to prove the strongest existing guarantees (Bottou et al., 2018; Huo et al., 2018a) for the non-convex setting despite time-varying input distributions from prior modules. Proofs are given in the Appendix.

Let us ﬁx a depth j, such that j > 1 and consider the converged density of the previous layer, p j 1(z). We deﬁne the following distance: ct j 1 R |pt j 1(z) p j 1(z)| dz. Denoting ℓthe composition of the non-negative loss function and the network, we will study the expected risk L(Θj) Ep j 1[ℓ(Zj 1; Θj)]. We will now state several standard assumptions we use. Assumption 1 (L-smoothness). L is differentiable and its gradient is L-Lipschitz.

We consider the SGD scheme with learning rate {ηt}t:

Θt+1 j = Θt j ηt Θjℓ(Zt j 1; Θt j), (1)

where Zt j 1 pt j 1. Assumption 2 (Robbins-Monro conditions). The step sizes satisfy P

t ηt = yet P

t η2 t < . Assumption 3 (Finite variance). There exists G > 0 such that t, Θj, Ept j 1 Θjℓ(Zj 1; Θj) 2 G.

The Assumptions 1, 2 and 3 are standard (Bottou et al., 2018; Huo et al., 2018a), and we show in the following that our proof of convergence leads to similar rates, up to a multiplicative constant. The following assumption is speciﬁc to our setting where we consider a time-varying distribution: Assumption 4 (Convergence of the previous layer). We assume that P

t ct j 1 < . Lemma 3.1. Under Assumption 3 and 4, for all Θj, one has Ep j 1 Θjℓ(Zj 1; Θj) 2 G.

We are now ready to prove the core statement for the convergence results in this setting:

Lemma 3.2. Under Assumptions 1, 3 and 4, we have:

E[L(Θt+1 j )] E[L(Θt j)] + LG

ηt E[ L(Θt j) 2]

The expectation is taken over each random variable. Also, note that without the temporal dependency (i.e. ct j = 0), this becomes analogous to Lemma 4.4 in (Bottou et al., 2018). Naturally it follows, that

Proposition 3.1. Under Assumptions 1, 2, 3 and 4, each term of the following equation converges:

t=0 ηt E[ L(Θt j) 2] E[L(Θ0 j)]

2ct j 1 + Lηt

Thus the DGL scheme converges in the sense of (Bottou et al., 2018; Huo et al., 2018a). We can also obtain the following rate:

Corollary 3.1. The sequence of expected gradient norm accumulates around 0 at the following rate:

inf t T E[ L(Θt j) 2] O

ct j 1ηt PT t=0 ηt

Thus compared to the sequential case, the parallel setting adds a delay that is controlled by q

4. Experiments

We conduct experiments that empirically show that DGL optimizes the greedy objective well, showing it is favorable against recent state-of-the-art proposals for decoupling training of deep network modules. We show that unlike previous decoupled proposals it can still work on a large-scale dataset (Image Net) and that it can, in some cases, generalize better than standard back-propagation. We then extensively evaluate the asynchronous DGL, simulating large delays. For all experiments we use architectures taken from prior works and standard optimization settings.

4.1. Other Approaches and Auxiliary Network Designs

This section presents experiments evaluating DGL with the CIFAR-10 dataset (Krizhevsky, 2009) and standard data augmentation. We ﬁrst use a setup that permits us to compare against the DNI method and which also highlights the

Decoupled Greedy Learning of CNNs

0 200 400 600 800 1000 1200 1400 epoch

Accuracy (CIFAR-10)

Backprop DGL DNI c DNI

0 200 400 600 800 1000 1200 1400 epoch

Training Loss (CIFAR-10)

Backprop DGL DNI c DNI

Figure 2. Comparison of DNI, c DNI, and DGL in terms of training loss and test accuracy for experiment from (Jaderberg et al., 2017). DGL converges better than c DNI and DNI with the same auxiliary net. and generalizes better than backprop.

generality and scalability of DGL. We then consider the design of a more efﬁcient auxiliary network which will help to scale to the Image Net dataset. We will also show that DGL is effective at optimizing the greedy objective compared to a naive sequential algorithm.

Comparison to DNI We reproduce the CIFAR-10 CNN experiment described in (Jaderberg et al., 2017), Appendix C.1. This experiment utilizes a 3 layer network with auxiliary networks of 2 hidden CNN layers. We compare our reproduction to the DGL approach. Instead of the ﬁnal synthetic gradient prediction for the DGL we apply a ﬁnal projection to the target prediction space. Here, we follow the prescribed optimization procedure from (Jaderberg et al., 2017), using Adam with a learning rate of 3 10 5. We run training for 1500 epochs and compare standard backprop, DNI, context DNI (c DNI) (Jaderberg et al., 2017) and DGL. Results are shown in Fig. 2. Details are included in the Appendix. The DGL method outperforms DNI and the c DNI by a substantial amount both in test accuracy and training loss. Also in this setting, DGL can generalize better than standard backprop and obtains a close ﬁnal training loss.

We also attempted DNI with the more commonly used optimization settings for CNNs (SGD with momentum and step decay), but found that DNI would diverge when larger learning rates were used, although DGL sub-problem optimization worked effectively with common CNN optimization strategies. We also note that the prescribed experiment uses a setting where the scalability of our method is not fully exploited. Each layer of the primary network of (Jaderberg et al., 2017) has a pooling operation, which permits the auxiliary network to be small for synthetic gradient prediction. This however severely restricts the architecture choices in the primary network to using a pooling operation at each layer. In DGL, we can apply the pooling operations in the auxiliary network, thus permitting the auxiliary network to be negligible in cost even for layers without pooling (whereas synthetic gradient predictions often have to be as costly as the base network). Overall, DGL is more scalable, accurate and robust to changes in optimization hyper-parameters than DNI.

Relative FLOPS Acc. CNN-aux 200% 92.2 MLP-aux 0.7% 90.6 MLP-SR-aux 4.0% 91.2

Table 1. Comparison of auxiliary networks on CIFAR. CNN-aux applied in previous work is inefﬁcient w.r.t. the primary module. We report ﬂop count of the aux net relative to the largest module. MLP-aux and MLP-SR-aux applied after spatial averaging operations are far more effective with min. acc. loss. Auxiliary Network Design We consider different auxiliary networks for CNNs. As a baseline we use convolutional auxiliary layers as in (Jaderberg et al., 2017) and (Belilovsky et al., 2019). For distributed training application this approach is sub-optimal as the auxiliary network can be substantial compared to the base network, leading to poorer parallelization gains. We note however that even in those cases (that we don t study here) where the auxiliary network computation is potentially on the order of the primary network, it can still give advantages for parallelization for very deep networks and many available workers.

The primary network architecture we use for this study is a simple CNN similar to VGG family models (Simonyan & Zisserman, 2014) and those used in (Belilovsky et al., 2019). It consists of 6 convolutions of size 3 3, batchnorm and shape preserving padding, with 2 2 maxpooling at layers 1 and 3. The width of the ﬁrst layer is 128 and is doubled at each downsampling operation. The ﬁnal layer does not have an auxiliary model it is followed by a pooling and 2-hidden layer fully connected network, for all experiments. Two alternatives to the CNN auxiliary of (Belilovsky et al., 2019) are explored (Tab. 1).

The baseline auxiliary strategy based on (Belilovsky et al., 2019) and (Jaderberg et al., 2017) applies 2 CNN layers followed by a 2 2 averaging and projection, denoted as CNN-aux. First, we explore a direct application of the spatial averaging to 2 2 output shape (regardless of the resolution) followed by a 3-layer MLP (of constant width). This is denoted MLP-aux and drastically reduces the FLOP count with minimal accuracy loss compared to CNN-aux. Finally, we study a staged spatial resolution, ﬁrst reducing the spatial resolution by 4 (and total size 16 ), then applying 3

Decoupled Greedy Learning of CNNs

Figure 3. Comparison of sequential and parallel training. Parallel catches up rapidly to sequential. 1 1 convolutions followed by a reduction to 2 2 and a 3 layer MLP, that we denote as MLP-SR-aux. These latter two strategies that leverage the spatial averaging produce auxiliary networks that are less than 5% of the FLOP count of the primary network even for large spatial resolutions as in real world image datasets. We will show that MLP-SR-aux is still effective even for the large-scale Image Net dataset. We note that these more effective auxiliary models are not easily applicable in the case of DNI s gradient prediction.

Sequential vs. Parallel Optimization of Greedy Objective We brieﬂy compare the sequential optimization of the greedy objective (Belilovsky et al., 2019; Bengio et al., 2007) to the DGL (Alg. 1). We use a 6 layer CIFAR-10 network with an MLP-SR-aux auxiliary model. In parallel we train the layers together for 50 epochs and in the sequential training we train each layer for 50 epochs before moving to the subsequent one. Thus the difference to DGL lies only in the input received at each layer (fully converged previous layer versus not fully converged previous layer). The rest of the optimization settings are identical. Fig. 3 shows comparisons of the learning curves for sequential training and DGL at layer 4 (layer 1 is the same for both as the input representation is not varying over the training period). DGL quickly catches up with the sequential training scheme and appears to sometimes generalize better. Like Oyallon (2017), we also visualize the dynamics of training per layer in Fig. 4, which demonstrates that after just a few epochs the individual layers build a dynamic of progressive improvement with depth.

Multi-Layer modules We have so far mainly considered the setting of layer-wise decoupling. This approach however can easily be applied to generic modules. Indeed, approaches such as DNI (Jaderberg et al., 2017) often consider decoupling entire multi-layer modules. Furthermore the propositions for backward unlocking (Huo et al., 2018b;a) also rely on and report they can often only decouple 100 layer networks into 2 or 4 blocks before observing optimization issues or performance losses and require that the number of parallel modules be much lower than the network

Figure 4. Per layer-loss on CIFAR: after few epochs, the layers build a dynamic of progressive improvement in depth.

Backprop DDG DGL 93.53 93.41 93.5 0.1

Table 2. Res Net-110(K = 2) for Backprop and DDG method from (Huo et al., 2018b). DGL is run for 3 trials to compute variance. They give the same acc. with DGL being update unlocked, DDG only backward unlocked. DNI is reported to not work in this setting (Huo et al., 2018b). depth for the theoretical guarantees to hold. As in those cases, using multi-layer decoupled modules can improve performance and is natural in the case of deeper networks. We now use such a multi-layer approach to directly compare to the backward unlocking of (Huo et al., 2018b) and then subsequently we will apply this on deep networks for Image Net. From here on we will denote K the number of total modules a network is split into.

Comparison to DDG Huo et al. (2018b) propose a solution to the backward locking (less efﬁcient than solving update-locking, see discussion in Sec 5). We show that even in this situation the DGL method can provide a strong baseline for work on backward unlocking. We take the experimental setup from (Huo et al., 2018b), which considers a Res Net-110 parallelized into K = 2 blocks. We use the auxiliary network MLP-SR-aux which has less than 0.1% the FLOP count of the primary network. We use the exact optimization and network split points as in (Huo et al., 2018b).

To assess variance in CIFAR-10 accuracy, we perform 3 trials. Tab. 2 shows that the accuracy is the same across the DDG method, backprop, and our approach. DGL achieves better parallelization because it is update unlocked. We use the parallel implementation provided by (Huo et al., 2018b) to obtain a direct wall clock time comparison. We note that there are multiple considerations for comparing speed across these methods (see Appendix C).

Wall Time Comparison We compare to the parallel implementation of (Huo et al., 2018b) using the same communication protocols and run on the same hardware. We ﬁnd for K = 2, 4 GPU gives a 5%, 18% respectively speedup

Decoupled Greedy Learning of CNNs

over DDG. With DDG K = 4 giving approximately 2.3 speedup over standard backprop on same hardware (close to results from (Huo et al., 2018b)).

4.2. Large-scale Experiments

Existing methods considering update or backward locking have not been evaluated on large image datasets as they are often unstable or already show large losses in accuracy on smaller datasets. Here we study the optimization of several well-known architectures, mainly the VGG family (Simonyan & Zisserman, 2014) and the Res Net (He et al., 2016), with DG on the Image Net dataset. In all our experiments we use the MLP-SR-aux auxiliary net which scales well from the smaller CIFAR-10 to the larger Image Net. The ﬁnal module has no auxiliary network. For all optimization of auxiliary problems and for end-to-end optimization of reference models we use the shortened optimization schedule prescribed in (Xiao et al., 2019). Results are shown in Tab. 3. We see that for all the models DGL can perform as well and sometimes better than the end-to-end trained models, while permitting parallel training. In all these cases the auxiliary networks are neglibile (see Appendix Table 4 for more details). For the VGG-13 architecture we also evaluate the case where the model is trained layer by layer (K = 10). Although here performance is slightly degraded, we ﬁnd it is suprisingly high given that no backward communication is performed. We conjecture that improved auxiliary models and combinations with methods such as (Huo et al., 2018a) to allow feedback on top of the local model, may further improve performance. Also for the settings with larger potential parallelization, slower but more performant auxiliary models could potentially be considered as well.

The synchronous DGL has also favorable memory usage compared to DDG and to the DNI method, DNI requiring to store larger activations and DDG having high memory compared to the base network even for few splits (Huo et al., 2018a). Although not our focus, the single worker version of DGL has favorable memory usage compared to standard backprop training. For example, the Res Net-152 DGL K = 2 setting can ﬁt 38% more samples on a single 16GB GPU than the standard end-to-end training.

4.3. Asynchronous DGL with Replay

We now study the effectiveness of Alg. 2 w.r.t delays. We use a 5 layer CIFAR-10 network with the MLP-aux and with all other architecture and optimization settings as in the auxiliary network experiments of Sec. 4.1. Each layer is equipped with a buffer of size M. At each iteration, a layer is chosen according to the pmf p(j), and a batch selected from buffer Bj 1. One layer is slowed down by decreasing its selection probability in the pmf p(j) by a factor S. We evaluate different slowdown factors (up to

S = 2.0). Accuracy versus S is shown in Fig. 5. For this experiment we use a buffer of size M = 50. We run separate experiments with the slowdown applied at each layer of the network as well as 3 random seeds for each of these settings (thus 18 experiments per data point). We show the evaluations for 10 values of S. To ensure a fair comparison we stop updating layers once they have completed 50 epochs, ensuring and identical number of gradient updates for all layers in all experiments.

In practice one could continue updating until all layers are trained. In Fig. 5 we compare to the synchronous case. First, observe that the accuracy of the synchronous algorithm is maintained in the setting where S = 1.0 and the pmf is uniform. Note that even this is a non-trivial case, as it will mean that layers inherently have random delays (as compared to Alg. 1). Secondly, observe that accuracy is maintained until approximately 1.2 and accuracy losses after that the difference remains small. Note that even case S = 2.0 is somewhat drastic: for 50 epochs, the sloweddown layer is only on epoch 25 while those following it are at epoch 50.

We now consider the performance with respect to the buffer size. Results are shown in Fig. 6. For this experiment we set S = 1.2 . Observe that even a tiny buffer size can yield only a slight loss in performance accuracy. Building on this demonstration there are multiple directions to improve Async DGL with replay. For example improving the efﬁciency of the buffer (Oyallon et al., 2018), by including data augmentation in feature space (Verma et al., 2018), mixing samples in batches, or improved batch sampling, among others.

5. Related work

To the best of our knowledge (Jaderberg et al., 2017) is the the ﬁrst to directly consider the update or forward locking problems in deep feed-forward networks. Other works (Huo et al., 2018a;b) study the backward locking problem. Furthermore, a number of backpropagation alternatives (Choromanska et al., 2018; Lee et al., 2014; Nø kland, 2016) can address backward locking. However, update locking is a more severe inefﬁciency. Consider the case where each layer s forward processing time is TF and is equal across a network of L layers. Given that the backward pass is a constant multiple in time of the forward, in the most ideal case the backward unlocking will still only scale as O(LTF ) with L parallel nodes, while update unlocking could scale as O(TF ).

One class of alternatives to standard back-propagation aims to avoid its biologically implausible aspects, most notably the weight transport problem (Bartunov et al., 2018; Nø kland, 2016; Lillicrap et al., 2014; Lee et al., 2014; Ororbia et al., 2018; Ororbia & Mali, 2019). Some of these

Decoupled Greedy Learning of CNNs

Figure 5. Evaluation of Async DGL. A single layer is slowed down on average over others, with negligible losses of accuracy at even substantial delays.

Model (training method) Top-1 Top-5 VGG-13 (DGL per Layer, K = 10) 64.4 85.8 VGG-13 (DGL K = 4) 67.8 88.0 VGG-13 (backprop) 66.6 87.5 VGG-19 (DGL K = 4) 69.2 89.0 VGG-19 (DGL K = 2) 70.8 90.2 VGG-19 (backprop) 69.7 89.7 Res Net-152 (DGL K = 2) 74.5 92.0 Res Net-152 (backprop) 74.4 92.1

Table 3. Image Net results using training schedule of (Xiao et al.,

2019) for DGL and standard e2e backprop. DGL with VGG and Res Net obtains similar or better accuracies, while enabling parallelization and reduced memory.

Figure 6. Buffer size vs. Acc. for Async DGL. Smaller buffers produce only small loss in acc. methods (Lee et al., 2014; Nø kland, 2016) can also achieve backward unlocking as they permit all parameters to be updated at the same time, but only once the signal has propagated to the top layer. However, they do not solve the update or forward locking problems. Target propagation uses a local auxiliary network as in our approach, for propagating backward optimal activations computed from the layer above. Feedback alignment replaces the symmetric weights of the backward pass with random weights. Direct feedback alignment extends the idea of feedback alignment passing errors from the top to all layers, potentially enabling simultaneous updates. These approaches have also not been shown to scale to large datasets (Bartunov et al., 2018), obtaining only 17.5% top-5 accuracy on Image Net (reference model achieving 59.8%). On the other hand, greedy learning has been shown to work well on this task (Belilovsky et al., 2019). We also note concurrent work in the context of biologically plausible models by (Nøkland & Eidnes, 2019) which improves on results from (Mostafa et al., 2018), showing an approach similar to a speciﬁc instantiation of the synchronous version of DGL. This work however does not consider the applications to unlocking nor asynchronous training and cannot currently scale to Image Net.

Another line of related work inspired by optimization methods such as Alternating Direction Method of Multipliers (ADMM) (Taylor et al., 2016; Carreira-Perpinan & Wang,

2014; Choromanska et al., 2018) use auxiliary variables to break the optimization into sub-problems. These approaches are fundamentally different from ours as they optimize for

the joint training objective, the auxiliary variables providing a link between a layer and its successive layers, whereas we consider a different objective where a layer has no dependence on its successors. None of these methods can achieve update or forward unlocking. However, some (Choromanska et al., 2018) are able to have simultaneous weight updates (backward unlocked). Another issue with ADMM methods is that most of the existing approaches except for (Choromanska et al., 2018) require standard ( batch ) gradient descent and are thus difﬁcult to scale. They also often involve an inner minimization problem and have thus not been demonstrated to work on large-scale datasets. Furthermore, none of these have been combined with CNNs.

Distributed optimization based on data parallelism is a popular area in machine learning beyond deep learning models and often studied in the convex setting (Leblond et al., 2018). For deep network optimization the predominant method is distributed synchronous SGD (Goyal et al., 2017) and variants, as well as asynchronous (Zhang et al., 2015) variants. Our work is closer to a form of model parallelism rather than data parallelism, and can be easily combined with many data parallel methods (e.g. distributed synchronous SGD). Finally, recent proposals for pipelining (Huang et al., 2018b) consider systems level approaches to optimize latency times. These methods do not address the update, forward, locking problems(Jaderberg et al., 2017) which are algorithmic constraints of the learning objective and backpropagation. Pipelining can be seen as a attempting to work around these restrictions, with the fundamental limitations remaining. Removing and reducing update, backward, forward locking would simplify the design and efﬁciency of such systemslevel machinery. Finally, tangential to our work Lee et al. (2015) considers auxiliary objectives but with a joint learning objective, which is not capable of addressing any of the problems considered in this work.

Decoupled Greedy Learning of CNNs

6. Conclusion

We have analyzed and introduced a simple and strong baseline for parallelizing per layer and per module computations in CNN training. This work matches or exceeds state-ofthe-art approaches addressing these problems and is able to scale to much larger datasets than others. Future work can develop improved auxiliary problem objectives and combinations with delayed feedback.

Acknowledgements

EO acknowledges NVIDIA for its GPU donation. EB acknowledges funding from IVADO. We would like to thank John Zarka, Louis Thiry, Georgios Exarchakis, Fabian Pedregosa, Maxim Berman, Amal Rannen, Kyle Kastner, Aaron Courville, and Nicolas Pinto for helpful discussions.

Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G. E., and Lillicrap, T. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In Advances in Neural Information Processing Systems, pp. 9389 9399, 2018.

Belilovsky, E., Eickenberg, M., and Oyallon, E. Greedy layerwise learning can scale to imagenet. Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153 160, 2007.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018.

Carreira-Perpinan, M. and Wang, W. Distributed optimization of deeply nested systems. In Artiﬁcial Intelligence and Statistics, pp. 10 19, 2014.

Choromanska, A., Kumaravel, S., Luss, R., Rish, I., Kingsbury, B., Tejwani, R., and Bouneffouf, D. Beyond backprop: Alternating minimization with co-activation memory. ar Xiv preprint ar Xiv:1806.09077, 2018.

Czarnecki, W. M., Swirszcz, G., Jaderberg, M., Osindero, S., Vinyals, O., and Kavukcuoglu, K. Understanding synthetic gradients and decoupled neural interfaces. Co RR, abs/1703.00522, 2017. URL http: //arxiv.org/abs/1703.00522.

Goyal, P., Doll ar, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and

He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. Co RR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.

Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. Latent space policies for hierarchical reinforcement learning. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1851 1860, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/haarnoja18a.html.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Huang, F., Ash, J., Langford, J., and Schapire, R. Learning deep resnet blocks sequentially using boosting theory. International Conference on Machine Learning(ICML), 2018a.

Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. ar Xiv preprint ar Xiv:1811.06965, 2018b.

Huo, Z., Gu, B., and Huang, H. Training neural networks using features replay. Advances in Neural Information Processing Systems, 2018a.

Huo, Z., Gu, B., qian Yang, and Huang, H. Decoupled parallel backpropagation with convergence guarantee. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2098 2106, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018b. PMLR. URL http://proceedings.mlr. press/v80/huo18a.html.

Ivakhnenko, A. G. and Lapa, V. G. Cybernetic Predicting Devices. CCM Information Corporation., 1965.

Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. International Conference of Machine Learning, 2017.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Asaga: Asynchronous parallel saga. In 20th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) 2017, 2017.

Decoupled Greedy Learning of CNNs

Leblond, R., Pedregosa, F., and Lacoste-Julien, S. Improved asynchronous parallel optimization analysis for stochastic incremental methods. Journal of Machine Learning Research, 19(81):1 68, 2018. URL http: //jmlr.org/papers/v19/17-650.html.

Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. Deeply-supervised nets. In Artiﬁcial intelligence and statistics, pp. 562 570, 2015.

Lee, D., Zhang, S., Biard, A., and Bengio, Y. Target propagation. Co RR, abs/1412.7525, 2014. URL http: //arxiv.org/abs/1412.7525.

Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous decentralized parallel stochastic gradient descent. ar Xiv preprint ar Xiv:1710.06952, 2017.

Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. Random feedback weights support learning in deep neural networks. Co RR, abs/1411.0247, 2014. URL http://arxiv.org/abs/1411.0247.

Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293 321, 1992.

Marquez, E. S., Hare, J. S., and Niranjan, M. Deep cascade learning. IEEE Transactions on Neural Networks and Learning Systems, 29(11):5475 5485, Nov 2018. ISSN 2162-237X. doi: 10.1109/TNNLS.2018.2805098.

Mostafa, H., Ramesh, V., and Cauwenberghs, G. Deep supervised learning using local errors. Frontiers in neuroscience, 12:608, 2018.

Nø kland, A. Direct feedback alignment provides learning in deep neural networks. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 1037 1045. 2016.

Nøkland, A. and Eidnes, L. H. Training neural networks with local error signals. ar Xiv preprint ar Xiv:1901.06656, 2019.

Ororbia, A. G. and Mali, A. Biologically motivated algorithms for propagating local target representations. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 4651 4658, 2019.

Ororbia, A. G., Mali, A., Kifer, D., and Giles, C. L. Conducting credit assignment by aligning local representations. ar Xiv preprint ar Xiv:1803.01834, 2018.

Oyallon, E. Building a regular decision boundary with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5106 5114, 2017.

Oyallon, E., Belilovsky, E., Zagoruyko, S., and Valko, M. Compressing the input for cnns with the ﬁrst-order scattering transform. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 301 316, 2018.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A., and Goldstein, T. Training neural networks without gradients: A scalable admm approach. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 2722 2731, New York, New York, USA, 20 22 Jun 2016. PMLR. URL http://proceedings.mlr. press/v48/taylor16.html.

Verma, V., Lamb, A., Beckham, C., Najaﬁ, A., Courville, A., Mitliagkas, I., and Bengio, Y. Manifold mixup: Learning better representations by interpolating hidden states. 2018.

Xiao, W., Chen, H., Liao, Q., and Poggio, T. A. Biologicallyplausible learning algorithms can scale to large datasets. International Conference on Learning Representations, 2019.

Zhang, S., Choromanska, A. E., and Le Cun, Y. Deep learning with elastic averaging sgd. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 685 693. Curran Associates, Inc., 2015.