# understanding_synthetic_gradients_and_decoupled_neural_interfaces__22d60282.pdf Understanding Synthetic Gradients and Decoupled Neural Interfaces Wojciech Marian Czarnecki 1 Grzegorz Swirszcz 1 Max Jaderberg 1 Simon Osindero 1 Oriol Vinyals 1 Koray Kavukcuoglu 1 When training neural networks, the use of Synthetic Gradients (SG) allows layers or modules to be trained without update locking without waiting for a true error gradient to be backpropagated resulting in Decoupled Neural Interfaces (DNIs). This unlocked ability of being able to update parts of a neural network asynchronously and with only local information was demonstrated to work empirically in Jaderberg et al. (2016). However, there has been very little demonstration of what changes DNIs and SGs impose from a functional, representational, and learning dynamics point of view. In this paper, we study DNIs through the use of synthetic gradients on feed-forward networks to better understand their behaviour and elucidate their effect on optimisation. We show that the incorporation of SGs does not affect the representational strength of the learning system for a neural network, and prove the convergence of the learning system for linear and deep linear models. On practical problems we investigate the mechanism by which synthetic gradient estimators approximate the true loss, and, surprisingly, how that leads to drastically different layer-wise representations. Finally, we also expose the relationship of using synthetic gradients to other error approximation techniques and find a unifying language for discussion and comparison. 1. Introduction Neural networks can be represented as a graph of computational modules, and training these networks amounts to optimising the weights associated with the modules of this graph to minimise a loss. At present, training is usually performed with first-order gradient descent style algorithms, 1Deep Mind, London, United Kingdom. Correspondence to: WM Czarnecki . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). Differentiable h Non-differentiable Forward connection Error gradient Synthetic error gradient Figure 1. Visualisation of SG-based learning (b) vs. regular backpropagation (a). where the weights are adjusted along the direction of the negative gradient of the loss. In order to compute the gradient of the loss with respect to the weights of a module, one performs backpropagation (Williams & Hinton, 1986) sequentially applying the chain rule to compute the exact gradient of the loss with respect to a module. However, this scheme has many potential drawbacks, as well as lacking biological plausibility (Marblestone et al., 2016; Bengio et al., 2015). In particular, backpropagation results in locking the weights of a network module can only be updated after a full forwards propagation of the data through the network, followed by loss evaluation, then finally after waiting for the backpropagation of error gradients. This locking constrains us to updating neural network modules in a sequential, synchronous manner. One way of overcoming this issue is to apply Synthetic Gradients (SGs) to build Decoupled Neural Interfaces (DNIs) (Jaderberg et al., 2016). In this approach, models of error gradients are used to approximate the true error gradient. These models of error gradients are local to the network modules they are predicting the error gradient for, so that an update to the module can be computed by using the predicted, synthetic gradients, thus bypassing the need for subsequent forward execution, loss evaluation, and backpropagation. The gradient models themselves are trained Understanding Synthetic Gradients and DNIs at the same time as the modules they are feeding synthetic gradients to are trained. The result is effectively a complex dynamical system composed of multiple sub-networks cooperating to minimise the loss. There is a very appealing potential of using DNIs e.g. the potential to distribute and parallelise training of networks across multiple GPUs and machines, the ability to asynchronously train multi-network systems, and the ability to extend the temporal modelling capabilities of recurrent networks. However, it is not clear that introducing DNIs and SGs into a learning system will not negatively impact the learning dynamics and solutions found. While the empirical evidence in Jaderberg et al. (2016) suggests that SGs do not have a negative impact and that this potential is attainable, this paper will dig deeper and analyse the result of using SGs to accurately answer the question of the impact of synthetic gradients on learning systems. In particular, we address the following questions, using feed-forward networks as our probe network architecture: Does introducing SGs change the critical points of the neural network learning system? In Section 3 we show that the critical points of the original optimisation problem are maintained when using SGs. Can we characterise the convergence and learning dynamics for systems that use synthetic gradients in place of true gradients? Section 4 gives first convergence proofs when using synthetic gradients and empirical expositions of the impact of SGs on learning. What is the difference in the representations and functional decomposition of networks learnt with synthetic gradients compared to backpropagation? Through experiments on deep neural networks in Section 5, we find that while functionally the networks perform identically trained with backpropagation or synthetic gradients, the layer-wise functional decomposition is markedly different due to SGs. In addition, in Section 6 we look at formalising the connection between SGs and other forms of approximate error propagation such as Feedback Alignment (Lillicrap et al., 2016), Direct Feedback Alignment (Nøkland, 2016; Baldi et al., 2016), and Kickback (Balduzzi et al., 2014), and show that all these error approximation schemes can be captured in a unified framework, but crucially only using synthetic gradients can one achieve unlocked training. 2. DNI using Synthetic Gradients The key idea of synthetic gradients and DNI is to approximate the true gradient of the loss with a learnt model which predicts gradients without performing full backpropagation. Consider a feed-forward network consisting of N layers fn, n {1, . . . , N}, each taking an input hn 1 i and pro- ducing an output hn i = fn(hn 1 i ), where h0 i = xi is the input data point xi. A loss is defined on the output of the network Li = L(h N i , yi) where yi is the given label or supervision for xi (which comes from some unknown P(y|x)). Each layer fn has parameters θn that can be trained jointly to minimise Li with the gradient-based update rule θn θn α L(h N i , yi) hn i where α is the learning rate and Li/ hn i is computed with backpropagation. The reliance on Li/ h N i means that an update to layer i can only occur after every subsequent layer fj, j {i + 1, . . . , N} has been computed, the loss Li has been computed, and the error gradient L/ h N i backpropgated to get Li/ h N i . An update rule such as this is update locked as it depends on computing Li, and also backwards locked as it depends on backpropagation to form Li/ hn i . Jaderberg et al. (2016) introduces a learnt prediction of the error gradient, the synthetic gradient SG(hn i , yi) = \ Li/ hn i Li/ hn i resulting in the update θk θk α SG(hn i , yi) hn i θk k n This approximation to the true loss gradient allows us to have both update and backwards unlocking the update to layer n can be applied without any other network computation as soon as hn i has been computed, since the SG module is not a function of the rest of the network (unlike Li/ hi). Furthermore, note that since the true Li/ hn i can be described completely as a function of just hn i and yi, from a mathematical perspective this approximation is sufficiently parameterised. The synthetic gradient module SG(hn i , yi) has parameters θSG which must themselves be trained to accurately predict the true gradient by minimising the L2 loss LSGi = SG(hn i , yi) Li/ hn i 2. The resulting learning system consists of three decoupled parts: first, the part of the network above the SG module which minimises L wrt. to its parameters {θn+1, ..., θN}, then the SG module that minimises the LSG wrt. to θSG. Finally the part of the network below the SG module which uses SG(h, y) as the learning signal to train {θ1, ...θn}, thus it is minimising the loss modeled internally by SG. Assumptions and notation Throughout the remainder of this paper, we consider the use of a single synthetic gradient module at a single layer k and for a generic data sample j and so refer to h = hj = hk j ; unless specified we drop the superscript k and subscript j. This model is shown in Figure 1 (b). We also focus on Understanding Synthetic Gradients and DNIs SG modules which take the point s true label/value as conditioning SG(h, y) as opposed to SG(h). Note that without label conditioning, a SG module is trying to approximate not L/ h but rather EP (y|x) L/ h since L is a function of both input and label. In theory, the lack of label is a sufficient parametrisation but learning becomes harder, since the SG module has to additionally learn P(y|x). We also focus most of our attention on models that employ linear SG modules, SG(h, y) = h A + y B + C. Such modules have been shown to work well in practice, and furthermore are more tractable to analyse. As a shorthand, we denote θh), i.e. if h is the kth layer then θh = 0, SG(h, y) h/ θ