# bpmathbflambda_online_learning_via_synthetic_gradients__b0b26c9f.pdf

Published in Transactions on Machine Learning Research (04/2024)

BP(λ): Online Learning via Synthetic Gradients

Joseph Pemberton joe.pemberton@bristol.ac.uk Computational Neuroscience Unit, Faculty of Engineering, University of Bristol, United Kingdom Centre for Neural Circuits and Behaviour, Department of Physiology, Anatomy and Genetics, Medical Sciences Division, University of Oxford, United Kingdom

Rui Ponte Costa rui.costa@dpag.ox.ac.uk Centre for Neural Circuits and Behaviour, Department of Physiology, Anatomy and Genetics, Medical Sciences Division, University of Oxford, United Kingdom Computational Neuroscience Unit, Faculty of Engineering, University of Bristol, United Kingdom

Reviewed on Open Review: https: // openreview. net/ forum? id= 3k Ygou Afqk

Training recurrent neural networks typically relies on backpropagation through time (BPTT). BPTT depends on forward and backward passes to be completed, rendering the network locked to these computations before loss gradients are available. Recently, Jaderberg et al. proposed synthetic gradients to alleviate the need for full BPTT. In their implementation synthetic gradients are learned through a mixture of backpropagated gradients and bootstrapped synthetic gradients, analogous to the temporal diﬀerence (TD) algorithm in Reinforcement Learning (RL). However, as in TD learning, heavy use of bootstrapping can result in bias which leads to poor synthetic gradient estimates. Inspired by the accumulate TD(λ) in RL, we propose a fully online method for learning synthetic gradients which avoids the use of BPTT altogether: accumulate BP(λ). As in accumulate TD(λ), we show analytically that accumulate BP(λ) can control the level of bias by using a mixture of temporal diﬀerence errors and recursively deﬁned eligibility traces. We next demonstrate empirically that our model outperforms the original implementation for learning synthetic gradients in a variety of tasks, and is particularly suited for capturing longer timescales. Finally, building on recent work we reﬂect on accumulate BP(λ) as a principle for learning in biological circuits. In summary, inspired by RL principles we introduce an algorithm capable of bias-free online learning via synthetic gradients.

1 Introduction

A common approach for solving temporal tasks is to use recurrent neural networks (RNNs), which with the right parameters can eﬀectively integrate and maintain information over time. The temporal distance between inputs and subsequent task loss, however, can make optimising these parameters challenging. The backpropagation through time (BPTT) algorithm is the classical solution to this problem that is applied once the task is complete and all task losses are propagated backwards in time through the preceding chain of computation. Exact loss gradients are thereby derived and are used to guide updates to the network parameters.

However, BPTT can be undesirably expensive to perform, with its memory and computational requirements scaling intractably with the task duration. Moreover, the gradients can only be obtained after the RNN forward and backward passes have been completed. This makes the network parameters eﬀectively locked until those computations are carried out. One common solution to alleviate these issues is to apply truncated BPTT, where error gradients are only backpropagated within ﬁxed truncation windows, but this approach can limit the network s ability to capture long-range temporal dependencies.

Published in Transactions on Machine Learning Research (04/2024)

synthesiser

unavailable BPTT gradient

Figure 1: Schematic of a recurrent neural network (RNN) which learns via synthetic gradients. (a) External input xt is provided to the RNN which has hidden state ht. Due to recurrency this state will aﬀect the task loss at the current timestep Lt and future timesteps L>t not yet seen. A distinct synthesiser network receives ht as input and estimates its future loss gradient ˆGt L>t

ht , which is provided to the RNN for learning. The synthesiser learns to mimic a target gradient Gt. How Gt is deﬁned and learned is the focus of this paper. (b) An illustration of the accumulate BP(λ) algorithm for learning synthetic gradients in an unrolled version of the network. Current activity ht must be correctly associated to the later task loss LT . Here the parameters θ of the synthesiser are updated via a mixture of temporal diﬀerence errors δ (red) and eligibility traces e (green). As in accumulate TD(λ) in RL, δ is computed online using bootstrapping whilst e propagates forwards with a decay component λ with 0 λ 1. Together, they approximate the true loss gradient. In contrast to the original synthetic gradient algorithm by Jaderberg et al. (2017), our model does not require BPTT.

One proposed method which avoids the need for many-step BPTT whilst still capturing long-range dependencies is to apply synthetic gradients (Jaderberg et al., 2017; Czarnecki et al., 2017). In this method gradients from future errors are predicted by a separate network, a synthesiser , given the current RNN activity (Fig. 1a). Synthetic gradients enable the network to model long-range dependencies on future errors whilst avoiding the waiting time imposed by BPTT. The independence of memory and computational complexity with respect to the total task length makes synthetic gradients an attractive alternative compared to BPTT (Marschall et al., 2020). Recently, these properties have led neuroscientists to speculate that synthetic gradients are computed at the systems-level in the brain, explaining a range of experimental observations (Marschall et al., 2019; Pemberton et al., 2021; Boven et al., 2023).

Despite their promise, the full potential of approximating BPTT with synthetic gradients has not yet been realised. In particular, it is not yet clear what are the optimal conditions for learning synthetic gradients. In its original implementation, Jaderberg et al. use synthetic gradients alongside truncated BPTT, and deﬁne the synthesiser target as a mixture of backpropagated gradients with its own predicted (future) gradient. That is, the synthesiser uses its own estimations bootstrapping for learning. As the original authors note, this is highly reminiscent of temporal diﬀerence (TD) algorithms used in Reinforcement Learning (RL) which use bootstrapping for estimating the future return (Sutton & Barto, 2018). Indeed, in their supplementary material Jaderberg et al. extend this analogy and introduce the notion of the λ-weighted synthetic gradient, which is analogous to the λ-return in RL. However, λ-weighted synthetic gradients were only presented conceptually and it remained unclear whether they would be of practical beneﬁts as they still require BPTT.

In this study, inspired by established RL theory, we make conceptual and experimental advancements on λ-weighted synthetic gradients. In particular, we propose an algorithm for learning synthetic gradients, accumulate BP(λ), which mirrors the accumulate TD(λ) algorithm in RL (Van Seijen et al., 2016). Just as

Published in Transactions on Machine Learning Research (04/2024)

how accumulate TD(λ) provides an online solution to learning the λ-return in RL, we show that accumulate BP(λ) provides an online solution to learning λ-weighted synthetic gradients. The algorithm uses forwardpropagating eligibility traces and has the advantage of not requiring (even truncated) BPTT at all. Moreover, we demonstrate that accumulate BP(λ) can alleviate the bias involved in directly learning bootstrapped estimations as suﬀered in the original implementation.

We now provide a brief background into the application of synthetic gradients for RNN learning. We then introduce the accumulate BP(λ) algorithm and demonstrate both analytically and empirically can it alleviates the problem of bias suﬀered in the original implementation. Next, we touch upon accumulate BP(λ) as a mechanism for learning in biological circuits. Finally, we discuss the limitations and conclusions of our work.

2 Background

2.1 Synthetic gradients for supervised learning

Consider an RNN with free parameters Ψ performing a task of sequence length T (which may be arbitrarily long). At time t RNN dynamics follow ht = f(xt, ht 1; Ψ), where h is the RNN hidden state, x is the input, and f is the RNN computation. Let Lt = L(ˆyt, yt) denote the loss at time t, where ˆyt is the RNN-dependent prediction and yt is the desired target. Let L>t = P

t<τ T Lτ denote the total loss strictly after timestep t.

During training, we wish to update Ψ to minimise all losses from timestep t onwards. Using gradient descent

this is achieved with Ψ = Ψ η P

Ψ for some RNN learning rate η. We can write this gradient as

Ψ = (Lt + L>t)

Whilst the terms Lt

Ψ above are relatively easy to compute and available at timestep t, L>t

ht can present challenges. Without BPTT, this term is simply taken as zero, L>t

ht = 0, and future errors are eﬀectively ignored. With BPTT, this term is only computed after the forward pass and the corresponding backward pass so that all future errors are observed and appropriately backpropagated. This has memory and computational complexity which scales with T, and thereby relies on arbitrarily long waits before the loss gradient is available.

The aim of synthetic gradients is to provide an immediate prediction ˆ Gt of the future loss gradient, ˆ Gt Gt := L>t

ht (Jaderberg et al., 2017). We use notation Gt both to represent gradient but also to highlight its resemblance to the return in RL. Note that this gradient is a vector of the same size as h. As in the original implementation, we consider ˆ Gt to be a computation of the current RNN state with a separate synthesiser network: ˆ Gt = g(ht; θ), where g denotes the synthesiser computation which depends on its free parameters θ. An approximation for the loss gradient with respect to the RNN parameters can then be written as

This is available at timestep t and removes the dependence of the memory and computational complexity on T as in Eq. 3.

Published in Transactions on Machine Learning Research (04/2024)

2.2 Learning synthetic gradients

How the synthesiser parameters θ should be learned remains relatively unexplored and is the focus of this paper. The problem can be stated as trying to minimise the synthesiser loss function Lg(θ) deﬁned as

Lg(θ) := Eht P

t P (ht) vt g(ht; θ) 2 2 (6)

where P is the probability distribution over RNN hidden states, Gt is the target synthetic gradient for state ht, and . 2 denotes the Euclidean norm. By taking samples of hidden states ht and applying the chain rule, the stochastic updates for θ can be written as

θt+1 = θt + α (vt g(ht; θt)) g(ht; θt)

where α is the synthesiser learning rate. Note the transpose operation since all terms in Equation 7 are vectors.

Ideally Gt is the true error gradient, Gt = Gt, but this requires full BPTT which is exactly what synthetic gradients are used to avoid. In its original formulation, Jaderberg et al. instead propose a target which is based on mixture of backpropagated error gradients within a ﬁxed window and a bootstrapped synthetic gradient to incorporate errors beyond. The simplest version of this, which only uses one-step BPTT, only incorporates the error at the next timestep and relies on a bootstrap prediction for all timesteps onwards. We denote this target G(1) t .

G(1) t = Lt+1

ht + γ ˆGt+1 ht+1

where, inspired by RL, γ [0, 1] is the gradient discount factor which if γ < 1 scales down later error gradients. Note that in its original implementation γ = 1, but in our simulations we ﬁnd it important to set γ < 1 (see Appendix section C).

Notably, Equation 8 resembles the bootstrapped target involved in the TD algorithm in RL Sutton & Barto (2018). In particular, G(1) t can be considered analogous to the one-step return at state St deﬁned as Rt+1 + γV (St+1), where Rt+1 is the reward, St+1 is the subsequent state, and V is the (bootstrapped) value function.

Table 1: Summary of terms used in value estimation in Reinforcement Learning (RL) and synthetic gradient (SG) estimation in supervised learning. For RL Rt denotes the reward at timestep t, φt denotes the feature-based representation of the environmental state, and V is the value function. See main text and Van Seijen et al. (2016) for further details of the terms.

Reinforcement Learning Synthetic Gradients

τ 1 γτ 1Rt+τ

τ 1 γτ 1 Lt+τ

ht ˆGt V (φt; θ) g(ht; θ)

Pn τ=1 γτ 1Rt+τ + γn ˆGt+n

Pn τ=1 γτ 1 Lt+τ

ht + γn ˆG t+n ht+n

Gλ t (1 λ) PT t 1 n=1 λn 1G(n) t + λT t 1Gt

Gλ|H k (1 λ) PH k 1 n=1 λn 1G(n) k + λH k 1G(H k) t

2.3 n-step synthetic gradients

In practice, in its original implementation Jaderberg et al. primarily consider applying synthetic gradients alongside truncated BPTT for truncation size n > 1.

Published in Transactions on Machine Learning Research (04/2024)

In this case, the left side term in Equation 8 can be extended to incorporate loss gradients backpropagated within the n timesteps of the truncation. The synthesiser target is then set as Gt = G(n) t where G(n) t is the n-step synthetic gradient deﬁned as

t<τ t+n γτ t 1 Lτ

ht + γn ˆGt+n ht+n

which is analogous to the n-step return in RL. Importantly, in practice this scheme uses truncated BPTT to both deﬁne the synthesiser target and the error gradient used to update the RNN. Speciﬁcally, Jaderberg et al. apply the n backpropagated gradients to directly learn Ψ as well as θ, with the synthetic gradients themselves only applied at the end of each truncation window (see original paper for details).

Just as in RL, increasing the truncation size n provides a target which assigns more weight to the observations than the bootstrapped term, thus reducing its bias and potentially leading to better synthesiser learning. On the other hand, Equation 9 requires n-step BPTT and thereby enforces undesirable waiting time and complexity for large n.

3 Model and analytical results

In this study we formulate a learning algorithm accumulate BP(λ) which has the advantage of reduced bias compared to the original n-step implementation. Furthermore, the algorithm can be implemented at each timestep and does not require BPTT at all.

We ﬁrst deﬁne λ-weighted synthetic gradients. We highlight that our deﬁnition is similar, but not the same, as that ﬁrst introduced in Jaderberg et al. (2017). Speciﬁcally, our deﬁnition incorporates loss gradients for losses strictly after the current timestep and can therefore naturally be used in the context of learning the synthesiser.

3.1 λ-weighted synthetic gradient

Let λ be such that 0 λ 1. We deﬁne the λ-weighted synthetic gradient as

Gλ t := (1 λ)

n=1 λn 1G(n) t (10)

n=1 λn 1G(n) t + λT t 1Gt (11)

which is analogous to the λ-return in RL.

Note the distinction in notation between the λ-weighted synthetic gradient Gλ t and the n-step synthetic gradient G(n) t . Moreover, note that Equation 11 is similar but distinct to the recursive deﬁnition as proposed in the original paper (see Appendix section B). In particular, whilst Jaderberg et al. also incorporate the loss gradient at the current timestep in their deﬁnition, Equation 11 only considers strictly future losses. Since the synthesiser itself is optimised to produce future error gradients (Equation 4), this enables the λ-weighted synthetic gradient to be directly used as a synthesiser target.

As in RL, a higher choice of λ results in stronger weighting of observed gradients compared to the bootstrapped terms. For example, whilst G0 t = G(1) t is just the one-step synthetic gradient which relies strongly on the bootstrapped prediction (cf. Equation 8), G1 t = Gt is the true (unbiased) gradient as obtained via full BPTT.

We also deﬁne the interim λ-weighted synthetic gradient Gλ|H k as

Gλ|H k := (1 λ)

n=1 λn 1G(n) k + λH k 1G(H k) k (12)

Published in Transactions on Machine Learning Research (04/2024)

Algorithm 1 RNN learning with accumulate BP(λ). Updates RNN parameters using estimated gradients provided by synthesiser function g. Input: Ψ0, θ0, {(xt, yt)}1 t T , η, α, γ, λ

Ψ Ψ0 {init. RNN parameters} θ θ0 {init. synthesiser parameters} h, h, e 0 {init. RNN state, Jacobian, and elig. trace} for t = 1 to T do

e γλ he + g(h;θ)

θ {update eligibility trace} h f(xt, h; Ψ), L L(h , yt) {compute next hidden state and task loss} h h

Ψ {compute local gradients} δ [ L + γg(h ; θ)] h g(h; θ) {compute synthesiser TD error} θ αδ e {update synthesiser parameters} Ψ η[ L + g(h ; θ)] Ψ {update RNN parameters} h h {update RNN hidden state} end for

Unlike Equation 11, this is available at time ( horizon ) H with k < H < T.

Table 1 provides an overview of the terms deﬁned along with their respective counterparts in RL.

3.2 Oﬄine λ-SG algorithm

We deﬁne the oﬄine λ-SG algorithm to learn θ, which is analogous to the oﬄine λ-return algorithm in RL but for synthetic gradients (SG), by taking Gt = Gλ t in Equation 7. It is oﬄine in the sense that it requires the completion of the sequence at timestep T before updates are possible.

3.3 Online λ-SG algorithm

We deﬁne the online λ-SG algorithm to learn θ, which is analogous to the online λ-return algorithm in RL but for synthetic gradients. At the current timestep t, the algorithm updates θ based on its prediction over all prior RNN hidden states hk from k = 0 up to k = t. In particular, at timestep t the algorithm applies as many weight updates by iterating over k according to

θt k = θt k 1 + α Gλ|t k g(hk; θt k 1) g(hk; θt k 1) θt k 1 (13)

where θ0 0 = θinit is the initialisation weight and subsequent initial weight vectors are inherited from the previous timestep, θt+1 0 = θt t. This latter quantity the ﬁnal weight vector at time t deﬁnes θt (without superscript).

The information used for the weight updates in Equation 13 is available at the current timestep and the algorithm is therefore online. Moreover, as in RL, the online λ-SG algorithm produces weight updates similar to the oﬄine λ-SG algorithm. In particular, at the end of the sequence with the horizon H = T note that Gλ|H t and Gλ t are the same. However, to store the previous hidden states and iteratively apply the online λ-SG algorithm requires undesirable computational cost. In particular, the 1 + 2 + + T operations in Equation 13 result in computational complexity which scales intractably with T 2.

3.4 Accumulate BP(λ)

In this study we propose the accumulate BP(λ) algorithm which is directly inspired by the accumulate TD(λ) algorithm in RL (Van Seijen et al., 2016). Like accumulate TD(λ), the motivation for accumulate BP(λ) is to enable relatively cheap, online computations whilst alleviating the problem of bias which comes from bootstrapping using eligibility traces (Fig. 1b).

Published in Transactions on Machine Learning Research (04/2024)

In accumulate BP(λ), the weight update at timestep t is deﬁned by

θt+1 = θt + αδ t et (14)

Where δt is the temporal diﬀerence error at timestep t

ht + γg(ht+1; θt) ht+1

ht g(ht; θt) (15)

and et is the eligibility trace of θ at time t

ht 1 et 1 + g(ht; θt)

with e0 deﬁned as the zero vector, e0 = 0.

Point of clarity. Note that there is some abuse of notation with respect to the matrix multiplication operations deﬁned in Equations 14 and 16. If inθ, outθ are the sizes of the input and output dimension of θ, respectively, then the eligibility trace et is a three-dimensional vector of shape (|h|, outθ, inθ), where |h| is size of h. To compute the matrix product Aet for a vector A of shape (r, |h|) we concatenate the latter two dimensions of et so that is of shape (|h|, outθ inθ) and once the product is computed reshape it as (r, outθ, inθ). Note that if r = 1 (as in Equation 14) then the ﬁrst dimension is removed.

The core idea behind accumulate BP(λ) is that these eligibility traces keep track of the weights which have contributed to previous error-gradient predictions, where the rate at which this contribution fades over time is λγ. These traces then inform the synthesiser which weights are more liable to change once a temporal diﬀerence error occurs (Equation 15). Importantly, accumulate BP(λ) applies a forward-view (online) method of learning which operates timestep by timestep, in stark contrast to the (backward-view) oﬄine λ-SG algorithm for which the task sequence must ﬁrst be completed before updates can occur.

The main analytical result in this paper is that accumulate BP(λ) provides an approximation to the online λ-SG algorithm. This theorem uses the term t i deﬁned as

t i := Gλ|t i,0 g(hi; θ0) g(hi; θi)

where Gλ|t i,0 is the λ-weighted synthetic gradient which uses the initial weight vector θ0 for all synthetic gradient estimations.

Theorem 3.1. Let θ0 be the initial weight vector, θBP t be the weight vector at time t computed by accumulate BP(λ), and θλ t be the weight vector at time t computed by the online λ-SG algorithm. Furthermore, assume that Pt 1 i=0 t i does not contain any zero-elements. Then, for all time steps t: θBP t θλ t 2 θBP t θ0 2 0 as α 0

Proof: The full proof can be found in Appendix section A. In general, the structure of the proof closely follows that provided in the analogous RL paradigm for accumulate TD(λ) (Van Seijen et al., 2016) .

Accumulate BP(λ) thus provides an online method for learning the λ-weighted synthetic gradient which, analogous to accumulate TD(λ), avoids the memory and computational requirements of the online λ-SG algorithm. For example, when eligibility traces are unused and λ = 0, i.e. accumulate BP(0), the synthesiser learns with the one-step synthetic gradient as its target, Gt = G(1) t , and arrives at the original implementation with truncation size n = 1 (Jaderberg et al., 2017). When λ = 1, i.e. accumulate BP(1), for an appropriately small learning rate α the synthesiser learns with true BPTT gradient as its target, Gt = Gt. Importantly, Equations 15 and 16 only consider gradients between variables of at most 1 timestep apart. In this respect, at least in the conventional sense, there is no BPTT.

Published in Transactions on Machine Learning Research (04/2024)

0 50 100 0.0

alignment to true gradient

= 0 (Jaderberg et al.)

0 50 100 epoch

t = T 1 t = T 5 t = T 9

T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9

alignment to true gradient

= 0 = 0.25 = 0.5 = 0.75 = 1

Figure 2: BP(λ) derives true BPTT gradients over multiple timesteps. In this toy paradigm, input is only provided at timestep 1 and the task target is only available at the end of the task at time T = 10. (a) Alignment between synthetic gradients and true gradients for a ﬁxed RNN model across diﬀerent timesteps within the task, where the synthetic gradients are learned using (accumulate) BP(λ). Alignment is deﬁned using the cosine similarity metric. (b) The average alignment over the last 10% of epochs in a across all timesteps.

4 Experiments

Next, we tested empirically the ability of accumulate BP(λ), which we henceforth simply call BP(λ), to produce good synthetic gradients and can drive eﬀective RNN parameter updates. In all experiments we take the synthesiser computation g simply as a linear function of the RNN hidden state, g(ht; θ) = θht.

4.1 Approximating true error gradients in a toy task

We ﬁrst analyse the alignment of BP(λ)-derived synthetic gradients and true gradients derived by full BPTT, which we use to quantify the bias in synthesiser predictions. For this we consider a toy task in which a ﬁxed (randomly connected) linear RNN receives a static input x1 at timestep 1 and null input onwards, xt = 0 for t > 1. To test the ability of BP(λ) to transfer error information across time the error is only deﬁned at the last timestep LT , where LT is the mean-squared error (MSE) between a two-dimensional target y T and a linear readout of the ﬁnal hidden activity h T . We use a task length of T = 10 timesteps. Note that since the network is linear and the loss is a function of the MSE, the linear synthesiser should in principle be able to learn perfectly model the true BPTT gradient (Czarnecki et al., 2017).

As expected, we ﬁnd that a high λ improves the alignment of synthetic gradients and true gradients compared to the λ = 0 case (i.e. BP(0) as in Jaderberg et al. (2017); Fig. 2a). Speciﬁcally, these results show, as predicted, that the heavy reliance on bootstrapping of BP(0) means that loss gradients near the end of the sequence must ﬁrst be faithfully captured before earlier timesteps can be learned. When eligibility traces are applied in BP(1), however, the over-reliance on the bootstrapped estimate is drastically reduced and the synthesiser can very quickly learn faithful predictions across all timesteps. Indeed, we observe that high λ avoids the deterioration of synthetic gradient quality at earlier timesteps as suﬀered by BP(0) (Fig. 2b). We also observe that BP(λ) often outperforms n-step synthetic gradients (Fig. 6).

Next, to verify that BP(λ) is actually beneﬁcial for RNN learning, we followed the scheme of Jaderberg et al. (2017) and applied RNN weight updates together with synthesiser weight updates (Algorithm 1). To ensure that the RNN needs to learn a temporal association (as opposed to a ﬁxed readout), in this case we consider 3 input/target pairs and consider large sequence lengths T; we also now apply a tanh non-linearity to bound RNN activity.

Consistent with our results for the case of ﬁxed RNNs, we observe that BP(λ) gives rise to better predictions for high λ when the RNN is learning (Fig. 3a). Notably, however, the alignment is weaker for plastic RNNs

Published in Transactions on Machine Learning Research (04/2024)

0.0 0.5 1.0

alignment to true gradient

0.0 0.5 1.0

plastic RNN

0.0 0.2 0.4 0.6 0.8

error (MSE)

BP( ) online -return

error (MSE)

BP(0) (Jaderberg et al.) BP(0.5) BP(1)

0 100 200 epoch

Figure 3: BP(λ) drives better RNN learning in a toy task. (a) Average cosine similarity between synthetic gradients and true gradients for ﬁxed (left) and plastic (right) RNNs. Cosine similarity for plastic RNNs is taken over the ﬁrst 5 training epochs, since this initial period is the key stage of learning (i.e. before the task is perfected). (b) Learning curves of RNNs which are updated using synthetic gradients derived by BP(λ) over diﬀerent sequence lengths T. (c) Eﬀect of learning rate α (cf. Equation 14) on task error (averaged over last 10 epochs of training) for plastic RNNs using synthetic gradients derived by BP(λ) and the online λ-SG algorithm (λ = 1 T = 10); this can be compared to Figure 2 in Van Seijen et al. (2016) which displays analagous results in the RL paradigm. Here standard stochastic gradient descent is used (all other main experiments use Adam). Results show average (with SEM) over 5 diﬀerent initial conditions.

when compared to ﬁxed RNNs. This is because the challenge is now harder for the synthesiser, since changes to the RNN will aﬀect its error gradients and therefore lead to a moving synthesiser target. Nonetheless, we ﬁnd that BP(λ) improves RNN learning even in the presence of relatively long temporal credit assignment (Fig. 3b). Next, we contrasted the ability of both BP(λ) and n-step methods to achieve near-zero error. Our results show that BP(1) is able to solve tasks more than double the length of those solved by the next best model (Table 2 and Fig. 7).

Finally, we explored the eﬀect of learning rate with BP(λ). Indeed, Theorem 3.1 only applies as the step-size α approaches zero, and empirically TD(λ) is known to diverge if α is too large Van Seijen et al. (2016). We observe similar results in our experiments, and note that BP(λ) is only applicable with modest step-sizes (Fig. 3c). This is true over a range of values for λ, though, consistent with TD(λ), we observe that mid-range λ values are more robust to the learning rate (Fig. 8; e.g. compare λ = 0.75 with λ = 1).

4.2 Sequential MNIST task

To test the ability of BP(λ) to generalise to non-trivial tasks, we now consider the sequential MNIST task (Le et al., 2015). In this task the RNN is provided with a row-by-row representation of an MNIST image which it must classify at the end (Fig. 4a). That is, as in the toy task above, the task loss is only deﬁned at the ﬁnal timestep and must be eﬀectively associated to prior inputs. Since this is a harder task we now use non-linear LSTM units in the RNN which are better placed for these temporal tasks (Hochreiter & Schmidhuber, 1997).

Published in Transactions on Machine Learning Research (04/2024)

Table 2: Overview of model performance for sequential MNIST and copy-repeat tasks. Results for toy and copy-repeat tasks shows the average task sequence length solved by the models (see text). Results for the sequential MNIST task show the average test accuracy as a percentage after training. Values denote average over 5 diﬀerent initial conditions.

BPTT BPTT + SG no BPTT n=2 n=3 n=4 n=5 n=2 n=3 n=4 n=5 n=1 BP(0) BP(0.5) BP(1) toy task 16 10 20 8 36 38 24 20 4 16 32 90 sequential MNIST 73.4 78.4 82.0 87.1 78.2 83.0 86.5 90.4 64.1 69.6 76.6 90.6 copy-repeat 9.0 9.0 12.0 15.0 12.0 9.0 15.0 23.0 8.2 9.0 15.8 29.0

Figure 4: Performance of BP(λ) in sequential MNIST task. (a) Schematic of task. Rows of an MNIST image are fed sequentially as input and the model must classify the digit at the end. (b) Validation accuracy during training for BP(λ) models. (c) Validation accuracy during training for models which learn synthetic gradients (SG) with n-step truncated BPTT as in original implementation (Jaderberg et al., 2017); ﬁnal performance of BP(1) (as in (b); dotted green) is given for reference. Results show mean performance over 5 diﬀerent initial conditions with shaded areas representing standard error of the mean.

Our results show that the use of BP(λ) signiﬁcantly improves on a standard no-BPTT model, i.e. with L>t

ht = 0 in Eq. 3 (Figs. 4b and 9). Consistent with our results from the toy task we ﬁnd that a high λ produces faster rates of learning and higher performance. For example, BP(1) achieves error less than a third of that achieved by the eligibility trace free model, BP(0) ( 10% vs 30%). Moreover, BP(1) generally outperforms the models considered by Jaderberg et al. (2017) which rely on truncated BPTT and the n-step synthetic gradient with n > 1 (Fig. 4c and Table 2).

4.3 Copy-repeat task

Finally, we consider a task with more complex and longer temporal dependencies the copy-repeat task (Graves et al., 2014). In this task the model receives as input a start delimiter followed by an 8-dimensional binary sequence of length N and a repeat character R. The model must then output the sequence R times before ﬁnishing with a stop character. The total sequence length is then T = N (R + 1) + 3. We follow the procedure as set out in Jaderberg et al. (2017) and deem a sequence length solved if the average error is less than 0.15 bits. Once solved we increment N and R alternatively.

We again observe that BP(1) provides the best BPTT-free solution in solving large sequence lengths and outperforms the n-step synthetic gradient methods (Fig. 5 and Table. 2). We do however note that our implementation of these n-step methods fails to reach the performance thresholds as presented in the original

Published in Transactions on Machine Learning Research (04/2024)

0 5 10 15 # training examples (millions)

sequence length solved

no BPTT models

no BPTT BP(0) (Jaderberg et al.) BP(0.5) BP(1)

0 5 10 15 # training examples (millions)

sequence length solved

BPTT models

BPTT (n=2) BPTT (n=3) BPTT (n=4) BPTT (n=5) BPTT + SG (n=2) BPTT + SG (n=3) BPTT + SG (n=4) BPTT + SG (n=5)

Figure 5: Performance of BP(λ) in copy-repeat task. (a) Maximum sequence length solved for BP(λ) models. A sequence length is considered as solved if the model achieves an average of 0.15 bits error for a given length. (b) Maximum sequence length solved for models with n-step synthetic gradient (SG) learning methods (Jaderberg et al., 2017); best task performance of BP(1) (as in (a); dotted green) is shown for reference. See also Table 2 for more details. Results show mean performance over 5 diﬀerent initial conditions with shaded areas representing standard error of the mean.

paper (Jaderberg et al., 2017), but that they are consistent with those reported in Bellec et al. (2019). This is perhaps due to the more modest hyperparameter choices (e.g. less hidden units) used in our models.

5 Relevance for learning in biological networks

Table 3: Complexity of the BP(λ) algorithm against backpropagation through time (BPTT) and real-time recurrent learning (RTRL) based algorithms for fully connected recurrent neural networks (RNNs). |h| is the hidden size of the RNN; T is the task length; n is the imposed truncation size. Time complexity is reported per timestep (i.e. the average complexity over the sequence of timesteps) and per pass , where pass refers to the period over which the gradient is computed (e.g. over each truncation for truncated BPTT, or over each timestep for RTRL and BP(λ)). |Ψ| denotes the number of parameters in the RNN, whilst |θ| denotes the number of parameters in the synthesiser.

Algorithm Memory Time (per timestep) Time (per pass)

Full BPTT |h|T |h|2 |h|2T Truncated BPTT |h|n |h|2 |h|2n Synthetic gradients (original) |h|n |h|2 |h|2n BP(λ) (ours) |h||θ| |h|2|θ| |h|2|θ| RTRL |h||Ψ| |h|2|Ψ| |h|2|Ψ|

The BP(λ) algorithm is of potential interest to neuroscience. Unlike BPTT which is considered biologically implausible (Lillicrap & Santoro, 2019; Prince et al., 2021), BP(λ) is fully online and avoids the need to store intermediate activations (and complex gradient calculations back in time) across an arbitrary number of timesteps. Moreover, the application of synthetic gradients has relatively cheap computational and memory costs when compared to other online learning algorithms (Marschall et al., 2020). Perhaps most interestingly, BP(λ) employs a combination of retrospective and prospective learning, each of which are thought to take place in the brain (Namboodiri & Stuber, 2021). Speciﬁcally, whilst the main network learns via prospective learning signals (i.e. synthetic gradients), the synthesiser itself learns in a retrospective manner by using forward-propagating eligibility traces.

Published in Transactions on Machine Learning Research (04/2024)

One possible candidate for the expression of synthetic gradients in the nervous system are neuromodulators. For example, dopaminergic neurons are known to encode expectation of future reward or error signals (Hollerman & Schultz, 1998) and have been observed to play an important role in mediating synaptic plasticity (Yagishita et al., 2014; Gerstner et al., 2018). Furthermore, as required for synthetic gradient vectors, there is increasing evidence for signiﬁcant heterogeneity in the dopaminergic population in areas such as the ventral tegmental area, both in its variety of encoded signals and the targeted downstream circuits (Lerner et al., 2015; Beier et al., 2015; Avvisati et al., 2022). Each signal may thus reﬂect a predicted gradient with respect to the target cortical cell or cell ensemble.

More recently, it has also been suggested that a particular subcortical structure the cerebellum predicts cortical error gradients via the cortico-cerebellar loop (Pemberton et al., 2021; Boven et al., 2023). The cerebellum would thus act as a synthesiser for the brain, and it is suggested that a bootstrapped learning strategy may be in line with experimentally observations (Ohmae S, 2019; Kawato et al., 2021). BP(λ) oﬀers an extra degree of biological plausibility to these studies by removing the need for BPTT at all. Moreover, the algorithm makes speciﬁc predictions regarding the need for eligibility traces (cf. Eq. 16) at key learning sites such as the cerebellar parallel ﬁbres (Kawato et al., 2011).

6 Limitations

Like the originally proposed algorithm for synthetic gradients, BP(λ) makes the assumption that future error gradients can be modeled as a function of the current activity in the task-performing network. The algorithm may therefore suﬀer when the mapping between activity and future errors becomes less predictable, for example when the task involves stochastic inputs. Indeed, we ﬁnd that whilst BP(λ) works well for tasks with fairly structured temporal structure as in the experiments presented, the algorithm can fall short in tasks with little or no temporal correlation between task inputs. For example, we failed to observe meaningful gains with BP(λ) in the List Ops task which employs randomly generated inputs (Tay et al., 2020). On a related note, it may be that in the case of variable, non-deterministic settings BP(λ) may be prone to a high variance in the synthesiser target gradient compared to the more biased BP(0), resulting in potential instability during learning as can occur, for example, in Monte Carlo algorithms in RL. However, how the bias-variance tradeoﬀin RL exactly translates to BP(λ) is not trivial. This is because in RL the biases and variances are computed on a more linear relationship between states and values, whereas in our case gradients have complex dependencies induced by the RNN Jacobian (cf. Eq. 15. Indeed, when we tried to directly analyse the bias and variance of the gradient vector, we were unable to obtain meaningful results. Instead in our results we simply quantify the goodness-of-ﬁt by the cosine similarity metric and the overall RNN performance. In this case a high λ is consistently the best choice, but whether mid-range values can be beneﬁcial for certain task conditions, as can happen in RL, remains to be tested in future work.

Additionally, whilst the complexity of BP(λ) does not depend on time, the algorithm is still expensive, particularly if there are many hidden units in the RNN (Table 3). For example, if there are |h| RNN units and a linear synthesiser is applied then, due to the requirements of storing and updating the synthesiser eligibility traces, the memory and computational complexity of the algorithm is O(|h|3) and O(|h|4) respectively (cf. Table 3); this is as costly as the notoriously expensive real-time recurrent learning algorithm (Williams & Zipser, 1989). We postulate that one way to alleviate this issue is to not learn error gradients with respect to RNN activity directly, but instead some low dimensional representation of that error gradient. In particular, it has been recently demonstrated that gradient descent often takes place within a small subspace Gur Ari et al. (2018). In principle, therefore, the synthesiser could learn some encoding of the error gradient of dimensionality s << |h|, signiﬁcantly reducing the complexity of BP(λ). We predict that such low dimensional representations, which can reduce the noise of high dimensional error gradients, may also lead to more stable synthesiser learning. Speculating further, it may be that this bottleneck role is played by the thalamus in the communication of predicted error feedback in cerebellar-cortico loops (see previous section; Pemberton et al. (2021)).

Finally, we highlight that BP(λ) requires the dynamics of the main model to be represented by sequentially dependent states; i.e. the Jacobian ht+1

ht = 0, as is inherently satisﬁed by RNNs. However, we note that BP(λ) may in principle be applied to feedforward networks where activities are now deﬁned across layers

Published in Transactions on Machine Learning Research (04/2024)

instead of timesteps similar to what was done by Jaderberg et al. (2017). However, according to our formulation BP(λ) the same synthesiser parameters should operate across layers, thus future work should explore how to best relax this constraint.

7 Conclusion

BPTT can be expensive and enforce long waiting times before gradients become available. Synthetic gradients remove these locking constraints imposed by BPTT as well as the associated computational and memory complexity with respect to the task length (Jaderberg et al., 2017). However, the bootstrapped n-step algorithm for learning synthetic gradients as proposed by Jaderberg et al. can lead to biased estimates and also maintains some dependence on (truncated) BPTT to be performed.

Inspired by the TD(λ) algorithm in RL we propose a novel algorithm for learning synthetic gradients: BP(λ). This algorithm applies forward propagating eligibility traces in order to reduce the bias of its estimates and is fully online. We thus extend the work of Jaderberg et al. in developing a computational bridge between estimating expected return in the RL paradigm and future error gradients in supervised learning.

Through a combination of analytical and empirical work we show that BP(λ) outperforms the original implementation for synthetic gradients. Moreover, our model oﬀers an eﬃcient online solution for temporal supervised learning that is of relevance for both artiﬁcial and biological networks.

Acknowledgments

We would like to thank the Neural & Machine Learning group and the Joao Sacramento group for useful feedback. J.P. was funded by a EPSRC Doctoral Training Partnership award (EP/R513179/1) and R.P.C. by the Medical Research Council (MR/X006107/1), BBSRC (BB/X013340/1) and a ERC-UKRI Frontier Research Guarantee Grant (EP/Y027841/1). This work made use of the HPC system Blue Pebble at the University of Bristol, UK. We would like to thank Dr Stewart for a donation that supported the purchase of GPU nodes embedded in the Blue Pebble HPC system.

Riccardo Avvisati, Anna-Kristin Kaufmann, Callum J Young, Gabriella E Portlock, Sophie Cancemi, Rui Ponte Costa, Peter J Magill, and Paul D Dodson. Distributional coding of associative learning within projection-deﬁned populations of midbrain dopamine neurons. bio Rxiv, 2022.

Kevin T Beier, Elizabeth E Steinberg, Katherine E De Loach, Stanley Xie, Kazunari Miyamichi, Lindsay Schwarz, Xiaojing J Gao, Eric J Kremer, Robert C Malenka, and Liqun Luo. Circuit architecture of vta dopamine neurons revealed by systematic input-output mapping. Cell, 162(3):622 634, 2015.

Guillaume Bellec, Franz Scherr, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. Biologically inspired alternatives to backpropagation through time for learning in recurrent neural nets. ar Xiv preprint ar Xiv:1901.09049, 2019.

Ellen Boven, Joseph Pemberton, Paul Chadderton, Richard Apps, and Rui Ponte Costa. Cerebro-cerebellar networks facilitate learning through feedback decoupling. Nature Communications, 14(1):1 18, 2023.

Wojciech Marian Czarnecki, Grzegorz Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals, and Koray Kavukcuoglu. Understanding synthetic gradients and decoupled neural interfaces. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 904 912. JMLR. org, 2017.

Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

Wulfram Gerstner, Marco Lehmann, Vasiliki Liakoni, Dane Corneil, and Johanni Brea. Eligibility traces and plasticity on behavioral time scales: experimental support of neohebbian three-factor learning rules. Frontiers in neural circuits, 12:53, 2018.

Published in Transactions on Machine Learning Research (04/2024)

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014.

Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. ar Xiv preprint ar Xiv:1812.04754, 2018.

Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735 1780, 1997. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735.

Jeﬀrey R Hollerman and Wolfram Schultz. Dopamine neurons report an error in the temporal prediction of reward during learning. Nature neuroscience, 1(4):304 309, 1998.

Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1627 1635. JMLR. org, 2017.

Mitsuo Kawato, Shinya Kuroda, and Nicolas Schweighofer. Cerebellar supervised learning revisited: biophysical modeling and degrees-of-freedom control. Current opinion in neurobiology, 21(5):791 800, 2011.

Mitsuo Kawato, Shogo Ohmae, Huu Hoang, and Terry Sanger. 50 years since the Marr, Ito, and Albus models of the cerebellum. Neuroscience, 462:151 174, 2021.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Quoc V Le, Navdeep Jaitly, and Geoﬀrey E Hinton. A simple way to initialize recurrent networks of rectiﬁed linear units. ar Xiv preprint ar Xiv:1504.00941, 2015.

Talia N Lerner, Carrie Shilyansky, Thomas J Davidson, Kathryn E Evans, Kevin T Beier, Kelly A Zalocusky, Ailey K Crow, Robert C Malenka, Liqun Luo, Raju Tomer, et al. Intact-brain analyses reveal distinct information carried by snc dopamine subcircuits. Cell, 162(3):635 647, 2015.

Timothy P Lillicrap and Adam Santoro. Backpropagation through time and the brain. Current opinion in neurobiology, 55:82 89, 2019.

Owen Marschall, Kyunghyun Cho, and Cristina Savin. Evaluating biological plausibility of learning algorithms the lazy way. In Real Neurons & Hidden Units: Future directions at the intersection of neuroscience and artiﬁcial intelligence @ Neur IPS 2019, 2019. URL https://openreview.net/forum?id=HJg PEXt IUS.

Owen Marschall, Kyunghyun Cho, and Cristina Savin. A uniﬁed framework of online learning algorithms for training recurrent neural networks. Journal of Machine Learning Research, 21(135):1 34, 2020.

Vijay Mohan K Namboodiri and Garret D Stuber. The learning of prospective and retrospective cognitive maps within neural circuits. Neuron, 109(22):3552 3575, 2021.

Medina JF Ohmae S. Plasticity of ponto-cerebellar circuits generates a prospective error signal in climbing ﬁber. Program No. 579.01. Neuroscience 2019 Abstracts. Chicago, IL: Society for Neuroscience, 2019. Online, 2019.

Joseph Pemberton, Ellen Boven, Richard Apps, and Rui Ponte Costa. Cortico-cerebellar networks as decoupling neural interfaces. Advances in Neural Information Processing Systems, 34, 2021.

Luke Y Prince, Roy Henha Eyono, Ellen Boven, Arna Ghosh, Joe Pemberton, Franz Scherr, Claudia Clopath, Rui Ponte Costa, Wolfgang Maass, Blake A Richards, et al. Current state and future directions for learning in biological recurrent neural networks: A perspective piece. ar Xiv preprint ar Xiv:2105.05382, 2021.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for eﬃcient transformers. ar Xiv preprint ar Xiv:2011.04006, 2020.

Published in Transactions on Machine Learning Research (04/2024)

Harm Van Seijen, A Rupam Mahmood, Patrick M Pilarski, Marlos C Machado, and Richard S Sutton. True online temporal-diﬀerence learning. The Journal of Machine Learning Research, 17(1):5057 5096, 2016.

Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270 280, 1989.

Sho Yagishita, Akiko Hayashi-Takagi, Graham C R Ellis-Davies, Hidetoshi Urakubo, Shin Ishii, and Haruo Kasai. A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science, 345(6204):1616 1620, 2014.

Published in Transactions on Machine Learning Research (04/2024)

A Proof of Theorem 3.1

As this section is more involved, for clarity we denote scalar variables in italics and vectors in bold.

Approach: We demonstrate that Theorem 3.1 holds for arbitrary synthesiser parameter θji (at the ith column and jth row of θ) ; that is,

θji,BP t θji,λ t 2 θji,BP t θji 0 2

This is proved in Proposition A.7.

For reference, we explicitly write the weight updates for parameter θji according to the accumulate BP(λ) algorithm:

eji t = γλ ht

ht 1 eji t 1 + g(ht; θt)

θji t , (18)

ht + γg(ht+1; θt) ht+1

ht g(ht; θt), (19)

θji t+1 = θji t + αδ t eji t . (20)

we also deﬁne the helper variable δa,t as

δa,t = Lt+1

ha + γg(ht+1; θt) ht+1

ha g(ht; θt) ht

ha = δ t ht ha . (21)

In general, we follow the structure of the analogous proof for the accumulate TD(λ) algorithm (Van Seijen et al., 2016).

A.1 Statements and proofs

Lemma A.1. Gλ|t+1 a = Gλ|t a + (λγ)t aδ a,t, where

δ a,t = Lt+1

ha + γg(ht+1; θt) ht+1

ha g(ht; θt 1) ht

Published in Transactions on Machine Learning Research (04/2024)

Gλ|t+1 a Gλ|t a =(1 λ)

n=1 λn 1G(n) a + λt a G(t+1 a) a Deﬁnition: Equation 12

n=1 λn 1G(n) a λt a 1G(t a) a

=(1 λ)λt a 1G(t a) a + λt a G(t+1 a) a λt a 1G(t a) a =λt a G(t+1 a) a λt a G(t a) a

=λt a G(t+1 a) a G(t a) a

=λt a t+1 a X

k=1 γk 1 La+k

ha + γt+1 ag(ht; θt) ht+1

k=1 γk 1 La+k

ha γt ag(ht; θt 1) ht

=λt a γt a Lt+1

ha + γt+1 ag(ht; θt) ht+1

ha γt ag(ht; θt 1) ht

=(λγ)t a Lt+1

ha + γg(ht; θt) ht+1

ha g(ht; θt 1) ht

| {z } δ a,t

Lemma A.2. Gλ|t a = Gλ|a+1 a + Pt 1 b=a+1(γλ)b aδ a,b

Proof: We apply Lemma A.1 recursively:

Gλ|t a = Gλ|t 1 a + (λγ)t 1 aδ a,t 1

= Gλ|t 2 a + (λγ)t 2 aδ a,t 2 + (λγ)t 1 aδ a,t 1

= Gλ|a+1 a + (λγ)1δ a,a+1 + + (λγ)t 1 aδ a,t 1

= Gλ|a+1 a +

b=a+1 (γλ)b aδ a,b

Lemma A.3. Gλ|a+1 a = δ a,a + g(ha; θa 1)

Gλ|a+1 a = G1 a

ha + γg(ha+1; θa) ha+1

ha + γg(ha+1; θa) ha+1

ha g(ht; θt 1) ha

ha + g(ha; θa 1) ha

ha = δ a,a + g(ha; θa 1)

Lemma A.4. Pt 1 b=a(γλ)b aδ a,b = Gλ|t a g(ha; θa 1)

Gλ|t a = Gλ|a+1 a +

b=a+1 (γλ)b aδ a,b Lemma A.2

= δ a,a + g(ha; θa 1) +

b=a+1 (γλ)b aδ a,b Lemma A.3

= θ a 1ha +

b=a (γλ)b aδ a,b

Published in Transactions on Machine Learning Research (04/2024)

Lemma A.5. eji b = Pb a=0(γλ)b a hb

ha g(ha;θa)

eji b = γλ hb

hb 1 eji b 1 + g(hb; θb)

hb 2 eji b 2 + g(hb 1; θb 1)

+ g(hb; θb)

hb 1 hb 2 . . . h1

θji 0 + (γλ)b 1 hb

hb 1 . . . h2

+ + hb hb 1

g(hb 1; θb 1)

θji b 1 + g(hb; θb)

θji 0 + (γλ)b 1 hb

+ + (γλ) hb

g(hb 1; θb 1)

θji b 1 + g(hb; θb)

a=0 (γλ)b a hb

Lemma A.6. Pt 1 b=a(γλ)b aδa,b = Gλ|t a g(ha; θa 1) + O(α)

Proof: Using δa,t = δ a,t g(ht; θt) ht

ha + g(ht; θt 1) ht

ha , we have

b=a (γλ)b aδa,b =

b=a (γλ)b a δ a,t g(ht; θt) ht

ha + g(ht; θt 1) ht

b=a (γλ)b aδ a,b

b=a (γλ)b a (g(ht; θt) g(ht 1; θt 1)) hb

b=a (γλ)b aδ a,b + O(α)

= Gλ|t a g(ha; θa 1) + O(α) Lemma A.4

Proposition A.7. Let θ0 be the initial weight vector and let θji be an arbitrary parameter of θ. θji t be the weight at time t computed by accumulate BP(λ), and θt,ji t be the weight at time t computed by the online

λ -SG algorithm. Furthermore, assume that Pt 1 a=0 t,ji a = 0, where t,ji a := Gλ|t a,0 g(ha; θ0) g(ha;θa)

θji a and Gλ|t a,0 is the λ-weighted synthetic gradient which uses θ0 for all synthetic gradient estimations. Then, for all time steps t:

θji t θt,ji t 2 θji t θji 0 2

Published in Transactions on Machine Learning Research (04/2024)

Proof: The updates according to accumulate BP(λ) (Equations 19 to 20) follow

θji t = θji 0 + α

b=0 δ b eb Deﬁnition: Equation 18

= θji 0 + α

a=0 (γλ)b a hb

= θji 0 + α

a=0 (γλ)b aδ b hb ha

= θji 0 + α

a=0 (γλ)b aδ a,b g(ha; θa)

θji a Deﬁnition: Equation 21

= θji 0 + α

b=a (γλ)b aδ a,b

! g(ha; θa)

= θji 0 + α

Gλ|t a g(ha; θa 1) + O(α) g(ha; θa)

θji a Lemma A.6

= θji 0 + α

Gλ|t a,0 g(ha; θ0) + O(α) g(ha; θa)

Where the last line holds as α 0. On the other hand, the updates according to the online λ-SG algorithm produce

θt,ji t = θji 0 + α

Gλ|t a g(ha; θa) g(ha; θa)

θji a Deﬁnition: Equation 13

= θji 0 + α

Gλ|t a,0 g(ha; θ0) + O(α) g(ha; θa)

Hence θji t θt,ji t 2 θji t θji 0 2

θji t θt,ji t /α 2 θji t θji 0 /α 2

= O(α) C + O(α)

Gλ|t a,0 g(ha; θ0) g(ha; θa)

a=0 t,ji a 2.

From the condition that Pt 1 a=0 t,ji a = 0 we have C > 0. Therefore

θji t θt,ji t 2 θji t θji 0 2

B Recursive deﬁnition of Gλ t

In Proposition B.2 we provide a recursive deﬁnition of the λ-weighted synthetic gradient Gλ t as deﬁned in Equation 11. Note that this is similar, but not the same, as the recursive deﬁnition deﬁned in A.3 in the

Published in Transactions on Machine Learning Research (04/2024)

supplementary material of Jaderberg et al. (2017). In particular, this deﬁnition strictly considers future losses as in the context of a synthesiser target, whilst that provided in Jaderberg et al. (2017) also considers the loss at the current timestep (Lt).

We ﬁrst require the following Lemma

Lemma B.1. For any sequence {xn}n 1,

n=1 λn 1xn = (1 λ)

n=1 λn 1 n X

k=1 xk. (23)

Proof: By the sum rule,

k=1 λn 1xk = (1 λ)

= 1 1 λ 1 λk 1

k=1 λn 1xk = (1 λ)

Proposition B.2. We can deﬁne the λ-weighted synthetic gradient (Eq. 11) incrementally. In particular,

Gλ t = Lt+1

ht + γλ(Gλ t+1) ht+1

ht + γ(1 λ)g(ht+1; θt) ht+1

Published in Transactions on Machine Learning Research (04/2024)

Proof: Let T be the sequence length and deﬁne G(τ) t := 0 for τ > T. Then we can write Gλ t+1 as

Gλ t+1 = (1 λ)

n=1 λn 1G(n) t+1

n=1 λn 1 " n X

k=1 γk 1 Lt+1+k

ht+1 + γng(ht+1+n; θt+n) ht+1+n

n=1 λn 1 n X

k=1 γk 1 Lt+1+k

ht+1 + (1 λ)

n=1 λn 1γng(ht+1+n; θt+n) ht+1+n

n=1 λn 1γn 1 Lt+1+n

ht+1 + (1 λ)

n=1 λn 1γng(ht+1+n; θt+n) ht+1+n

ht+1 Lemma B.1

n=2 λn 2γn 2 Lt+n

ht+1 + (1 λ)

n=2 λn 2γn 1g(ht+n; θt+n 1) ht+n

Gλ t = (1 λ)

n=1 λn 1G(n) t

n=1 λn 1 " n X

k=1 γk 1 Lt+k

ht + γng(ht+n; θt+n 1) ht+n

n=1 λn 1 n X

k=1 γk 1 Lt+k

n=1 λn 1γng(ht+n; θt+n 1) ht+n

n=1 λn 1γn 1 Lt+n

n=1 λn 1γng(ht+n; θt+n 1) ht+n

ht Lemma B.1

n=2 λn 2γn 2 Lt+n

ht + (1 λ)γg(ht; θt) ht+1

n=2 λn 2γn 1g(ht+n; θt+n 1) ht+n

n=2 λn 2γn 2 Lt+n

ht+1 + (1 λ)

n=2 λn 2γn 1g(ht+n; θt+n 1) ht+n

+ (1 λ)γg(ht+1; θt) ht+1

ht + γλ(Gλ t+1) ht+1

ht + γ(1 λ)g(ht+1; θt) ht+1

C Experimental Details

For the experiments which analyse the alignment of synthetic gradients to true gradients (Figures 2 and 6), we use an RNN with a linear activation function. For the experiments for which we train the RNN using synthetic gradients (Figures 4 and 5), we use LSTM units in our RNN architecture (Hochreiter & Schmidhuber, 1997).

The model output for a given task is a (trained) linear readout of the RNN (output) state. The synthesiser network performs a linear operation (with a bias term) on the RNN hidden state (concatenation of cell state and output state for LSTM) to produce an estimate of its respective loss gradient. The synthetic gradient at the ﬁnal timestep is deﬁned as zero. As in Jaderberg et al. (2017), the synthesiser parameters θ are initialised at zero.

Published in Transactions on Machine Learning Research (04/2024)

When truncated BPTT with truncation size n is used for a task sequence length T, the sequence is divided such that the ﬁrst truncation is of size R where R = T mod n and each following truncation is of size n.

For (accumulate) BP(λ) we often ﬁnd it empirically important to use a discount factor γ < 1. Except in the experiments using ﬁxed RNNs (Fig. 2) for which γ = 1, for all conditions we set γ = 0.9.

We often observe it necessary to scale down the synthetic gradient before it is received by the RNN. We ﬁnd this particularly necessary when using synthetic gradients alongside truncated BPTT, and in this case generally ﬁnd a scaling factor 0.1 optimal as in Jaderberg et al. (2017). We ﬁnd that BP(λ) is less sensitive to this condition and choose scaling factors depending on the task (see below).

Data is provided to the model in batches. Though it is in principle possible to update the model as soon as the synthetic gradient is provided (e.g. every timestep), we ﬁnd that this can lead to unstable learning, particularly when explicit supervised signals are sparse (as in the sequential-MNIST task). For this reason we instead accumulate gradients over timesteps and update the model at the end of the batch sequence. We use an ADAM optimiser for gradient descent on the model parameters (Kingma & Ba, 2014).

All experiments are run using the Py Torch library. Code used for the experiments can be found on the Github page: https://github.com/neuralml/bp_lambda.

The toy task experiments used to analyse gradient alignment were conducted with an Intel i7-8665U CPU, where each run with a particular seed took approximately one minute. The sequential MNIST task and copy-repeat tasks were conducted on NVIDIA Ge Force RTX 2080 Ti GPUs. Each run in the sequential MNIST task took approximately 3 hours or less (depending on the model used); each run in the copy-repeat task took approximately 12 hours or less.

C.1 Toy task used to analyse gradient alignment

To analyse gradient alignment for ﬁxed RNNs in the toy task we provide the model an input at timestep 1 and the model is trained to produce a target 2-d coordinate on the unit-circle at the ﬁnal timestep T where T = 10. The input is a (ﬁxed) randomly generated binary vector of dimension 10. The task error is deﬁned as the mean-squared error between the model output and target coordinate.

We also consider plastic RNNs which are themselves updated at the same time as the synthesiser. To ensure that the RNN has to discover the temporal association between input and target, and not simply readout a ﬁxed value, in this case we consider 3 input/target pairs, where the targets are spread equidistantly on the unit circle. For this case we are interested in the limits of temporal association that can be capture and consider increasing sequence lengths T; speciﬁcally, we consider T as multiples of 10 up to 100. We deem a sequence length solved by a model with a certain initialisation (i.e. a given seed) if the model is able to achieve less that 0.025 averaged over the last 20 of 250 training epochs for that sequence length and all smaller sequence lengths. To improve readability, for this task the training curves are smoothed using a Savitzky Golay ﬁlter (with a ﬁlter window of length 25 and polynomial of order 3).

For this task we provide the model in batches of size 10, where 1 epoch involves 100 batches. The number of RNN units is 30 and the initial learning rate for the synthesiser is set as 1 10 4 and 1 10 3 for the ﬁxed and plastic RNN cases respectively.

Finally, to elucidate whether, like the TD(λ) algorithm (see e.g. Figure 2 in Van Seijen et al. (2016)), BP(λ) can only operate for modest learning rates, we test the model with diﬀerent step sizes α and compare the resulting performance against the online λ-SG algorithm (Figure 3). Here we consider a plastic RNN trained using the respective synthetic gradients with the task sequence length T = 10. Note that in this case synthesiser weights are updated at each timestep (Eq. 14) as opposed to only at the end of the sequence once all gradients are accumulated. For these experiments standard stochastic gradient descent is used.

C.2 Sequential-MNIST task

For the sequential-MNIST task we present a row-by-row presentation of a given MNIST image to the model (Deng, 2012; Le et al., 2015). That is, the task is divided into 28 timesteps where at the ith timestep the

Published in Transactions on Machine Learning Research (04/2024)

ith row of 28 pixels is provided to the model. At the end of the presentation, the model must classify which digit the input represents. The task loss is deﬁned as the cross entropy between the model output and the target digit.

For this task we provide the model in batches of size 50 with an initial learning rate of 3 10 4. The number of hidden LSTM units is 30. For BP(λ) the synthetic gradient is scaled by a factor of 0.1.

During training, the models with the lowest validation score over 50 epochs are selected to produce the ﬁnal test error.

To verify that BP(λ) can operate on non-trivial tasks with simple learning optimizers, we also perform weight updates for this task with stochastic gradient descent with momentum term ζ = 0.9 (Fig. 9). In this case the learning rates for each network is ﬁxed at 0.01.

C.3 Copy-repeat task

For the copy-repeat task (Graves et al., 2014) the model receives a delimiter before an 8-dimensional binary pattern of length N and then a repeat character R The model must then repeat the binary pattern R times followed by a stop character. The total sequence length is therefore N (R + 1) + 3. For easier absorption R is normalised by 10 when consumed by the model.

We follow the curriculum in Jaderberg et al. (2017) and alternatively increment N and R when a batch average less than 0.15 bits is achieved.

For this task we provide the model in batches of size 100. For the RNN and readout parameters we use an initial learning rate of 1 10 3, whilst we ﬁnd a smaller learning rate of 1 10 5 for the synthesiser parameters necessary for stable learning. The number of hidden LSTM units is 100. For BP(λ) we do not apply scaling to the synthetic gradient.

Published in Transactions on Machine Learning Research (04/2024)

D Supplementary Figures and Tables

0 50 100 epoch

alignment to true gradient

BPTT + SG (n=2)

T - 3 T - 6 T - 9

0 50 100 epoch

BPTT + SG (n=3)

T-1T-2T-3T-4T-5T-6T-7T-8T-9

timestep from end

alignment to true gradient

BPTT + SG (n=2) BPTT + SG (n=3) BPTT + SG (n=4) BPTT + SG (n=5)

n=1 (BP(0))

n=2 n=3 n=4 n=5

alignment to true gradient

first timestep (T-9)

Figure 6: Learning synthetic gradients with the n-step synthetic gradient (Eq. 9) as in (Jaderberg et al., 2017) in the toy task. Inputs are provided at timestep 1 and the corresponding target is only available at the end of the task at time T = 10. (a) Alignment between synthetic gradients and true gradients for a ﬁxed RNN model across diﬀerent timesteps within the task, where the synthetic gradients are learned with BPTT truncation size n = 2 (left) and n = 3 (right). Alignment is deﬁned using the cosine similarity metric. (b) The average alignment over the last 10% of epochs in a across all timesteps. (c) Alignment of synthetic gradients at the ﬁrst timestep of the task for diﬀerent n; BP(1) (dotted green) is shown for reference. Results show average (with SEM) over 5 diﬀerent initial conditions.

Published in Transactions on Machine Learning Research (04/2024)

error (MSE)

BPTT (n=2) BPTT (n=3) BPTT (n=4) BPTT (n=5)

0 100 200 epoch

10 20 30 40 50 T

final error (MSE)

error (MSE)

BPTT + SG (n=2) BPTT + SG (n=3) BPTT + SG (n=4) BPTT + SG (n=5)

0 100 200 epoch

0 100 200 10 20 30 40 50 T

final error (MSE)

10 20 30 40 50 60 70 80 90 100 T

final error (MSE)

BP(0) (Jaderberg et al.) BP(0.5) BP(1)

Figure 7: RNN learning using n-step synthetic gradients in the toy task. (a) (Left) Learning curves of RNNs which are updated using n-step truncated BPTT over diﬀerent sequence lengths T. (Right) Task error at end of training for increasing T. (b) Same as (a) but n-step synthetic gradients are also applied. (c) Task error for RNNs updated using BP(λ) at end of training for increasing T. Results show average (with SEM) over 5 diﬀerent initial conditions.

Published in Transactions on Machine Learning Research (04/2024)

0.0 0.2 0.4 0.6 0.8

error (MSE)

= 0 = 0.25 = 0.5 = 0.75 = 1

0.0 0.2 0.4 0.6 0.8

error (MSE)

online -return

Figure 8: Eﬀect of step-size α on performance in toy task for (top) BP(λ) and (bottom) the online λ-SG algorithm. Task errors are averaged over the last 10 epochs of training (out of 100). Results show average (with SEM) over 5 diﬀerent initial conditions.

0 20 40 60 80 100 epoch

accuracy (%)

no BPTT models

no BPTT BP(0) (Jaderberg et al.) BP(0.5) BP(1)

0 20 40 60 80 100 epoch

accuracy (%)

BPTT models

BPTT (n=2) BPTT (n=3) BPTT (n=4) BPTT (n=5)

BPTT + SG (n=2) BPTT + SG (n=3) BPTT + SG (n=4) BPTT + SG (n=5)

Figure 9: BP(λ) in sequential MNIST task with stochastic gradient descent. (a) Validation accuracy during training for BP(λ) models. (b) Validation accuracy during training for models which learn synthetic gradients (SG) with n-step truncated BPTT as in original implementation (Jaderberg et al., 2017); ﬁnal performance of BP(1) (as in (b); dotted green) is given for reference. Results show average (with SEM) over 5 diﬀerent initial conditions.