# learning_in_temporally_structured_environments__4b06cc54.pdf

Published as a conference paper at ICLR 2023

LEARNING IN TEMPORALLY STRUCTURED ENVIRONMENTS

Matt Jones,1,2 Tyler R. Scott,1 Mengye Ren,1,3 Gamaleldin El Sayed,1

Katherine Hermann,1 David Mayo,1,4 Michael C. Mozer1

1Brain Team, Google Research 2University of Colorado 3NYU 4MIT mcj@colorado.edu dmayo2@mit.edu mengye@cs.nyu.edu {tylersco,gamaleldin,hermannk,mcmozer}@google.com

Natural environments have temporal structure at multiple timescales. This property is reﬂected in biological learning and memory but typically not in machine learning systems. We advance a multiscale learning method in which each weight in a neural network is decomposed as a sum of subweights with different learning and decay rates. Thus knowledge becomes distributed across different timescales, enabling rapid adaptation to task changes while avoiding catastrophic interference. First, we prove previous models that learn at multiple timescales, but with complex coupling between timescales, are equivalent to multiscale learning via a reparameterization that eliminates this coupling. The same analysis yields a new characterization of momentum learning, as a fast weight with a negative learning rate. Second, we derive a model of Bayesian inference over 1/f noise, a common temporal pattern in many online learning domains that involves long-range (power law) autocorrelations. The generative side of the model expresses 1/f noise as a sum of diffusion processes at different timescales, and the inferential side tracks these latent processes using a Kalman ﬁlter. We then derive a variational approximation to the Bayesian model and show how it is an extension of the multiscale learner. The result is an optimizer that can be used as a drop-in replacement in an arbitrary neural network architecture. Third, we evaluate the ability of these methods to handle nonstationarity by testing them in online prediction tasks characterized by 1/f noise in the latent parameters. We ﬁnd that the Bayesian model signiﬁcantly outperforms online stochastic gradient descent and two batch heuristics that rely preferentially or exclusively on more recent data. Moreover, the variational approximation performs nearly as well as the full Bayesian model, and with memory requirements that are linear in the size of the network.

1 INTRODUCTION

Many online tasks facing both biological and artiﬁcial intelligence systems involve changes in data distribution over time. Natural environments exhibit correlations at a wide range of timescales, a pattern variously referred to as self-similarity, power-law correlations, and 1/f noise (Keshner, 1982). This is in stark contrast with the iid environments assumed by many machine learning (ML) methods, and with diffusion or random-walk environments that exhibit only short-range correlations. Moreover, biological learning systems are well-tuned to the temporal statistics of natural environments, as seen in phenomena of human cognition including power laws in learning (Anderson, 1982), power-law forgetting (Wixted & Ebbesen, 1997), long-range sequential effects (Wilder et al., 2013), and spacing effects (Anderson & Schooler, 1991; Cepeda et al., 2008). An important goal is to incorporate similar inductive biases into ML systems for online or continual learning.

This paper analyzes a framework for learning in temporally structured environments, multiscale learning, which for neural networks (NNs) can be implemented as a new kind of optimizer. A common explanation for self-similar temporal structure in nature is that it arises from a mixture of events at various timescales. Indeed, many generative models of 1/f noise involve summing independent stochastic processes with varying time constants (Eliazar & Klafter, 2009). Accordingly, the multiscale optimizer comprises multiple learning processes operating in parallel at different timescales.

Published as a conference paper at ICLR 2023

In a NN, every weight wj is replaced by a family of subweights ωij, each with its own learning rate and decay rate, that sum to determine the weight as a whole. Learning at multiple timescales is a key idea in several theories in neuroscience, including conditioning (Staddon et al., 2002), learning (Benna & Fusi, 2016), memory (Howard & Kahana, 2002; Mozer et al., 2009), and motor control (Kording et al., 2007), and has also been exploited in ML (Hinton & Plaut, 1987; Rusch et al., 2022). The multiscale learner isolates and simpliﬁes this idea, by assuming knowledge at different timescales evolves independently and that credit assignment follows gradient descent.

The ﬁrst part of this paper (Sections 2 and 3) proves three other models are formally equivalent to instances of the multiscale optimizer: a new variant of fast weights (cf. Ba et al., 2016; Hinton & Plaut, 1987), the model synapse of Benna & Fusi (2016), and momentum learning (Rumelhart et al., 1986; Qian, 1999). The insight behind these proofs is that each of these models can be written in terms of a linear update rule with diagonalizable transition matrix. Thus the eigenvectors of this matrix correspond to states that evolve independently. By writing the state of the model as a mixture of eigenvectors, we effect a coordinate transformation that exactly yields the multiscale optimizer. These results imply that the complicated coupling among timescales assumed by some models can be superﬂuous. They also provide a new perspective on momentum learning, with implications for how and when it is beneﬁcial and how it interacts with nonstationarity in the task environment.

In Section 4, we provide a normative grounding for multiscale learning in terms of Bayesian inference over 1/f noise. Our starting point is a generative model of 1/f noise as a sum of diffusion processes at different timescales. Exact Bayesian inference with respect to this generative process is possible using a Kalman ﬁlter (KF) that tracks the component processes jointly (Kording et al., 2007). When learning a single environmental parameter θ, such as mean reward for some action in a bandit task, this amounts to modeling θ(t) = Pn i=1 zi(t), where each zi is a diffusion process with a different characteristic timescale τi, and doing joint inference over Z = (z1, . . . , zn).

We then generalize this approach to an arbitrary statistical model, h (x, θ), where x is the input and θ Rm is a parameter vector to be estimated. For instance, h might be a NN with parameters θ. Our Bayesian model places a 1/f prior on θ (as a stochastic process), by assuming θ(t) = Pn i=1 zi(t) for diffusion processes zi Rm with characteristic timescales τi. We then do approximate inference over the joint state Z = (z1, . . . , zn), using an extended Kalman ﬁlter (EKF) that linearizes h by calculating its Jacobian at each step (Singhal & Wu, 1989; Puskorius & Feldkamp, 2003). Next, we derive a variational approximation to the EKF that constrains the covariance matrix to be diagonal, and show how it extends the multiscale optimizer. Speciﬁcally, writing wj and ωij as the current mean estimates of θj and zij (for weight j and time scale i), the variational update to each ωij follows that of the multiscale optimizer, with additional machinery for determining decay rates based on τi and adapting learning rates based on the current prior variance s2 ij(t).

In Section 5, we test our methods in online prediction and classiﬁcation tasks with nonstationary distributions. In online learning, nonstationarity often manifests as poorer generalization performance on future data versus held-out data from within the training interval. Common solutions are to train on a window of ﬁxed length (to exclude stale data) or to use stochastic gradient descent (SGD) with ﬁxed learning rate and weight decay, which leads older observations to have less inﬂuence (Ditzler et al., 2015). Here, we demonstrate that performance can be signiﬁcantly improved by retaining all data and using a learning model that accounts for the temporal structure of the environment. We introduce nonstationarity in our simulations by varying the latent data-generating parameters according to 1/f noise. Thus an important caveat is the task domains are matched to the Bayesian model. Notwithstanding, we test robustness by using a different set of timescales for task generation versus learning (Section 5.1), a generative process that mismatches the NN architecture (Section 5.2), and a construction of 1/f noise that differs from the sum-of-diffusion process the model assumes (Section 5.3). Results show the Bayesian methods (KF and EKF) outperform windowing and online SGD, as well as a novel heuristic of training the network on all past data with gradients weighted by recency. We also ﬁnd the variational approximation performs nearly as well as the full model (Section 5.1) and scales well to a multilayer NN trained on real data (Section 5.3).

2 MULTISCALE OPTIMIZER

Assume a statistical model ˆy(t) = h(x(t), w(t)) and loss function L(y, ˆy), where x(t) is the input on step t, w(t) is the parameter estimate, ˆy(t) is the model output, and y(t) is the target output. In a

Published as a conference paper at ICLR 2023

0 50 100 150 200 -1

1 Figure 1: Toy illustration of fast weights. A single weight w (blue) with constant input (x 1) predicts a target signal T (black) with square loss L = 1

2(T w)2. The weight is a sum of subweights ωslow (yellow) and ωfast (red). Initial learning is rapid, due to ωfast. Because of decay and the shared error signal, knowledge is gradually transferred to ωslow while ωfast returns to zero. When the task switches (trial 151), ωfast enables rapid adaptation while long-term knowledge is preserved in ωslow. Thus the model recovers quickly on the second reversal (compare blue curve beginning on trials 1 vs 156). The general multiscale optimizer extends this idea to an array of faster and slower weights.

NN, w(t) is the vector of current weights. (Under the Bayesian framing in Section 4, w is the mean estimate of the optimal parameters θ.) For exposition, we assume the weights are updated by SGD, w(t + 1) = w(t) α w(t)L(y(t), ˆy(t)), (1) and we henceforth abbreviate the gradient as w(t)L. However, the following approach can be naturally composed with other optimizers, such as extensions of SGD or Hebbian learning, by replacing α w(t)L with the appropriate update term.

The multiscale optimizer is motivated by the assumption that, in online learning tasks, the true or optimal parameters change over time, on multiple timescales. Accordingly, it expands each weight into a sum of subweights, wj = P ωij, each with a different learning rate αi and decay rate γi. Here j indexes weights in w, and i indexes timescales. The subweights evolve according to: ωij(t + 1) = γiωij(t) αi wj(t)L. (2)

Each ωij has characteristic timescale τi := ( log γi) 1. Note that wj(t)L = ωij(t)L, so one can think of the gradient for wj being apportioned among the subweights (with total learning rate α = P αi), or equivalently of each subweight following its own gradient.

2.1 FAST WEIGHTS

A potentially important special case of multiscale learning arises with two timescales, w = ωslow + ωfast. We assume γslow = 1 (no decay) and αfast > αslow. Thus each ωslow,j can be thought of as the original weight, which is augmented by ωfast,j, a second channel between the same neurons that both learns and decays rapidly. The fast weight enables the system to adapt quickly to distribution shifts while resisting catastrophic forgetting (Figure 1).

This model is conceptually similar to the fast weights approach of Ba et al. (2016) and Hinton & Plaut (1987). In that work, the weights are updated by a different mechanism (Hebbian learning) than the primary weights, and they act as a memory of recent hidden states in a recurrent network. In the present conception, fast weights optimize the same loss as the primary weights, only with different temporal properties, and they act as a memory for recent learning signals (e.g., loss gradients). Thus they are perhaps better suited for handling distribution shifts of the sort considered here.

3 EQUIVALENCE RESULTS

3.1 BENNA-FUSI SYNAPSE

Benna & Fusi s (2016) model synapse is designed to capture how biochemical mechanisms in real synapses implement a cascading hierarchy of timescales, and has been adopted in ML for continual reinforcement learning (Kaplanis et al., 2018; 2019). We consider a single weight w in a network, suppressing the index j. The Benna-Fusi model assumes that the information in w is maintained in a 1d hierarchy of variables u1, . . . , un, each dynamically coupled to its immediate neighbors: C1(u1(t + 1) u1(t)) = g1(u2(t) u1(t)) w(t)L (3)

Ck(uk(t + 1) uk(t)) = gk 1(uk 1(t) uk(t)) + gk(uk+1(t) uk(t)) (4)

Published as a conference paper at ICLR 2023

for 2 k n, with gn = 0. The external behavior of the synapse comes from u1 alone (i.e., w = u1), while u2:n act as stores with progressively longer timescales.

This update rule can be rewritten as u(t + 1) = T u(t) d(t), (5) with transition matrix T determined by the coefﬁcients in Equations 3 and 4, and external signal d(t) deﬁned by d1(t) = 1 C1 w(t)L and d2:n 0. It can be shown that the transition matrix is diagonalizable, T = V ΛV 1, with eigenvalues Λii = λi < 1 (see Appendix A). We can further enforce V1 = 1, for a purpose explained below. We refer to the eigenvectors (columns V i) as modes of the system, because they are preserved over time up to a scalar. That is, if the initial state is proportional to mode i, then in the absence of external signal (d 0), the system will remain in that mode, decaying exponentially with rate factor λi: u(0) V i = t : u(t) = λt iu(0) (6) In general, any state can be written uniquely as a linear combination of modes, u = P ωi V i = V ω. Therefore, reparameterizing the model as ω := V 1u yields the simpliﬁed update equation: ω(t + 1) = Λω(t) + V 1d(t) (7)

where V 1d(t) = 1 C1 [V 1] 1 w(t)L. Because Λ is diagonal, there is no cross-talk between the modes, unlike in the original dynamics. Thus we have derived an instance of the multiscale optimizer, with subweights ωi(t), decay rates λi, and learning rates 1 C1 [V 1]i1. The assumption above, V1 = 1, implies w = u1 = P ωi, so the models agree on the external behavior of the weight as a whole. Figure 2 illustrates the translation between the two models.

3.2 MOMENTUM LEARNING

The standard rationale for momentum learning is to smooth updates over time, so that oscillations along directions of high curvature cancel out while progress can be made in directions with consistent gradients (Rumelhart et al., 1986). To simplify notation, we again focus on a single weight w in the network, suppressing the index j. The momentum g is deﬁned as an exponentially ﬁltered running average of gradients, with weight update determined by current momentum: g(t + 1) = βg(t) + (1 β) w(t)L (8)

w(t + 1) = w(t) ηg(t + 1). (9) This formulation is equivalent to one in which the update w(t) = w(t + 1) w(t) includes a portion of the previous update: w(t) = α w(t)L + β w(t 1), with α = η(1 β).

Paralleling the analysis in Section 3.1, we write the state of the momentum optimizer as [w, g] and use Equations 8 and 9 to obtain the update rule: w(t + 1) g(t + 1)

+ η(1 β) (1 β)

w(t)L. (10)

The transition matrix has eigenvectors [1, 0] with eigenvalue 1, and [1, 1 β

ηβ ] with eigenvalue β. Now use this eigenbasis to deﬁne a reparameterization: w g

= 1 1 0 1 β

ωslow ωfast

Substitution into Equation 10 yields the reparameterized update rule: ωslow(t + 1) ωfast(t + 1)

ωslow(t) ωfast(t)

Thus we recover the fast-weight optimizer, with decay γfast = β and learning rates αslow = η and αfast = ηβ. The negative fast learning rate is perhaps surprising but can be understood as follows: When εfast < 0, the subweights learn in opposite directions, with the latent knowledge in ωslow overshooting the observable knowledge in w = ωslow + ωfast. As ωfast decays toward 0, w catches up to ωslow, so that the model appears to continue learning from past input, just as it would with momentum. This analysis highlights the contrasting rationales of these two methods: Learning at multiple timescales is motivated by an expectation of positive autocorrelation in the environment, whereas momentum is effective at smoothing out negative autocorrelation in the gradient signal.

Published as a conference paper at ICLR 2023

1 2 3 4 5 6 7 8 -0.25

1 2 3 4 5 6 7 8 -0.4

1 2 3 4 5 6 7 8 -0.25

1 2 3 4 5 6 7 8 0

1 2 3 4 5 6 7 8 0

1 2 3 4 5 6 7 8 -1

Figure 2: Translation between the model of Benna & Fusi (2016) and the multiscale optimizer works by decomposing the state of the former model into modes, or eigen-patterns of activation that decay independently, which correspond to subweights in the multiscale optimizer. A: All modes for a default Benna-Fusi model with eight variables (n = 8). B: An arbitrary initial state of the model. C: Unique eigen-decomposition of the state in Figure 2B. Implied values of the corresponding multiscale optimizer s subweights can be read off as the values of the curves at k = 1. D: Decay of the individual modes or subweights for 1000 steps (with no external input) at rates given by their eigenvalues. E: Reconstruction of the ﬁnal state exactly matches the result of iterating the Benna-Fusi update (dotted arrow from Figure 2B). F: Decomposition of a unit impulse to u1 (e.g., loss gradient, shown as grey bar) as a weighted sum of modes. Learning rates for the corresponding subweights, ωi, can be read off as the values of the curves at k = 1 (because V1i = 1).

4 BAYESIAN MULTISCALE OPTIMIZER

We turn now to a normative analysis of learning at multiple timescales, based on Bayesian inference over 1/f noise. The Bayesian model introduced here assumes that the latent parameters θ governing the observed data in some learning task vary over time according to 1/f noise. When the statistical model h(x, θ) is linear in θ, exact Bayesian inference is possible with a KF that maintains a posterior over an expanded representation of θ. When the model is nonlinear, approximate Bayesian inference is achieved by an EKF that uses a linear approximation of h. We then show that a variational approximation of the KF or EKF, in which the posterior covariance matrix is constrained to be diagonal, yields an extension of the multiscale optimizer that adapts its learning rates online by tracking uncertainty.

4.1 GENERATIVE MODEL FOR 1/f NOISE

Let zi(t) be an Ornstein-Uhlenbeck process (i.e., diffusion with decay), with timescale or inverse decay rate τi and diffusion rate σ2 i , deﬁned by the following stochastic differential equation:

dzi = τ 1 i z dt + σi d W. (13)

Here W(t) is a standard Wiener process (Brownian motion). As a Gaussian process, zi has kernel E[zi(t)zi(t + s)] e |s|/τi, implying exponentially decaying autocorrelations. However, a superposition of such processes at different timescales can have qualitatively different properties (Eliazar

Published as a conference paper at ICLR 2023

& Klafter, 2009). In particular, consider

i=1 zi(t), (14)

where τi = νi and σi = ν i/2 for a chosen ν > 1, and n is an integer such that τn is very large. We show in Appendix B that ξ has power-law (i.e., long range) autocorrelations, E[ξ(t)ξ(t+s)] |s| 1 for s τn, and accordingly a power spectrum that is well-approximated by 1/f for frequencies f τ 1 n . Moreover, m independent copies of this process constitute m-dimensional 1/f noise, due to the rotational invariance of multidimensional Ornstein-Uhlenbeck processes.

This construction affords a ﬂexible generative model of nonstationarity in a variety of online learning domains, by applying it to the latent parameters governing the relationships among observable variables. Assume we receive observations x(t), y(t) that we wish to model with a statistical model h that is parameterized by θ Rm:

y(t) = h(x(t), θ(t)). (15)

For example, h may be a NN with weights θ, input x, and target output y. The generative side of our Bayesian model posits latent variables zi (i = 1, . . . , n) such that each zi is an Ornstein-Uhlenbeck process in Rm with timescale τi, and these processes sum to determine the original parameters:

i=1 zi(t). (16)

These assumptions imply that θ follows a 1/f process, and they entail an expanded state representation, Z = (z1, . . . , zn) Rnm, that enables efﬁcient inference as described in Section 4.2.

4.2 INFERENCE OVER 1/f NOISE VIA EXTENDED KALMAN FILTER

We consider Bayesian methods that adopt the construction in Section 4.1 as a generative model to account for nonstationarity. Equations 13 and 14 describe a linear dynamic system with state Z = (z1, . . . , zn) Rn. If ξ is directly observed at discrete intervals, then optimal Bayesian online prediction of each ξ(t) based on all preceding observations can be implemented by a KF over Z (Kording et al., 2007) (see Appendix D).

We extend this approach to arbitrary statistical models with nonstationarity in their latent parameters, as in Equations 15 and 16. When h is linear in θ (and hence in Z), such as in the regression task and 1-layer perceptron model in Section 5.1, exact inference is possible with a standard KF (Appendix D). For a general h, such as a multilayer NN, we use an EKF. The EKF makes a local linear approximation of h based on its Jacobian, the matrix of gradients of predictions ˆy with respect to θ (Appendix E). We use Ollivier s (2018) generalization of the EKF that replaces Gaussian observation noise with any exponential family p(y|ˆy), which is better suited for modeling discrete outcomes such as the classiﬁcation tasks of Sections 5.2 and 5.3.

4.3 VARIATIONAL APPROXIMATION

Finally, we derive a variational approximation of the EKF that extends the multiscale optimizer and affords efﬁcient implementation in large NNs (Appendix F). As is standard, the EKF maintains an iterative prior over the latent state based on all previous observations:

p (Z(t)|x<t, y<t) N(ω(t), S(t)). (17)

The mean, ω(t), is the vector of current subweight estimates in the network, while S(t) captures their joint uncertainty and hence determines updates (as a preconditioner on the gradient).

We use variational inference to approximate the distribution in Equation 17 by one in which S(t) S(t), where S(t) is constrained to be diagonal, written as S(t) = diag(s2(t)). This reduces the complexity from O(m2n2) to O(mnk) (the size of the Jacobian, where k is the size of the output layer). The simplest case is a KF that tracks a single 1/f variable, with no inputs or latent variables. That is, y(t) = ξ(t) R1 as in Equation 14. Appendix F.1 derives the variational update rule as

ωi(t + 1) = e 1/τiωi(t) + αi(t) (yi(t) ˆyi(t)) , (18)

Published as a conference paper at ICLR 2023

where yi(t) ˆyi(t) = w L (i.e., square loss), and the learning rates are given by

αi(t) = e 1/τis2 i (t) P

i s2 i (t) . (19)

For the EKF with a general nonlinear model h(x, θ), a slight extension of the variational approximation, derived in Appendix F.3, provides the following update:

ωij(t + 1) = e 1/τiωij(t) e 1/τis 2 ij(t) wj(t) L. (20)

Here diag(s 2) is the diagonal variational approximation of the posterior variance after observing y(t), and L is the negative loglikelihood in the EKF s Gaussian approximate output distribution. Importantly, the update rule for the variance uses the diagonal of the precision matrix but can be calculated without matrix inversion, which is relevant to scaling up to large networks.

Thus the variational method amounts to decomposing every weight in the network as a sum of subweights, wj = P

i ωij, that learn independently according to their individual gradients, with decay rates τi and learning rates coming from S(t). This is a special case within the family of multiscale optimizers, with additional machinery to adapt the learning rates based on current uncertainty.

5 SIMULATION TESTS

5.1 REGRESSION TASK

As a simple demonstration, we created an online linear regression task with 10 features (including a bias term), in which the true weights β varied over time according to 1/f noise using Equation 16. The outcome was generated as y = x β (no noise term was needed because β is inherently noisy). The corresponding predictive model is a one-layer perceptron, which we write as ˆy = x w to distinguish weight estimates (w) from true parameters (β). We model the data using the perceptron and compare methods for optimization, using square loss, L = 1

We tested two baseline training methods, representing common heuristi practices with nonstationary data (Ditzler et al., 2015; Parisi et al., 2019). First, we tested a batch learning method that uses a ﬁxed memory horizon H. To produce a prediction on step t, the batch learner ﬁts the perceptron to trials t H through t 1. Figure 3A shows performance is U-shaped: accuracy suffers with short horizons because of sampling error, but it also suffers from longer horizons because older observations are less valid. Second, we tested SGD, in which the weights are updated once after each step t, based on x(t) and y(t). Figure 3B shows performance is best with an intermediate learning rate, which roughly corresponds to assuming the environment changes on a single characteristic timescale (see Appendix C).

As applied to the perceptron, the Bayesian 1/f model described in Section 4.2 decomposes the weight for each feature j into subweights, wj = P

i ωij, and tracks the ωij jointly with a KF (generalizing Dayan & Kakade, 2000). The subweights combine to predict the outcome on each step: ˆy(t) = P

ij xi(t)ωij(t). Relative to blind prediction (guessing the overall mean of y on

0 50 100 150 200 250 300 350 400 60

0 0.02 0.04 0.06 0.08 0.1 60

Figure 3: Regression task with 1/f dynamics. A: Batch optimization. B: Stochastic gradient descent. C: Model comparison.

Published as a conference paper at ICLR 2023

every trial), this exact Bayesian solution explains 54.0% more variance in the outcome than the best batch learner, and 25.7% more than the best parameterization of SGD (Figure 3C). Moreover, the variational model that constrains the KF covariance matrix to be diagonal (see Appendix F.2) performs nearly (96.9%) as well as the full Bayesian model.

Finally, we tested a discounting method, similar to windowed batch optimization except all past trials were used for training, discounted by a function of lag. Because the 1/f environment has powerlaw correlations, we weighted each observation t k by k a and optimized a. This method also signiﬁcantly outperforms windowed batch and SGD, showing that accounting for an environment s autocorrelation function can achieve much of the advantage of the Bayesian approach. Nevertheless, the full KF and variational methods still outperform discounting, by 5.5% and 2.2% respectively.

5.2 LINEAR CLASSIFICATION TASK

Next, we investigated an online 10-way classiﬁcation task with 10 features (including a bias term). The data were generated by ﬁrst sampling the class, y(t) softmax(e(t)), and then sampling the feature vector, x(t)|y(t) N(µy(t), I). The logits ej and the feature-class means µij were independently sampled from 1/f processes (using Equation 16), so that there was nonstationarity in both p(y) and p(x|y). Scaling of e and µ was chosen to equate to maximum possible performance based on perfect knowledge of either one alone (both yielding average loss L 1). For the predictive model, we used a one-layer perceptron with a softmax output layer, ˆy = softmax(x W ), where W is a matrix of learnable feature-class weights. We assumed cross-entropy loss, L(y, ˆy) = log ˆyy.

The batch method trained the network on trials t H through t 1 until convergence before predicting y(t). Weight decay was included for regularization, optimized for each value of H. Figure 4A shows a U-shaped pattern of performance, reﬂecting the tradeoff between sampling error and stale data. We also used weight decay with SGD, optimized for each learning rate. Figure 4B again shows a U-shaped pattern of performance. The variational EKF for this task and model is derived in Appendix F.3. We applied ℓ2 regularization to the prior on each time step, on par with the SGD and batch optimization methods. Figure 4C shows the variational method outperforms the other two.

5.3 MNIST CLASSIFICATION

Finally, we tested our methods on classifying a stream of handwritten MNIST digits (Le Cun et al., 2010). We created a nonstationary online learning task by sampling an example from the 10-way MNIST training set on each time step according to class logits that followed 1/f noise (Figure 5A). For a predictive model, we used a convolutional neural network (CNN) with two convolution layers followed by two dense layers, with 824458 parameters. This experiment provides a more stringent test of the present methods in several ways. First, it tests whether the EKF s linear approximation and the variational diagonal-variance approximation perform well on a multilayer NN, and whether the algorithm is efﬁcient with a moderately large number of parameters. Second, the predictive model the optimizer is doing inference over is unrelated to how the images and labels were generated. Third, the sampling procedure for 1/f noise did not accord with the additive generative process

0 0.1 0.2 0.3 0.4 0.5 1.5

0 50 100 150 200 250 1.5

Figure 4: Synthetic classiﬁcation task with 1/f dynamics. Loss is negative log-likelihood. A: Windowed batch optimization. B: Stochastic gradient descent. C: Model comparison.

Published as a conference paper at ICLR 2023

Figure 5: A: Sample frequencies in blocks of consecutive time steps for all 10 MNIST classes (indicated with a unique color per class). The 1/f sequence exhibits long-range autocorrelations, with nearly the same pattern over blocks of 10 or 100. B: Model performance over a sequence of 10000 examples (1/6 of MNIST training set).

assumed by the Bayesian model (Section 4.1), but instead used a spectral procedure described in Appendix G.

The variational EKF was compared to SGD with momentum, in both 1/f and standard iid environments. Hyperparameters for both methods (noise variance for EKF, learning and momentum rates for SGD) were optimized separately for the two environments. Because the models learn quickly, we evaluated them (within each replication) on a sequence comprising only a random subset of the MNIST training set. Performance was measured by top-1 error rate. This should be interpreted as generalization (i.e., test) performance, because each item was observed only once. The variational EKF outperforms SGD by 1.4% in the iid environment, because its tracking of uncertainty enables more effective gradient steps. However, its advantage jumps to 2.4% in the 1/f environment, showing once again that it is better able to leverage dynamics at multiple scales.

6 CONCLUSIONS

Our analytic and simulation results demonstrate how online learning performance in nonstationary environments can be improved by incorporating a model of temporal structure. The Bayesian 1/f model amounts to distributing knowledge across multiple timescales, and the variational EKF enables approximate implementation in a neural network using subweights with different learning and decay rates. The variational EKF extends the multiscale optimizer, which is closely related to previous models in both neuroscience and ML, and in some cases is equivalent to them despite being simpler in having no coupling between timescales.

We have implemented the variational EKF optimizer in JAX in a format compatible with Optax. In the MNIST simulations of Section 5.3, we ﬁnd our optimizer code (with 8 timescales) is actually 1.6% faster than Optax s off-the-shelf SGD, in compute time per example. Note also that the multiplexing of subweights is not expensive relative to current optimizers (e.g., Adam; Kingma & Ba, 2015), which also store multiple variables for each weight.

In sum, the multiscale optimizer and variational EKF enjoy a combination of normative, heuristic, and biological justiﬁcation, good performance, and computational efﬁciency. Our ongoing work aims to extend the theory in several ways. Chang et al. (2022) compare the present variational method to the fully-decoupled EKF of Puskorius & Feldkamp (2003). Another possible variational method is to assume a block-diagonal matrix that maintains covariance information only between subweights (timescales) within each weight, so that computational complexity still scales linearly with network size. Finally, the present method is not limited to 1/f noise but generalizes to other power laws (e.g., 1/f β) by appropriate choice of the timescales τi and noise variances σi in the generative model (see Appendix B). If, for example, data or theory is available bearing on the power spectrum of the dynamics in a given domain, the optimizer could be tuned accordingly.

Published as a conference paper at ICLR 2023

John R. Anderson. Acquisition of cognitive skill. Psychological Review, 89:369 406, 1982.

John R. Anderson and Lael J. Schooler. Reﬂections of the environment in memory. Psychological Science, 2(6):396 408, 1991.

Jimmy Ba, Geoffrey E. Hinton, Volodymyr Mnih, Joel Z. Leibo, and Catalin Ionescu. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.

Marcus K. Benna and Stefano Fusi. Computational principles of synaptic memory consolidation. Nature Neuroscience, 19, 2016. doi: 10.1038/nn.4401. URL https://doi.org/10.1038/ nn.4401.

Nicholas J. Cepeda, Edward Vul, Doug Rohrer, John T. Wixted, and Harold Pashler. Spacing effects in learning: A temporal ridgeline of optimal retention. Psychological Science, 19(11):1095 1102, 2008.

Peter G. Chang, Matt Jones, and Kevin P. Murphy. On diagonal approximations to the extended Kalman ﬁlter for online training of Bayesian neural networks. In 14th Asian Conference on Machine Learning (ACML), Workshop on Continual Lifelong Learning, 2022.

Peter Dayan and Sham Kakade. Explaining away in weight space. Advances in neural information processing systems, 13, 2000.

Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine, 10(4):12 25, 2015. doi: 10.1109/MCI.2015.2471196.

Iddo Eliazar and Joseph Klafter. A uniﬁed and universal explanation for l evy laws and 1/f noises. Proceedings of the National Academy of Sciences, 106(30):12251 12254, 2009.

Geoffrey E. Hinton and David C. Plaut. Using fast weights to deblur old memories. In Proceedings of the 9th annual conference of the cognitive science society, pp. 177 186, 1987.

Marc W. Howard and Michael J. Kahana. A distributed representation of temporal context. Journal of mathematical psychology, 46(3):269 299, 2002.

Christos Kaplanis, Murray Shanahan, and Claudia Clopath. Continual reinforcement learning with complex synapses. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 2018. PMLR 80.

Christos Kaplanis, Murray Shanahan, and Claudia Clopath. Policy consolidation for continual reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, 2019. PMLR 97.

Marvin S. Keshner. 1/f noise. Proceedings of the IEEE, 70(3):212 218, 1982.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Konrad P. Kording, Joshua B. Tenenbaum, and Reza Shadmehr. The dynamics of memory as a consequence of optimal adaptation to a changing body. Nature Neuroscience, 10(6):779 786, 2007. doi: 10.1038/nn1901. URL https://doi.org/10.1038/nn1901.

Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

Michael C. Mozer, Harold Pashler, Nicholas Cepeda, Robert V. Lindsey, and Ed Vul. Predicting the optimal spacing of study: A multiscale context model of memory. Advances in neural information processing systems, 22, 2009.

Yann Ollivier. Online natural gradient as a kalman ﬁlter. Electronic Journal of Statistics, 12(2): 2930 2961, 2018.

Published as a conference paper at ICLR 2023

German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54 71, 2019. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2019.01.012. URL https://www. sciencedirect.com/science/article/pii/S0893608019300231.

Gintaras V Puskorius and Lee A Feldkamp. Parameter-based kalman ﬁlter training: Theory and implementation. In Simon Haykin (ed.), Kalman Filtering and Neural Networks, pp. 23 67. John Wiley & Sons, Inc., 2003. URL https://onlinelibrary.wiley.com/doi/10. 1002/0471221546.ch2.

Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12 (1):145 151, 1999.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. In D.E. Rumelhart and J.L. Mc Clelland (eds.), Parallel distributed processing, Vol. 1, pp. 318 362. MIT Press, Cambridge, MA, 1986.

T Konstantin Rusch, Siddhartha Mishra, N Benjamin Erichson, and Michael W Mahoney. Long expressive memory for sequence modeling. In International Conference on Learning Representations (ICLR), 2022.

Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. ar Xiv preprint ar Xiv:1206.6471, 2012.

Sharad Singhal and Lance Wu. Training multilayer perceptrons with the extended kalman algorithm. volume 1, 1989. URL https://proceedings.neurips.cc/paper/1988/ file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf.

JER Staddon, IM Chelaru, and JJ Higa. Habituation, memory and the brain: The dynamics of interval timing. Behavioural Processes, 57(2-3):71 88, 2002.

Richard S Sutton. Gain adaptation beats least squares. In Proceedings of the 7th Yale workshop on adaptive and learning systems, volume 161, pp. 166, 1992.

Matthew H. Wilder, Matt Jones, Alaa A. Ahmed, Tim Curran, and Michael C. Mozer. The persistent impact of incidental experience. Psychonomic bulletin & review, 20(6):1221 1231, 2013.

John T. Wixted and Ebbe B. Ebbesen. Genuine power curves in forgetting: A quantitative analysis of individual subject forgetting functions. Memory & Cognition, 25(5):731 739, 1997.

Published as a conference paper at ICLR 2023

A DIAGONALIZATION OF BENNA-FUSI TRANSITION MATRIX

This section provides details on diagonalizing the transition matrix of the update rule for the Benna & Fusi (2016) model, and the corresponding reparameterization in terms of the eigenbasis.

First, reparameterize the Benna-Fusi model so that its transition matrix is symmetric, as follows. Recursively deﬁne

( 1 k = 1 bk 1ck

ck 1 1 < k n (21)

and write the state of the Benna-Fusi synapse as

bkuk)1 k n. (22)

The update becomes

φ(t + 1) = Γφ(t) + d(t) (23)

Γk,k = 1 gk 1 + gk

Γk 1,k = Γk,k 1 = gk 1 p

Ck 1Ck . (25)

Symmetry of Γ implies it has an orthonormal eigenbasis, {ψ1, . . . , ψn}, with corresponding eigenvalues λ1, . . . , λn. Because the scaling of eigenvectors is arbitrary, we can enforce ψi,1 = 1 for all i.

To translate the eigenbasis back to u, deﬁne B = diag(

b) so that φ = Bu and T = B 1ΓB. It is then easily veriﬁed that

V i = B 1ψi, (26)

is an eigenvector of T with eigenvalue λi. Therefore T = V ΛV 1, as claimed in the main text. Note that the choice ψi,1 = 1 entails V1,i = 1 (because b1 = 1). Finally, Ψ = [ψ1, . . . , ψn] is invertible because it comprises a basis, and therefore so is V = B 1Ψ. Therefore the reparameterization u 7 ω = V 1u is well-deﬁned.

B GENERATIVE MODEL FOR 1/f NOISE

Consider a single Ornstein-Uhlenbeck (OU) process z(t) described by Equation 13 with σ = 1. The covariance function of z is given by

E[z(t)z(t + s)] = Z t

e (t t )/τe (t+s t )/τdt (27)

2e s/τ. (28)

Note that this expression decays exponentially with the lag s, yielding short-range autocorrelations. The power spectrum of z, as a function of frequency f, is the Fourier transform of the covariance function:

τ 2e |s|/τe 2πifsds (29)

= 1 τ 2 + (2πf)2 (30)

where 2πf is angular frequency.

To deﬁne a generative model of 1/f noise, we deﬁne a continuous mixture model over a family of OU processes (zτ)0<τ T for some large T:

0 2τ 1zτ(t)dτ. (31)

Published as a conference paper at ICLR 2023

10-4 10-3 10-2 10-1 10-4

Figure 6: A: Construction of 1/f noise (black trajectory at top) as a sum of OU processes with different time constants (colored trajectories). Vertical offsets are applied as a visual aid for discriminating the curves. B: The power spectrum of the aggregate trajectory in log-log coordinates. The red line has unit slope.

The power spectrum of ξ is then a mixture over the component spectra Pzτ :

Pξ (f) = Z T

0 4τ 2Pzτ (f)dτ (32)

τ 2 + (2πf)2 dτ (33)

πf tan 1(2πf T) (34)

which is approximately 1/f for f 1/T, i.e. for all but very low frequencies.

We next deﬁne a discrete approximation of the continuum mixture model in Equation 31,

i zi(t), (35)

where zi has timescale τi and scaling parameter σi. To approximate a 1/f spectrum, σ2 i (τi+1 τi) 1

should scale as 4τ 2 i , the squared weight density in Equation 31 (because power is additive and proportional to σ2). For example, the τi could be arithmetically spaced with σi τ 1. Instead, we assume geometric spacing, with n components deﬁned by τi = νi and σi = 2ρτ 1/2 i , where ρ2 is a hyperparameter determining overall steady-state variance. Figure 6 illustrates this construction and exempliﬁes the accuracy of the discrete approximation.

To deﬁne a generative model of 1/f noise in Rm, we assume an independent copy of this process for each dimension. The resulting joint process is rotationally symmetric, a property inherited from OU processes. That is, any linear combination P

j cjξj is a 1d 1/f process.

C KALMAN FILTER FOR A SINGLE TIMESCALE

Start as in Appendix B with a single OU process z, with σ = 1, and assume it is observed at unit intervals (yt)t N with Gaussian observation noise of variance η2. Assume a conjugate iterative prior:

z(t)|y<t N w(t), s2(t) . (36)

The posterior after each observation is

z(t)|y t N η2w(t) + s2(t)y(t)

s2(t) + η2 , s2(t)η2

Published as a conference paper at ICLR 2023

Evolving the process to time t + 1 amounts to decay by e 1/τ and accumulation of new noise. At each time t [t, t + 1], variance from the noise appearing at time t (i.e., from d W(t ) in Equation 13) decays by a factor e 2(t+1 t )/τ by time t + 1. Therefore the total accumulated variance is: Z t+1

t e 2(t+1 t )/τdt = τ

1 e 2/τ . (38)

Therefore the prior for the next observation is

z(t + 1)|y t N e 1/τ η2w(t) + s2(t)y(t)

s2(t) + η2 , e 2/τ s2(t)η2

s2(t) + η2 + τ

1 e 2/τ (39)

= N w(t + 1), s2(t + 1) . (40)

We thus obtain the recursion

w(t + 1) = e 1/τ η2w(t) + s2(t)y(t)

s2(t) + η2 (41)

= e 1/τ (w(t) + α(t)(y(t) w(t))) . (42)

This is a temporal-difference learning rule, or gradient descent on squared error 1

2(y(t) w(t)),

with learning rate α(t) = s2(t) s2(t)+η2 . This connection between temporal-difference learning and the Kalman ﬁlter is well known (Sutton, 1992).

The steady state for s2 is given by the solution to s2(t) = s2(t + 1), a quadratic with one positive root. If s2(0) is initialized to this value, then α(t) will be constant. Note that this algorithm (and the extension to 1/f noise in Appendix D) generalizes to irregularly spaced observations, in which case one can derive how the optimal learning rate should vary on each step.

D KALMAN FILTER OVER 1/f NOISE

Assume we receive a sequence of observations at unit time intervals, y(t) for t N, and we want to do online prediction under the assumption that y is a 1/f noise process. Then we can make the generative assumption y(t) = ξ(t) with ξ deﬁned as at the end of Appendix B. For simplicity and in contrast to Appendix C, we assume no observation noise, because the shortest timescales (z1, etc.) already play this role.

Because y(t) is conditionally independent of the history given (and indeed is fully determined by) the joint state Z(t) = (z1(t), . . . , zn(t)), it sufﬁces to compute the posterior for the latter. Write the iterative prior for Z as

Z(t)|y<t N (ω(t), S(t)) , (43)

implying an optimal (maximum-likelihood or least-squares) prediction of ˆy(t) = P

i ωi(t). The posterior after observing y(t) is the intersection of the prior with the hyperplane P

i zi(t) = y(t):

Z(t)|y t N ω(t) + S(t)1 1 S(t)1(y(t) ˆy(t)), P S(t) 1P + 11 1 P . (44)

Here, 1 is the vector with all elements equal to 1, and P = I 1

n11 is the orthogonal projector.

The prior for the next time step is then obtained by applying decay and adding variance from the noise accumulated over the intervening interval. Deﬁne D as the diagonal matrix of decay factors, e 1/τi, and N as the diagonal matrix of added noise 2ρ2(1 e 2/τi), obtained by multiplying the RHS of Equation 38 by σ2 i = 4ρ2τ 1 i (from Appendix B). Then we have the update equations for the exact Bayesian model:

ω(t + 1) = D ω(t) + S(t)1 1 S(t)1(y(t) ˆy(t)) (45)

S(t + 1) = D P S(t) 1P + 11 1 P D + N. (46)

Published as a conference paper at ICLR 2023

Next, consider the perceptron in Section 5.1, where y = x θ. The 1/f generative model assumes θj = P

i zij for each j, where i indexes timescales and j indexes features, and zij has timescale τi and scaling parameter σi deﬁned as above. The latent state is described by Z = (zij)ij. We treat ij as a single composite index, so that Z is a vector. As above, write the iterative prior as

Z(t)|x<t, y<t N (ω(t), S(t)) . (47)

Let X be a multiplexed copy of the input x, so that Xij = xj. Assuming square loss, the optimal prediction for y(t) is ˆy(t) = X(t) ω(t). The posterior after observing y(t) is the intersection of the prior with the hyperplane X(t) Z(t) = y(t):

Z(t)|x t, y t N ω(t) + S(t)X(t) X(t) S(t)X(t)(y(t) ˆy(t)),

PX(t)S(t) 1PX(t) + X(t)X(t) 1 PX(t) , (48)

where PX = I XX /X X is the projector orthogonal to X.

Generalizing the deﬁnitions above, let D and N be the diagonal matrices of decay factors and noise accumulation, Dij,ij = e 1/τi and Nij,ij = 2ρ2(1 e 2/τi). Applying these to Equation 48 to obtain the prior for the next time step yields the update equations for the exact Bayesian model of the regression task:

ω(t + 1) = D ω(t) + S(t)X(t) X(t) S(t)X(t)(y(t) ˆy(t)) (49)

S(t + 1) = D PX(t)S 1(t)PX(t) + X(t)X(t) 1 PX(t)D + N. (50)

Equation 49 exempliﬁes how the variance matrix, S(t), can be thought of as deﬁning a preconditioner of the gradient, X(t)(y(t) ˆy(t)).

E EXTENDED KALMAN FILTER

Given a nonlinear model h(x, θ), the 1/f EKF posits that the optimal parameters follow 1/f dynamics according to Equation 16, with expanded latent state Z = (z1, . . . , zn) . It is convenient to introduce the expanded model h deﬁned by h(x, Z) = h(x, P

i zi). The 1/f EKF maintains an iterative prior over Z as in Equation 47 and updates that prior by linearizing h about Z = ω(t):

h(x(t), Z) h(x(t), ω(t)) + J h(Z ω(t)). (51)

Here, J h = ˆy

Z is the Jacobian matrix of h, evaluated at Z = ω(t). Note that J h is just n copies of Jh.

Following Ollivier (2018), we assume the observation y is governed by some exponential family P(y|η(ˆy)), with vector of sufﬁcient statistics T (y). The model s output ˆy = h(x, ω) is taken to encode the predicted mean parameter of that family: ˆy = Ey P ( |η(ˆy)) [T (y)] (this can be read as a deﬁnition of the mapping ˆy 7 η(ˆy)). It then approximates the conditional distribution of the sufﬁcient statistics as a Gaussian,

p(T (y)|ˆy) N (ˆy, Rˆy) , (52)

where Rˆy = Var (T (y)|ˆy) is the conditional variance.

For example, when h is a classiﬁcation model as in Section 5.2 or 5.3, the output of the network is a vector ˆy of class probabilities, and the sufﬁcient statistics T (y) are a one-hot vector. For numerical stability, we exclude the ﬁnal element of ˆy and T (y), which are determined by the other elements. The conditional outcome variance is given by

[Rˆy]i,j = ˆyi(1 ˆyi) i = j ˆyiˆyj i = j. (53)

Published as a conference paper at ICLR 2023

Under the approximations of Equations 51 and 52, the posterior is given by the standard KF formula:

Z(t)|x t, y t N ω(t) + S(t)J h J h S(t)J h + Rˆy(t) 1 (T (y(t)) ˆy(t)),

S(t) S(t)J h J h S(t)J h + Rˆy(t) 1 J h S(t) . (54)

Applying decay (D) and accumulated noise (N) as in Appendix D to obtain the prior for the next time step yields the update equations for the 1/f EKF:

ω(t + 1) = D ω(t) + S(t)J h J h S(t)J h + Rˆy(t) 1 (T (y(t)) ˆy(t)) (55)

S(t + 1) = D S(t) S(t)J h J h S(t)J h + Rˆy(t) 1 J h S(t) D + N. (56)

Although we apply the EKF to feedforward NNs in this paper, we note that the approach naturally generalizes to other probabilistic causal models relating the observed variables. Thus it offers a means to model nonstationarity that covers all of the traditional forms of distribution shift, following the causal framework of Sch olkopf et al. (2012). Consider a classiﬁcation task, with input features x(t) Rm and class labels y(t) Zk. Under a generative causal model where x depends on y, label shift can be modeled by 1/f noise in the distribution p(y), for example y softmax (ℓ) with ℓ given by Equation 16. Manifestation shift can be modeled by 1/f noise in the distributions p(x|y), for example x(t)|y(t) N µy(t), Σ with µy given by Equation 16 for each y. Likewise, under a discriminative causal model where y depends on x, covariate shift can be modeled by 1/f noise in p(x), and concept shift can be modeled by 1/f noise in p(y|x).

F VARIATIONAL APPROXIMATION

This section derives variational approximations of the KF and EKF models, with S(t) S(t) where S(t) := diag(s2(t)). Thus the resulting algorithms need only track the individual variance terms in s2(t) rather than the full covariance matrix. Moreover, the update equations (63,64,68,69,74,77) all avoid matrix inversion even though this might initially seem necessary from Equation 57 a property that may be relevant for efﬁcient scaling.

F.1 UNCUED INFERENCE

We begin with the simple Bayesian 1/f model in Equations 45 and 46, where there are no predictors and the model merely tracks an observable y(t). Given an arbitrary Gaussian distribution, the variational approximation (in the sense of minimizing Kullback-Leibler divergence) by another Gaussian with diagonal covariance matrix is obtained by taking the diagonal of the original distribution s precision matrix. Thus in the present case we have

s 2 i (t + 1) = S 1(t + 1)

where S(t+1) is given by Equation 46 with S 1(t) replaced by S 1(t) (the inductive assumption). To calculate S 1(t + 1) under this assumption, we ﬁrst observe the identity

P diag s 2 P + 11 1 P = diag s2 s2s2

i s2 i . (58)

Therefore Equation 46 becomes

S(t + 1) = D S(t)D + N (Ds2(t))(Ds2(t))

i s2 i (t) (59)

:= diag(a) bb

where diag(a) = D S(t)D + N, b = Ds2(t), and c = P

i s2 i (t). We next use the identity

diag (a) bb

1 = diag 1 a

a 1/2 b a 1/2 b

c P b2 i ai

Published as a conference paper at ICLR 2023

implying the diagonal elements are " diag(a) bb

b2 i ai c P

Combining Equations 57, 60, and 62 gives the ﬁnal form of the variational update:

s2 i (t + 1) =

s2 i (t)e 1/τi + 4ρ2 sinh 1

2 Ω s2 i (t) + 4ρ2e1/τi sinh 1

Ω+ s4 i (t) (63)

4ρ2s2 i (t) sinh 1

τi e 1/τis2 i (t) + 4ρ2 sinh 1

This update of the variance converges exponentially to a unique ﬁxed point. Numerical simulations conﬁrm that, in this limit, s2 i is larger for smaller τi, meaning faster learning rates for shorter timescales. If the prior variances are initialized at the ﬁxed point then they are constant throughout learning.

By substituting the diagonal matrix S(t) for S(t), the update for the mean in Equation 45 simpliﬁes to

ωi(t + 1) = e 1/τiωi(t) αi(t) (ˆy(t) y(t)) (65) where ˆy(t) y(t) is the loss gradient (assuming square loss), and the learning rates are given by

αi(t) = e 1/τis2 i (t) P

i s2 i (t) . (66)

That is, the subweights learn independently according to their gradients, with different decay rates and learning rates. Thus we have recovered an extension of the multiscale optimizer, with an additional mechanism that adapts the learning rates on each time step (via s).

F.2 KALMAN FILTER

To derive the variational KF for the regression task, we apply the analysis of Section F.1 to the KF update in Equations 49 and 50. Paralleling the derivation of Equation 59, the variance update in Equation 50 (i.e., before applying the diagonal variational approximation) can be written as

S(t + 1) = D S(t)D + N (D(x(t) s2(t)))(D(x(t) s2(t)))

ij x2 ij(t)s2 ij(t) . (67)

Paralleling the derivation of Equations 63 and 64, the variational update comes out to be

s2 ij(t + 1) =

s2 ij(t)e 1/τi + 4ρ2 sinh 1

2 Ω s2 ij(t) + 4ρ2e1/τi sinh 1

Ω+ x2 j(t)s4 ij(t) (68)

4ρ2x2 j(t)s2 ij(t) sinh 1

τi e 1/τis2 ij(t) + 4ρ2 sinh 1

By substituting the diagonal matrix S(t) for S(t), the mean update in Equation 49 simpliﬁes to

ωij(t + 1) = e 1/τiωij(t) αijxj(t) (ˆy(t) y(t)) (70) where xj(t) (ˆy(t) y(t)) is the loss gradient for ωij, and the learning rates are given by

αij = e 1/τis2 ij(t) P

i j x2 j (t)s2 i j (t). (71)

Thus we have recovered an extension of the multiscale optimizer, with an additional mechanism that adapts the learning rates on each time step (via x and s).

Published as a conference paper at ICLR 2023

F.3 EXTENDED KALMAN FILTER

To derive a closed-form variational approximation for the general EKF, such as for the classiﬁcation models in Sections 5.2 and 5.3, it turns out that we need to apply the variational approximation to the posterior in Equation 54, rather than to the iterative prior in Equation 56. Using Woodbury s identity, the posterior variance can be rewritten as

S(t) S(t)J h J h S(t)J h + Rˆy(t) 1 J h S(t) = (J h R 1 ˆy(t)J h + S 1(t)) 1. (72)

The form on the RHS is convenient because it is in terms of precision, allowing us to read off the diagonal entries directly. That is, the variational approximation for the posterior variance is diag(s 2(t)), with

s 2 ij(t) = h J h R 1 ˆy(t)J h i

ij,ij + s 2 ij (t) 1 . (73)

Here was have used the inductive assumption S(t) diag(s2(t)). Applying the transition from posterior on step t to prior on step t + 1, we obtain the variance update for the variational model, replacing Equation 56:

s2 ij(t + 1) = D2 ij,ijs 2 ij(t) + Nij,ij. (74)

Woodbury s identity also enables the mean update from Equation 55 to be rewritten, as

ω(t + 1) = D J h R 1 ˆy(t)J h + S 1(t) 1 S 1(t)ω(t) + J h R 1 ˆy(t) T (y(t)) ˆy(t) + J hω(t) .

Substituting the gradient of the EKF s approximate likelihood (denoted L) from Equation 52,

ω(t) L = J h R 1 ˆy(t)(ˆy(t) T (y(t))), (76)

ω(t + 1) = D ω(t) J h R 1 ˆy(t)J h + S 1(t) 1 ω(t) L . (77)

The preconditioner on the gradient here is the posterior variance (see Equation 72), which we have approximated as diag(s 2(t)). Although not necessarily entailed by the variational approximation, we can consider applying the same approximation in updating the mean. This yields

ωij(t + 1) = e 1/τiωij(t) e 1/τis 2 ij(t) wj(t) L, (78)

which once again extends the multiscale optimizer by adapting its learning rate to current uncertainty.

G IMPLEMENTATION DETAILS

Latent parameters for the synthetic tasks in Sections 5.1 and 5.2 were sampled using the generative model in Appendix B. That is, the data-generating process matched the generative assumptions of the Bayesian model in both of these cases. We used 20 timescales, geometrically spaced from τ1 = 1 to τ20 = 1000, as illustrated in Figure 6A. Each component OU process was run for 10τi burn-in steps to ensure stationarity. The regression task was run for 10k trials, and the linear classiﬁcation task for 1000 trials. The 1/f power spectrum was conﬁrmed with a log-log plot (see Figure 6B).

In the regression task of Section 5.1, the ﬁrst feature was a constant bias term, x1 1. The other 9 features were sampled independently as N(0, I) on each time step.

For the linear classiﬁcation task of Section 5.2, class logits were sampled from 1/f processes and then multiplied by 0.886 before entering into softmax to determine class probabilities. Featureclass means were ﬁxed at 1 for the ﬁrst feature (i.e., bias term) and were sampled from mutually independent 1/f processes for features 2-10, multiplied by 0.224. These scaling factors were chosen so that perfect knowledge of either the prior probabilities or the feature-class means on every trial

Published as a conference paper at ICLR 2023

would yield ideal-observer performance of L 1. Perfect knowledge of both would yield L 0.35. These were merely guidelines for equating prior and likelihood information, as perfect knowledge of either source of information is not possible even with an optimal model of the dynamics.

In both Sections 5.1 and 5.2, all Bayesian and variational models assumed 10 timescales, geometrically spaced from τ1 = 2 to τ10 = 800. This deliberate deviation from the data-generating process (see the 20 timescales listed above) provided a mild test of robustness, speciﬁcally the hypothesis that it is the aggregate 1/f character of the environment and of the model that matters, not the choice of component timescales used to approximate that character.

For the MNIST classiﬁcation task in Section 5.3, the class on each time step was sampled by softmax applied to class logits that varied across steps according to 1/f noise. The item on that time step was than sampled without replacement from all members of that class in the MNIST training set. The logit sequences were generated as follows. First, for each class, we sampled a Standard Gaussian random vector of length equal to the total number of time steps (10k). Then we applied a discrete Fourier transform, multiplied the result by f 1/2, and ﬁnally applied the inverse Fourier transform. Thus the resulting sequence had 1/f power spectrum. For the iid environment, the logits were sampled as white noise (constant power spectrum), by sampling a Standard Gaussian vector as above and using it unaltered.