# bam_bayes_with_adaptive_memory__0525d198.pdf Published as a conference paper at ICLR 2022 BAM: BAYES WITH ADAPTIVE MEMORY Josue Nassar Department of Electrical and Computer Engineering Stony Brook University josue.nassar@stonybrook.edu Jennifer Brennan Department of Computer Science University of Washington jrb@cs.washington.edu Ben Evans Department of Computer Science New York University benevans@nyu.edu Kendall Lowrey Department of Computer Science University of Washington klowrey@cs.washington.edu Online learning via Bayes theorem allows new data to be continuously integrated into an agent s current beliefs. However, a naive application of Bayesian methods in non-stationary environments leads to slow adaptation and results in state estimates that may converge confidently to the wrong parameter value. A common solution when learning in changing environments is to discard/downweight past data; however, this simple mechanism of forgetting fails to account for the fact that many real-world environments involve revisiting similar states. We propose a new framework, Bayes with Adaptive Memory (BAM), that takes advantage of past experience by allowing the agent to choose which past observations to remember and which to forget. We demonstrate that BAM generalizes many popular Bayesian update rules for non-stationary environments. Through a variety of experiments, we demonstrate the ability of BAM to continuously adapt in an ever-changing world. 1 INTRODUCTION The ability of an agent to continuously modulate its belief while interacting with a non-stationary environment is a hallmark of intelligence and has garnered a lot of attention in recent years (Zhang et al., 2020; Ebrahimi et al., 2020; Xie et al., 2020). The Bayesian framework enables online learning by providing a principled way to incorporate new observations into an agent s model of the world (Jaynes, 2003; Gelman et al., 2013). Through the use of Bayes theorem, the agent can combine its own (subjective) a priori knowledge with data to achieve an updated belief encoded by the posterior distribution. The Bayesian framework is a particularly appealing option for online learning because Bayes theorem is closed under recursion, enabling continuous updates in what is commonly referred to as the recursive Bayes method (Wakefield, 2013). As an example, suppose the agent first observes a batch of data, D1, and then later observes another batch of data, D2. We can express the agent s posterior distribution over the world, where the world is represented by θ, as p(θ|D1, D2) = p(D2|θ)p(θ|D1) p(D2|D1) , (1) p(D2|D1) = Z p(D2|θ)p(θ|D1)dθ. (2) Equation 1 demonstrates the elegance and simplicity of recursive Bayes: at time t, the agent recycles its previous posterior, p(θ|D 0 is known. We use a normal prior p(θ) = N( θ0, τ0), (26) where τ0 > 0. Given arbitrary data y1, , y N p(y1:N) we get that the posterior is of the form p(θ|y1:N) = N( θN, τN), (27) τN = (τ 1 0 + Nσ 2) 1 = σ2τ0 σ2 + Nτ0 , (28) τ 1 0 θ0 + σ 2 N X We observe that the posterior variance, equation 28, is not a function of the observed data. In fact, the posterior variance is deterministic given N, τ0 and σ2. In this particular setting, we can show that τN is a strictly decreasing function of N. To prove that τ0 > τ1 > > τn > > τN, it suffices to show that τn 1 > τn, n {1, N}, (30) which is equivalent to showing that τn τn 1 < 1, n {1, N}. (31) Before proceeding, we note that as Bayes theorem is closed under recursion, we can always express the posterior variance as τn = (τn 1 + σ 2) 1 = σ2τn 1 σ2 + τn 1 . (32) Published as a conference paper at ICLR 2022 Computing τn/τn 1 τn τn 1 = σ2τn 1 σ2 + τn 1 1 τn 1 , (33) σ2 + τn 1 . (34) Because τn > 0, n {0, , N}, (35) we have that σ2 < σ2 + τn 1, and conclude that τn/τn 1 < 1. A.2 BAYESIAN LINEAR REGRESSION Next, we consider the setting of Bayesian linear regression with known variance. The likelihood is of the form p(yi|xi, θ) = N(θxi, σ2), xi R, (36) where σ2 > 0 is known. We use a normal prior p(θ) = N( θ0, τ0), (37) where τ0 > 0. Given arbitrary observations (x1, y1), . . . , (xn, yn), we have that the posterior is of the form p(θ|x1:N, y1:N) = N( θN, τN), (38) where τ 1 0 + σ 2 N X = σ2τ0 σ2 + τ0 PN n=1 x2n , (39) θN = τN(τ 1 0 θ0 + σ 2 N X n=1 xnyn). (40) To prove that τ0 τ1 τn τN, it suffices to show that τn τn 1 1, xn R, n {1, , N}. (41) Again, due to the Bayes being closed under recursion, we can always rewrite the posterior variance as τn = τ 1 n 1 + σ 2x2 n 1 = σ2τn 1 σ2 + τn 1x2n . (42) So τn τn 1 = σ2τn 1 σ2 + τn 1x2n 1 τn 1 , (43) σ2 + τn 1x2n . (44) As x2 n 0, we have that τn/τn 1 1, which completes the proof. B PROOF OF PROPOSITION 1 For clarity, we rewrite the proposition below Proposition. Let p(θ|D 0 then scores[i] log R p(Dt|θt)p(θt|W, D priorscore then W [idx] 1 ; priorscore score ; p = posterior(p, D priorscore, q) ; for each in scores do if scores[i] > cutoff then W [i] 1 else W [i] 0 end end Result: Readout weights W D EXPERIMENTAL SETTINGS D.1 CONTROLS For our controls experiments, we used Model Predictive Path Integral control (Williams et al., 2017), a model predictive control (MPC) algorithm with a planning horizon of 50 timesteps and 32 sample trajectories. Our sampling covariance was 0.4 for each controlled joint in the case of Cartpole, the action space is 1. The temperature parameter we used was 0.5. Planning with a probabilistic model involves each sampling trajectory to use a different model sampled from the current belief (as opposed to a sampled model per timestep); planning rollouts included noise, such that xt = xt 1 + M φ(xt 1, at) + εt, εt N(0, σ2I), (60) where M is sampled from the current belief. φ is the random Fourier features function from (Rahimi & Recht, 2007) where we use 200 features with a bandwidth calculated as the mean pairwise distance of the inputs (states and actions) which is 6.0. To learn M, we use Bayesian linear regression where Published as a conference paper at ICLR 2022 each row of M is modeled as being independent. We place a multivariate Normal prior on each of the rows with a prior mean of all 0s and prior precision of 10 4I. The Cartpole model s initial state distribution for positions and velocities were sampled uniformly from -0.05 to 0.05, with the angle of the cart being π such that it points down. This sets up the swing-up problem. For the episodic one-shot experiment, we perform MPC for 200 timesteps as one trial. 15 trials make one episode, with the dynamical properties of the environment (i.e. gravity) fixed for the duration of the trial. We vary the gravity parameter of the model by selecting gravity values from celestial bodies of the Solar System; we used Earth, Mars, and Neptune at 9.81, 3.72, and 11.15 m/s2, respectively. At the start of a new episode, each method s beliefs are reset to the base prior, and each method proceeds to update their respective beliefs accordingly. BAM retains each trail s datum in memory across episodes. For the continual learning experiment, we do not inform our agent that the model dynamics have changed, i.e. we never reset the agent s belief to a prior. Instead, we use Bayesian Online Changepoint Detection (BOCD) to discern if the underlying model distribution has changed. BOCD is compared against BAM, both with and without changepoint detection; while BOCD resets to a prior when a change is detected, BAM optimizes for a weight vector over the previously experienced data. The BOCD switching parameter λ for its hazard function was set to 0.11. The agent attempts the task for 60 trials, with the environment experiencing changes 3 times during said trials. D.2 DOMAIN ADAPTATION WITH ROTATED MNIST We ran 10 independent Bayesian linear regressions, one for each dimension of the one-hot encoded target. As the prior, we use a multivariate Normal distribution with a prior mean of all 0s and prior precision of 0.1I. Similar to the controls experiment, we assume the additive noise is fixed and set to σ2 = 10 4. As regularization had little effect, we set λ = 0. D.3 NON-STATIONARY BANDITS For both UCB and UCBAM, we use a confidence-level function of f(t) = 1 + t log2(t). The timescale parameter for BOCD + Thompson sampling is 0.016, which is the expected frequency of the arm switches. The weighting term for Bayesian exponential forgetting + Thompson sampling is 0.8. D.3.1 DESCRIPTION OF UCBAM The challenge of bandit settings is the need to explore, especially in the non-stationary setting we devised. As such, UCB is a well known algorithm for leveraging the uncertainty in the arm values to enable exploration. We combine this frequentist method with BAM as follows. When we assume to know the current best arm value, we exploit it and keep a belief over its distribution with BAM. The signal for whether the best arm is known is if the likelihood of the current arm s value is higher with our current arm belief or higher with the naive base prior. If the base prior produces a higher likelihood, we assume the current arm distribution is incorrect (and will be updated with BAM), and we default to the UCB metric for arm selection. This simple combination of methods in this setting allows for the exploration benefits of UCB with the quick recognition of high value arms due to BAM and subsequent exploitation. Published as a conference paper at ICLR 2022 Figure 5: For reference, the arm values of the bandit experiments without added noise (σ = 0.25), with only 2000 time steps for clarity. Each arm switches between a high and low value at variable times, such that the highest value arm may be a previously low value arm. The simplest strategy would be to find the highest mean value arm, which is what UCB does in this case. Recursive Bayes attempts the same, but does not explore sufficiently to achieve an accurate estimate of the mean arm values. Algorithm 3: UCBAM Data: prior distribution p K number of arms ; b = copy(p), empty D, K times ; # belief and memory per arm known false ; for each iteration do if known then arm thompson(b1...K) else arm UCB choice end v pull(arm) ; if log(p(v)) log(barm(v)) then known false else known true end barm = BAM(p, Darm