# memorybased_metalearning_on_nonstationary_distributions__de490247.pdf Memory-Based Meta-Learning on Non-Stationary Distributions Tim Genewein * 1 Gr egoire Del etang * 1 Anian Ruoss * 1 Li Kevin Wenliang 1 Elliot Catt 1 Vincent Dutordoir 1 2 Jordi Grau-Moya 1 Laurent Orseau 1 Marcus Hutter 1 Joel Veness 1 Memory-based meta-learning is a technique for approximating Bayes-optimal predictors. Under fairly general conditions, minimizing sequential prediction error, measured by the log loss, leads to implicit meta-learning. The goal of this work is to investigate how far this interpretation can be realized by current sequence prediction models and training regimes. The focus is on piecewise stationary sources with unobserved switching-points, which arguably capture an important characteristic of natural language and action-observation sequences in partially observable environments. We show that various types of memory-based neural models, including Transformers, LSTMs, and RNNs can learn to accurately approximate known Bayes-optimal algorithms and behave as if performing Bayesian inference over the latent switching-points and the latent parameters governing the data distribution within each segment. 1. Introduction Memory-based meta-learning (MBML) has recently risen to prominence due to breakthroughs in sequence modeling and the proliferation of data-rich multi-task domains. Previous work (Ortega et al., 2019; Mikulik et al., 2020) showed how, in principle, MBML can lead to Bayes-optimal predictors by learning a fixed-parametric model that performs amortized inference via its activations. This interpretation of MBML can provide theoretical understanding for counter-intuitive phenomena such as in-context learning that emerge in large language models with frozen weights (Xie et al., 2022). In this work, we investigate the potential of MBML to learn parametric models that implicitly perform Bayesian infer- *Equal contribution 1Deep Mind 2University of Cambridge. Correspondence to: Tim Genewein , Gr egoire Del etang , Anian Ruoss . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). ence with respect to more elaborate distributions than the ones investigated in Mikulik et al. (2020). We focus on piecewise stationary Bernoulli distributions, which produce sequences that consist of Bernoulli segments (see Figure 1). The predictor only observes a stream of samples (0s and 1s), with abrupt changes to local statistics at the unobserved switching-points between segments. The focus on piecewise stationary sources is inspired by natural language, where documents often switch topic without explicit indication (Xie et al., 2022), and observation-action streams in environments with discrete latent variables, e.g., multi-task RL without task-indicators. In both domains, neural models that minimize sequential prediction error demonstrate hallmarks of sequential Bayesian prediction: strong context sensitivity or in-context learning (Reed et al., 2022), and rapid adaptation or few-shot learning (Brown et al., 2020). To solve the sequential prediction problem, Bayes-optimal (BO) predictors simultaneously consider a number of hypotheses over switching-points and use prior knowledge over switching-points and segment-statistics. Tractable exact BO predictors require non-trivial algorithmic derivations, and are only known for certain switching-point distributions. The main question of this paper is whether neural predictors with memory, trained by minimizing sequential prediction error (log loss), can learn to mimic Bayes-optimal solutions and match their prediction performance. Our contributions are: Review of the theoretical connection between minimizing sequential prediction error, meta-learning, and its implied Bayesian objective (Section 3). Theoretical argument for the necessity of memory to minimize the former (Bayesian) objective (Section 4). Empirical demonstration that meta-learned neural predictors can match prediction performance of two general non-parametric Bayesian predictors (Section 7). Comparison of off-distribution generalization of learned solutions and Bayesian algorithms (Section 7). Source code available at: https://github.com/ deepmind/nonstationary_mbml. Memory-Based Meta-Learning on Non-Stationary Distributions Figure 1. A single sequence from a piecewise Bernoulli source with three switching-points drawn from the PTW prior (see Section 6). Top: The predictors observe streams of binary samples xt and, at each step, predict the probability of the next observation. The solid lines show predictions p(xt|x n, we define x1:m := x1:n and xn:m := ϵ. The concatenation of two strings s and r is denoted by sr. Probabilistic Data Generating Sources A probabilistic data generating source ρ is defined by a sequence of probability mass functions ρn : X n [0, 1], for all n N, satisfying the compatibility constraint that ρn(x1:n) = P y X ρn+1(x1:ny) for all x1:n X n, with base case ρ0(ϵ) = 1. From here onward, whenever the meaning is clear from the argument to ρ, the subscripts on ρ will be dropped. Under this definition, the conditional probability of a symbol xn given previous data x 0, with the familiar chain rules ρ(x1:n) = Qn i=1 ρ(xi|x 1; in other words, wρ n 1 can depend upon the whole history. A complete proof is given in Appendix D. Importantly, this argument is independent of the representation capacity of νθ, and for example still holds even if νθ is a universal function approximator, or if νθ can represent each possible ρ M given data only from ρ. The same argument extends to any k-Markov stationary model for finite k, though one would expect much better approximations to be possible in practice with larger k. 5. Priors and Exact Inference Baselines This section describes our baseline Bayesian algorithms for exact Bayesian inference on piecewise stationary Bernoulli data. The algorithms make different assumptions regarding the statistical structure of switching-points. If the data generating source satisfies these assumptions, then the baselines are theoretically known to perform optimally in terms of expected cumulative regret. This allows us to assess the quality of the meta-learned solutions against known optimal predictors. Note that while exact Bayesian inference is often computationally intractable, the cases we consider here are noteworthy in the sense that they can be computed efficiently, and in some cases with quite elaborate algorithms involving combinations of dynamic programming (see Koolen & de Rooij (2008) for a comprehensive overview) and the generalized distributive law (Aji & Mc Eliece, 2000). In order to ensure that the data generating source matches the statistical prior assumptions made by the different baselines, we use their underlying priors as data generating distributions in our experiments (see Appendix E for details on the algorithms that sample from the priors). KT Estimator The KT estimator is a simple Beta Binomial model which efficiently implements a Bayesian predictor for Bernoulli(θ) sources with unknown θ by maintaining sufficient statistics in the form of counts. By using a Beta( 1 2) prior over θ, we obtain the KT-estimator (Krichevsky & Trofimov, 1981), which has optimal worst case regret guarantees with respect to data generated from an unknown Bernoulli source. Conveniently, the predictive probability has a closed form KT(xn+1 = 1|x1:n) = c(x1:n) + 1 where c(x1:n) returns the number of ones in x1:n, and KT(xn+1 = 0|x1:n) = 1 KT(xn+1 = 1|x1:n). This can be implemented efficiently online by maintaining two counters, and the associated marginal probability can be obtained via the chain rule KT(x1:n) = Qn i=1 KT(xi|x