# learning_universal_predictors__b748501d.pdf Learning Universal Predictors Jordi Grau-Moya * 1 Tim Genewein * 1 Marcus Hutter * 1 Laurent Orseau * 1 Gr egoire D eletang 1 Elliot Catt 1 Anian Ruoss 1 Li Kevin Wenliang 1 Christopher Mattern 1 Matthew Aitchison 1 Joel Veness 1 Meta-learning has emerged as a powerful approach to train neural networks to learn new tasks quickly from limited data by pre-training them on a broad set of tasks. But, what are the limits of meta-learning? In this work, we explore the potential of amortizing the most powerful universal predictor, namely Solomonoff Induction (SI), into neural networks via leveraging (memory-based) meta-learning to its limits. We use Universal Turing Machines (UTMs) to generate training data used to expose networks to a broad range of patterns. We provide theoretical analysis of the UTM data generation processes and meta-training protocols. We conduct comprehensive experiments with neural architectures (e.g. LSTMs, Transformers) and algorithmic data generators of varying complexity and universality. Our results suggest that UTM data is a valuable resource for metalearning, and that it can be used to train neural networks capable of learning universal prediction strategies. 1. Introduction Meta-learning has emerged as a powerful approach to enable AI systems to learn new tasks quickly from limited data (Hospedales et al., 2021). By training a model on a diverse set of tasks, meta-learning encourages the discovery of representations and learning strategies that generalize to new, unseen tasks. Memory-based meta-learning (Santoro et al., 2016) a type of meta-learning that relies on memory updates, instead of a two-level optimization procedure (Finn et al., 2017) has been shown to be a promising way of teaching neural networks to implicitly implement Bayesian inference (Ortega et al., 2019; Mikulik et al., 2020; Genewein et al., 2023) tailored to a prior task-distribution. A *Equal contribution 1Google Deep Mind, London, UK. Correspondence to: Jordi Grau-Moya . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Figure 1: Summary of our meta-learning methodology. key challenge in meta-learning is to design task distributions that are sufficiently broad, exposing the model to a rich variety of structures and patterns. Such broad exposure could lead to universal representations, enabling the system to tackle a wide range of problems and bringing us closer to the goal of artificial general intelligence (AGI). Solomonoff Induction 1 (SI) offers a compelling theoretical foundation for constructing such an ideal universal prediction system (Solomonoff, 1964a;b) 2 . At its core, SI elegantly integrates three fundamental principles (see Figure 1). Consideration of all computable hypotheses: Unlike traditional approaches, SI explores the entire space of computable hypotheses (i.e. generated by a computer program) as potential explanations for observed data. Occam s Razor: SI assigns higher prior probabilities to simpler hypotheses with shorter descriptions. Bayesian Updating: With new data, SI employs Bayes rule to refine its belief about each hypothesis. The theoretical strength of SI lies in its ability to rapidly converge on the true data-generating process, if computable (Li & Vitanyi, 1992; Hutter, 2004; Sunehag & 1SI arguably solved the century-old induction problem (Rathmanner & Hutter, 2011), is the basis of the Hutter prize (Hutter, 2006/2020) and has been praised by the father of AI, Marvin Minsky: the most important discovery since G odel . 2For an introduction see (Hutter et al., 2007; Hutter, 2017) and see (Hutter, 2007) for technical details. Learning Universal Predictors Hutter, 2013; Li et al., 2019). Yet, a significant barrier is its practical incomputability. The exhaustive exploration of algorithmic hypotheses demands immense computational resources. To address this, approximations of SI were developed e.g. the Speed Prior (Schmidhuber, 2002; Filan et al., 2016) and the Context Tree Weighting algorithm (Willems et al., 1995; Willems, 1998; Veness et al., 2012). To understand the power of SI, imagine a program that generates an infinite stream of data x, e.g., a fluid dynamics simulation or an AI movie generator. Let s say the length of the shortest possible version of this program i.e. its Kolmogorov complexity (Li et al., 2019) is N bits long. This could be approximated by removing all unnecessary elements of the program and using compression to further reduce the size. Now, feeding the data stream x to SI and letting it predict each bit, something remarkable happens: After making fewer than N prediction errors, SI predicts the future data perfectly! This occurs because SI effectively learns the underlying rules of the data-generating program. With each incorrect prediction, it eliminates a range of possible explanations, allowing it to quickly find the correct program behind the data. In this paper we explore the potential of amortizing Solomonoff Induction into neural networks via memorybased meta-learning (see Figure 1). A key challenge is finding neural architectures and training data distributions that guide networks towards learning SI in the limit. While neural networks are theoretically capable of universal computation (Chen et al., 2017; Stogin et al., 2020; Mali et al., 2023), practical training methods (e.g., stochastic gradient descent) can limit this ability (Deletang et al., 2022). Here we simply use off-the-shelf architectures like Transformers (Vaswani et al., 2017) and LSTMs (Hochreiter & Schmidhuber, 1997), while focusing on designing a suitable data training protocol. To address this, we generate data from Universal Turing Machines (UTMs), which are fully general computers. Training on this universal data exposes the network to a broad space of computable patterns that guide the network towards learning universal inductive strategies. Our key contributions are: 1) UTM data: We use, for the first time, UTM data to meta-train neural networks. 2) Theoretical Analysis: We provide a theoretical analysis of the UTM data generation process and training protocol that converges to SI in the limit. 3) Extensive Experiments: We conduct comprehensive experiments with a variety of neural architectures (e.g. LSTMs, Transformers) and algorithmic data generators of varying complexity and universality. 4) We open-sourced all our generators at https://github.com/google-deepmind/ neural_networks_solomonoff_induction. Our results show that increasing model size leads to im- proved performance, demonstrating that model scaling helps learning increasingly universal prediction strategies. We find that: Large Transformers trained on UTM data successfully transfer their learning to other tasks suggesting they acquired reusable universal patterns; On variable-order Markov sources, large LSTMs and Transformers achieve optimal performance, highlighting their ability to model Bayesian mixtures over programs necessary for SI. 2. Background Notation. An alphabet X is a finite, non-empty set of symbols. A string x1x2 . . . xn X n of length n is denoted by x1:n. The prefix x1:j of x1:n, j n, is denoted by x j or x n, we define x1:m := x1:n and xn:m := ϵ. The concatenation of two strings s and r is denoted by sr. The expression [[A]] is 1 if A is true and 0 otherwise. Semimeasures. A semimeasure is a probability measure P over infinite and finite sequences X X for some finite alphabet X assumed to be {0, 1} (most statements hold for arbitrary finite X). Let µ(x) be the probability that an (in)finite sequence starts with x. While proper distributions satisfy P a X µ(xa) = µ(x), semimeasures exhibit probability gaps and satisfy P a X µ(xa) µ(x). Turing Machines. A Turing Machine (TM) takes a string of symbols z as an input, and outputs a string of symbols x (after reading z and halting), i.e. T(z) = x. For convenience we define the output string at computation step s as T s(z) = x which may be the empty string ϵ. We adopt similar notation for Universal Turing Machines U. Monotone TMs (see Definition 1 below) are special TMs that can incrementally build the output string while incrementally reading the input program, which is a convenient practical property we exploit in our experiments. Definition 1 (Monotonicity). A universal machine U is monotone if for all p, q, x, y with U(p) = y and U(q) = x we have that ℓ(x) ℓ(y) and p q imply y x, where p q means that p is a prefix string of q. See Appendix C for a more thorough description. Solomonoff Induction (SI). The optimal prediction over the next symbol xn+1 given an observed sequence x1:n is µ(xn+1|x1:n) = µ(x1:n+1)/µ(x1:n), assuming that µ is the true (but unknown) computable probability distribution over sequences. In contrast, SI predicts the next symbol xn+1 using a single universal semimeasure M widely known as the Solomonoff Universal Prior (see definition below). Definition 2 ((Monotone) Solomonoff Prior). Let U be a universal monotone machine, then the Solomonoff prior is Learning Universal Predictors defined as M(x) := X p:U(p)=x 2 ℓ(p) with the sum is over all p {0, 1} , where the output x is any string that starts with x and the whole program p has been read by U. We can use M to construct the posterior predictive distribution M(xn+1|x1:n) = M(x1:nxn+1) M(x1:n) (see Figure 1). This is equivalent to performing Bayesian inference on program space M(xn+1|x1:n) = P p P(p|x1:n)[[U(p) = x1:nxn+1 ]] (for prefix-free programs, and any continuation of the sequence), where P(p|x1:n) is the Bayesian posterior over programs given the data using the prior P(p) = 2 ℓ(p) and the zero-one likelihood P(x|p) = [[U(p) = x ]]. Solomonoff (1964a) showed that M converges fast (to the true µ) if the data is generated by any computable probability distribution µ: x n. Consistency of meta-training data is shown next. Proposition 6. Let now DJ := (x1, ..., x J) be samples from the measure Ms,L,n. Given ˆ MDJ(x) = 1 y DJ [[ℓ(y) ℓ(x) y1:ℓ(x) = x]] then, ˆ MDJ(x) Ms,L,n(x) for J . We get better approximations to the Solomonoff prior M as we increase the UTM steps s, the program length L, the sequence length n, and the number of samples J. More formally, since M(x) = lim s,L,n Ms,L,n(x) = sup s,L,n Ms,L,n(x), we have that ˆ MDJ M for s, L, n, J . Note that DJ depends on s, L, n, but this can easily be avoided by choosing s(j), L(j), n(j) to be any functions tending to infinity, and sampling xj from Ms(j),L(j),n(j)(x) for j = 1, 2, 3, .... Remark 7. Although Ms,L,n is computable, it still suffers from two inconveniences. First, sampling from it is inefficient because it is a semimeasure and exhibits a probability gap. Second, we need to differentiate whether programs halt or end up in a infinite non-printing loop (to fill the probability gap with absorbing tokens when training). We can bypass these inconveniences by estimating the normalized and computable Solomonoff prior combining Definitions 3 and 5. We can estimate the (computable) normalized Solomonoff prior, M norm s,L,n (x), by the following. Proposition 8. Using the definitions from Proposition 6 we have that ˆ M norm s,L,n (xt|x 0 q X and Q(q1:n) 0 for n . More precisely, for all universal monotone TM U and all Q with the above properties, there exists a universal MTM V (as constructed in the proof) s.th. M Q U (x) = MV (x) x. Proof in Appendix C. Note on the assumptions above. We assumed an infinite number of data points and universality (and learnablity) of the approximator, which are difficult to obtain in practice and diminish the relevance of inductive biases of neural models. For finite data, however, inductive biases are important for strong generalization. We leave out of the scope of the paper the theoretical work on the effect of the inductive bias and universality of neural models and simply provide experimental evidence of neural network performance in the next section. 4. Experimental Methodology We aim to evaluate various neural architectures and sizes trained on UTM and two other types of algorithmically generated data for comparison and analysis. Variable-order Markov Sources (VOMS). A k-Markov model assigns probabilities to a string of characters by, at any step t, only using the last k characters to output the next character probabilities. A VOMS is a Markov model where the value of k is variable and it is obtained using a tree of non-uniform depth. A tree here is equivalent to a program that generates data. We sample trees and meta-train on the generated data. We consider binary VOMS where a Bayes-optimal predictor exists: the Context Tree Weighting (CTW) predictor (Willems et al., 1995; 1997), to which we compare our models to. CTW is only universal w.r.t. n-Markov sources, and not w.r.t. all computable functions like SI. See Appendix D.2 for more intuition on VOMS, how we generate the data and how to compute the CTW Bayes-optimal predictor. Chomsky Hierarchy (CH) Tasks. We take the 15 algorithmic tasks (e.g. arithmetic, reversing strings) from (Deletang Learning Universal Predictors 0 24 48 72 96 0.0 Predictions Sample CTW Transformer-L Ground truth 0 24 48 72 96 0 Regret [bits] 0 24 48 72 96 Step regret [bits] Transformer Mean cumulative regret [bits] Evaluated on CTW (256 steps) Median Seed CTW 4 0 256 512 768 1024 Step Mean cumulative regret [bits] RNN LSTM Stack-RNN Tape-RNN Transformer Figure 2: Evaluation on VOMS data. Left: Example sequence and highly overlapped predictions of Transformer-L (red) and Bayes-optimal CTW predictor (blue). Lower panels show instantaneous and cumulative regret w.r.t. the ground-truth. Middle: Mean cumulative regret over 6k sequences (length 256, max. CTW tree depth 24, in-distribution) for different networks (3 seeds) and sizes (S, M, L). Larger models perform better for all architectures, and the Transformer-L and LSTM-L match the optimal CTW predictor. Right: Length generalization (1024 steps). LSTMs generalize to longer length, whereas Transformers do not. et al., 2022) lying on different levels of the Chomsky hierarchy (see Appendix D.3 for a description of all tasks). These tasks are useful for comparison and for assessing the algorithmic power of our models. In contrast to (Deletang et al., 2022), in which they train on individual tasks, we are interested in meta-training on all tasks simultaneously. We make sure that all tasks use the same alphabet X (expanding the alphabet of tasks with smaller alphabets). We do not consider transduction as in (Deletang et al., 2022) but sequence prediction, thus we concatenate inputs and outputs with additional delimiter tokens i.e. for {(xi X, yi X)}I i=1 and delimiters , and ; , we construct sequences of the form z := (x1, y1; x2, y2; . . . xn, yn; . . . ). We evaluate our models using the regret (and accuracy) only on the output symbols, masking the inputs because they are usually random and non-informative of task performance. Denoting Oz the set of outputs time-indices, we compute accuracy for trajectory z as A(z) := 1 |Oz| P t Oz[[arg maxy πθ(y|z 0 q X and Q(q1:n) 0 for n . More precisely, for all universal monotone TM U and all Q with the above properties, there exists a universal MTM V (as constructed in the proof) s.th. M Q U (x) = MV (x) x. Proof in Appendix C. We can effectively sample from any computable Q if we have access to infinitely many fair coin flips. The conditions on Q ensure that the entropy of Q is infinite, and stays infinite even when conditioned on any q X . This also allows the reverse: Converting a sample from Q into infinitely many uniform random bits. Forward and backward conversion can be achieved sample-efficiently via (bijective) arithmetic (de)coding. This forms the basis of the proof below. The condition of Q being a proper measure rather than just being a semimeasure is also necessary: For instance, for Q(q) = 4 ℓ(q), on a Bernoulli( 1 2) sequence x1: , MU(xt|x 0 q X implies that F is strictly increasing, Learning Universal Predictors and assumption Q(q1:n) 0 implies that F is continuous. Since F(0) = 0 and F(1) = 1, this implies that F is a bijection. Let 0.p1: = F(0.q1: ) and 0.q1: = F 1(0.p1: ). 4. Further for some finite prefix q q1: , we partition the interval [0.p0 1: ; 0.p1 1: ) := [F(0.q0 ); F(0.q1 )) =: [ into a minimal set of binary intervals 0.Γp, where Φ(q) is a minimal prefix free set in the sense that for any p, at most one of p, p0, p1 is in Φ(q). An explicit representation is Φ(q) := {p0 t0 p0 t = 0} {p1 t0 p1 t = 1} where t0 is the first t for which p0 t = p1 t. Now we plug Q(q) = F(0.q1 ) F(0.q0 ) = X p Φ(q) |0.Γp| = X p Φ(q) 2 ℓ(p) into M Q U (x) X q:U(q)=x Q(q) = X p Φ(q) 2 ℓ(p) = X p:V (p)=x 2 ℓ(p) = MV (x) where V (p) := U(q) for the maximal q such that p Φ(q). The maximal q is unique, since Φ(q) Φ(q ) = {} if q q and q q, and finite since F is continuous. It remains to show that V is universal. Let pı be such that 0.Γpı [F(0.ı 0 ); F(0.ı 1 )). The choice doesn t matter as long as it is a computable function of ı, but shorter is better . This choice ensures that F 1(0.pı ) = 0.ı ... whatever the continuation is. Now let F(q1: )tail := F(q1: )ℓ(pı)+1: = pℓ(pı)+1: if q1: starts with ı , and arbitrary, e.g. F(q1: ), otherwise. Let T be a MTM with T(q1: ) := U0(F(q1: )tail) for some universal MTM U0. By Kleene s 2nd recursion theorem (Sipser, 2012, Chp.6), there exists an i such that Ti(q) = T(i q) q. Let k := ℓ(i ) + 1 and ℓ:= ℓ(pi) + 1 and q< k := i , hence p< ℓ= pi. Now V (p1: ) = U(q1: ) implies V (pip ℓ: ) = U(i q k: ) = Ti(q k: ) = T(i q k: ) = U0(F(i q k: )tail) = U0(p ℓ: ) hence V is universal, which concludes the proof. Practical universal streaming functions. Turing machines are impractical and writing a program for a universal streaming function is another layer of indirection which is best to avoid. Programming languages are already universal machines. We can define a conversion of real programs from/to binary strings and prepend it to the input stream. When sampling input streams q1: we convert the beginning into a program of the desired programming language, and feed it the tail as input stream. D. Experiment methodology details D.1. Architecture details RNN. A vanilla multi-layer RNN (Elman, 1990) with hidden sizes and multi-layer perceptron (MLP) before and after the RNN layers as described in Table 1. Stack-RNN. A multi-layer RNN controller with hidden sizes and MLP exactly the same as the RNN and LSTMs on Table 1 with access to a differentiable stack (Joulin & Mikolov, 2015). The controller can perform any linear combination of PUSH, POP, and NO-OP on the stack of size according to Table 1, with action weights given by a softmax over a linear readout of the RNN output. Each cell of the stack contains a real vector of dimension 6 and the stack size is 64 for all (S, M and L) sizes. 4Note that p1:m is uniformly distributed and is (for some m) essentially the arithmetic encoding of q1:n with one caveat: The mapping from sequences to reals conflates 0.q10 = 0.q01 . Since the set of all conflated sequences has probability 0, (under Q as well as Bernoulli( 1 2)), any error introduced due to this conflation has no effect on the distribution M Q U (x). Learning Universal Predictors Table 1: Architectures RNN and LSTMs S M L RNN Hidden size 16 32 128 Number of RNN layers 1 2 3 MLP before RNN layers (16,) (32, 32) (128, 128, 128) MLP after RNN layers (16,) (32, 32) (128, 128, 128) Transformer SINCOS Embedding dimension 16 64 256 Number of heads 2 4 4 Number of layers 2 4 6 Tape-RNN. A multi-layer RNN controller with hidden sizes according to the Table 1 with access to a differentiable tape, inspired by the Baby-NTM architecture (Suzgun et al., 2019). The controller can perform any linear combination of WRITE-RIGHT, WRITE-LEFT, WRITE-STAY, JUMP-LEFT, and JUMP-RIGHT on the tape, with action weights given by a softmax. The actions correspond to: writing at the current position and moving to the right (WRITE-RIGHT), writing at the current position and moving to the left (WRITE-LEFT), writing at the current position (WRITE-STAY), jumping ℓsteps to the right without writing (JUMP-RIGHT), where ℓis the length of the input, and jumping ℓsteps to the left without writing (JUMP-LEFT). As in the Stack-RNN, each cell of the tape contains a real vector of dimension 6 and the tape size is 64 for all (S, M and L) sizes. LSTM. A multi-layer LSTM (Hochreiter & Schmidhuber, 1997) of hidden sizes according to Table 1. Transformer decoder. A vanilla Transformer decoder (Vaswani et al., 2017). See Table 1 for the embedding dimension, number of heads and number of layers for each model size (S, M and L). Each layer is composed of an attention layer, two dense layers, and a layer normalization. We add a residual connections as in the original architecture (Vaswani et al., 2017). We consider the standard sin/cos (Vaswani et al., 2017) positional encoding. Model sizes Model sizes (see 2) were determined by pilot experiments on VOMS data using the LSTM and Transformer architecture, where we increased the architecture size until models could fit the VOMS data well i.e. minimal log-loss on a 500k budget iterations. This formed the L-variants of the models and the M, and S variants used reduced numbers of layers and hidden sizes. All other recurrent networks use the same hidden sizes and layers as the LSTM for better comparison. We used this procedure to assess performance within an architecture class since the number of parameters is not a good metric across architecture classes. Below is an ultra-compact introduction to (sampling from) CTW (Willems et al., 1995; 1997). For more explanations, details, discussion, and derivations, see (Catt et al., 2024, Chp.4). A variable-order Markov process. is a probability distribution over (binary) sequences x1, x2, x3, ... with the following property: Let S {0, 1} be a complete suffix-free set of strings (a reversed prefix-free code) which can equivalently be viewed as a perfect binary tree. Then p(xt = 0|x