# deep_symbolic_regression_for_recurrence_prediction__43bde646.pdf Deep Symbolic Regression for Recurrent Sequences St ephane d Ascoli * 1 2 Pierre-Alexandre Kamienny * 2 3 Guillaume Lample 2 Franc ois Charton 2 Symbolic regression, i.e. predicting a function from the observation of its values, is well-known to be a challenging task. In this paper, we train Transformers to infer the function or recurrence relation underlying sequences of integers or floats, a typical task in human IQ tests which has hardly been tackled in the machine learning literature. We evaluate our integer model on a subset of OEIS sequences, and show that it outperforms built-in Mathematica functions for recurrence prediction. We also demonstrate that our float model is able to yield informative approximations of out-of-vocabulary functions and constants, e.g. bessel0(x) sin(x)+cos(x) πx and 1.644934 π2/6. 1. Introduction Given the sequence [1,2,4,7,11,16], what is the next term? Humans usually solve such riddles by noticing patterns in the sequence. In the easiest cases, one can spot a function: [1,4,9,16,25] are the first five squares, so the n-th term in the series is un = n2, and the next one is 36. Most often however, we look for relations between successive terms: in the sequence [1,2,4,7,11,16], the differences between successive values are 1, 2, 3, 4, and 5, which makes it likely that the next term will be 16 + 6 = 22. Mathematically, we are inferring the recurrence relation un = un 1 + n, with u0 = 1. In all cases, we handle such problems as symbolic regressions: starting from a sequence of numbers, we try to discover a function or a recurrence relation that they satisfy, and use it to predict the next terms. This can lead to very challenging problems as the complexity of the unknown recurrence relation un = f(n, {ui}i. Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). e.g. un = tan 1(un 3) exp(cos(n2)). In this paper, we train neural networks to infer the recurrence relation f from the observation of the first terms of the sequence. The majority of studies in machine learning for symbolic regression have focused on non-recurrent functions, i.e. expressions of the form y = f(x). Recurrence relations provide a more general setting, which gives a deeper account of the underlying process: if the sequence corresponds to successive time steps, the recurrence relation is a discrete time differential equation for the system considered. 1.1. Contributions We show that transformers can learn to infer a recurrence relation from the observation of the first terms of a sequence. We consider both sequences of integers and floats, and train our models on a large set of synthetic examples. We first demonstrate that our symbolic regression model can predict complex recurrence relations that were not seen during training. We also show that those recurrence relations can be used to extrapolate the next terms of the sequence with better precision than a numeric model of similar architecture, trained specifically for extrapolation. We then test the out-of-domain generalization of our models. On a subset of the Online Encyclopedia of Integer Sequences, our integer models outperform built-in Mathematica functions, both for sequence extrapolation and recurrence prediction, see Table 1 for some examples. We also show that our symbolic float model is capable of predicting approximate expressions for out-of-vocabulary functions and constants (e.g. bessel0(x) sin(x)+cos(x) πx and 1.644934 π2/6), see Tables 2, 3 for more examples. We conclude by discussing potential limitations of our models and future directions. 1.2. Related work AI for maths The use of neural networks for mathematics has recently gained attention: several works have demonstrated their surprising reasoning abilities (Saxton et al., 2019; Cobbe et al., 2021), and have even sparked some interesting mathematical discoveries (Davies et al., 2021). In Deep symbolic Regression for Recurrent Sequences OEIS Description First terms Predicted recurrence A000792 a(n) = max{(n i)a(i), i < n} 1, 1, 2, 3, 4, 6, 9, 12, 18, 27 un = un 1 + un 3 un 1%un 3 A000855 Final two digits of 2n 1, 2, 4, 8, 16, 32, 64, 28, 56, 12 un = (2un 1)%100 A006257 Josephus sequence 0, 1, 1, 3, 1, 3, 5, 7, 1, 3 un = (un 1 + n)%(n 1) 1 A008954 Final digit of n(n + 1)/2 0, 1, 3, 6, 0, 5, 1, 8, 6, 5 un = (un 1 + n)%10 A026741 a(n) = n if n odd, n/2 if n even 0, 1, 1, 3, 2, 5, 3, 7, 4, 9 un = un 2 + n//(un 1 + 1) A035327 n binary, switch 0 s and 1 s, then decimal 1, 0, 1, 0, 3, 2, 1, 0, 7, 6 un = (un 1 n)%(n 1) A062050 n-th chunk contains numbers 1, ..., 2n 1, 1, 2, 1, 2, 3, 4, 1, 2, 3 un = (n%(n un 1)) + 1 A074062 Reflected Pentanacci numbers 5, -1, -1, -1, -1, 9, -7, -1, -1, -1 un = 2un 5 un 6 Table 1. Our integer model yields exact recurrence relations on a variety of interesting OEIS sequences. Predictions are based on observing the first 25 terms of each sequence. Constant Approximation Rel. error 0.3333 (3 + exp( 6)) 1 10 5 0.33333 1/3 10 5 3.1415 2 arctan(exp(10)) 10 7 3.14159 π 10 7 1.6449 1/ arctan(exp(4)) 10 7 1.64493 π2/6 10 7 0.123456789 10/92 10 9 0.987654321 1 (1/9)2 10 11 Table 2. Our float model learns to approximate out-ofvocabulary constants with its own vocabulary. We obtain the approximation of each constant C by feeding our model the 25 first terms of un = Cn. particular, four types of tasks have recently been tackled in the deep learning literature. First, converting a symbolic expression to another symbolic expression. Direct examples are integration and differentation (Lample & Charton, 2019), expression simplification (Allamanis et al., 2017) or equation solving (Arabshahi et al., 2018). Second, converting a symbolic expression to numerical data. This includes predicting quantitative properties of a mathematical structure, for example the local stability of a differential system (Charton et al., 2020). Third, converting numerical data to numerical data using mathematical rules. Examples range from learning basic arithmetics (Kaiser & Sutskever, 2015; Trask et al., 2018) to linear algebra (Charton, 2021). Fourth, converting numerical data to a symbolic expression: this is the framework of symbolic regression, which we will be focusing on. Symbolic regression Two types of approaches exist for symbolic regression, which one could name selection-based and pretraining-based. In selection-based approaches, we only have access to one set of observations: the values of the function we are trying Expression un Approximation ˆun Comment arcsinh(n) log(n + n2 + 1) Exact arccosh(n) log(n + n2 1) Exact arctanh(1/n) 1 2 log(1 + 2/n) Asymptotic catalan(n) un 1(4 6/n) Asymptotic dawson(n) n 2n2 un 1 1 Asymptotic j0(n) (Bessel) sin(n)+cos(n) πn Asymptotic i0(n) (mod. Bessel) en 2πn Asymptotic Table 3. Our float model learns to approximate out-ofvocabulary functions with its own vocabulary. For simple functions, our model predicts an exact expression in terms of its operators; for complex functions which cannot be expressed, our model manages to predict the first order of the asymptotic expansion. to infer. Typical examples of this approach are evolutionnary programs, which have long been the de facto standard for symbolic regression (Augusto & Barbosa, 2000; Schmidt & Lipson, 2009; Murari et al., 2014; Mc Kay et al., 1995), despite their computational cost and limited performance. More recently, neural networks were used following this approach (Sahoo et al., 2018; Kim et al., 2020; Petersen et al., 2019). In pretraining-based approaches, we train neural networks on a large dataset containing observations from many different function (Biggio et al., 2021; Valipour et al., 2021), hoping that the model can generalize to new expressions. Although the pretraining is computationally expensive, inference is much quicker as one simply needs to perform a forward pass, rather than search through a set of functions. In this work, we choose this approach. Recurrent sequences All studies on symbolic regression cited above consider the task of learning non-recurrent functions from their values at a set of points. Our contribution is, to the best of our knowledge, the first to tackle the setup of recurrent expressions. Although this includes non-recurrent Deep symbolic Regression for Recurrent Sequences expressions as a special case, one slight restriction is that the inputs need to be sampled uniformly. Hence, instead of feeding the model with input-output pairs {(xi, yi)}, we only need to feed it the terms of the sequence {ui}. Another important difference is that order of these terms matters, hence permutation invariant representations (Valipour et al., 2021) cannot be used. Integer sequences, in particular those from the Online Encyclopedia of Integer Sequences (OEIS) (Sloane, 2007), have been studied using machine learning methods in a few recent papers. Wu (2018) trains various classifiers to predict the label of OEIS sequences, such as interesting , easy , or multiplicative . Ryskina & Knight (2021) use embeddings trained on OEIS to investigate the mathematical properties of integers. Ragni & Klein (2011) use fully connected networks to predict the next term of OEIS sequences (a numeric rather than symbolic regression task). Nam et al. (2018) investigate different architectures (most of them recurrent networks) for digit-level numeric regression on sequences such as Fibonacci, demonstrating the difficulty of the task. Broadly speaking, we want to solve the following problem: given a sequence of n points {u0, . . . , un 1}, find a function f such that for any i N, ui = f(i, ui 1, . . . , ui d), where d is the recursion degree. Since we cannot evaluate at an infinite number of points, we declare that a function f is a solution of the problem if, given the first ninput terms in the sequence, it can predict the next npred following ones. Under this form, the problem is underdetermined: given a finite number of input terms, an infinite number of recurrence relations exist. However in practice, one would like the model to give us the simplest solution possible, following Occam s principle. This is generally ensured by the fact that simple expressions are more likely to be generated during training. Integer vs. float We consider two settings for the numbers in the sequences: integers and floats. For integer sequences, the recurrence formula only uses operators which are closed in Z (e.g. +, , abs, modulo and integer division). For float sequences, additional operators and functions are allowed, such as real division, exp and cos (see Table 4 for the list of all used operators). These two setups are interesting for different reasons. Integer sequences are an area of strong interest in mathematics, particular for their relation with arithmetics. The float setup is interesting to see how our model can generalize to a larger set of operators, which provides more challenging problems. Symbolic vs. numeric We consider two tasks: symbolic regression and numeric regression. In symbolic regression, the model is tasked to predict the recurrence relation the sequence was generated from. At test time, this recurrence relation is evaluated by how well it approximates the npred following terms in the sequence. In numeric regression, is tasked to directly predict the values of the npred following terms, rather than the underlying recurrence relation. At test time, the model predictions are compared with the true values of the sequence. Integer Float Unary abs, sqr, sign, relu abs, sqr, sqrt, inv, log, exp, sin, cos, tan, atan Binary add, sub, mul, intdiv, mod add, sub, mul, div Table 4. Operators used in our generators. 2.1. Data generation All training examples are created by randomly generating a recurrence relation, randomly sampling its initial terms, then computing the next terms using the recurrence relation. More specifically, we use the steps below: 1. Sample the number of operators o between 1 and omax, and build a unary-binary tree with o nodes, as described in (Lample & Charton, 2019). The number of operators determines the difficulty of the expression. 2. Sample the nodes of the tree from the list of operators in Table 4. Note that the float case uses more operators than the integer case, which makes the task more challenging by expanding the problem space. 3. Sample the recurrence degree d between 1 and dmax, which defines the recurrence depth: for example, a degree of 2 means that un+1 depends on un and un 1. 4. Sample the leaves of the tree: either a constant, with probability pconst, or the current index n, with probability pn, or one of the previous terms of the sequence un i, with i [1, d], with probability pvar. 5. Recalculate the true recurrence degree d considering the deepest leaf un i sampled during the previous step, then sample d initial terms from a random distribution P. 6. Sample l between lmin and lmax and compute the next l terms of the sequence using the initial conditions. The total sequence length is hence ninput = deff + l. Deep symbolic Regression for Recurrent Sequences We provide the values of the parameters of the generator in Table 5. Note that in the last step, we interrupt the computation if we encounter a term larger than 10100, or outside the range of one of the operators: for example, a division by zero, or a negative square root. Parameter Description Value dmax Max degree 6 omax Max number of operators 10 lmin Min length 5 lmax Max length 30 pconst Prob of constant leaf 1/3 pindex Prob of index leaf 1/3 pvar Prob of variable leaf 1/3 P Distrib of first terms U( 10, 10) Table 5. Hyperparameters of our generator. Limitations One drawback of our approach is poor out-ofdistribution generalization of deep neural networks, demonstrated in a similar context to ours by (Charton et al., 2020): when shown an example which is impossible to be generated by the procedure described above, the model will inevitably fail. For example, as shown in App. B, our model performs poorly when the initial terms of the sequence are sampled outside their usual range, which is chosen to be [ 10, 10] in these experiments. One easy fix would be to increase this range, but this would only shift the problem. 2.2. Encodings Model inputs are sequences of integers or floats. The outputs are recurrence relations for the symbolic task, and sequences of numbers for the numeric task. To be processed by transformers, inputs and outputs must be represented as sequences of tokens from a fixed vocabulary. To this effect, we use the encodings presented below. Integers We encode integers using their base b representation, as a sequence of 1 + logb |x| tokens: a sign (which also serves as a sequence delimiter) followed by logb |x| digits from 0 to b 1, for a total vocabulary of b + 2 tokens. For instance, x = 325 is represented in base b = 10 as the sequence [-, 3, 2, 5], and in base b = 30 as [-, 10, 25]. Choosing b involves a tradeoff between the length of the sequences fed to the Transformer and the vocabulary size of the embedding layer. We choose b = 10, 000 and limit integers in our sequences to absolute values below 10100, for a maximum of 26 tokens per integer, and a vocabulary of order 104 words. Floats Following Charton (2021), we represent float numbers in base 10 floating-point notation, round them to four significant digits, and encode them as sequences of 3 tokens: their sign, mantissa (between 0 and 9999), and exponent (from E-100 to E100)1. For instance, 1/3 is encoded as [+, 3333, E-4]. Again, the vocabulary is of order 104 For all operations in floating-point representation, precision is limited to the length of the mantissa. In particular, when summing elements with different magnitudes, subdominant terms may be rounded away. Partly due to this effect, when approximating complex functions, our symbolic model typically only predicts the largest terms in its asymptotic expansion, as shown in Table 3. We discuss two methods for increasing precision when needed in Section E of the Appendix. Recurrence relations To represent mathematical trees as sequences, we enumerate the trees in prefix order, i.e. direct Polish notation, and encode each node as a single autonomous token. For instance, the expression cos(3x) is encoded as [cos,mul,3,x]. Note that the generation procedure implies that the recurrence relation is not simplified (i.e. expressions like 1 + un 1 1 can be generated). We tried simplifying them using Sympy before the encoding step (see Appendix C), but this slows down generation without any benefit on the performance of our models, which turn out to be surprisingly insensitive to the syntax of the formulas. 2.3. Experimental details Similarly to Lample & Charton (2019), we use a simple Transformer architecture (Vaswani et al., 2017) with 8 hidden layers, 8 attention heads and an embedding dimension of 512 both for the encoder and decoder. Training and evaluation The tokens generated by the model are supervised via a cross-entropy loss. We use the Adam optimizer, warming up the learning rate from 10 7 to 2.10 4 over the first 10,000 steps, then decaying it as the inverse square root of the number of steps, following (Vaswani et al., 2017). We train each model for a minimum of 250 epochs, each epoch containing 5M equations in batches of 512. On 16 GPU with Volta architecture and 32GB memory, one epoch is processed in about an hour. After each epoch, we evaluate the in-distribution performance of our models on a held-out dataset of 10,000 equations. Unless specified otherwise, we generate expressions using greedy decoding. Note that nothing prevents the symbolic model from generating an invalid expression such as [add,1,mul,2]. These mistakes, counted as invalid predictions, tend to be very rare: in models trained for more 1By convention, we represent zero as [+, 0, E0]. Deep symbolic Regression for Recurrent Sequences than a few epochs, they occur in less than 0.1% of the test cases. Hypothesis ranking To predict recurrence relations for OEIS sequences (Section 4.1), or the examples displayed in Tables 1,2,3, we used a beam size of 10. Usually, hypotheses in the beam are ranked according to their log-likelihood, normalized by the sequence length. For our symbolic model, we can do much better, by ranking beam hypotheses according to how well they approximate the initial terms of the original sequence. Specifically, given a recurrence relation of degree d, we recompute for each hypothesis the ninput first terms in the sequence (using the first d terms of the original sequence as initial conditions), and rank the candidates according to how well they approximate these terms. This ranking procedure is impossible for the numerical model, which can only perform extrapolation. In some sense, this relates our method to evolutionary algorithms, which use the input terms for candidate selection. Yet, as shown by the failure modes presented in Section G of the Appendix, our model is less prone to overfitting since the candidates are not directly chosen by minimizing a loss on the input terms. 3. In-domain generalization We begin by probing the in-domain accuracy of our model, i.e. its ability to generalize to unseen sequences generated with the same procedure as the training data. As discussed in Section D of the Appendix, the diversity of mathematical expressions and the random sampling of the initial terms ensures that almost all the examples presented at test time have not been seen during training: one cannot attribute the generalization to mere memorization. Prediction accuracy Due to the large number of equivalent ways one can represent a mathematical expression, one cannot directly evaluate the accuracy by comparing (token by token) the predicted recurrence relation to the ground truth. Instead, we use the predicted expression to compute the next npred terms of the sequence {ˆui}, and compare them with those computed from the ground truth, {ui}. The prediction accuracy is then defined as: acc(npred, τ) = P max 1 i npred By choosing a small enough τ 0 and a large enough npred, one can ensure that the predicted formula matches the true formula. In the float setup, τ = 0 must be avoided for two reasons: (i) equivalent solutions represented by different expressions may evaluate differently because due to finite machine precision, (ii) setting τ = 0 would penalize Model Integer Float nop 5 nop 10 nop 5 nop 10 Symbolic 92.7 78.4 74.2 43.3 Numeric 83.6 70.3 45.6 29.0 Table 6. Average in-distribution accuracies of our models. We set τ = 10 10 and npred = 10. the model for ignoring sub-dominant terms which are indetectable due to the finite precision encodings. Hence, we select τ = 10 10 and npred = 10 unless specified otherwise2. Occasionally, we will consider larger values of τ, to assess the ability of our model to provide approximate expressions. Results The average in-distribution accuracies of our models are reported in Table 6 with τ = 10 10 and npred = 10. Although the float setup is significantly harder that the integer setup, our symbolic model reaches good accuracies in both cases. In comparison, the numeric model obtains worse results, particularly in the float setup. The interested reader may find additional training curves, attention maps and visualizations of the model predictions in Appendix G. Ablation over the evaluation metric The two first panels of Figure 1 illustrate how the accuracy changes as we vary the tolerance level τ and the number of predictions npred. The first important observation is that the symbolic model performs much better than the numeric model at low tolerance. At high tolerance, it still keeps an advantage in the integer setup, and performs similarly in the float setup. This demonstrates the advantage of a symbolic approach for high-precision predictions. Second, we see that the accuracy of both models degrades as we increase the number of predictions npred, as one could expect. However, the decrease is less important for the symbolic model, especially in the float setup where the curve is essentially flat. This demonstrates another strong advantage of the symbolic approach: once it has found the correct formula, it can predict the whole sequence, whereas the precision of the numeric model deteriorates as it extrapolates further. Ablation over the example difficulty The three last panels of Figure 1 decompose the accuracy of our two models along three factors of difficulty: the number of operators o3, the recurrence degree d and the sequence length ninput (see 2For the float numeric model, which can only predict values up to finite precision ϵ, we round the values of target function to the same precision. This explains the plateau of the accuracy at τ < ϵ in Figure 1. 3Since expressions are not simplified, o may be overestimated. Deep symbolic Regression for Recurrent Sequences 10 8 10 5 10 2 2.5 5.0 7.5 10.0 Number of predictions 2.5 5.0 7.5 10.0 Number of operators 0 2 4 6 Recurrence degree 10 20 30 Input sequence length Integer Symbolic Integer Numeric Float Symbolic Float Numeric Figure 1. The symbolic model extrapolates further and with higher precision than the numeric model. From left to right, we vary the tolerance τ, the number of predictions npred, the number of operators o, the recurrence degree d and the number of input terms l. In each plot, we use the following defaults for quantities which are not varied: τ = 10 10, npred = 10, o [[1, 10]], d [[1, 6]], l [[5, 30]]. base div sqrt exp trig 0 Numeric Symbolic (a) In-domain poly hyper fresnelbessel 0 Numeric Symbolic (b) Out-of-domain Figure 2. Accuracy of our models on various in-domain and out-of-domain groups. We set τ = 10 10, npred = 10. Section 2.1). Unsurprisingly, accuracy degrades rapidly as the number of operators increases, particularly in the float setting where the operators are more diverse: the accuracy of the symbolic model drops from 100% for o = 1 to 10% for o = 10. We attempted a curriculum learning strategy to alleviate the drop, by giving higher probabilities to expressions with many operators as training advances, but this did not bring any improvement. Increasing the recurrence degree has a similar but more moderate effect: the accuracy decreases from 70% for d=0 (non-recurrent expressions) to 20% for d = 6 in the float setup. Finally, we observe that shorter sequences are harder to predict as they give less information on the underlying recurrence; however, even when fed with less than 10 terms, our models achieve surprisingly high accuracies. Ablation over operator families To understand what kind of operators are the hardest for our float model, we bunch them into 5 groups: base: {add, sub, mul}. division: base + {div, inv}. sqrt: base + {sqrt}. exponential: base + {exp, log}. trigonometric: base + {sin, cos, tan, arcsin, arccos, arctan}. Results are displayed in Figure 2a. We see that the main difficulties lie in division and trigonometric operators, but the performance of both models stays rather good in all categories. Visualizing the embeddings To give more intuition on the inner workings of our symbolic models, we display a t-SNE (van der Maaten & Hinton, 2008) projection of the embeddings of the integer model in Figure 3a and of exponent embeddings of the float model in Figure 3b. Both reveal a sequential structure, with the embeddings organized in a clear order, as highlighted by the color scheme. In Appendix F, we study in detail the pairwise distances between embeddings, unveiling interesting features such as the fact that the integer model naturally learns a base-6 representation. 4. Out-of-domain generalization In this section, we evaluate the ability of our model to generalize out-of-domain. Recurrence prediction being a previously unexplored branch of symbolic regression, there are no official benchmarks we can compare our models to. For integer sequences, we use a subset of OEIS as our out-of-domain benchmark; for float sequences, we use a generator with out-of-vocabulary constants and operators. In Appendix A, we also show that our models can and can Deep symbolic Regression for Recurrent Sequences 80 85 90 95 (a) Integer embeddings (b) Exponent embeddings Figure 3. The number embeddings reveal intriguing mathematical structure. We represented the t-SNE of the embeddings of the integer model and the exponent embeddings of the float model. We depicted the first 100 integer embeddings (10,000 in the model), and the exponent embeddings -40 to 40 (-100 to 100 in the model). be made robust noise in the inputs. 4.1. Integer sequences: OEIS dataset The Online Encyclopedia of Integer Sequences (OEIS) is an online database containing over 300,000 integer sequences. It is tempting to directly use OEIS as a testbed for prediction; however, many sequences in OEIS do not have a closedform recurrence relation, such as the stops on the New York City Broadway line subway (A000053). These will naturally cause our model to fail. Preprocessing Luckily, OEIS comes with keywords, and 22% of the sequences are labelled as easy , meaning that there is a logic to find the next terms (although this logic is by no means easy in most cases). Note that this logic cannot always be formulated as a recurrence relation: for example, the sequence of primes or decimals of π are included in this category, but intractable for our models. We keep the first 10,000 of these sequences as our testbed. Evaluation consists in in showing our models the first ninput {15, 25} terms of each sequence and asking it to predict the npred {1, 10} following terms. Results Results are reported in Table 7. With only ninput = 15 terms, the numeric model reaches an impressive accuracy of 53% at next term prediction, and 27% for predicting the next ten terms. The symbolic model achieves lower results, with 33% and 19% respectively; we attribute this to the large number of non-analytic sequences in the testbed. Nonetheless, this means that our model can retrieve a valid recurrence relation for almost a fifth of the sequences, which is rather impressive: we give a few interesting examples in Table 1. Increasing ninput to 25 increases our performances rather marginally. As a comparison, we ran two built-in Mathematica func- tions for the task at hand: Find Sequence Function, which finds non-recurrent expressions, and Find Linear Recurrence, which finds linear recurrence relations. These functions are much more sensitive to the number of terms given as input: they obtain similar accuracies at ninput = 15, but Find Linear Recurrence performs significantly better at ninput = 25, while Find Sequence Function performs pathologically worse. Both these functions perform less well than our symbolic model in all cases. 4.2. Float sequences: robustness to out-of-vocabulary tokens One major difficulty in symbolic mathematics is dealing with out-of-vocabulary constants and operators: the model is forced to approximate them using its own vocabulary. We investigate these two scenarios separately for the float model. Out-of-vocabulary constants The first possible source of out-of-vocabulary tokens are prefactors. For example, a formula as simple as un = 0.33n is hard to predict perfectly, because our decoder only has access to integers between 10 and 10 and a few mathematical constants, and needs to write 0.33 as [div,add,mul,3,10,3,mul,10,10]. To circumvent this issue, Biggio et al. (2021) use a separate optimizer to fit the prefactors, once the skeleton of the equation is predicted. In contrast, our model is end-to-end, and is surprisingly good at approximating out-of-vocabulary prefactors with its own vocabulary. For un = 0.33n, one could expect the model to predict un = n/3, which would be a decent approximation. Yet, our model goes much further, and outputs un = cos(3)n/3, which is a better approximation. We give a few other spectacular examples in Table 2. Our model is remarkably capable of using the values of operators such as exp and arctan, as if it were able to perform computations internally. To investigate the approximation capabilities of our model systematically, we evaluate the its performance when sampling the prefactors uniformly in [ 10, 10], rather than in { 10, 9, ...9, 10} {e, π, γ} as done usually. It is impossible for the symbolic model to perfectly represent the correct formulas, but since we are interested in its approximation capabilities, we set the tolerance to 0.01. Results are shown in Table 8. Unsurprisingly, the performance of the numeric model is unaffected as it does not suffer from any out-of-vocabulary issues, and becomes better than the symbolic model. However, the symbolic model maintains very decent performances, with its approximation accuracy dropping from 60% to 42%. This approximation ability in itself an impressive feat, as the model was not explicitly trained to achieve it. It can poten- Deep symbolic Regression for Recurrent Sequences Model ninput = 15 ninput = 25 npred = 1 npred = 10 npred = 1 npred = 10 Symbolic (ours) 33.4 19.2 34.5 21.3 Numeric (ours) 53.1 27.4 54.9 29.5 Find Sequence Function 17.1 12.0 8.1 7.2 Find Linear Recurrence 17.4 14.8 21.2 19.5 Table 7. Accuracy of our integer models and Mathematica functions on OEIS sequences. We use as input the first ninput = {15, 25} first terms of OEIS sequences and ask each model to predict the next npred = {1, 10} terms. We set the tolerance τ = 10 10. Model [[ 10, 10]] {e, π, γ} U( 10, 10) nop 5 nop 10 nop 5 nop 10 Symbolic 81.9 60.7 60.1 42.1 Numeric 72.4 60.4 72.2 60.2 Table 8. Our symbolic model can approximate out-ofvocabulary prefactors. We report the accuracies achieved when sampling the constants uniformly from [[ 10, 10]] {e, π, γ}, as during training, versus sampling uniformly in [ 10, 10]. We set τ = 0.01 (note the higher tolerance threshold as we are considering approximation) and npred = 10. tially have strong applications for mathematics, exploited by the recently proposed Ramanujan Machine (Raayoni et al., 2021): for example, if a sequence converges to a numerical value such as 1.64493, it can be useful to ask the model to approximate it, yielding π2/6. In fact, one could further improve this approximation ability by training the model only on degree-0 sequences with constant leaves ; we leave this for future work. Out-of-vocabulary functions A similar difficulty arises when dealing with out-of-vocabulary operators, yet again, our model is able to express or approximate them with its own vocabulary. We show this by evaluating our model on various families of functions from scipy.special: polynomials: base + orthogonal polynomials of degree 1 to 5 (Legendre, Chebyshev, Jacobi, Laguerre, Hermite, Gegenbauer) hyperbolic: base + {sinh, cosh, tanh, arccosh, arcsinh, arctanh} bessel: base + {Bessel and modified Bessel of first and second kinds} fresnel: base + {erf, Faddeeva, Dawson and Fresnel integrals}. The results in Figure 2b show that both the numeric and symbolic models cope surprisingly well with these functions. The symbolic model has more contrasted results than the numeric model, and excels particularly on functions which it can easily build with its own vocabulary such as polynomials. Surprisingly however, it also outperforms the numeric model on the other groups. In Table 3, we show a few examples of the remarkable ability of our model to yield high-quality asymptotic approximations, either using a recurrence relation as for the Catalan numbers and the Dawson function, either with a non-recurrent expression as for the Bessel functions. 5. Conclusion In this work, we have shown that Transformer models can successfully infer recurrence relations from observations. We applied our model to challenging out-of-distribution tasks, showing that it outperforms Mathematica functions on integer sequences and yields informative approximations of complex functions as well as numerical constants. Scope of our approach One may ask to what extent our model can be used for real-world applications, such as timeseries forecasting. Although robustness to noise is an encouraging step in this direction, we believe our model is not directly adapted to such applications, for two reasons. First, real-world time-series often cannot be described by a simple mathematical formula, in which case numeric approaches will generally outperform symbolic approaches. Second, even when they can be expressed by a formula, the latter will contain complex prefactors and non-deterministic terms which will make the task extremely challenging. Future directions To bring our model closer to real-world problems, we need our model to be able to handle arbitrary prefactors. We introduce a method to solve this problem in our follow-up work by introducing numeric tokens in the decoder (Kamienny et al., 2022). Another important extension of our work is the setting of multi-dimensional sequences. With two dimensions, one could study complex sequences, which are a well-studied branch of mathematics given their relationship with fractals. Finally, recurrence relations being a discrete version of differential equations, the most natural extension to this work is to infer differential equations from trajectories; this will be an important direction for future work. Deep symbolic Regression for Recurrent Sequences Allamanis, M., Chanthirasegaran, P., Kohli, P., and Sutton, C. Learning continuous semantic representations of symbolic expressions. In International Conference on Machine Learning, pp. 80 88. PMLR, 2017. Arabshahi, F., Singh, S., and Anandkumar, A. Towards solving differential equations through neural programming. In ICML Workshop on Neural Abstract Machines and Program Induction (NAMPI), 2018. Augusto, D. A. and Barbosa, H. J. Symbolic regression via genetic programming. In Proceedings. Vol. 1. Sixth Brazilian Symposium on Neural Networks, pp. 173 178. IEEE, 2000. Biggio, L., Bendinelli, T., Neitz, A., Lucchi, A., and Parascandolo, G. Neural symbolic regression that scales. ar Xiv preprint ar Xiv:2106.06427, 2021. Charton, F. Linear algebra with transformers. ar Xiv preprint ar Xiv:2112.01898, 2021. Charton, F., Hayat, A., and Lample, G. Learning advanced mathematical computations from examples. ar Xiv preprint ar Xiv:2006.06462, 2020. Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021. Davies, A., Velickovic, P., Buesing, L., Blackwell, S., Zheng, D., Tomasev, N., Tanburn, R., Battaglia, P., Blundell, C., Juhasz, A., et al. Advancing mathematics by guiding human intuition with ai. Nature, 2021. Kaiser, Ł. and Sutskever, I. Neural gpus learn algorithms. ar Xiv preprint ar Xiv:1511.08228, 2015. Kamienny, P.-A., d Ascoli, S., Lample, G., and Charton, F. End-to-end symbolic regression with transformers. ar Xiv preprint ar Xiv:2204.10532, 2022. Kim, S., Lu, P. Y., Mukherjee, S., Gilbert, M., Jing, L., ˇCeperi c, V., and Soljaˇci c, M. Integration of neural network-based symbolic regression in deep learning for scientific discovery. IEEE Transactions on Neural Networks and Learning Systems, 2020. Lample, G. and Charton, F. Deep learning for symbolic mathematics. ar Xiv preprint ar Xiv:1912.01412, 2019. Mc Kay, B., Willis, M. J., and Barton, G. W. Using a tree structured genetic algorithm to perform symbolic regression. In First international conference on genetic algorithms in engineering systems: innovations and applications, pp. 487 492. IET, 1995. Meurer, A., Smith, C. P., Paprocki, M., ˇCert ık, O., Kirpichev, S. B., Rocklin, M., Kumar, A., Ivanov, S., Moore, J. K., Singh, S., et al. Sympy: symbolic computing in python. Peer J Computer Science, 3:e103, 2017. Murari, A., Peluso, E., Gelfusa, M., Lupelli, I., Lungaroni, M., and Gaudio, P. Symbolic regression via genetic programming for data driven derivation of confinement scaling laws without any assumption on their mathematical form. Plasma Physics and Controlled Fusion, 57(1): 014008, 2014. Nam, H., Kim, S., and Jung, K. Number sequence prediction problems for evaluating computational powers of neural networks, 2018. Paccanaro, A. and Hinton, G. E. Learning distributed representations of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering, 13(2):232 244, 2001. Petersen, B. K., Larma, M. L., Mundhenk, T. N., Santiago, C. P., Kim, S. K., and Kim, J. T. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. ar Xiv preprint ar Xiv:1912.04871, 2019. Raayoni, G., Gottlieb, S., Manor, Y., Pisha, G., Harris, Y., Mendlovic, U., Haviv, D., Hadad, Y., and Kaminer, I. Generating conjectures on fundamental constants with the ramanujan machine. Nature, 590(7844):67 73, 2021. Ragni, M. and Klein, A. Predicting numbers: An ai approach to solving number series. In Bach, J. and Edelkamp, S. (eds.), KI 2011: Advances in Artificial Intelligence, pp. 255 259. Springer Berlin Heidelberg, 2011. Ryskina, M. and Knight, K. Learning mathematical properties of integers. ar Xiv preprint ar Xiv:2109.07230, 2021. Sahoo, S., Lampert, C., and Martius, G. Learning equations for extrapolation and control. In International Conference on Machine Learning, pp. 4442 4450. PMLR, 2018. Saxton, D., Grefenstette, E., Hill, F., and Kohli, P. Analysing mathematical reasoning abilities of neural models. ar Xiv preprint ar Xiv:1904.01557, 2019. Schmidt, M. and Lipson, H. Distilling free-form natural laws from experimental data. science, 324(5923):81 85, 2009. Sloane, N. J. The on-line encyclopedia of integer sequences. In Towards mechanized mathematical assistants, pp. 130 130. Springer, 2007. Deep symbolic Regression for Recurrent Sequences Trask, A., Hill, F., Reed, S., Rae, J., Dyer, C., and Blunsom, P. Neural arithmetic logic units. ar Xiv preprint ar Xiv:1808.00508, 2018. Valipour, M., You, B., Panju, M., and Ghodsi, A. Symbolicgpt: A generative transformer model for symbolic regression. ar Xiv preprint ar Xiv:2106.14131, 2021. van der Maaten, L. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579 2605, 2008. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017. Wu, C. W. Can machine learning identify interesting mathematics? an exploration using empirically observed laws. ar Xiv preprint ar Xiv:1805.07431, 2018. Deep symbolic Regression for Recurrent Sequences A. Robustness to noise One particularity of our model is that it is entirely trained and evaluated on synthetic data which is completely noise-free. Can our model also predict recurrence relations when the inputs are corrupted ? In this section, we show that the answer is yes, provided the model is trained with noisy inputs. For simplicity, we restrict ourselves here to the setup of float sequences, but the setup of integer sequences can be dealt with in a similar manner, trading continuous random variables for discrete ones. Setup Considering the wide range of values that are observed in recurrent sequences, corruption via additive noise with constant variance, i.e. un = f(n, {ui}i