# deep_generative_symbolic_regression_with_montecarlotreesearch__fc1e78b2.pdf Deep Generative Symbolic Regression with Monte-Carlo-Tree-Search Pierre-Alexandre Kamienny 1 2 Guillaume Lample 1 Sylvain Lamprier 3 Marco Virgolin 4 Symbolic regression (SR) is the problem of learning a symbolic expression from numerical data. Recently, deep neural models trained on procedurally-generated synthetic datasets showed competitive performance compared to more classical Genetic Programming (GP) algorithms. Unlike their GP counterparts, these neural approaches are trained to generate expressions from datasets given as context. This allows them to produce accurate expressions in a single forward pass at test time. However, they usually do not benefit from search abilities, which result in low performance compared to GP on out-of-distribution datasets. In this paper, we propose a novel method which provides the best of both worlds, based on a Monte-Carlo Tree Search procedure using a context-aware neural mutation model, which is initially pre-trained to learn promising mutations, and further refined from successful experiences in an online fashion. The approach demonstrates state-of-the-art performance on the well-known SRBench benchmark. 1. Introduction Symbolic Regression (SR) is the problem of simultaneously finding the structure (operators, variables) and optimising the parameters (constants) of an expression that describes experimental data in an interpretable manner. SR can produce human-readable expressions that are particularly useful in natural sciences, e.g., materials sciences (Wang et al., 2019; Kabliman et al., 2021; Ma et al., 2022) or physics (Schmidt & Lipson, 2009; Vaddireddy et al., 2020; Sun et al., 2022; Cranmer et al., 2020; Hernandez et al., 2019; Udrescu & Tegmark, 2020). The recent benchmarking effort SRBench (La Cava et al., 2021) has shown that SR 1Meta AI, Paris, France 2ISIR MLIA, Sorbonne Universit e, France 3LERIA, Universit e d Angers, France 4Centrum Wiskunde & Informatica, the Netherlands. Correspondence to: Pierre Alexandre Kamienny . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). algorithms can additionally outperform over-parametrized models, e.g., decision-tree ensembles or neural networks, on a set of real-world and synthetic datasets. SR is a challenging task, which implies composing inherently discrete objects (operators and variables) to make the resulting expression fit well the given data. As seeking the optimal composition is intractable (Virgolin & Pissis, 2022), typical approaches to SR are based on heuristics. In fact, the leading algorithms on SRBench are modern versions of genetic programming (GP) (Koza, 1994), a class of genetic algorithms where evolving solutions need to be executed to determine their quality (i.e., SR expressions need to be evaluated on the data to determine their quality of fit). Lately, there has been a growing interest in the SR community for neural network-based approaches. For example, Udrescu & Tegmark (2020) use neural networks (NNs) to explicitly detect data properties (e.g., symmetries in the data) useful to reduce the search space of SR. Other approaches, which we group together under the name of Deep Generative Symbolic Regression (DGSR), have used NNs to learn to generate expressions directly from the data. Similarly to GP, DSR (Petersen et al., 2019) and (Mundhenk et al., 2021) faces a tabula rasa setup of the problem for each new dataset. On the other hand, several DGSR approaches are inductive: they are pre-trained to predict an expression in a single forward pass for any new dataset, by feeding the dataset as input tokens (Biggio et al., 2021; Valipour et al., 2021; Kamienny et al., 2022). As such, these approaches have the appeal of generating expressions extremely quickly. However, their lack of a search component makes them unable to improve for the specific dataset at hand. This aspect can be particularly problematic when the given data is outof-distribution compared to the synthetic data the NN was pre-trained upon. A promising direction to cope with the limitations of inductive DGSR is therefore to include a suitable search strategy. The use of neural policies in Monte-Carlo Tree Search (MCTS) has led to improved exploration via long-term planning in the context of automated theorem proving with formal systems (Polu & Sutskever, 2020; Lample et al., 2022), algorithm discovery (Fawzi et al., 2022), and also games (Silver et al., 2016; 2018). In the context SR, search tree nodes are expressions (e.g., x1 + x2), and edges between Deep Generative Symbolic Regression with MCTS them represent mutations (e.g., x2 (7 x3), leading to the expression x1 + 7 x3). White et al. (2015); Sun et al. (2022) proposed a classic formulation of MCTS for SR, where possible mutations are pre-defined along with prioritization rules to decide which nodes to mutate in priority. Li et al. (2019) use a recurrent neural network to produce possible mutations, which is conditioned on shape constraints for the expressions, but not on the dataset. Most similar to our approach is the study of Lu et al. (2021), that uses a pre-trained model to sample promising but rather simple mutations (up to one expression term). However, the model is not fine-tuned on the specific dataset at hand as the search progresses. In our proposal, MCTS is augmented with neural policies that are both pre-trained and fine-tuned. Existing works have only showed good performance on simple benchmark problems (e.g., no observed competitive performance on SRBench real-world problems). Furthermore, we found in preliminary experiments that the combination of NN within MCTS with pre-training is key to achieve good performance. Summary of the approach. In this paper, we seek to overcome the limitations of DGSR, by proposing a synergistic combination, which we call DGSR-MCTS, where MCTS is seeded with pre-trained DGSR, and DGSR is fine-tuned over time on multiple datasets simultanously as the search progresses. A mutation policy is responsible for producing mutations that expand expressions from the tree search. A selection policy prioritizes which expression to mutate next, by trading-off between exploration (low number of visits) and exploitation (most promising expressions) using a critic network. The mutation policy first undergoes a pre-training phase. Then, both the mutation policy and the critic network are updated in online fashion, by leveraging results from search trials on new provided datasets. We first position our approach in an unifying view of SR, that encapsulates both earlier work in GP and more recent deep learning-based approaches. Then, we describe our approach in detail, and evaluate it on the SRBench benchmark. We show that our approach achieves state-of-the-art performance, and discovers expressions that are both simple and accurate. 2. Unifying View of of SR Let us denote by F the family of symbolic expressions1 that form the search space of the SR problem. Generally speaking, F is defined by the set of building blocks from which expressions can be composed (Virgolin & Pissis, 2022). 1Note that even though different expressions may be functionally equivalent, this is normally not taken into account in existing approaches, as determining functional equivalence can be undecidable (Buchberger & Loos, 1982). These usually include constants (possibly sampled from a distribution, e.g., N(0, 1)), variables, and basic operators as well as trigonometric or transcendental ones (e.g., +, , , , sin, cos, exp, log). A common formulation of SR poses to seek a well-fitting expression for the dataset at hand. We now generalize this idea to the case where good performance is sought across multiple datasets (as one usually seeks a generally-competent search algorithm rather than a dataset-specific one). Given Ωa distribution over datasets D = {(xi, yi)}i N, with (xi, yi) Rd R a given point (descriptor, image) from a specific dataset D, and a limited budget for the exploration process T, the general objective of SR is to define an algorithm that produce distributions of expressions f F, that minimize the following theoretical risk at step T: RΩ,F = ED Ω(D)Ef P T f (D)[L(f, D)] (1) with P T f (D) the final distribution over expressions provided by the considered algorithm. A typical loss function L considered in the SR literature is the negative R2: R2(f, D) = 1 MSE(y, f(x)) VAR(y) = 1 P i (yi f(xi))2 P i(yi yi)2 . (2) where yi = (1/N) P Considering algorithms that start from P 0 f (D) to incrementally build P T f (D), the problem can be decomposed into two main steps: 1) define an accurate starting point P 0 f (D) for the search process; 2) specify the exploration process that allows to update P 0 f (D) to form P T f (D) in a maximum of T iterations (search trials). Some approaches in the literature only consider the first step (i.e., P T f (D) = P 0 f (D) no search process). Others only investigate step 2 (i.e., P 0 f (D) = P 0 f ( ) no inductive context), as described in the following. There are many ways to define P 0 f . For example, traditionally in GP, the dataset does not play a role (i.e., it is P 0 f ( )), but there exist different strategies to randomly initialize the population (Looks, 2007; Pawlak & Krawiec, 2016; Ahmad & Helmuth, 2018). Since we deal with DGSR, we focus on P 0 f (D) that entail pre-training from here on. Note that Ωis unknown during this pre-training, and that we only observe its datasets at search time (datasets from SRBench in our experiments). 2.1. Pre-training: How to define P 0 f (D) The formulation of Equation (1) is reminiscent of metalearning or few-shot learning where the goal is to train a model on a set of training tasks such that it can solve new ones more efficiently (Schmidhuber, 1987; Thrun & Pratt, 2012; Finn et al., 2017). In that vein, many approaches focus Deep Generative Symbolic Regression with MCTS Table 1. Unifying view of SR. θ represents weights of a probabilistic neural network that embodies Pf. ψ is parameters of a critic network (see Section 3). Algorithm Pre-train of P 0 f P t f conditioned by Update of P t f GP (Poli et al., 2008) No Population & Genetic operators Selection operators EDA (Kim et al., 2014) No Explicit Factorization Selection operators E2E (Kamienny et al., 2022) SSL on synthetic data D & θ0 No DSR(Petersen et al., 2019) No θt Update θt with policy gradients u DSR (Landajuela et al., 2022) SSL on synthetic data D & θt Update θt with policy gradients DGSR-MCTS (Ours) SSL and MCTS MCTS Update θt & ψt on synthetic data using D & θt & ψt via Selection & Imitation on learning a generative model to induce an implicit form of P 0 f . However, since Ωis unknown during pre-training, those models are usually trained on large synthetic datasets built from randomly generated expressions (Lample & Charton, 2019). Rather than focusing on the accuracy of the sampled expressions regarding some criteria such as Equation (2), the vast majority of these approaches are neural language models (and thus belong to the DGSR family) trained to express P 0 f by auto-regressively predicting a given ground truth expression from the training data (e.g., using a recurrent neural network or a transformer). Although these methods tend to produce expressions similar to the synthetic ones seen during pre-training, they have the advantage of explicitly considering the dataset as an input: Pf is conditioned on D (Biggio et al., 2021; Valipour et al., 2021; Kamienny et al., 2022). Consequently, pre-training approaches for DGSR can produce expressions as a one-shot inference process by the simple action of sampling, eliminating the need for search iterations. However, sampling from P 0 f is limited as the accuracy improvement stops after a small number of samples (50 in (Kamienny et al., 2022)), and can perform badly on out-of-domain datasets given at test-time. 2.2. Search Process: How to build P T f (D) Given a dataset D, the aim of any process at search time is to define a procedure to build an accurate distribution P T f (D) from the starting one P 0 f (D), via a mix of exploration and exploitation. Most approaches seek the maximization of Ef P T f [R2(f, D)]. Several approaches additionally include in that objective a functional C(f) that penalizes overlycomplex expressions (Vladislavleva et al., 2008; Chen et al., 2018). Even though there exists different definition of complexity (Kommenda et al., 2015; Virgolin et al., 2020), we consider expression size (i.e., the number of operators, variables and constants in the expression) as complexity measure. In that aim, the learning process of different SR algorithms can be unified into a general, iterative framework: until the termination condition is not met (e.g., runtime budged exceeded or satisfactory accuracy reached), (i) sample one or more symbolic expressions f F using the current probability distribution P t f at step t; (ii) update P t f using statistics of the sample, such as the accuracy of the expression. Steps (i) and (ii) constitute an iteration of the framework. Different SR algorithms have considered various definitions of P t f and strategies to update it. Its definition can be implicit, explicit, or learned, and be biased towards certain expression structures. Table 1 provides a non-exhaustive summary of popular algorithms, which we elaborate upon in the following paragraphs. Genetic Programming (GP) is a meta-heuristic inspired by natural evolution and is a popular approach to SR (Koza, 1994; Poli et al., 2008). GP implements step (i) by maintaining a population of expressions which undergoes stochastic modifications (via crossover and mutation), and step (ii) by selection, i.e., stochastic survival of the fittest, where better expressions are duplicated and worse ones are removed. In other words, sampling and updating P t f is implicitly defined by the heuristics that are chosen to realize the evolution. Estimation of Distribution Algorithms (EDAs) attempt to be more principled than GP by explicitly defining the form of P t f (Salustowicz & Schmidhuber, 1997; Hemberg et al., 2012; Kim et al., 2014). Normally, P t f is factorized according to the preference of the practitioner, e.g., expression terms can be modelled and sampled independently or jointly in tuples, or chosen among multiple options as the search progresses, using, e.g., the minimum description length principle (Harik et al., 1999). EDAs use methods similar to those of GP to realize step (ii). DSR (Petersen et al., 2019) proposed an approach where P t f is realized by an NN, whose parameters θ are updated using policy gradients with the accuracy of sampled expressions as rewards. Though being very general and Deep Generative Symbolic Regression with MCTS only biased by the model parametrization θ, this approach was found to generate very short expressions with low accuracy on SRBench (see Section 4). This is likely due to sparse reward and credit assignment issues typical of reinforcement learning (Sutton, 1984). (Mundhenk et al., 2021) adds GP search samples on top of DSR. Very recently, the work by Petersen et al. (2019) was integrated with a pre-training component (Landajuela et al., 2022), albeit in a large pipeline that also includes GP, NNs for search space reduction (Udrescu & Tegmark, 2020), and linear regression. In the recent benchmarking effort SRBench by La Cava et al. (2021), modern GP algorithms have shown to be the most successful. An interesting property that distinguishes GP from other DGSR approaches is that its crossover and mutation operators tend to edit existing expressions (thus preserving substantial parts of them), rather than sampling them from scratch (Koza, 1994; Poli et al., 2008). The algorithm we develop in Section 3 expands expressions over time, similarly to GP. At the same time, our algorithm is powered by pre-training and its ability to learn over time, a feature missing from GP and other approaches. Following the unified framework of SR detailed in previous section, we derive our method, DGSR-MCTS, as an expert-iteration algorithm (Anthony et al., 2017), which iteratively 1) samples expressions from P t f and 2) updates P t f by improving the distribution via imitation learning (i.e., log-likelihood maximization of solution expressions). Recall that in DGSR methods, sampling expressions from P t f involve producing tokens step-by-step by sampling from a next-token distribution with techniques such as Monte Carlo sampling or Beam-search. Turning pre-trained DGSR methods into ones that can update P t f with expert-iteration is a challenging task as 1) reward signal can be very sparse (i.e., very few sequences may correspond to accurate expressions for the given dataset); 2) such left-to-right blind way of decoding does not allow for accurate planning; 3) using accuracy objectives to guide the decoding would be difficult with such auto-regressive approaches, since intermediate sequences of tokens are not valid expressions in that setting. Rather, we propose to derive P t f as the distribution induced by a Monte-Carlo Tree Search (MTCS) process (Browne et al., 2012), which we iteratively refine via imitation learning on samples it produces that solve the problem. Our MCTS process also considers expression mutations, following a mutation policy M t θ which deals with transformations of existing expressions, rather than a greedy concatenation of tokens from a given vocabulary, allowing a more systematic evaluation of intermediate solutions for a more guided exploration. This section first explains our MCTS search process then the way we pre-train the mutation policy from synthetic data. 3.1. MCTS Sampling Our sampling process of new expressions is derived from an MCTS process, where each node of the search tree2 corresponds to an expression f F. Following a similar process as in (Lample et al., 2022), our MCTS process is made of three steps: 1) Selection, 2) Expansion, 3) Backpropagation, that we detail below. In our MCTS, Mθ is used to sample promising mutations that populate the MCTS search tree via expansions (described below). Besides Mθ, we additionally leverage a critic network Cψ to assess how likely an expression can lead to a solution. We use the same NN (weights are shared) to realize Mθ and Cθ, with the exception that Cθ includes an additional head that is trained during the process to output a value in R. These three steps are iteratively repeated 1000 times, a period3 that we call search trial, before Mθ and Cψ is updated. Selection The selection step of MCTS aims at following a path in the search tree, from its root, i.e. the empty expression, to a leaf in the search tree, that trades off between exploration and exploitation. Following PUCT (Silver et al., 2016), at each new node f, we select its child f that maximizes: V (f ) + puct E(f )Mθ(f |f, D, θ) (3) with E(f ) = P f child(f) N(f ) 1+N(f ) . puct R+ is the exploration coefficient, N(f) is the number of times f was selected so far during the search, and V (f) is an estimate of the expected value of the expression, expressed as v(f)/N(f), with v(f) accumulating values of all the nodes depending on f in the tree search. This selection criteria balances exploration and exploitation, controlled by hyper-parameter puct. Expansion Once the selection step reaches a leaf f of the tree, this leaf is expanded to produce new child nodes by applying K mutations to f, sampled from Mθ(D, f, θ). This leads to modified expressions {f k}k K. In our case, the distribution Pf(f |f, D, θ) is induced by the application of m Mθ on f, resulting in f = m(f). For each k K, we add a node f k to the search tree, as well as an edge between f k and f, labeled mk. Each expression resulting from an expansion is evaluated (with or without constant optimization as discussed in Section 4.1) to check whether it solves the SR problem. Here, 2Not to be confused with tree-based representations of expressions, which we discussed in the previous section. 3This corresponds to a single step i) in the unifying framework of Section 2. Deep Generative Symbolic Regression with MCTS Figure 1. Example of data generation to train the mutation model. Given a starting ground-truth expression (e.g., f (x0, x1, x2) = 6.67x1x2/x2 0 as a tree, we procedurally dismantle the tree until no node is left. This is done by, at each step (red arrows), a) picking a node (dashed contour), b) removing the picked node and, if the operator is binary, additionally remove the subtree rooted in one of the two child nodes B, c) adding an edge (black dotted line) between the parent node and the remaining child node A to obtain a well-formed expression. When the picked node is the root node, the entire tree is removed, and the dismantling stops. Then, we train the mutation model to assemble the tree back via subsequent mutations (green arrows), which revert the dismantling process. The mutation model is conditioned on the current tree (initially empty) as well as the dataset D. we assess whether a relatively high accuracy is reached to determine whether the expression is a solution for the given dataset (we use R2 0.99 in our experiments). Nodes that solve the problem obtain a value v(f) = 1. Others obtain an estimate from the critic network: v(f) = Cψ(D, f). We remark that a simpler strategy is to define v(f) as the accuracy of f, i.e. v(f) = R2(f, D). However, this strategy usually induces deceptive rewards, because a few mutations that lead to less accurate expressions (e.g., akin to sacrificing pieces in chess) may be needed before a very accurate expression is found (resp., we obtain a winning position). We confirmed that our strategy outperforms the naive use of accuracy in preliminary experiments Back-Propagation After each expansion, the value of every ancestor f of any new node f is updated as V (f ) V (f ) + v(f). Note that this form of back-propagation regards weighing the nodes of the MCTS search tree, and should not be confused with the algorithm used to train NNs. As mentioned before, selection, expansion, and backpropagation are repeated 1000 times, after which the search trial is completed. At completion, the parameters of Mθ and Cψ are updated as described in Section 3.2. Finally, the MCTS search tree is reset, and built anew using updated Mθ and Cψ during the next trial. 3.2. Learning critic Cψ and Mθ After each search trial, we update the two parametrized components Cψ and Mθ. To that end, training samples from the previous search trials are stored in two separate first-in-first-out queues: a buffer stores mutation sequences (f (τ), mt) that produced a solution expression f to update Mθ4, the other contains V values of nodes. For the latter, nodes that lead to a solution expression f are assigned a score of 1.0. Others are considered for training only if their visit count is higher than a given threshold N min visits , in order to focus training on sufficiently reliable estimates. At each training step, batches containing an equal proportion of mutation and critic data points are sampled from the queues to train Mθ and Cψ respectively. Both training objectives are weighted equally. To prevent any mode collapse, we continue sampling training examples from the supervised data used to pre-train the mutation model Mθ, as described in Section 3.3. Note that even though the pre-training data generation is biased in different ways, stochasticity of Mθ enables its adaptation over search trials thanks to its updates; for instance, even if Mθ was trained to output mutations with arguments of size B 10 can learn mutations of size 1, and vice-versa. As a result, Mθ can automatically learn the appropriate (and dataset-dependent) size of mutations through MCTS search. We set up our MCTS search to simultaneously operate on multiple datasets at the same time, so that Mθ and Cψ can leverage potential information that is shared across different SR instances (transfer learning). Given a set of datasets, a controller schedules a queue of datasets that includes a proportion of unsupervised (i.e. the ground-truth expression is not known) datasets to send to workers in charge of handling search trials. To avoid catastrophic forgetting, we also continue training on pre-training examples from synthetic datasets i.e. with example mutations as described in Section 3.3, in the spirit of Alpha Star (Vinyals et al., 2019) 4If multiple such sequences exist, we select the one with the smallest number of mutations. Deep Generative Symbolic Regression with MCTS or HTPS (Lample et al., 2022) with human annotated data. Mθ and Cψ updates are tackled by trainers. 3.3. Mutation Policy Our search process relies on a mutation policy M t θ, a distribution over promising mutations of a given expression f F. In what follows, we drop the t index from Mθ for clarity. Expressions are represented as a tree structure and mutations act by modifying it. Definition of Mθ. We define mutations as transformations that can be applied on any node of a given expression tree. Table 2 contains the list of considered transformations, where A stands for an existing node of the expression tree and B stands for a new (sub-)expression to be included in the tree. The symbol represents the replacement operation, e.g., A A+B means that node A from f is replaced by a new addition node with A as its left child and B its the right one. Thus, a valid mutation from Mθ is a triplet < A, op, B >, where A is a node from the expression tree, op is an operation from Table 2 and B is a new expression whose generation is described below. A constant optimization step, detailed in Appendix E.2, can be performed after the mutation to better fit the data; we explore in Section 4 whether including constant optimization improves the performance. We call mutation size the size of the expression B. Table 2. Set of operators that can be applied on a sub-expression A of a function, with optional argument B as a new sub-expression to include in the tree structure. 0 refers to the root node of the function. Unary A cos(A), A sin(A), A tan(A) A exp(A), A log(A) A A0.5, A A 1, A A2, 0 B Binary A A + B, A A B A A B, A A/B A B + A, A B A A B A, A B/A The mutation policy Mθ provides a distribution over possible mutations. Rather than having B be generated completely at random, we parameterize Mθ so that it is Mθ : Ω F, i.e. the mutations conditions on the dataset D and on the current expression f. The dependance from D is akin to the approach in inductive DGSR works, while the dependence on f is novel. Both are passed as inputs to Mθ as a sequence of tokens, f by its prefix notation and D as in (Kamienny et al., 2022). We use the transformer-based NN architecture from Kamienny et al. (2022) but task the model to decode token-by-token a sequence ω (flattened version of < A, op, B >) until a EOS token is reached. A is represented in ω as the index of that node (i.e., [[1, n]] for an expression that contains n nodes). While this may allow to output an invalid mutation expression, this happens very rarely in practice as shown in Appendix D, thanks to an efficient pre-training of the policy (described below). We remark that our mutation distribution is different from those that are commonly used in GP, in that the latter are not conditioned on D nor parameters (i.e., they are not updated via learning), and they can also shrink the size of an expression or keep it as is, whereas our mutations strictly increase the expression. Note that it is possible to consider mutations that remove and/or replace parts of the expression, but we left exploring this to future work. We also restrict our mutation process to only generate expressions with less than 60 operators and without nesting operations other than the basic arithmetic ones (+, , , ). Pre-training of Mθ. Since our mutation policy Mθ is expected to produce mutations for a given expression, and not the final expression directly (as it is the case in the majority of DGSR approaches), it requires a specifically designed pretraining process. To that end, pre-training labeled examples (dataset & ground-truth expression with up to 10 features) are first sampled from a hand-crafted generator (Lample & Charton, 2019) as done in most pre-training NSR approaches (c.f. Appendix A). Next, given a ground-truth expression f , we extract a sequence of mutations [ml] L that iteratively map the empty expression f (0) to the final expression f . As illustrated in Figure 1, starting from the ground-truth expressions f , we deconstruct f by procedurally removing a node (and if the node is binary also one of its child subtree B) from the current f until we get to the empty expression f (0). After reversing this sequence, we obtain a training sequence of expressions and mutations f (0) m1 f (1) m2 f (2) m3 . . . m L f (L) = f (more details in Appendix A). After tokenization, every mutation ml serves as target for the pre-training process: Mθ is classically trained as a sequence-to-sequence encoder-decoder, using a cross-entropy loss, to sequentially output each token we l from ml, given the considered dataset D, the previous expression f (l 1), and the sequence of previous tokens w(