# online_symbolic_regression_with_informative_query__08e84553.pdf Online Symbolic Regression with Informative Query Pengwei Jin1,2,3, Di Huang1,2,3, Rui Zhang1, 3, Xing Hu1, Ziyuan Nan1,2,3, Zidong Du1, Qi Guo1, Yunji Chen1,2* 1 State Key Lab of Processors, Institute of Computing Technology, CAS 2 University of Chinese Academy of Sciences 3 Cambricon Technologies {jinpengwei20z, huangdi20b, zhangrui, huxing, nanziyuan21s, duzidong, guoqi, cyj}@ict.ac.cn Symbolic regression, the task of extracting mathematical expressions from the observed data {xi, yi}, plays a crucial role in scientific discovery. Despite the promising performance of existing methods, most of them conduct symbolic regression in an offline setting. That is, they treat the observed data points as given ones that are simply sampled from uniform distributions without exploring the expressive potential of data. However, for real-world scientific problems, the data used for symbolic regression are usually actively obtained by doing experiments, which is an online setting. Thus, how to obtain informative data that can facilitate the symbolic regression process is an important problem that remains challenging. In this paper, we propose QUOSR, a query-based framework for online symbolic regression that can automatically obtain informative data in an iterative manner. Specifically, at each step, QUOSR receives historical data points, generates new x, and then queries the symbolic expression to get the corresponding y, where the (x, y) serves as new data points. This process repeats until the maximum number of query steps is reached. To make the generated data points informative, we implement the framework with a neural network and train it by maximizing the mutual information between generated data points and the target expression. Through comprehensive experiments, we show that QUOSR can facilitate modern symbolic regression methods by generating informative data. Introduction Symbolic regression plays a central role in scientific discovery, which extracts the underlying mathematical relationship between variables from observed data. Formally, given a physical system f and N data points D = {(xi, yi)}N i=1, where xi Rm and yi = f(xi) R, symbolic regression tries to find a mathematical expression ˆf : Rm R that best fits the physical system f. Conventional symbolic regression methods include genetic evolution algorithms (Augusto and Barbosa 2000; Schmidt and Lipson 2009; Chen, Luo, and Jiang 2017), and neural network methods (Udrescu and Tegmark 2020; Petersen et al. 2020; Valipour et al. 2021). Despite their promising performance, most of them perform symbolic regression in an offline setting. That is, they take the data *Yunji Chen (cyj@ict.ac.cn) is the corresponding author. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. points D = {(xi, yi)}N i=1 as given ones and simply sample them from uniform distributions without exploring the expressive potential of data. In real-world scientific discovery, however, data can be collected actively by doing specific experiments (Hernandez et al. 2019; Weng et al. 2020), which is an online setting. Overall, online symbolic regression is a worth-studying and challenging problem that underexplored. The unique problem introduced by online symbolic regression is how to actively collect informative data that best distinguish the target expression from other similar expressions. Specifically, this process aims to actively collect data by generating informative input data x (i.e. queries) and then querying f to get the output data y = f(x). Considering the regression process to use data D = {(xi, yi)}N i=1 for finding the mathematical expression ˆf that best fits the physical system can be borrowed directly from offline symbolic regression, we mainly focus on the key problem of how to get the informative data for symbolic regression. In this paper, we propose QUOSR, a query-based framework that can acquire informative data for online symbolic regression. QUOSR works in an iterative manner: at each query step, it receives historical data points, generates new x, and then queries the target f to get the corresponding y, where the (x, y) serves as new data points. This process repeats until the maximum number of query steps is reached. Our contribution is twofold: giving theoretical guidance on the learning of the query process and implementing QUOSR for online symbolic regression. Specifically, we theoretically show the relationship between informative data and mutual information, which can guide the generation of queries. This relationship can be proved by modeling the query process as the expansion of a decision tree, where the nodes correspond to the query selections x, the edges correspond to the responses y, and the leaves correspond to the physical systems f Although the computation of mutual information is intractable, we can instead optimize QUOSR by minimizing the Info NCE loss in contrastive learning (van den Oord, Li, and Vinyals 2018) which is the lower bound of mutual information. However, it is still difficult to practically implement QUOSR for online symbolic regression for two reasons. (1) The amount of information contained in each data point is so small that, to obtain enough information, the number The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) of query steps becomes extremely large, which makes the query process not only time-consuming but also hard to optimize. (2) The Info NCE loss needs to calculate the similarity between queries and the target system f in a latent space, but f is hard to represent. Conventionally, f would be represented by expression strings, which is sub-optimal. First, the same function f can be expressed by different symbolic forms. For example, f(x) = cos(x) can also be written as f(x) = cos(x + 2kπ) or f(x) = sin(x + 4k+1 2 π), where k N, with exactly the same functionality. Second, a small change to the expression can make a huge difference in functionality. To tackle these challenges, we propose the query-bydistribution strategy and the modified Info NCE loss. First, to increase the amount of information obtained in each query step, the query-by-distribution strategy aims to find informative function features, which can be represented by a set of data instead of a single data point. Second, to eliminate the influence of the expression representation problem, we remove the requirement of the calculation of similarity between data and expressions in the Info NCE loss. Instead, we show that mutual information between data and expressions can be estimated by simply measuring the similarity among different sets of queries. Through comprehensive experiments, QUOSR shows its advantage in exploring informative data for online symbolic regression. We combine QUOSR with a transformer-based method called Symbolic GPT (Valipour et al. 2021) and find an over 11% improvement in average R2. Related Work Symbolic regression. Symbolic regression in an offline setting has been studied for years (Koza 1994; Martius and Lampert 2016; Sahoo, Lampert, and Martius 2018; Lample and Charton 2019; La Cava et al. 2021; Xing, Salleb Aouissi, and Verma 2021; Zheng et al. 2022). Traditionally, methods based on genetic evolution algorithms have been utilized to tackle offline symbolic regression (Augusto and Barbosa 2000; Schmidt and Lipson 2009; Arnaldo, Krawiec, and O Reilly 2014; La Cava et al. 2018; Virgolin et al. 2021; Mundhenk et al. 2021). Many recent methods leverage neural networks with the development of deep learning. AI Feynman (Udrescu and Tegmark 2020) recursively decomposes an expression with physics-inspired techniques and trains a neural network to fit complex parts. Petersen et al. (2020) proposes a reinforcement learning based method, where expressions are generated by an LSTM. Recently, some transformer-based methods are proposed (Biggio et al. 2021; Kamienny et al. 2022). Symbolic GPT (Valipour et al. 2021) generates a large amount of expressions and pretrains a transformer-based model to predict expressions. Similar to QUOSR, Ansari et al. (2022) and Haut, Banzhaf, and Punch (2022) also get data points from a physical system. However, there are still two main differences. For purpose, they focus on training efficiency which is an active learning style, while we focus both on the training and inference. For method, theirs are strongly correlated with the SR process, while QUOSR is a general framework since its query pro- cess is decomposed from SR. Active learning. Active learning is a subfield of machine learning in which learners ask the oracle to label some representative examples (Settles 1995). Three main scenarios in which learners can ask queries are considered in active learning: (1) pool-based sampling (Lewis and Gale 1994) which evaluates all examples in the dataset and samples some of them, (2) stream-based selective sampling (Cohn, Atlas, and Ladner 1994; Angluin 2001) in which each example in the dataset is independently assessed for the amount of information, and (3) membership query synthesis (Angluin 1988; King et al. 2004) in which examples for query can be generated by learners. Notice that active learning assumes only one oracle to give labels thus learners just need to model one, but each expression is an oracle in QUOSR which makes selecting examples much more difficult. Also, in active learning, examples are only sampled in the training process to make use of abundant unlabeled data but QUOSR aims to find informative examples both in the training and inference process to assist the following expression generation stage. Learning to acquire information. Pu, Kaelbling, and Solar-Lezama (2017) and Pu et al. (2018) study the query problem in a finite space aiming to select representative ones from a set of examples. Huang et al. (2021) propose an iterative query-based framework in program synthesis and generalize the query problem to a nearly infinite space. They take the symbolic representation of programs as reference substances during optimization and adopt a query-by-point strategy, where the framework iterates for several steps and only one data point is generated at each step, which fails on symbolic regression. Problem Statement In this section, we make a formal statement for online symbolic regression to make our problem clear. Definition 1 (The physical system f) The physical system f is a function f : RM R. Given an input data x RM, f can respond with y = f(x). Intuitively, f can be seen as scientific rules, and the input data x and the response y can be seen as experimental results. Definition 2 (Online symbolic regression) Given a physical system f F, where F denotes the set of all possible systems, the online symbolic regression aims to find a mathematical expression ˆf that best fits f by interacting with f, i.e. x RM, | ˆ f(x) f(x)| < τ, where τ is the error tolerance. Unlike conventional symbolic regression tasks, online symbolic regression allows interactions with the physical system f, which is a more common setting in scientific discovery. Methods In this section, first, we make a detailed description of the query-based framework, including the working process of the framework, what is informative , and how to obtain Algorithm 1: The query-based framework Query 1: physical system f, max query steps K, query module Q, initial data points D. 2: for k {1 . . . K} do 3: x Q(D) 4: y f(x) 5: D D {(x, y)} 6: end for 7: return D Regression 1: data points D, regression algorithm R 2: ˆf R(D) 3: return ˆf informative data theoretically. Then, we point out the difficulties of applying the query-based framework and give out corresponding solutions. At the end of this section, we present the architecture details and the training algorithm of the query-based framework. The Query-Based Framework We propose the query-based framework which aims to generate informative data by querying the target physical system. The query-based framework works in an iterative manner that it queries the target system for several steps with a data point generated at each step. This process is shown in Algorithm 1. Specifically, at each step, the query-based framework receives the historical data points D and generates the next query x with a trained query module Q according to D. Then, x is used to query the target system f and get the corresponding output y = f(x). Finally, the newly queried data point (x, y) is added to D, and the next query step begins. This process repeats until the maximum number of query steps is reached. After collecting the data points D, the regression process begins to fit the symbolic expression ˆf, which can be done by conventional symbolic regression algorithms. Obviously, the design of the query module Q is the central problem of the query process. To make the design principle clear, two questions need to be answered. That is, what is informative data, and how can we get it? What is informative data? Intuitively, if data DB can lead us to the target f but DA cannot, then we would say DB is more informative than DA. Further on, if data DC can also lead us to the target f and it contains fewer data points than DB, then we would say DC is even more informative than DB. Based on this observation, we can conclude that informative means to find the target f with as few data points as possible. Formally, we give the following definitions. Definition 3 (Distinguish) Given the target physical system f F, and a set of input data Q = {x|x Rm}, we say that f is distinguished from F by Q if and only if fi F, x Q, f(x) = fi(x). Definition 4 (Informative queries) Given the target physical system f F, and two sets of queries Qi and Qj, we say Qi is more informative than Qj if and only if Qi can distinguish f from F but Qj cannot, or they can both distinguish f from F but |Qi| < |Qj|. How can we obtain informative data? From the perspective of the decision process, we can model the query process as the expansion of a decision tree. Specifically, the nodes correspond to the query selections x, the edges correspond to the responses y, and the leaves correspond to the physical systems f. Then, the problem of obtaining informative data can be reinterpreted as the minimization of the average length of decision paths. Definition 5 (Average decision path) Given a decision tree with a leaf denoted as f F, the decision path can be denoted as L(f), and the probability of arriving at f can be denoted as P(f). Then, the average decision path can be defined as L = P f P(f)L(f). Obviously, the average decision path here is equivalent to the average number of query steps. Traditionally, the problem of finding the average decision path can be solved by maximizing the information gain in decision node expansion, which is the same as maximizing mutual information (Gallager 1968). Thus, with the help of decision trees, we conclude that informative data is the data that can maximize the mutual information with the target functions. Formally, we give such claim: Claim 1 (Mutual information guides the query) The collection of informative data can be guided by mutual information: D = arg max D I(F; D). (1) Here F denotes the variable that represents functions. The detailed proof is shown in Appendix A (Jin et al. 2023). The Modified Info NCE Info NCE Although Equation 1 has shown the optimization direction of the query process, I(F; D) is impossible to compute due to the large space of data and expressions. Fortunately, recent works in contrastive learning show that the Info NCE loss estimates the lower bound of mutual information (van den Oord, Li, and Vinyals 2018; Poole et al. 2019). LNCE = E[log( exp(sim(di, fi)) PN j=1 exp(sim(di, fj)) )], (2) and I(F; D) log(B) LNCE (3) where d denotes the data points, f denotes the physical system, B denotes the batch size, i, j denotes the sample index, and sim denotes a similarity function. Thus, we can implement the query process as a query network, and then optimize it with Info NCE to obtain informative queries. Specifically, in contrastive learning, we can construct several positive and negative pairs, and the Info NCE loss guides the network to maximize the similarity of positive pairs and minimize the similarity of negative pairs. For the query network, we can construct positive pairs as queried data points and corresponding physical system (di, fi) and negative pairs as others (di, fj =i). system 𝒙𝒊𝑖=1 (𝒙𝒊, 𝑦𝑖) 𝑖=1 (𝒙𝒊 , 𝑦𝑖 ) 𝑖=1 (𝒙𝒊, 𝑦𝑖) 𝑖=1 (𝒙𝒊, 𝑦𝑖) 𝑖=𝑚+1 (𝒙𝒊 , 𝑦𝑖 ) 𝑖=1 (𝒙𝒊 , 𝑦𝑖 ) 𝑖=𝑚+1 Info NCE Loss Info NCE Loss 𝐌𝐋𝐏𝐪𝐮𝐞𝐫𝐲 𝒗𝒊𝑖=1 𝑚 𝐌𝐋𝐏𝐢𝐧𝐯𝐞𝐫𝐬𝐢𝐨𝐧 sample 𝒙𝒊𝑖=1 Uniform sampling (𝝁, 𝑙𝑜𝑔(𝝈𝟐)) (𝝁 , 𝑙𝑜𝑔(𝝈 𝟐)) (𝝁, 𝑙𝑜𝑔(𝝈𝟐)) (𝝁 , 𝑙𝑜𝑔(𝝈 𝟐)) (𝝁, 𝑙𝑜𝑔(𝝈𝟐)) (𝝁𝒅, 𝒍𝒐𝒈(𝝈𝒅 (𝝁, 𝑙𝑜𝑔(𝝈𝟐)) Encoder Decoder (𝒙𝒊, 𝑦𝑖) 𝑖=1 𝑚 (𝒙𝒊, 𝑦𝑖) 𝑖=1 (𝒙𝒊, 𝑦𝑖) 𝑖=𝑚+1 2𝑚 (𝝁, 𝑙𝑜𝑔(𝝈𝟐)) Figure 1: Overview of QUOSR. The Modified Info NCE According to Equation 2, we can simply design the query network with a data encoder to extract the embedding of di and an expression encoder to extract the embedding of fi, and then compute their similarity to get the Info NCE loss. However, the expression encoder is hard to optimize due to several reasons. First, the same function f can be expressed by different symbolic forms, which improves the difficulty of learning. For example, f(x) = cos(x) can also be written as f(x) = cos(x + 2kπ) or f(x) = sin(x + 4k+1 2 π), where k N, with exactly the same functionality. Second, a small change to the expression can make a huge difference in functionality, like the negative sign . To solve this problem, we propose a modified Info NCE loss that guides the learning process without the requirement for a good expression encoder: L NCE = E(di,d i) E[Pd|f Pd |f ] [log( exp(sim(di, d i)) 1 N PN j=1 exp(sim(di, d j)) )], (4) where d denotes another set of queried data points with different initial conditions from d. Specifically, we first sample physical system f from space F, sample two different groups of data points di and d i from a Uniform distribution as the initial condition and then get two different data points by querying f and modifying di and d i (adding the corresponding data point to di and another to d i), and this process repeats for N times to get N different (di, d i) pairs to calculate the loss. Intuitively, the main difference between Equation 4 and Equation 2 is that the former replaces the expression f with another set of queried points d , which avoids the expression representation problems mentioned above. In practice, we share parameters between the data encoders for d and d , which is the classical siamese architecture used in contrastive learning (Chen et al. 2020). We prove that optimizing L NCE can also achieve the purpose of maximizing mutual information I(F; D): Claim 2 (L NCE bounds mutual information) L NCE KL(E[PD|F PD |F ]||PDPD ) min{I(F; D), I(F; D )}, (5) where KL denotes the Kullback Leibler divergence. A similar proof is given by (Tsai et al. 2021), and we modify it to fit this paper in Appendix A (Jin et al. 2023). We use LNCE to denote L NCE for simplicity in the rest of our paper. Query-by-Distribution Another problem is that the amount of information contained in each data point is extremely small, which increases the number of query steps and makes the query network hard to optimize. To alleviate this problem, we compress the number of query steps by querying a set of m data points at each query step instead of only one single data point. However, simply increasing the output dimension of the query network to get m data points is sub-optimal because the number of data points needed in symbolic regression changes depending on the specific problem (i.e. m changes). Since a set can be presented by exponential family distributions (Sun and Nielsen 2019), we represent the query result as a Normal distribution, from which we can flexibly sample any amount of latent vectors and use them to generate the final queries. Also, the training process benefits from the diversity caused by sampling which can help it avoid the local minima. We choose the Normal distribution due to its good properties: (1) It has a differentiable closed form KL divergence which is convenient for optimization. (2) The product of two Normal distributions (i.e. the intersection of two sets) is still a scaled Normal distribution (Huang et al. 2021). Details are shown in the next subsection. QUOSR Next, we will introduce the architecture and training details. Architecture details are shown in Figure 1. As mentioned before, QUOSR adopts a siamese architecture where the two branches share parameters. Each branch mainly contains two parts: a physical system f which is used to be queried by generated x, and a query network which receives historical data points {(xi, yi)}i and generates the next queries x. The query network consists of an encoder that embeds the historical data points, and a decoder that receives the data embedding and generates the next queries. Details are described in the following. Encoder. The design of the encoder follows Ren and Leskovec (2020) and Huang et al. (2021). First, given a set of data points D = {(xi, yi)}N i=1, each data point is projected into a Normal distribution in latent space. This distribution can be seen as the representation of the set of candidate expressions that satisfy the given data point: [µi, log(σi 2)] = MLPdata((xi, yi)), (6) where [] denotes concatenation, and µi and σi represents Normal distribution N(µ, σ). Then, for the N distributions, we take them as independent ones (i.e. input order irrelevant) and calculate their intersection as the final embedding of data points D: [µ, log(σ2)] = i=1 wi[µi, log(σi 2)], (7) wi = exp(MLPattention([µi, log(σ2 i )])) PN j=1 exp(MLPattention(µj, log(σ2 j ))) , (8) where MLP means multi-layer perceptron. The further illustration of the intersection is beyond the scope of this paper and please refer to their original papers. Decoder. The decoder consists of two parts: generating a Normal distribution and sampling m points from it. Different from the distribution produced by the encoder which represents the set of candidate expressions, this distribution represents the set of possible queries as mentioned in the last subsection. [µq, log(σq 2)] = MLPquery[µ, log(σ2)], (9) where the subscript q denotes query . Then, we sample m points from the Normal distribution N(µ, σ) and project them back to RM with an MLP. xi = MLPinversion(vi), (10) Training. At the beginning of the query process, we sample two groups of data points uniformly as the initial condition of the query process (the number of initial data points equals the number of data points queried in each step). Since LNCE is calculated by measuring the similarity between two different sets of data points d and d , we maintain two query processes with different initial conditions simultaneously. Specifically, in each step of query, QUOSR receives N historical data points {(xi, yi)}N i=1 as input, generates m new data {xi}N+m i=N+1, and then gets the corresponding {yi}N+m i=N+1 from f. These N +m data points {(xi, yi)}N+m i=1 are taken as the next step input of QUOSR, and this process repeats. As mentioned in Equation 7, the data embedding represents Normal distributions, so the similarity of two sets of data points is measured with Kullback-Leibler divergence: sim(d, d ) = KL(N(ud, σd)||N(u d, σ d)) (11) We merge d into d in Equation 4 by crossing their elements, thus the loss function in Equation 4 is finally rewritten as LNCE = 1 2N i=1 [l(d2i 1, d2i) + l(d2i, d2i 1)], (12) l(di, dj) = log exp(sim(di, dj)/τ) P2N k=1 1[k =i]exp(sim(di, dk)/τ) (13) where d2i 1 denotes d, d2i denotes d and τ denotes a temperature parameter. See Figure 1 and Algorithm 2 in Appendix C (Jin et al. 2023) for details. Experiments We study the ability of QUOSR to find informative points in experiments. First, we combine QUOSR with a symbolic regression method and present the main results. Then some cases are given to visualize the strategy learned by QUOSR. And at last, we do the ablation experiments. Settings We combine QUOSR with a transformer-based method named Symbolic GPT (Valipour et al. 2021) which achieves competent performance in symbolic regression. Symbolic GPT uses Point Net (Qi et al. 2017) to encode any number of data points and then generates the skeleton of an expression character by character using GPT where constants 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 Origin Uniform Normal QUOSR Figure 2: The performance of the origin Symbolic GPT model and another three models finetuned on Uniform, Normal and Query. are represented by a < C > token. To learn these constants in the skeleton, Symbolic GPT employs BFGS optimization (Fletcher 2013). More experiments are shown in Appendix B (Jin et al. 2023) The training process is split into three stages: training QUOSR, generating a new dataset and finetuning Symbolic GPT. We train QUOSR using expressions in the dataset of Valipour et al. (2021) which contains about 500k onevariable expressions. The max query times K is set to 9, thus we first uniformly sample 3 points and then generate the other 27 data points in the following 9 query steps. We limit QUOSR to generate values of x [ 3.0, 3.0] to fit the range of the original dataset. Then we generate a new dataset named Query for all expressions. For comparison, we also sample x from U( 3.0, 3.0) and N(0, 1) and generate another two datasets, Uniform and Normal. We finetune the pretrained model of Symbolic GPT on these three datasets. The original test set of Symbolic GPT contains 966 expressions. For each expression, 30 data points sampled uniformly in the range of [ 3.0, 3.0] are used to generate expressions, and another 30 data points sampled uniformly in the range of [ 5.0, 3.0] [3.0, 5.0] are used to evaluate the predicted expression. We replace data points used for generating expressions with the ones sampled with three methods: QUOSR, uniform sampling, and normal sampling, and use original test data points to evaluate two metrics with three models respectively: the proportion of log MSEN < 10 and the average R2 (Kamienny et al. 2022): ||y + ϵ||2 (14) R2 = max(0, 1 PNtest i (yi ˆyi)2 PNtest i (yi y)2 ), (15) where ϵ is set to 0.00001 to avoid division by zero and y = 1 Ntest PNtest i=0 yi. We set the lower bound of R2 to 0 since a bad prediction can lead R2 to be an extremely large negative number. Results Figure 2 shows the proportion of predicted expressions whose log MSEN < 10 and the average of R2. Though 3 6 9 12 15 18 21 24 27 30 Amount of data used for generating expressions QUOSR uniform normal origin Figure 3: Results for different amount of data points. results of uniform sampling improve after finetuning, QUOSR performs the best in both metrics, which indicates that in the online setting of symbolic regression, QUOSR can explore more informative data points than uniform sampling and thus helps Symbolic GPT achieve a better result. We also evaluate the performance of QUOSR by each query steps, shown in Figure 3. Since the more information a data point contains, the sooner it will be queried, the performance of QUOSR is significantly improved in the first four steps and grows steadily in the next five. Case Study Figure 4 gives a visualization of the latent space using t SNE (Van der Maaten and Hinton 2008). We cluster data points embeddings (Figure 4 Left), symbolic forms embeddings (Figure 4 Mid), and functions (into 9 classes, Figure 4 Right). To see which embedding method matches the functionality better, we label 9 classes in Figure 4 Right with 9 different colors and color Figure 4 Left and Mid with corresponding colors. Graphs of expressions in the same cluster are close, but their symbolic form embeddings are chaotic. Figure 5 presents graphs of two expressions in which QUOSR outperforms uniform sampling. Data points sampled uniformly are marked with blue dots and the ones generated by QUOSR are marked in red. We observe two interesting phenomena: (1) QUOSR pays more attention to the neighborhood of extreme points and discontinuity points. (2) For periodic functions, QUOSR samples multiple points in one cycle, and only a few points in other cycles. Ablation Study Table 1 presents results for the ablation study over the following four variants: (1) Methods of intersection: attention in Equation 7, mean in Equation 16 or max in Equation 17: [µ, log(σ2)] = [ 1 i=1 µi, log( 1 i=1 σ2 i )], (16) [µ, log(σ2)] = [max µi, log(max σ2 i )], (17) Figure 4: The clustering results of different embedding methods (Left and Mid). Different colors represent different sets of expressions that have similar functionality (Right). 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Figure 5: Graphs of two expressions in which QUOSR outperforms uniform sampling. The blue points are sampled uniformly and the red ones are generated by QUOSR. (2) The similarity function: KL divergence in Equation 11 or cosine similarity (cos) in Equation 18: sim(d, d ) = [ud, σd] [u d, σ d]T [ud, σd] [u d, σ d] , (18) (3) Representation methods of physical system f: data points (data) or symbolic forms (expr). (4) Query strategies: query-by-distribution (QBD), queryby-set (QBS) in Equation 19 or query-by-point (QBP) Equation 20. [{xi}m i=1] = MLPQBS([µ, log(σ2)]), (19) xi = MLPQBP ([µ, log(σ2)]), (20) The results are presented in table 1. We can conclude that: (1) Experiments with the KL similarity and attention intersection get a better result than others, suggesting the advantage of projecting data points to a Normal distribution. (2) The performance drops when using symbolic forms of expressions, which indicates that representing an expression Intersect Sim Rep Stra Proportion R2 mean KL data QBD 41.20 0.5370 max KL data QBD 45.65 0.5774 attention cos data QBD 41.72 0.5380 attention KL expr QBD 44.20 0.5721 attention KL expr QBS 40.89 0.5290 attention KL data QBS 41.61 0.5388 attention KL data QBP N/A N/A attention KL data QBD 50.41 0.6177 Table 1: Ablation over interaction methods (Intersect), the similarity function (Sim), expression representation (Rep) and query strategies (Stra). N/A means the out of memory error code occurs when training. with its data points is more appropriate than with its symbolic forms. (3) Among all query strategies, our query-bydistribution strategy performs the best. The query-by-point strategy does not give out useful results due to its optimization difficulty. The query-by-set strategy performs worse than ours probably because sampling from distribution results in more diversity which helps the query network avoid local minima in training. Conclusion In this work, we propose a query-based framework QUOSR to tackle online symbolic regression. To maximize the mutual information between data points and expressions, we modify Info NCE loss and thus a good expression encoder is unnecessary. Furthermore, we use the query-by-distribution strategy instead of query-by-point, and thus the amount of information obtained in each query step increases. Combining with offline symbolic regression methods, QUOSR achieves a better performance. We take the query-based framework for multi-variable expressions as our future work for its challenge in spurious multi-variable expressions. Acknowledgments This work is partially supported by the NSF of China(under Grants 61925208, 62222214, 62102399, 62002338, U22A2028, U19B2019), Beijing Academy of Artificial Intelligence (BAAI), CAS Project for Young Scientists in Basic Research(YSBR-029), Youth Innovation Promotion Association CAS and Xplore Prize. Angluin, D. 1988. Queries and concept learning. Machine learning, 2(4): 319 342. Angluin, D. 2001. Queries revisited. In International Conference on Algorithmic Learning Theory, 12 31. Springer. Ansari, M.; Gandhi, H. A.; Foster, D. G.; and White, A. D. 2022. Iterative symbolic regression for learning transport equations. AICh E Journal, e17695. Arnaldo, I.; Krawiec, K.; and O Reilly, U.-M. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, 879 886. Augusto, D. A.; and Barbosa, H. J. 2000. Symbolic regression via genetic programming. In Proceedings. Vol. 1. Sixth Brazilian Symposium on Neural Networks, 173 178. IEEE. Biggio, L.; Bendinelli, T.; Neitz, A.; Lucchi, A.; and Parascandolo, G. 2021. Neural symbolic regression that scales. In International Conference on Machine Learning, 936 945. PMLR. Chen, C.; Luo, C.; and Jiang, Z. 2017. Elite bases regression: A real-time algorithm for symbolic regression. In 2017 13th International conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD), 529 535. IEEE. Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597 1607. PMLR. Cohn, D.; Atlas, L.; and Ladner, R. 1994. Improving generalization with active learning. Machine learning, 15(2): 201 221. Fletcher, R. 2013. Practical methods of optimization. John Wiley & Sons. Gallager, R. G. 1968. Information theory and reliable communication, volume 588. Springer. Haut, N.; Banzhaf, W.; and Punch, B. 2022. Active Learning Improves Performance on Symbolic Regression Tasks in Stack GP. ar Xiv preprint ar Xiv:2202.04708. Hernandez, A.; Balasubramanian, A.; Yuan, F.; Mason, S. A.; and Mueller, T. 2019. Fast, accurate, and transferable many-body interatomic potentials by symbolic regression. npj Computational Materials, 5(1): 1 11. Huang, D.; Zhang, R.; Hu, X.; Zhang, X.; Jin, P.; Li, N.; Du, Z.; Guo, Q.; and Chen, Y. 2021. Neural Program Synthesis with Query. In International Conference on Learning Representations. Jin, P.; Huang, D.; Zhang, R.; Hu, X.; Nan, Z.; Du, Z.; Guo, Q.; and Chen, Y. 2023. Online Symbolic Regression with Informative Query. ar Xiv:2302.10539. Kamienny, P.-A.; d Ascoli, S.; Lample, G.; and Charton, F. 2022. End-to-end symbolic regression with transformers. Ar Xiv, abs/2204.10532. King, R. D.; Whelan, K. E.; Jones, F. M.; Reiser, P. G.; Bryant, C. H.; Muggleton, S. H.; Kell, D. B.; and Oliver, S. G. 2004. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427(6971): 247 252. Koza, J. R. 1994. Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4(2): 87 112. La Cava, W.; Orzechowski, P.; Burlacu, B.; de Franc a, F. O.; Virgolin, M.; Jin, Y.; Kommenda, M.; and Moore, J. H. 2021. Contemporary symbolic regression methods and their relative performance. ar Xiv preprint ar Xiv:2107.14351. La Cava, W.; Singh, T. R.; Taggart, J.; Suri, S.; and Moore, J. H. 2018. Learning concise representations for regression by evolving networks of trees. In International Conference on Learning Representations. Lample, G.; and Charton, F. 2019. Deep learning for symbolic mathematics. ar Xiv preprint ar Xiv:1912.01412. Lewis, D. D.; and Gale, W. A. 1994. A sequential algorithm for training text classifiers. In SIGIR 94, 3 12. Springer. Martius, G.; and Lampert, C. H. 2016. Extrapolation and learning equations. ar Xiv preprint ar Xiv:1610.02995. Mundhenk, T. N.; Landajuela, M.; Glatt, R.; Santiago, C. P.; Faissol, D. M.; and Petersen, B. K. 2021. Symbolic regression via neural-guided genetic programming population seeding. ar Xiv preprint ar Xiv:2111.00053. Petersen, B. K.; Larma, M. L.; Mundhenk, T. N.; Santiago, C. P.; Kim, S. K.; and Kim, J. T. 2020. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. In International Conference on Learning Representations. Poole, B.; Ozair, S.; Van Den Oord, A.; Alemi, A.; and Tucker, G. 2019. On variational bounds of mutual information. In International Conference on Machine Learning, 5171 5180. PMLR. Pu, Y.; Kaelbling, L. P.; and Solar-Lezama, A. 2017. Learning to Acquire Information. In UAI. AUAI Press. Pu, Y.; Miranda, Z.; Solar-Lezama, A.; and Kaelbling, L. 2018. Selecting representative examples for program synthesis. In International Conference on Machine Learning, 4161 4170. PMLR. Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652 660. Ren, H.; and Leskovec, J. 2020. Beta Embeddings for Multi Hop Logical Reasoning in Knowledge Graphs. In Neur IPS. Sahoo, S.; Lampert, C.; and Martius, G. 2018. Learning equations for extrapolation and control. In International Conference on Machine Learning, 4442 4450. PMLR. Schmidt, M.; and Lipson, H. 2009. Distilling free-form natural laws from experimental data. science, 324(5923): 81 85. Settles, B. 1995. Active Learning Literature Survey. Science, 10(3): 237 304. Sun, K.; and Nielsen, F. 2019. Information-Geometric Set Embeddings (IGSE): From Sets to Probability Distributions. Ar Xiv, abs/1911.12463. Tsai, Y.-H. H.; Li, T.; Liu, W.; Liao, P.; Salakhutdinov, R.; and Morency, L.-P. 2021. Integrating auxiliary information in self-supervised learning. ar Xiv preprint ar Xiv:2106.02869. Udrescu, S.-M.; and Tegmark, M. 2020. AI Feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16): eaay2631. Valipour, M.; You, B.; Panju, M.; and Ghodsi, A. 2021. Symbolic GPT: A Generative Transformer Model for Symbolic Regression. Ar Xiv, abs/2106.14131. van den Oord, A.; Li, Y.; and Vinyals, O. 2018. Representation Learning with Contrastive Predictive Coding. Ar Xiv, abs/1807.03748. Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11). Virgolin, M.; Alderliesten, T.; Witteveen, C.; and Bosman, P. A. 2021. Improving model-based genetic programming for symbolic regression of small expressions. Evolutionary computation, 29(2): 211 237. Weng, B.; Song, Z.; Zhu, R.; Yan, Q.; Sun, Q.; Grice, C. G.; Yan, Y.; and Yin, W.-J. 2020. Simple descriptor derived from symbolic regression accelerating the discovery of new perovskite catalysts. Nature communications, 11(1): 1 8. Xing, H.; Salleb-Aouissi, A.; and Verma, N. 2021. Automated symbolic law discovery: A computer vision approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 660 668. Zheng, W.; Chen, T.; Hu, T.-K.; and Wang, Z. 2022. Symbolic Learning to Optimize: Towards Interpretability and Scalability. ar Xiv preprint ar Xiv:2203.06578.