# deep_symbolic_superoptimization_without_human_knowledge__181dce4e.pdf

Published as a conference paper at ICLR 2020

DEEP SYMBOLIC SUPEROPTIMIZATION WITHOUT HUMAN KNOWLEDGE

Hui Shi1 , Yang Zhang2 , Xinyun Chen3, Yuandong Tian4, Jishen Zhao1

UC San Diego1, MIT-IBM Watson AI Lab2, UC Berkeley3, Facebook AI Research4

{hshi, jzhao}@ucsd.edu, yang.zhang2@ibm.com, xinyun.chen@berkeley.edu, yuandong@fb.com

Deep symbolic superoptimization refers to the task of applying deep learning methods to simplify symbolic expressions. Existing approaches either perform supervised training on human-constructed datasets that deﬁne equivalent expression pairs, or apply reinforcement learning with human-deﬁned equivalent transformation actions. In short, almost all existing methods rely on human knowledge to deﬁne equivalence, which suffers from large labeling cost and learning bias. We thus propose HISS, a reinforcement learning framework for symbolic superoptimization that keeps humans outside the loop. HISS introduces a tree LSTM encoder-decoder network with attention to ensure tractable learning. Our experiments show that HISS can discover more simpliﬁcation rules than existing human-dependent methods, and can learn meaningful embeddings for symbolic expressions, which are indicative of equivalence.1

1 INTRODUCTION

Superoptimization refers to the task of simplifying and optimizing over a set of machine instructions, or code (Massalin, 1987; Schkufza et al., 2013), which is a fundamental problem in computer science. As an important direction in superoptimization, symbolic expression simpliﬁcation, or symbolic superoptimization, aims at transforming symbolic expression to a simpler form in an effective way, so as to avoid unnecessary computations. Symbolic superoptimization is an important component in compilers, e.g. LLVM and Halide, and it also has a wide application in mathematical engines including Wolfram2, Matlab, and Sympy.

Over recent years, applying deep learning methods to address symbolic superoptimization has attracted great attention. Despite their variety, existing algorithms can be roughly divided into two categories. The ﬁrst category is supervised learning, i.e. to learn a mapping between the input expressions and the output simpliﬁed expressions from a large number of human-constructed expression pairs (Arabshahi et al., 2018; Zaremba & Sutskever, 2014). Such methods rely heavily on a human-constructed dataset, which is timeand labor-consuming. What is worse, such systems are highly susceptible to bias, because it is generally very hard to deﬁne a minimum and comprehensive axiom set for training. It is highly possible that some forms of equivalence are not covered in the training set, and fail to be recognized by the model. In order to remove the dependence on human annotations, the second category of methods leverages reinforcement learning to autonomously discover simplifying equivalence (Chen et al., 2018). However, to make the action space tractable, such systems still rely on a set of equivalent transformation actions deﬁned by human beings, which again suffers from the labeling cost and learning bias.

In short, the existing neural symbolic superoptimization algorithms all require human input to deﬁne equivalences. It would have beneﬁted from improved efﬁciency and better simpliﬁcation if there were algorithms independent of human knowledge. In fact, symbolic superoptimization should have been a task that naturally keeps human outside the loop, because it directly operates on machine code, whose consumers and evaluators are machines, not humans.

Authors contributed equally to this paper. 1The code is available at https://github.com/shihui2010/symbolic_simplifier. 2https://www.wolframalpha.com/

Published as a conference paper at ICLR 2020

Therefore, we propose Human-Independent Symbolic Superoptimization (HISS), a reinforcement learning framework for symbolic superoptimization that is completely independent of human knowledge. Instead of using human-deﬁned equivalence, HISS adopts a set of unsupervised techniques to maintain the tractability of action space. First, HISS introduces a tree-LSTM encoder-decoder architecture with attention to ensure that its exploration is conﬁned within the set syntactically correct expressions. Second, the process of generating a simpliﬁed expression is broken into two stages. The ﬁrst stage selects a sub-expression that can be simpliﬁed and the second stage simpliﬁes the sub-expression. We performed a set of evaluations on artiﬁcially generated expressions as well as a publicly available code dataset, called the Halide dataset (Chen & Tian, 2018), and show that HISS can achieve competitive performance. We also ﬁnd out that HISS can automatically discover simpliﬁcation rules that are not included in the human predeﬁned rules in the Halide dataset.

2 RELATED WORK

Superoptimization origins from 1987 with the ﬁrst design of Massalin (1987). With the probabilistic testing to reduce the testing cost, the brute force searching is aided with a pruning strategy to avoid searching sub-spaces that contains pieces of code that have known shorter alternatives. Due to the explosive searching space for exhaustive searching, the capability of the ﬁrst superoptimizer is limited to only very short programs. More than a decades later, Joshi et al. (2002) presented Denali, which splits the superoptimization problem into two phases to expand the capability to optimize longer programs. STOKE (Schkufza et al., 2013) follows the two phases but sacriﬁces the completeness for efﬁciency in the second phase.

Recent attempts to improve superoptimization are categorized into two ﬁelds: exploring transformation rules and accelerating trajectory searching. Searching the rules are similar to the problem of superoptimization on limited size program, but targeting more on the comprehensiveness of the rules. Buchwald (2015) exhaustively enumerates all possible expressions given the syntax and checks the equivalence of pairs of expressions by SMT solver. A similar method with an adaption of the SMT solver to reuse the previous result is proposed by Jangda & Yorsh (2017). On the other hand, deep neural networks are trained to guide the trajectory searching (Cai et al., 2018; Chen & Tian, 2018).

Considering transformation rule discovery as a limited space superoptimization, the large action space and sparse reward are the main challenges for using neural networks. Special neural generator structures are proposed for decoding valid symbolic programs, which leverage the syntax constraints to reduce the searching space as well as learn the reasoning of operations, and are gaining popularity in program synthesis (Parisotto et al., 2016; Zhong et al., 2017; Bunel et al., 2018), program translation (Chen et al., 2018; Drissi et al., 2018), and other code generation tasks (Ling et al., 2016; Alvarez-Melis & Jaakkola, 2016). Among the symbolic expression decoders, the family of tree structure RNNs (Parisotto et al., 2016; Drissi et al., 2018; Alvarez-Melis & Jaakkola, 2016; Chen et al., 2018) are more ﬂexible than template-based predictors (Ling et al., 2016; Zhong et al., 2017).

3 THE HISS ARCHITECTURE

In this section, we will detail our proposed HISS architecture. We will ﬁrst introduce a few notations. T denotes a tree; a denotes a vector, and A denotes a matrix. We introduce an LSTM( ) function that summarizes standard one-step LSTM operation as

[ht, ct] = LSTM(xt, ht 1, ct 1), (1)

where ht, ct and xt denote the output, cell and input at time t of a standard LSTM respectively.

3.1 FRAMEWORK OVERVIEW

Our problem can be formulated as follows. Given a symbolic expression TI, represented in the expression tree form, our goal is to ﬁnd a simpliﬁed expression TO, such that 1) the two expressions are equivalent, and 2) TO contains a smaller number of nodes than TI.

It is important to write the symbolic expressions in their expression tree form, rather than strings because HISS will be operating on tree structures. An expression tree assigns a node for each operation or variable. Each non-terminal node represents an operation, and each terminal node, or leaf node,

Published as a conference paper at ICLR 2020

)& /& )% /%

Prob. of choosing sub-tree

(a) The tree encoder (green) and subtree selector (orange).

(b) The tree decoder.

Copy & Split Concatenate

LSTM Output & Cell Input

Output Previous Output & Cell

Linear Projection (Feed Forward)

Attention Module

Figure 1: The HISS architecture, illustrated on a three-node binary subtree, where i is the parent of j, k, and p is the parent of i.

represents a variable or a constant. The arguments of operation are represented as the descendant subtrees of the corresponding node. Compared to string representation, tree representation naturally ensures any randomly generated expression in its form is syntactically correct. It also makes working with subexpressions easier: simply by working with subtrees.

HISS approaches the problem using the reinforcement learning framework, where the action of generating simpliﬁed expressions are divided into two consecutive actions. The ﬁrst action is to pick a subexpression (or subtree) that can be simpliﬁed, and the second action generates the simpliﬁed expression for the selected subexpression.

Accordingly, HISS contains three modules. The ﬁrst module is a tree encoder, which computes an embedding for each subtree (including the entire tree) of the input expression. The embeddings are useful for picking subtree for simpliﬁcation as well as simplifying a subtree. The second module is a subtree selector, which selects a subtree for simpliﬁcation. The third module is a tree decoder with an attention mechanism, which generates a simpliﬁed expression based on the input subtree embedding. The subsequent subsections will introduce each module respectively.

3.2 THE TREE ENCODER

The tree encoder generates embedding for every subtree of the input expression. We apply the N-ary Tree LSTM as proposed in Tai et al. (2015), where N represents the maximum number of arguments that an operation has. It is important to note that although different operations have a different number of arguments, for structural uniformity, we assume that all operations have N arguments, with the excessive arguments being a NULL symbol.

The tree encoder consists of two layers. The ﬁrst layer is called the embedding layer, which is a fullyconnected layer that converts the one-hot representation of each input symbol to an embedding. The second layer is the tree LSTM layer, which is almost the same as the regular LSTM, except that the cell information now ﬂows from the children nodes to their parent node. Formally, denote ci, hi, and xi as the cell, output and input of node i respectively. Then the tree LSTM encoder performs the following information

[hi, ci] = LSTM

j D(i) hj, [

where D(i) denotes the set of children of node i. Fig. 1(a) plots the architecture of the tree LSTM encoder (in green). Since each node fuses the information from its children, which again fuse the information from their own children, it is easy to see that the output hi summarizes the information of the entire subtree led by node i, and thus can be regarded as an embedding for this subtree.

3.3 THE SUBTREE SELECTOR

The subtree selector performs the ﬁrst action to select a subtree for simpliﬁcation. It takes the output of the tree encoder, {hi}, as its input, and produces the probability with which each subtree is

Published as a conference paper at ICLR 2020

selected. It consists of two feed-forward layers followed by a softmax layer across all the nodes in the input tree. Figure 1(a) shows the architecture of the subtree selector (in orange).

3.4 THE TREE DECODER

Once a subtree has been selected, and suppose the root node of the selected subtree is node i, the output of the encoder at node i, hi, is then fed into the tree decoder, which generates a simpliﬁed version of the subtree. The tree decoder can be regarded as the inverse process of the tree encoder: the latter fuses information from the children to the parents, whereas the former unrolls the information from parents down to the entire N-ary tree. When the parent node outputs a non-operation symbol, the expansion of this branch terminates and no child is further decoded.

The tree decoder adopts a novel LSTM architecture with attention, which, compared with the attention LSTM proposed by Chen et al. (2018), is more parameterand computationally-efﬁcient. It consists of two layers. The ﬁrst layer is a tree LSTM layer, and the second layer is the symbol generation layer with attention. Figure 1(b) illustrates the decoder structure. The decoder shares the same vocabulary with the encoder, and the embedding layer of the decoder shares the parameters with the embedding layer in the encoder.

Tree LSTM Layer. The tree LSTM in the decoder needs to accomplish two tasks. First, it needs to extract the information for generating the output for the current node. Second, it needs to split and pass on the information to its children. To better control the information ﬂow, we introduce two tracks of LSTMs for the two different tasks. Formally, denote [h i, c i] as the output and cell of node i, and assume [j1, , j N] are children nodes of i. Also denote yp as the decoder output for node p, which is the parent node of node i (If node i is already the root node of the selected subtree, then yp becomes a special start token). Then the ﬁrst LSTM track extracts the information that generates the current output: [h i out, c i out] = LSTMout(yp, h i, c i). (3) The second LSTM track splits and passes on the information to the children, i.e. n {1, , N}

[h i jn, c i jn] = LSTMn(yp, h i, c i). (4)

Notice that we have appended a subscript to the LSTM( ) to emphasize that LSTM functions with different subscripts do not share parameters. Finally, the LSTM information for a speciﬁc children is derived by linearly projecting the output track and that speciﬁc children track:

h jn = Wh[h j out, h i jn] + bh, c jn = Wc[c j out, c i jn] + bc. (5)

We ﬁnd that this linear projection is useful for adding additional dependencies between the parent output and the descendants, so that the generated expression is more coherent.

Symbol Generation Layer with Attention. The symbol generation layer takes the output track produced by the previous tree LSTM layer, h i out, as input, and outputs the probability distribution of generating different output symbols for the current node. It adopts an attention mechanism (Bahdanau et al., 2014) to attend to the relevant part in the encoder, so that the input and output expressions have better correspondence. Formally, when generating the output for decoder node i, the attention weight on encoder node j is computed from h i out and hj as follows:

ei(j) = v T tanh(Wdh i out + Wehj + ba), [ai(1), , ai(J)] = softmax([ei(1), , ei(J)]), (6)

where J is the total number of input nodes at the encoder. Finally, the probability of symbol generation at node i, denoted as pi, is computed by passing into a linear projection layer h i out and an attention context vector ci, which is a linear combination of the encoder embeddings with the attention weights, i.e.

pi = Wo[h i out; ci] + bo, where ci =

j=1 ai(j)hj. (7)

4 LEARNING WITH HISS

In this section, we will elaborate on the training and inference schemes of HISS. In particular, we will introduce several mechanisms to improve the exploration efﬁciency of HISS.

Published as a conference paper at ICLR 2020

4.1 TRAINING

We apply the standard REINFORCE framework (Williams, 1992) for training, where the reward function is given by

R(TI, TO) = γcard(TO) if TI TO, βγcard(TO) otherwise, (8)

where denotes that the two expressions are equivalent; card( ) denotes the number of nodes in the tree expression, or the length of the expression. β is a hyperparameter that depicts the penalty of not producing an equivalent expression. This reward prioritizes equivalence, and given equivalence, favors shorter expressions. We applied a probabilistic testing scheme to determine equivalence as proposed in Massalin (1987). For each input, multiple outputs are decoded via beam search, on each of which a reward is evaluated. We introduce the following mechanisms to maintain the efﬁciency and stability of training.

Curriculum Learning. Since generating the simpliﬁed expression is divided into two actions, subtree selection and subtree simpliﬁcation, directly learning both can lead to very inefﬁcient exploration. Instead, we introduce a two-stage curriculum. The ﬁrst stage trains only the encoder and decoder on very short expressions (maximum depth less than four). The subtree selector is not trained. Instead, we always feed the entire tree to the decoder for simpliﬁcation. The second stage trains all the modules on longer expressions.

Subtree Embedding Similarity. In order to guide the encoder to learning meaning embeddings, we introduce an additional ℓ2 loss to enforce that the equivalent expressions have similar encoder embeddings, i.e. similar his . Speciﬁcally, for each input expressions TI, we decode a set of generated expressions S = {TO} with beam search, and obtain their embeddings {h(TO)} by feeding them back into the encoder (here we add an argument to h to emphasize that the embedding is a function of input expression). Then the ℓ2 loss is expressed as follows

TO S h(TO) h(TI) 2 2 ( 1)1[TI TO], (9)

where 1[ ] denotes the indicator function, which equals one if the statement in its argument is true, and zero otherwise. Note that this ℓ2 applies to the encoder only, and can be optimized by regular gradient descent methods. REINFORCE is not needed.

4.2 INFERENCE

Similar to training, the inference is performed by decoding multiple outputs via beam search and ﬁnding the best result as the ﬁnal output. In order to accelerate the inference process, we introduce an ofﬂine procedure. During the ﬁrst stage of the curriculum training, i.e. training on very short expressions, all the simpliﬁed equivalence rules discovered are logged. During inference, if the subtree to be fed into the decoder has an exact match in the log, we will apply the logged simpliﬁed equivalence directly, rather than redoing the entire decoding process.

5 EXPERIMENTS

We performed two experiments. The ﬁrst experiment compares HISS with human-independent naive search algorithms. The second experiment compares HISS with existing human-dependent state-of-the-art systems on benchmark datasets. Throughout all the experiments, we use the same hyperparameter setting. Additional details on the hyperparameters and how the hyperparameters are determined can be found in appendix A.3. Besides the experiments discussed in this section, some additional experiment results are reported in appendix B and a set of ablation studies are introduced in appendix C.

5.1 COMPARING WITH HUMAN-INDEPENDENT METHODS

Since there are no existing human-independent methods speciﬁcally for symbolic superoptimization, we compare several search algorithms. Due to the search complexity, the evaluation cannot be performed on very long expressions. Therefore, this experiment is performed on the traverse equivalence dataset.

Published as a conference paper at ICLR 2020

5 10 15 20 Sampling param

MCTS MCMC HISS

5 10 15 20 Sampling param

Expression Length Reduction

MCTS MCMC HISS

5 10 15 20 Sampling param

Tree Size Reduction

MCTS MCMC HISS

Figure 2: Comparison with human-independent methods in terms of hit rate (left), expression length reduction (middle), tree size reduction (right), on Traverse Equivalence dataset.

Traverse Equivalence Dataset As a complement to the Halide dataset, we build a dataset by traversing all the possible expressions with maximum depth four that consist of operations and symbols drawn from the Halide vocabulary in table 3. Among these expressions, we use the Finger Print method (Jia et al., 2019) to test if they can be simpliﬁed, and discard those that cannot. From the remaining expressions, we sample 900 expressions as the training set, 300 as the validation set and 300 as the test set. The advantage of this dataset is that it is not built from human-predeﬁned equivalence relation. However, the disadvantage of this dataset is that it does not contain expressions with a maximum depth greater than four, limited by the complexity of the Finger Print method. Additional details of our Traverse Equivalence dataset can be found in appendix A.1.

Training Since HISS does not operate on very long expressions, it is only trained with stage-one in curriculum learning (the one with no subtree selection). Additional details regarding training can be found in appendix A.4.

Baselines Two baseline searching methods are compared: Monte Carlo Tree Search (MCTS) (Bertsekas, 1995) and Markov Chain Monte Carlo (MCMC) (Schkufza et al., 2013). MCTS decides the expression tree from root to leaves, choosing one symbol from Halide vocabulary for each node, and adopts Upper Conﬁdence Bound (Kocsis & Szepesv ari, 2006) for balancing exploration and exploitation. Similar to Schkufza et al. (2013), MCMC takes one of four transformations: 1) replace an operator by another random operator, and generate or discard operands if two operator takes a different number of operands. 2) replace a variable/constant with another random variable/constant. 3) replace a subexpression with a random single variable/constant. 4) replace a variable/constant with a random expression. The probability distribution of taking the transformation is deﬁned as the same as in Schkufza et al. (2013).

Metrics. Three metrics are introduced: 1) hit rate, deﬁned as percentage of expressions that the model successfully found an equivalence given the computation budget parameter; 2) expression length reduction, deﬁned as reduction in the total number of tokens; and 3) tree size reduction, deﬁned as reduction in the number of nodes in the expression tree.

Results. The performance comparison of three models is shown in ﬁgure 2. The sampling parameter in the horizontal axis refers to the beam size for HISS , the max trials budget for MCTS for each token decoded, and the sampling budget for MCMC. These quantities equivalently deﬁne the number of

Expression Length Reduction

39.9 Z3 Halide HISS

Tree Size Reduction

11.4 Z3 Halide HISS

Figure 3: Comparison with human-dependent methods in terms of expression length reduction (left), and tree size reduction (right), on the Halide dataset.

Published as a conference paper at ICLR 2020

Table 1: Example simpliﬁcation rules learned by HISS (left) and their corresponding rules listed in Halide (right). If the learned rule has no corresponding preset rule in Halide, the row is left blank.

HISS Halide Input Output Input Output (c0+v0) -(v0+c0) 0 (c0/v0)*v0 c0 (y*x)/x y c0-c0 0 x-x 0 c0 c0 true x < x false c0 == c0 true x == x true min(c0, v0) v0 true x < min(x, y) false min(c0 + v0, c1) (c0 + v0) true min(c0, (min(v0, c1) + c2) + c1) ((c1 + c2 + c1) true (c0 + v0) - c0 v0 (x + y) - x y max(c0, c0) c0 max(x, x) x min(c0, v0 + c1) (c1 + v0) true c0 max(v0, c0) true max(y, x) < x false c0 / c0 1 x / x 1 min(c0 * v0, c1) (c0 * v0) true c0 + (v0 - c0) v0 x + (y - x) y min(min(c0, v0), c1 - c2) v0 true min(min(c0, v0), v1) v0 true

search attempts per token. As can be seen, HISS is signiﬁcantly more powerful in ﬁnding the simpler equivalent than MCTS and MCMC. MCMC performs almost equally well as HISS in terms of Hit Rate, and both of them far outperform MCTS. However, both MCTS and HISS adopt top-down decoding in the huge decoding space, while MCMC starts with the input expression and applies local transformations, which makes it much easier to ﬁnd an equivalence. Also, MCMC achieves much worse average length reduction and average tree size reduction than HISS.

5.2 COMPARING WITH HUMAN-DEPENDENT METHODS

In this section, we compare HISS with existing human-dependent state-of-the-art methods on the Halide dataset.

Halide Dataset The Halide dataset (Chen & Tian, 2018) is the benchmark dataset for symbolic expression simpliﬁcation. It consists of equivalent expression pairs constructed from human predeﬁned rules. Since HISS is an unsupervised method, we only use the longer expression from each pair for training. Additional details of the Halide dataset can be found in appendix A.2.

Training & Inference In this experiment, HISS is trained with both stages of curriculum learning. The ﬁrst stage is trained on the moderately short expressions from the Traverse Equivalence dataset, whose conﬁguration follows that in section 5.1. The second stage is trained on the Halide training set in an unsupervised manner. More details can be found in appendix A.4.

Since HISS only simpliﬁes one subtree at a time, while the actual simpliﬁcation usually requires sequentially simplifying multiple subtrees, we iteratively invoke the HISS procedure for both training and inference, as in Chen & Tian (2018). Table 2 shows an example of the iterative process. The iterations terminate when 1) the number of iterations reaches 20; 2) the simpliﬁcation output becomes a single node; or 3) the subtree selector assigns small scores (below 0.05) to all subtrees.

Baselines Two baselines are included: 1) Halide (Ragan-Kelley et al., 2013), which applies Halide predeﬁned rules; 2) Z3, the simpliﬁcation function in Z3, a high-performance theorem prover developed by De Moura & Bjørner (2008), to perform transformations using the Halide predeﬁned rules. It is worth mentioning that both baselines have access to the Halide predeﬁned rules that are used to construct the dataset, which gives them an advantage over HISS.

Metrics Expression length reduction and tree size reduction are applied as the metrics.

Results Figure 3 shows the performance of HISS compared with the baselines. As can be seen, HISS outperforms both Halide and Z3 in both metrics. This result is quite non-trivial because the Halide dataset is built exclusively from the Halide s ruleset, to which both baseline algorithms have access. Therefore, this result implies that even for expressions that are speciﬁcally designed to be simpliﬁed by a set of predeﬁned rules, they can still be further simpliﬁed by rules that are outside of the set.

Published as a conference paper at ICLR 2020

Table 2: Simpliﬁcation process of HISS (upper) and Halide (lower) on Eq. (10). The subtrees selected for simpliﬁcation in the next iteration are marked with box unless the entire tree is selected.

Step HISS 0 (((((144 - (v0*72))/2)*2) + 4) ((150 - (v0*72))/4)*4)

1 (((((144 - (v0*72))/2)*2) + 4) (150 - (v0*72))) 2 (((144 - (v0*72)) + 4) (150 - (v0*72)) 3 ((144 - (v0*72)) (146 - (v0*72)))

4 (((v0*72) -(v0*72)) 2 5 0 2 6 True Step Halide 0 (((((144 - (v0*72))/2)*2) + 4) ((150 - (v0*72))/4)*4) 1 ((((72 - (v0*36))*2) + 4) (((150 - (v0*72))/4)*4)) 2 ((((72 - (v0*36))*2) + 4) ((37.5 - (v0*18))*4)) 3 !((37.5 - (v0*18))*4)) < ((((72 - (v0*36))*2) + 4)

5.3 SIMPLIFICATION PROCESS ANALYSIS

In this subsection, we will provide an in-depth analysis of the simpliﬁcation process of HISS and Halide, which can explain why human-predeﬁned rules can be insufﬁcient, and how HISS can exceed the limit of human-predeﬁned rules.

Example Simpliﬁcation Rules To start with, table 1 lists some example simpliﬁcation rules learned by HISS . For each example rule, the corresponding human-predeﬁned rule in Halide is also listed if available. There are two observations. First, HISS is able to learn the most fundamental axiomatic identities such as the inverse relationship between plus and minus, and between multiplication and division, most of which can be matched with the human-predeﬁned rules in Halide. Second and more importantly, HISS can also uncover some more complicated rules that have no matches in Halide, nor could be equivalently derived from any composition of the rules in Halide. A close inspection into these rules reveals that these rules, despite their complexity, are still very fundamental, and therefore the failure to include these rules is expected to impact the simpliﬁcation performance. In addition to these fundamental rules, HISS is also able to ﬁnd some involved but interesting rules, which are listed in table 4 and discussed in appendix B.2.

Example Simpliﬁcation Traces. To illustrate how the completeness of the rules can impact on the simpliﬁcation performance, table 2 compares the simpliﬁcation traces of Halide and HISS for the following expression

(((((144 - (v0*72))/2)*2) + 4) ((150 - (v0*72))/4)*4) (10)

For Halide, each step represents the process of applying one Halide predeﬁned rule to simplify the expression. For HISS, each step represents one simpliﬁcation iteration. As each step, the subtree that is chosen for simpliﬁcation for the next step is marked with box, unless the entire expression is chosen. As can be seen, HISS follows some reasonable steps to trim the constants and eliminates v0, but Halide gets stuck after some trivial constant reduction. The reason for this failure is there are no such rules as (((c0 (v0 c1))/c2) c2 7 c0 (v0 c1) or x + y 7 y + x or (x + y) z 7 x z + y z in the Halide ruleset. Of course, one can ﬁx this problem one time by appending the aforementioned rules to the ruleset, but this does not fundamentally solve the problem because one would never be able to exhaustively list all the possible rules needed for simpliﬁcation in the ruleset.

Why Subtree Selector Matters. Table 2 also illustrates why the subtree selector is an integral part of HISS. In many reduction steps, only a subexpression is simpliﬁed (as boxed). Without the subtree selector, the decoder of HISS would have to process the entire expression only to simplify the subexpression, which makes it hard to effectively learn via reinforcement learning. For more analysis on the effect of the subtree selector, please refer to appendix C.

6 CONCLUSIONS

We presented HISS as a symbolic expression simpliﬁcation algorithm that is independent of human knowledge. We demonstrated that removing the dependence on humans is advantageous for this task

Published as a conference paper at ICLR 2020

because machines can autonomously ﬁgure out rewrite rules that humans fail to discover, and thus achieve comparably well simpliﬁcation results. We also showed that we are one step closer to ﬁnding an equivalence-preserving embedding for symbolic expressions. Although HISS has achieved promising results, there is still much room for improvement. Although HISS has adopted several techniques to reduce the complexity of the search space, learning simpliﬁcation rules on very long expressions is still challenging, which calls for the exploration of more efﬁcient reinforcement learning algorithms as a future research direction.

7 ACKNOWLEDGEMENT

We thank the anonymous reviewers for their valuable feedback. This paper is supported by NSF grant 1829525.

Published as a conference paper at ICLR 2020

David Alvarez-Melis and Tommi S Jaakkola. Tree-structured decoding with doubly-recurrent neural networks. 2016.

Forough Arabshahi, Sameer Singh, and Animashree Anandkumar. Combining symbolic expressions and black-box function evaluations in neural programs. ar Xiv preprint ar Xiv:1801.04342, 2018.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.

Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientiﬁc Belmont, MA, 1995.

Sebastian Buchwald. Optgen: A generator for local optimizations. In International Conference on Compiler Construction, pp. 171 189. Springer, 2015.

Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. ar Xiv preprint ar Xiv:1805.04276, 2018.

Cheng-Hao Cai, Yanyan Xu, Dengfeng Ke, and Kaile Su. Learning of human-like algebraic reasoning using deep feedforward neural networks. Biologically inspired cognitive architectures, 25: 43 50, 2018.

Xinyun Chen and Yuandong Tian. Learning to progressively plan. ar Xiv preprint ar Xiv:1810.00337, 2018.

Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program translation. In Advances in Neural Information Processing Systems, pp. 2547 2557, 2018.

Leonardo De Moura and Nikolaj Bjørner. Z3: An efﬁcient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 337 340. Springer, 2008.

Mehdi Drissi, Olivia Watkins, Aditya Khant, Vivaswat Ojha, Pedro Sandoval, Rakia Segev, Eric Weiner, and Robert Keller. Program language translation using a grammar-driven tree-to-tree model. ar Xiv preprint ar Xiv:1807.01784, 2018.

Abhinav Jangda and Greta Yorsh. Unbounded superoptimization. In Proceedings of the 2017 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reﬂections on Programming and Software, pp. 78 88. ACM, 2017.

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. Taso: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 47 62. ACM, 2019.

Rajeev Joshi, Greg Nelson, and Keith Randall. Denali: a goal-directed superoptimizer, volume 37. ACM, 2002.

Levente Kocsis and Csaba Szepesv ari. Bandit based monte-carlo planning. In European conference on machine learning, pp. 282 293. Springer, 2006.

Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tom aˇs Koˇcisk y, Andrew Senior, Fumin Wang, and Phil Blunsom. Latent predictor networks for code generation. ar Xiv preprint ar Xiv:1603.06744, 2016.

Hongbin Liu and Ickjai Lee. End-to-end trajectory transportation mode classiﬁcation using bilstm recurrent neural network. In 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), pp. 1 5. IEEE, 2017.

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579 2605, 2008.

Published as a conference paper at ICLR 2020

Henry Massalin. Superoptimizer: a look at the smallest program. In ACM SIGARCH Computer Architecture News, volume 15, pp. 122 126. IEEE Computer Society Press, 1987.

Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis. ar Xiv preprint ar Xiv:1611.01855, 2016.

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Fr edo Durand, and Saman Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48(6):519 530, 2013.

Eric Schkufza, Rahul Sharma, and Alex Aiken. Stochastic superoptimization. In ACM SIGPLAN Notices, volume 48, pp. 305 316. ACM, 2013.

Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. ar Xiv preprint ar Xiv:1503.00075, 2015.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

Wojciech Zaremba and Ilya Sutskever. Learning to execute. ar Xiv preprint ar Xiv:1410.4615, 2014.

Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. ar Xiv preprint ar Xiv:1709.00103, 2017.

Published as a conference paper at ICLR 2020

A EXPERIMENT SETUP

A.1 TRAVERSE EQUIVALENCE DATASET

Table 3 shows the vocabulary of Halide syntax, which is used to construct the traverse equivalence dataset. As there are 14 binary operators, 1 unary operator, and 1 ternary operator, the total number of different expressions of depth two would be over 50k, and the number of different expressions of depth three is over 1.25e+14. However, the rule of thumb is that the simpliﬁcation is most possible when there are some variables repeatedly appear in the expression, for example, (x + y) x 7 y while (x + z) + y could not be simpliﬁed. Therefore, we constraint the enumeration of expression to only 16 operators, 5 constants, and the ﬁrst three variables (v0, v1, v2). As a result, roughly 3 billion expressions are enumerated.

To check whether an expression could be simpliﬁed in this plethora of possible candidates, we adopted the Finger Print method (Jia et al., 2019). The idea is that equivalent expressions should produce the same output under the same set of input. Initially, n sets of random value assignment to all variables are generated, and the ﬁngerprint of an expression is deﬁned as the tuple of the corresponding n output given the assignment. According to the ﬁngerprint, expressions with exactly n (n = 4 in our case) same results are grouped into a shard. It can be implied that any equivalent expressions must be in the same shard. Thus, the shards could be processed in parallel. It is highly possible that the initial assignment would lead to an extremely large shard. Therefore, when processing each shard, we repetitively compute new ﬁngerprints until the shard breaks into small shards containing fewer than 5,000 expressions. Then, we use probabilistic testing on each pair of expressions from the same shard to determine whether the expression pair is equivalent. For each expression pair that is equivalent, the longer expression is labeled as reducible. After all the pairs are tested, all the expressions that are labeled as reducible are retained, and the rest are removed.

Further, it is obvious that if an expression could be simpliﬁed, the longer expression containing it could also be simpliﬁed. So if an expression contains a subexpression that could be simpliﬁed, then this expression does not represent the minimal sequence of this simpliﬁcation and is removed from the dataset as well. After the Finger Print method and the minimality checking, we obtain approximately 20,000 equivalent expression pairs, from which 1,500 samples are randomly sampled and further split into training, testing, and validation set with a ratio of 6:2:2.

A.2 HALIDE DATASET

The Halide dataset (Chen & Tian, 2018) contains around 10,000 training sequences, 1,000 testing sequences, and 1,000 validation sequences, generated and split randomly. The number of words for each sequence ranges from 6 to 100, averaged at 58. The expressions generated contain many constants beyond {0, 1, 2, True, False}, and the constant is renamed to constant symbols shown in table 3, and the same constant value is renamed to the same constant symbol. There are at most 14 constant symbols and at most 15 variables in a single expression.

A.3 HYPERPARAMETER SETTING AND DETERMINATION

The input to the network is one-hot encoded sequences where the vocabulary size is 50, then the input is encoded by a single fully connected layer with output size 32. The hidden units of LSTM are set to 64 for both encoder and decoder, as a common setting adopted in many previous works (Liu & Lee, 2017), and the number of layers is 1. The output size of the encoder is 64, and the output size of the decoder is equal to the vocabulary size (50). The subtree selector consists of two feed-forward layers with output sizes of 128 and 1 respectively. The model is trained with the ADAM optimizer with a learning rate of 1e-3. Rather than tuned on the validation set, this hyperparameter setting is determined by following the common setting in previous works ((Liu & Lee, 2017)) as well as referencing to the Halide vocabulary size. The same hyperparameter setting has been applied throughout this research project. Finetuning the hyperparameters is expected to have a minor effect on the performance as compared to reﬁning our major algorithm design, e.g. introducing subtree selector and ℓ2 embedding regularizer, which will be discussed in further details in the ablation studies in appendix C.

Published as a conference paper at ICLR 2020

Table 3: The vocabulary of the symbols and operators that are used to construct the traverse equivalence dataset.

Type Symbols Operators min, max, , , <, >, ==, !, =, select, +, -, *, /, &&, || Constants 0, 1, 2, True, False Variables v0, v1, ..., v14 Constant symbols c0, c1, ... c13

The penalty of not getting an equivalent expression, β (as in Eq. (8)), is set to 0.1. This is motivated by our observation that in the Monte Carlo sampling results the ratio of equivalent expressions over nonequivalent expressions is roughly 10:1 on a randomly initialized model. Therefore, by matching β to this ratio we can balance the reward and penalty and achieve an average reward of around zero, at least for the initial iterations, which is shown to contribute to a more stable REINFORCE training. However, please be reminded that β is not a crucial parameter because the baseline removal process in REINFORCE would automatically balance the reward and penalty. A good choice of β would mostly only beneﬁt the onset of the training when the baseline estimation is not yet accurate. The weight to embedding similarity loss term is set to 0.1.

A.4 TRAINING AND INFERENCE

As mentioned, the training for section 5.1 involves only the ﬁrst stage in the curriculum learning, whereas the training for section 5.2 involves both stages in the curriculum learning. Below are the details regarding the training and inference schemes.

Training Iterations For the stage-one training in both experiments, we apply an identical setting, where the encoder-decoder model is trained for 10 epochs (90,000 steps). The model with the highest hit rate on the validation set is selected for evaluation in section 5.1 as well as for the stagetwo training of the curriculum. The stage-two training takes two weeks to train the HISS for full simpliﬁcation pipeline on RTX 2080 for 40 epochs (400k steps).

Beam Search The beam search algorithm is performed as follows. In the beginning, the top k choices of the root node with the highest probability are decoded. Then for each choice of root node value, the next step would be to decode the top k choices of each child node of the root node. Since the decoding processes of different child nodes of the same parent node are independent, we then perform the Cartesian product of all the k choices for each node at the current step, and preserve the top k combinations with the highest probability. Repeating this way, at step t, beam search decodes k highest probability tree up to depth t. Finally, top s (s < k) expressions are backtracked and used to estimate the expected reward of the model. The probability of beam search decoded expression is re-normalized according to Bunel et al. (2018). In all the experiments, we set beam size k = 20 and s = 20.

Constant Folding Since the Halide dataset contains a large number of constant values, we apply a technique called constant folding to both stage-two training and inference of HISS as well as to all other baselines. Speciﬁcally, once the expression is rewritten by the neural network in the symbolic domain, it will be checked if all the leaf nodes in the subtree are constant. If so, the expression is executed and replaced by a new single node with the execution result. Constant folding is applied in both training and inference.

B ADDITIONAL EXPERIMENT RESULTS

B.1 ATTENTION VISUALIZATION

To understand the attention mechanism, we visualize the attention to the input sequence shown in ﬁgure 4. We ﬁnd that when decoding an operator, the attention tends to be ﬂat (with a few exceptions in the right two ﬁgures), because it needs to understand the overall logic. This is different from machine translation or summarization, where the output attention of a single word is usually focused on several input tokens. On the other hand, when decoding a variable, the model attends sharply to the corresponding variable in the input.

Published as a conference paper at ICLR 2020

!( true >= y)

2) && (y > 1) + (y

0 == y) - y) (x, (max

1) - (z >= z) - (1

Figure 4: Attention weights on the input sequences for each token decoded. The x-axis shows the output sequence, and the y-axis shows the input sequence. The input sequence is encoded from leaves to root and the output sequence is decoded from root to leaves. Tokens are re-arranged in the natural order for better visualization.

Table 4: Involved simpliﬁcation rules discovered by HISS.

Input Output (!true) < (y 1) !(true y) (y true)||(y true) true max(z, 1) (false||x) true (1 z) (z 1) 1 z min(x, true)&&(z/2) x&&z 1 (1 < x) 1 x select(z, z, true) == select(z, false, true) !z (x y)&&true y&&x (y + 1) > (y&&2) y 0

Table 5: Mean and standard deviation of the performance metrics among random initialization. The experiment setting is the same as in section 5.1.

Sampling Parameter 5 10 15 20 Hit Rate 0.391 0.013 0.391 0.009 0.406 0.014 0.406 0.014 Expression Length Reduction 2.250 0.172 2.250 0.140 2.578 0.242 2.578 0.242 Tree Size Reduction 1.078 0.077 1.078 0.069 1.219 0.114 1.219 0.114

B.2 EXAMPLES OF INVOLVED SIMPLIFICATION RULES

In order to further appreciate the ability of HISS in ﬁnding equivalence, in addition to the rules listed is table 1, we list some simpliﬁcation rules discovered by HISS on randomly generated expressions that are more involved in table 4. In fact, it takes the authors quite a while to ﬁgure out the equivalence. These rules are hardly useful in practice because no humans will code in this way, but it is a vivid illustration of the advantage of HISS in ﬁnding powerful simpliﬁcations beyond human knowledge.

B.3 ROBUSTNESS AGAINST RANDOM INITIALIZATION

To assess if the performance of HISS is robust to random model initialization, we perform the same experiment in section 5.1 eight times with different random initialization and compute the mean and standard deviation of the three metrics, which are shown in table 5. Compared to the absolute value of the mean, the standard deviation is very small, which shows that the random initialization has a minor inﬂuence on the performance of HISS.

C ABLATION STUDIES

In this section, we introduce a set of ablation studies that investigate the signiﬁcance of the major components of HISS in terms of contribution to performance. The major components of interest are the subtree embedding similarity loss as in Eq. (9), the subtree selector as introduced in section 3.3, and the tree-LSTM encoder-decoder architecture as in sections 3.2 and 3.4.

As an overview, ﬁgure 5 compares the performance between HISS and the following variants of HISS on the Halide dataset.

Published as a conference paper at ICLR 2020

Expression Length Reduction

39.9 Linear-LSTM No-Selector No-Embed-Loss HISS

Tree Size Reduction

11.4 Linear-LSTM No-Selector No-Embed-Loss HISS

Figure 5: Performance of different variants of HISS on the Halide dataset.

No-Embed-Loss: The original HISS trained without the subtree embedding similarity loss.

No-Selector: The original HISS without the subtree selector.

Linear-LSTM: The Tree-LSTM structure is removed and a simple linear-LSTM is used instead. The embedding loss as well as the subtree selector are no longer applicable and therefore also removed.

Other than the variations aforementioned, all the experiment settings are identical to the experiment in section 5.2. As can be seen in ﬁgure 5, without training with the embedding similarity loss or the subtree selector, the performance is signiﬁcantly compromised. Furthermore, removing the tree LSTM structure leads to almost a complete failure. The subsequent subsections further investigate why each of these modules has such a signiﬁcant impact on the performance respectively.

C.1 SUBTREE EMBEDDING SIMILARITY

We would like to investigate the speciﬁc effects that introducing the embedding similarity loss brings.

The direct goal of the embedding similarity loss is to better cluster the embeddings whose corresponding expressions are equivalent. Therefore, we would like to ﬁrst check if this direct goal is achieved. To evaluate this, in the experiment described in section 5.1, we select the six mostpopulated subsets of equivalent expressions in the test set and evaluate the similarity of the embeddings computed by HISS in two ways. First, the embeddings are further projected to twodimensional space using t-SNE (Maaten & Hinton, 2008), which forms a scatter plot as in Figure 6(a-1). The points corresponding to equivalent expressions are shown in the same color. As can be seen, the embeddings equivalent expressions are highly clustered. Notice that this result is on the low-dimensional projection of the embedding. To better evaluate their similarity in the original space, we compare their interand intra-subset distances. The inter-/intra-subset distance of a subset is deﬁned as the Euclidean distance between the centroid of the subset and the samples outside/within the subset. Figure 6(a-2) illustrates the box plot of these distances. As shown, there is a signiﬁcant difference between intraand inter-subset distances. Except for the ﬁrst subset, the quartile intervals are well separated.

Figure 6(b-1) and (b-2) show the same plots on the No-Embed-Loss model, i.e. the HISS variant that is trained without the embedding similarity loss. As can be seen, the scatter points are apparently less well-clustered, and the difference between interand intra-subset distances, although still exists, is smaller. Therefore, we can draw two conclusions. First, even without the embedding similarity loss, HISS is still able to learn embeddings that are somewhat clustered according to equivalence, which shows the goal of ﬁnding equivalent expressions roughly aligns with the need to cluster the embedding based on the expression equivalence. Second, the embedding similarity loss can improve embedding clustering.

However, what we are more interested in is why an improved clustering of the embeddings would lead to a signiﬁcant performance gain. To answer this question, ﬁgure 7 compares the average reward as a function of training epoch of HISS and its variant without the embedding similarity loss, which gives us some very interesting insights. Notice that despite their initial difference, which is due to the random initialization, both reward curves reach the same plateau at around epoch seven, where they both become stagnant for a while. However, with the help of embedding similarity loss, HISS

Published as a conference paper at ICLR 2020

60 40 20 0 20 40 Component 1

Component 2

Intra Inter-

40 20 0 20 40 60 Component 1

Component 2

Intra Inter-

(b-2) Figure 6: Evaluation of similarity of the embeddings of equivalent expressions. The upper plots ((a-1) and (a-2)) are evaluated on the original HISS model. The lower plots ((b-1) and (b-2)) are evaluated on the No-Embed-Loss model. Left plots ((a-1) and (b-1)) are scatter plots of embeddings projected onto two-dimensional space using t-SNE. Points corresponding to equivalent expressions are shown in the same color. Right plots ((a-2) and (b-2)) are box plots of intra- (blue) and intersubset (orange) distances of the embeddings trained with ℓ2 loss. The bars in the box represent 25%, 50% and 75% quartile values. The line intervals denote the 1.5 interquartile range (IQR) beyond the quartile values. The dots represent the extreme values.

0 5 10 15 20 25 Epoch

Reward(10 2)

HISS No-Embed-Loss

Figure 7: Expected reward of HISS with the embedding similarity loss (HISS) and without the embedding similarity loss (No-Embed-Loss)

Published as a conference paper at ICLR 2020

Table 6: Simpliﬁcation traces of HISS with and without the subtree selector on Eq. (11). The subexpressions selected by the subtree selector are boxed, unless the entire expression is chosen.

Step HISS with the Subtree Selector 0 ((v1+v2)-7) ((( ((max(v1,16)+18)/8)*8) +(v1+v2))-27) 1 ((v1+v2)-7) (((max(v1,16)+18)+(v1+v2))-27) 2 (v1+v2) (((max(v1,16)+18)+(v1+v2))-20) Step HISS without the Subtree Selector 0 ((v1+v2)-7) (((((max(v1,16)+18)/8)*8)+(v1+v2))-27) 1 (v1+v2) (((((max(v1,16)+18)/8)*8)+(v1+v2))-20)

Table 7: Output sequences of Linear-LSTM trained on Traverse Equivalence dataset

Linear-LSTM with only equivalence reward Input Output max(v1 + v0, v2 + v0) [ , , c8 , 2 , 2 , 1 , v9 , , , c8 , 2 , 2 ] min(min(v2, v0 + v1), v0 + v3) [ , , c9 , c9 , c9 , c9 , 2 , 2 , 1 , , , c9 ] (v0 &&(v2||v1))||v1 [ , , c8 , 2 , 2 , 1 , , , c9 , c9 , c9 , 2 ] Linear-LSTM trained with equivalence and valid expression reward

Input Output min(v2, v0 + v1) - v1 [ v1 ] v1||(v0||v1) [ c0 ] v0 - min(v2, v0 + v1) [ 2 , , v0 ]

is ﬁnally able to escape from the plateau and reach a higher reward level, whereas the one without the embedding similarity loss gets trapped in the plateau. This result suggests that the embedding similarity loss provides an extra training signal to address the convergence issue of REINFORCE.

C.2 SUBTREE SELECTOR

We have already demonstrated the inner-workings of the subtree selector in section 5.3 and shown in ﬁgure 5 that subtree selector is indispensable for the good performance of HISS. Here we would like to intuitively explain why this is the case. Table 6 compares the simpliﬁcation traces of HISS with and without the subtree selector on the following expression

((v1+v2)-7) (((((max(v1,16)+18)/8)*8)+(v1+v2))-27) (11)

As can be seen, HISS with the subtree selector can ﬁrst offset the *8 and /8 terms in the subexpression ((max(v1,16)+18)/8)*8), before it applies the cancellation rule to merge the constants -7 and -27 . On the other hand, HISS without the subtree selector is unable to identify the reducible subexpression, and so it only applies the cancellation rule. This result shows that the reason why the subtree selector helps is that it can thoroughly check on each subexpression, and therefore contributes to a better simpliﬁcation. Without the subtree selector, the algorithm is prone to overlook some small subexpressions that are reducible.

C.3 TREE LSTM V.S. LINEAR LSTM

We can see the conspicuous disadvantage of the Linear-LSTM model from ﬁgure 5. To illustrate the fundamental issue with the Linear-LSTM model, we sampled some output of the model trained on the Traverse Equivalence dataset, listed in table 7. The model was trained with REINFORCE for 200k steps, with two different reward settings: 1) the equivalence reward as in Eq. 8; 2) the equivalence reward plus a valid expression reward, which is an additional reward of 0.1 if the decoded sequence is syntactically correct.

As can be seen in table 7, when trained with only the equivalence reward, the linear decoder is unable to decode even a valid expression, not to mention generating an equivalent expression to the input. So the linear decoder could hardly converge and learn useful information from reward, which was almost always a negative constant. As a remedy, we can add a valid expression reward to guide

Published as a conference paper at ICLR 2020

the linear LSTM to generate valid expressions. However, as can be observed in table 7, under this additional reward, the model would overﬁt this reward by only generating short valid expressions, which are not equivalent to the input. This is because short expressions contain only one variable or constant have a higher generation probability than longer valid expressions. During the training, whenever the model happens to generate a single expression, it was rewarded a positive valid expression reward. So the behavior of generating only a single constant or variable is strengthened. Still, the probability that a random variable was equivalent to the input expression, is small, and therefore the model can hardly be guided by equivalence reward, but focuses only on generating short but valid expressions.

This experiment demonstrates the advantage of using the tree LSTM, which is guaranteed to generate syntactically correct, and which has a much higher probability of hitting an equivalent expression. Therefore, the tree LSTM can be much better guided by the equivalence reward.