# neural_program_synthesis_with_query__5b22124d.pdf

Published as a conference paper at ICLR 2022

NEURAL PROGRAM SYNTHESIS WITH QUERY

Di Huang1,2,4, Rui Zhang1,4 , Xing Hu1,4, Xishan Zhang1,4, Pengwei Jin1,2,4, Nan Li1,3,4, Zidong Du1,4, Qi Guo1 & Yunji Chen1,2

1SKL of Computer Architecture, Institute of Computing Technology, CAS 2University of Chinese Academy of Sciences 3University of Science and Technology of China 4Cambricon Technologies

Aiming to find a program satisfying the user intent given input-output examples, program synthesis has attracted increasing interest in the area of machine learning. Despite the promising performance of existing methods, most of their success comes from the privileged information of well-designed input-output examples. However, providing such input-output examples is unrealistic because it requires the users to have the ability to describe the underlying program with a few inputoutput examples under the training distribution. In this work, we propose a querybased framework that trains a query neural network to generate informative inputoutput examples automatically and interactively from a large query space. The quality of the query depends on the amount of the mutual information between the query and the corresponding program, which can guide the optimization of the query framework. To estimate the mutual information more accurately, we introduce the functional space (F-space) which models the relevance between the input-output examples and the programs in a differentiable way. We evaluate the effectiveness and generalization of the proposed query-based framework on the Karel task and the list processing task. Experimental results show that the querybased framework can generate informative input-output examples which achieve and even outperform well-designed input-output examples.

1 INTRODUCTION

Program synthesis is the task of automatically finding a program that satisfies the user intent expressed in the form of some specifications like input-output examples (Gulwani et al., 2017). Recently, there has been an increasing interest in tackling it using neural networks in various domains, including string manipulation (Gulwani, 2011; Gulwani et al., 2012; Devlin et al., 2017b), list processing (Balog et al., 2017; Zohar & Wolf, 2018) and graphic applications (Ellis et al., 2018; 2019).

Despite their promising performance, most of their success relies on the well-designed input-output examples, without which the performance of the program synthesis model drops heavily. For example, in Karel (Devlin et al., 2017b; Bunel et al., 2018), the given input-output examples are required to have a high branch coverage ratio on the test program; In list processing, the given input-output examples are guided by constraint propagation (Balog et al., 2017) to ensure their effectiveness on the test program. However, providing such input-output examples is unrealistic because it requires the users to be experienced in programming to describe the underlying program with several inputoutput examples. Worse, the users must be familiar with the distribution of the training dataset to prevent themselves from providing out-of-distribution examples (Shin et al., 2019). In summary, how to generate informative input-output examples without expert experience is still an important problem that remains a challenge.

In this paper, we propose a query-based framework to automatically and efficiently generate informative input-output examples from a large space, which means that the underlying program can be easily distinguished with these examples. This framework consists of two parts: the query network

Corresponding author. Contact: {huangdi20b, zhangrui}@ict.ac.cn.

Published as a conference paper at ICLR 2022

Predicted program

Generate Synthesize Interact Embed

Query network Synthesis network User intent (program)

Input Output

move() put Marker() turn Right() move()

Figure 1: The query-based framework. The query network generates informative input-output examples by interacting with the user, and then the generated examples are sent to the synthesis network to synthesize the underlying program.

and the synthesis network, and each part is trained separately. The query network is trained to generate the input-output examples in an iterative manner, and then the synthesis network is trained to synthesize the program with the input-output examples generated by the query network. It has three advantages: (1) There is no demand for well-designed input-output examples in both training and testing, which leads to a good generalization. (2) The query network works in an efficiently generative manner, which can essentially reduce the computation cost while facing problems with large input-output space. (3) The query network serves as a plug-and-play module with high scalability, for it is separated from the program synthesis process entirely.

To train the query network, the key idea is to model the query process as a decision tree and find out that the informativeness of the input-output examples is associated with the amount of mutual information (i.e.the information gain in the decision tree) between the input-output examples and the programs, and thus the query network can be optimized by maximizing the mutual information. The mutual information can be approximated by the Info NCE loss (van den Oord et al., 2018), which depends on the relevance between the samples. However, due to the many-to-many relationship between the input-output examples and the programs, the relevance is difficult to be measured straightforwardly. To this end, we introduce the functional space (F-space). In F-space, each program can be projected into a vector, and each input-output example corresponds to a set of candidate programs, which can be represented by a normal distribution (Sun & Nielsen, 2019). Whenever a new example is queried, the new set of candidate programs is produced by the intersection of the two original sets of programs, and the new distribution is produced by the product of the two original distributions. The more the queried examples, the smaller the size of the program set, the lower the entropy of the distribution. With the Info NCE loss and F-space, the query network is trained to generate the input-output examples whose F-space distribution can maximize the probability of the corresponding program and minimize the probability of the others. Once the query network is trained, the training of the synthesis network is no different from what other researchers have done before, except that the input-output examples are generated by the query network.

We evaluate our method on the Karel task and the list processing task which have large input spaces. Without utilizing the well-designed input-output examples both in training and testing, we achieve and even outperform the state-of-the-art performance on the Karel task and list processing task, which shows the effectiveness and generalization of our query-based framework.

2 PROBLEM STATEMENT

Intuitively, consider an oracle that contains some kind of symbolic rules (e.g.programs), our goal is to discover these symbolic rules inside this oracle. To do this, the traditional program synthesis assumes that the oracle can provide some informative signals (e.g.input-output examples) automatically, based on which the synthesizer can find the rules. However, this is a strong assumption with high requirements on the oracle that is impractical in many cases. In this work, we consider a query problem where the informative signals are gained actively by querying the oracle under a much weaker assumption that the behavior of the oracle is deterministic (i.e.the same input always results in the same output).

Following the intuition, a reasonable solution for the query problem would be making the programs distinguishable with as few queries as possible . To make this statement concrete and practical, we introduce the following formulations.

Published as a conference paper at ICLR 2022

First, we define the functional equivalence, which is consistent with the definition of the equivalence of functions in mathematics. Definition 2.1 (Functional equivalence). Let I be the input example domain containing all valid input examples, O be the output example domain containing all possible output examples, and P be the program domain containing all valid programs under the domain specific language (DSL), each program p P can be seen as a function p : I O. For two programs pi P and pj P, pi and pj are functional equivalent if and only if x I, pi(x) = pj(x).

Using the concept of functional equivalence, we can formulate the program synthesis task: Definition 2.2 (Program synthesis). Suppose there is an underlying program p P and K inputoutput examples Je K = {(xk, yk)|(xk, yk) I O, k = 1 K} generated by it, the program synthesis task aims to find a program ˆp which is functional equivalent to p using Je K.

Following the definitions above, we can define the task of the query. Definition 2.3 (Query). Given a program p P and a set of history K input-output examples Je K, the query process firstly generates a query by the query network fq: x K+1 = fq(Je K) I. Then, a corresponding response is given by the program y K+1 = p(x K+1) O. The query and response are added to the history input-output examples: Je K Je K {(x K+1, y K+1)} and this process repeats.

The query process aims to distinguish the underlying program with as few queries as possible, based on which we can define the optimal query strategy. Definition 2.4 (The optimal query strategy). Given a set of programs P and a target program p , the optimal query strategy Q aims to distinguish the p by generating as few input-output examples Je K as possible.

When the query space is large, it is impractical to find the optimal query strategy by direct search. Thus, we take the query process as the expansion of a decision tree where the leaves are the programs, the nodes are the queries, and the edges are the responses, and then the generation of queries can be guided by the information gain, which is also known as the mutual information: Je K = arg max Je K I(P; Je K), (1)

The mutual information is hard to calculate due to the large spaces of queries and programs. To this end, we resort to the Info NCE loss (van den Oord et al., 2018) which estimates the lower bound of the mutual information:

LNCE = E[log( exp(f(Je K, pn)) PN i=1 exp(f(Je K, pi)) )], (2)

and I(P; Je K) log(N) LNCE, (3) where f( , ) denotes a relevance function. Maximizing the mutual information is equivalent to minimizing LNCE. Intuitively, Info NCE maximizes the relevance between the positive pairs and minimizes the relevance between the negative pairs. Specifically, for a batch of data {(Je Kn, pn), pn}N, we construct positive pairs as {(Je Ki, pi)} and negative pairs as {(Je Ki, pj)|i = j}. Traditionally, the relevance is defined as the dot product of the samples, which may be inaccurate for the manyto-many relationship between the input-output examples and the programs. Thus next, we will introduce functional space (F-space) to model the relationship properly.

3.1 F-SPACE

Definition 3.1 (F-space and functional distance). F-space is a |I| dimensional space which consists of all valid programs that can be implemented by program domain P. Each program is represented by |I| different output examples v = (y1, y2, . . . , y|I|). The distance in F-space can be measured by the number of different output examples: d(vi, vj) = |diff(vi, vj)|.

Published as a conference paper at ICLR 2022

F-space F-space F-space 𝑒𝑖 𝑒𝑗 𝑒= 𝑒𝑖 𝑒𝑗

Figure 2: The illustration of F-space. Left: The projection of the input-output examples Je K and program p; Middle: The subset relationship between Je Ki and Je Kj. Note that this relation is opposite in F-space; Right: The union operation of Je Ki and Je Kj. This operation is also opposite in F-space where an union operation correspond to an intersection operation in F-space.

Intuitively, F-space (P, d) measures the functional differences between programs. Each program can be represented by a vector in F-space, and if different programs are represented by the same vector v = (y1, y2, . . . , y|I|), it indicates that these two programs get the same results for all inputs, and they are considered to be functionally equivalent. This is also consistent with Definition 2.1. In practice, a space with dimension |I| is too large to compute, and thus we utilize the sparsity of F-space and learn an approximate space with dimension reduction by neural networks.

Regarding input-output examples, representing them by vectors is not appropriate. Consider a set of input-output examples Je K = {(xk, yk)}K, we cannot find a vector that represents it when K < |I| because there are more than one programs that satisfies these K input-output examples.

Formally, we summarize the properties of the representation of input-output examples in F-space as follows:

Each set of input-output examples Je K = {(xk, yk)}K should be represented by a set of F-space vectors Jr K = {vn}N. For two sets of input-output examples Je Ki and Je Kj, if Je Ki Je Kj, then their F-space representations Jr Ki and Jr Kj have the relationship Jr Ki Jr Kj. For two sets of input-output examples Je Ki and Je Kj, suppose Je K = Je Ki Je Kj, then in F-space, their corresponding representations have the relationship Jr K = Jr Ki Jr Kj Similarly, if Je K = Je Ki Je Kj, then in F-space, their corresponding representations have the relationship Jr K = Jr Ki Jr Kj

Additionally, this representation should be differentiable and thus can be optimized by neural networks. To this end, we model the representation of Je K as a Normal distribution (Ren & Leskovec, 2020; Sun & Nielsen, 2019), where the probability of the distribution indicates the possibility that it is the underlying program to be synthesized given input-output examples Je K. We illustrate the projection, the subset relationship, and the union/intersection operation of distributions in Figure 2. Under this representation, the query process becomes the process of reducing the uncertainty of the distribution. To train the query network, we define several neural operators for the neural network training as follows.

Program projection. To project a program p to a vector v in F-space, we traditionally use a sequence model as our encoder: v = Encoderp(p). (4) Note that the program synthesis model differs largely on different datasets, so we choose to vary our program encoder according to the specific program synthesis models on different datasets. In practice, our program encoder is the reverse of the program synthesis decoder.

Input-output example projection. Given a single input-output example Je K = {(x, y)}, we can project it into a Normal distribution using the same architecture of the corresponding program synthesis model, except that we add an MLP to output the two parameters of the Normal distribution: µ and log(σ2). [µ, log(σ2)] = MLPe(Encodere(Je K)), (5) where MLP means a multi-layer perceptron.

Published as a conference paper at ICLR 2022

𝑥𝑡, 𝑦𝑡 𝑥𝑡+1, 𝑦𝑡+1

Program Program

loss Info NCE

Figure 3: The recurrent training process.

Input-output examples intersection. Given K input-output examples Je K = {(xk, yk)}K, each example {(xk, yk)} is represented by a Normal distribution Prk = N(µk, σk) using the projector above. The purpose of the intersection is to aggregate these Normal distributions into a new one, which represents Je K in F-space. Under the assumption of independence, the probability of the intersection distribution should be:

k=1 Prk. (6)

Fortunately, the product of independent Normal distributions is still a Normal distribution, which means that we can represent it as [µ , log(σ 2)]. In practice, we utilize the attention mechanism with another MLP to let the neural network learn the new Normal distribution (Ren & Leskovec, 2020):

[µ , log(σ 2)] =

i=1 wi[µi, log(σ2 i )], (7)

wi = exp(MLPattention([µi, log(σ2 i )])) PK j=1 exp(MLPattention(µj, log(σ2 j ))) . (8)

This formulation not only keeps the form of distributions closed but also approximates the mode of the effective support of Normal distributions, which satisfies our requirement on the intersection of distribution (Sun & Nielsen, 2019). We give detailed proof on the reasonability of doing this in Appendix C.1.

Inverse projection to query. Finally, we need to generate a new query from the representation of input-output examples in F-space. Using the projection and the intersection operation mentioned above, we can project the K input-output examples into the representation [µ , log(σ 2)]. To generate query xnew from this representation, we introduce a decoder which has a similar architecture to Encodere but reversed: xnew = Decoder([µ , log(σ 2)]). (9)

With these operators, we can model the relevance between input-output examples and programs as the probability in the distributions in F-space, and rewrite the loss in Equation 2 as

LNCE = E[log( exp(N(pn; µ , σ )) PN i=1 exp(N(pi; µ , σ ) )], (10)

Next, we will introduce how to train the neural network.

Published as a conference paper at ICLR 2022

3.2 TRAINING

As mentioned above, the query network is optimized by maximizing the mutual information. We observe that taking previous query examples as the condition and only optimizing the query strategy at the current query step is a greedy optimization, which may fail into the local minima more easily. An example is shown in Appendix C.2. Thus, a recurrent training process is adopted to grep the global mutual information on every step instead of the mutual information conditioned on the previous examples. See Figure 3 and Algorithm 1.

Additionally, like the task of sequence generation, we need an input-output example to act as the start signal < sos >. However, different datasets differ largely in program synthesis, which makes the design of a universal start signal difficult. Thus, we design specific start signals for each dataset separately. For example, in Karel, the signal is designed to be an empty map with the robot at the center of the map; In list processing, the signal is just three lists full of NULL. More training details such as the network architectures, the processing of datasets, training tricks, can be seen in Appendix A.

4 EXPERIMENTS

We studied three problems in our experiments. (1) The metric that is reasonable in our query-based framework. (2) The performance of the query-based framework. (3) The generalization ability on different tasks of the query-based framework. To do this, first, we demonstrate a comparison among several program synthesis metrics. Then, we present our results of the query on the Karel dataset and the list processing dataset to show our methods performance and generalization ability.

4.1 METRICS

There are three metrics in neural program synthesis.

Semantics: Given 5 input-output examples, if the predicted program satisfies all these examples, then it is semantically correct. Generalization: Given 5 input-output examples and a held-out input-output example, if the predicted program satisfies all 6 examples, then it is generally correct. Exact match: If the predicted program is the same as the ground-truth program, then it matches the ground-truth exactly.

Among them, the exact match is the most strict metric and generalization is the second one. In practice, the exact match has its drawback in that the predicted program may be different from the ground-truth program, but they are functionally equivalent. Thus, the performance is measured by generalization on Karel and semantics on list processing traditionally. However, generalization is not appropriate here for two reasons: (1) It can only measure the performance of the program synthesis process instead of the query process. In the query process, the network chooses inputoutput examples independently, and judging them by the held-out example is meaningless. (2) Even for the program synthesis process without query, generalization is not strict enough. The program that satisfies a small set of input-output examples may fail on a larger set of input-output examples. Thus, the best choice is to use functional equivalence as our metric. Unfortunately, the judgment of functional equivalent is a notoriously difficult problem which makes it hard to be applied in evaluation. To alleviate this problem, we generate 95 held-out input-output examples randomly to achieve a higher branch coverage than 1 held-out example, and make the generalization on these examples a proxy of the functional equivalence:

Functional equivalence (proxy): Given 5 input-output examples and 95 held-out inputoutput examples, if the predicted program satisfies all 100 examples, then it is functional equivalent to the ground-truth program.

To illustrate the difference between functional equivalence and generalization further, we measured the average branch coverage of these two metrics on Karel s validation set. Given a set of inputoutput examples, branch coverage evaluates the percentage of program branches that are covered by the examples.

Published as a conference paper at ICLR 2022

Table 2: The performance of the program synthesis model trained on input-output examples generated by different methods on the Karel task.

Metric Bunel et al. (2018) Chen et al. (2019) Random Well-designed Query Random Well-designed Query

Exact match 29.44% 41.16% 41.12% 26.08% 37.36% 38.68% Functional equivalence 33.08% 48.52% 46.64% 32.28% 47.60% 48.48%

Semantics Generalization FE

Branch coverage 86.57% 87.99% 97.58%

Table 1: The branch coverage of semantics, generalization and functional equivalence (FE).

The result is shown in Table 1. Functional equivalence outperforms generalization by nearly 10%, which indicates that functional equivalence can represent the correctness of the predicted program much better than generalization.

4.2 KAREL TASK

Karel is an educational programming language used to control a robot living in a 2D grid world (Pattis et al., 1981). The domain-specific language (DSL) and other details are included in Appendix B.1.

Settings. Following Section 3, we split the training process into the training of the query network and the training of the synthesis network. For the query network, we set Encodere the same as the one of Bunel et al. (2018) except for the output layer, and Encoderp a two-layer LSTM (see details in Appendix A.2). For the synthesis network, all settings stay unchanged as in the original program synthesis method. We report the top-1 results with beam size 100 for Bunel et al. (2018) and 64 for Chen et al. (2019) which is the default setting for validation in the original code. As in Section 4.1, exact match and functional equivalence are more appropriate than other metrics in the query task, so we save the checkpoints based on the best exact match instead of the best generalization (for the computation inefficiency of functional equivalence), which may cause differences in our baseline reproduction.

We generate the dataset with 5 input-output examples using three different methods: randomly selected (random), baseline models (well-designed), and our method (query). Then, we train the synthesis network on these datasets with two state-of-the-art methods: Bunel et al. (2018) and Chen et al. (2019). Note that there is a slight difference between the program simulators used by them which will give different responses during the query process, and we choose the one in Bunel et al. (2018) as ours.

Dataset Performance. Table 2 presents the performance of the trained synthesis networks, from which we can conclude that (1) The queried dataset performs well on both two training methods, indicating that the query process is totally decoupled with the synthesis process and has a high generality. (2) Our query method gets comparable results with both baseline methods and largely outperforms the random selection method with more than 10%, which shows its effectiveness.

1 2 3 4 5 Number of queries

Exact match

Query Random Well-designed QBC-crash-aware QBC-crash-unaware

1 2 3 4 5 Number of queries

Functional equivalence

Query Random Well-designed QBC-crash-aware QBC-crash-unaware

Figure 4: The query performance of different methods.

Comparison on query. We also compared our method with query by committee (QBC) Seung et al. (1992) as another baseline, shown in Figure 4. In QBC, we sample queries based on their diversity. That is, we generate program candidates by beam search, and then select the query that can result in the most diverse outputs on these program candidates. The diversity is measured by the output equivalence. Algorithm details can be seen

Published as a conference paper at ICLR 2022

0.0 0.2 0.4 0.6 0.8 1.0 0.0

0 10 20 30 Number of epochs

Exact match

Well-designed Query Random

0 10 20 30 Number of epochs

Functional equivalence

Well-designed Query Random

Bunel et.al

0 2 4 6 8 10 Number of epochs

Exact match

Well-designed Query Random

0 2 4 6 8 10 Number of epochs

Functional equivalence

Well-designed Query Random

Figure 5: The training curve of the program synthesis model on Karel.

Table 3: The performance of the program synthesis model trained on input-output examples generated by different methods on the list processing task.

Dataset Metric Searching for semantics Searching for exact match Random Well-designed Query Random Well-designed Query

D1 Exact match 20.07% 32.81% 22.26% 50.65% 81.21% 81.56% Functional equivalence 52.03% 80.26% 68.20% 50.65% 81.21% 81.56%

D2 Exact match 9.25% 15.72% 10.44% 22.61% 38.51% 38.63% Functional equivalence 24.49% 39.88% 32.94% 22.61% 38.51% 38.63%

in Appendix B.4. There are two strategies based on QBC: Crash-aware, which repeats the algorithm to sample another one if the query crashes; And crash-unaware, which samples queries regardless of the crash problem (see Appendix B.1 for a detailed explanation of crashes). QBC-crash-unaware performs worse than Random because Random is chosen by filtering crashed IOs while QBC-crashunaware may contain crashes. QBC-crash-aware performs much better than QBC-crash-unaware because it queries the underlying program multiple times to make sure that the query will not result in a crash, which is unfair. Even though, out method still outperforms QBC-crash-aware, which shows its advantage.

Training process. To do a further study, we plot the training curve in Figure 5. It is shown that the Query dataset always has a quick start, which indicates that the query method can extract the features of the distinguishable programs effectively and makes them easier to be synthesized.

4.3 LIST PROCESSING TASK

To show the generalization ability of the query method, we conduct another experiment on the list processing task. The list processing task takes 1-3 lists or integers as the input example, and then produces a list or an integer as the output example. More details can be seen in Appendix B.2.

Settings. Following PCCoder (Zohar & Wolf, 2018), we generate two datasets with program length 4 as dataset D1 and program length up to 12 as dataset D2. We set the Encodere of the query network similar to PCCoder except for an MLP, and use a single layer LSTM as the Encoderp (see Appendix A.2). The synthesis network and the parameters of complete anytime beam search (CAB) (Zhang, 1998) stay the same as PCCoder, except that the maximum time is set to 5 seconds instead of 5,000 seconds for reality.

Similar to the Karel task, we generate the dataset with 5 input-output examples with three different methods. Furthermore, we use two end conditions of CAB: The searching ends when all inputoutput examples are satisfied (searching for semantics), and the searching ends when all statements are true (searching for exact match).

Performance. The results are presented in Table 3, from which we can conclude that: (1) Our query method results higher than well-designed input-output examples in searching for the exact match and consistently outperforms the random. (2) When the length of programs increases, the performance decreases largely. This results from the original algorithm of PCCoder that the inter-

Published as a conference paper at ICLR 2022

mediate variables are difficult to be saved when the ground-truth program is long. However, the query method decreases slower than others, and the gap between the well-designed and the query is closed largely.

5 RELATED WORK

Programming by examples. Synthesizing a program that satisfies the user intent using the provided input-output examples is a challenging problem that has been studied for years (Manna & Waldinger, 1971; Lieberman, 2001; Solar-Lezama et al., 2006; Gulwani, 2011; Gulwani et al., 2012). Recently, with the development of Deep Learning, more researchers tend to tackle this problem with neural networks in a variety of tasks, including string transformation (Devlin et al., 2017b), list processing (Balog et al., 2017; Zohar & Wolf, 2018), graphic generation (Ellis et al., 2018; Tian et al., 2019), Karel (Devlin et al., 2017a; Bunel et al., 2018), policy abstraction (Sun et al., 2018; Verma et al., 2018) and so on. Additionally, techniques like program debugger (Balog et al., 2020; Gupta et al., 2020), traces (Chen et al., 2019; Shin et al., 2018; Ellis et al., 2019), property signatures (Odena & Sutton, 2020; Odena et al., 2021) are also used to improve the performance of program synthesis. However, their promising performance relies on the well-designed input-output examples which is a high requirement for users. If the examples provided are of poor quality, the performance will be affected severely. Worse, if out-of-distribution examples are provided, the program synthesis model cannot finish the task as expected (Shin et al., 2019).

Interactive program synthesis. Considering the high requirement of input-output examples, Le et al. (2017) build an abstract framework in which the program synthesis system can interact with users to guide the process of synthesis. Following this framework, multiple interactive synthesis systems are studied. Among them, Mayer et al. (2015), Wang et al. (2017), and Laich et al. (2020) tend to find counter-examples as queries to the users in a random manner. Although they get rid of the restrictions on the users, the quantity of the input-output examples is not guaranteed, which may get the synthesis process into trouble. Padhi et al. (2018) select more than one query each time and let the users choose which one to answer. This brings an additional burden on the users, and the users are unaware of which query will improve the program synthesized most. Most recently, Ji et al. (2020) utilizes the minimax branch to select the question where the worst answer gives the best reduction of the program domain. Theoretically, the performance of the selection is guaranteed. However, this method is based on search and hardly be applied to tasks with large input-output space. Worse, the query process and the synthesis process are bound, which results in its poor scalability. In contrast, our method can be applied to problems with large spaces, and the query process is decoupled with the synthesis process, which makes its application more flexible.

Learning to acquire information. Similar to our work, Pu et al. (2018) and Pu et al. (2017) also study the query problem from an information-theoretic perspective. They show that maximizing the mutual information between the input-output examples and the corresponding program greedily is 1 1

e as good as the optimal solution that considers all examples globally. However, they assume that the space of queries can be enumerated, which limits the application of their query algorithm on complex datasets like Karel. By comparison, our work proposes a more general algorithm that can generate queries in a nearly infinite space. Other related work including active learning and black-box testing can be seen in Appendix D.

6 CONCLUSION

In this work, we propose a query-based framework to finish the program synthesis task more realistically. To optimize this framework, we show the correlation between the query strategy and the mutual information. Moreover, we model the relevance between the input-output examples and programs by introducing the F-space, where we represent the input-output examples as the distribution of the programs. Using these techniques, we conduct a series of experiments that shows the effectiveness, generalization, and scalability of our query-based framework. We believe that our methods work not only on the program synthesis tasks, but also on any task that aims to simulate an underlying oracle, including reverse engineering, symbolic regression, scientific discovery, and so on.

Published as a conference paper at ICLR 2022

ACKNOWLEDGMENTS

This work is partially supported by the National Key Research and Development Program of China (under Grant 2020AAA0103802), the NSF of China (under Grants 61925208, 62102399, 62002338, 61906179, 61732020, U19B2019), Strategic Priority Research Program of Chinese Academy of Science (XDB32050200), Beijing Academy of Artificial Intelligence (BAAI) and Beijing Nova Program of Science and Technology (Z191100001119093), CAS Project for Young Scientists in Basic Research (YSBR-029), Youth Innovation Promotion Association CAS and Xplore Prize.

D. Angluin. Queries and concept learning. Machine Learning, 2:319 342, 1988.

Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, S. Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs. In ICLR (Poster). Open Review.net, 2017.

Matej Balog, Rishabh Singh, Petros Maniatis, and Charles Sutton. Neural program synthesis with a differentiable fixer. Ar Xiv, abs/2006.10924, 2020.

Rudy Bunel, M. Hausknecht, J. Devlin, Rishabh Singh, and P. Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In ICLR (Poster). Open Review.net, 2018.

Xi Chen, Yan Duan, Rein Houthooft, J. Schulman, Ilya Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.

Xinyun Chen, Chang Liu, and D. Song. Execution-guided neural program synthesis. In ICLR, 2019.

Ido Dagan and S. Argamon. Committee-based sampling for training probabilistic classifiers. In ICML, 1995.

J. Devlin, Rudy Bunel, Rishabh Singh, M. Hausknecht, and P. Kohli. Neural program metainduction. In NIPS, 2017a.

J. Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel rahman Mohamed, and P. Kohli. Robustfill: Neural program learning under noisy i/o. In ICML, 2017b.

Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and J. Tenenbaum. Learning to infer graphics programs from hand-drawn images. In Neur IPS, 2018.

Kevin Ellis, Maxwell Nye, Yewen Pu, Felix Sosa, J. Tenenbaum, and Armando Solar-Lezama. Write, execute, assess: Program synthesis with a repl. In Neur IPS, 2019.

S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL 11, 2011.

Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55:97 105, 2012.

Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. Found. Trends Program. Lang., 4:1 119, 2017.

Kavi Gupta, P. E. Christensen, Xinyun Chen, and D. Song. Synthesize, execute and debug: Learning to repair for neural program synthesis. In Neur IPS, 2020.

Eric Jang, Shixiang Shane Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR (Poster). Open Review.net, 2017.

Ruyi Ji, Jingjing Liang, Yingfei Xiong, Lu Zhang, and Zhenjiang Hu. Question selection for interactive program synthesis. Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, 2020.

R. King, Ken E. Whelan, F. M. Jones, Philip G. K. Reiser, Christopher H. Bryant, S. Muggleton, D. Kell, and S. Oliver. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427:247 252, 2004.

Published as a conference paper at ICLR 2022

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

V. Krishnamurthy. Algorithms for optimal scheduling and management of hidden markov model sensors. IEEE Trans. Signal Process., 50:1382 1397, 2002.

Larissa Laich, Pavol Bielik, and Martin T. Vechev. Guiding program synthesis by learning to generate examples. In ICLR, 2020.

Vu Le, Daniel Perelman, Oleksandr Polozov, Mohammad Raza, A. Udupa, and S. Gulwani. Interactive program synthesis. Ar Xiv, abs/1703.03539, 2017.

D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In SIGIR 94, 1994.

H. Lieberman. Your wish is my command: Programming by example. 2001.

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR (Poster). Open Review.net, 2017.

Z. Manna and R. Waldinger. Toward automatic program synthesis. Commun. ACM, 14:151 165, 1971.

Mika el Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr Polozov, Rishabh Singh, B. Zorn, and S. Gulwani. User interaction models for disambiguation in programming by example. Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology, 2015.

Karl Meinke. Automated black-box testing of functional correctness using function approximation. In ISSTA, pp. 143 153. ACM, 2004.

Karl Meinke and Fei Niu. A learning-based approach to unit testing of numerical software. In ICTSS, volume 6435 of Lecture Notes in Computer Science, pp. 221 235. Springer, 2010.

Karl Meinke and Muddassar A. Sindhu. Incremental learning-based testing for reactive systems. In TAP@TOOLS, volume 6706 of Lecture Notes in Computer Science, pp. 134 151. Springer, 2011.

Augustus Odena and Charles Sutton. Learning to represent programs with property signatures. In ICLR. Open Review.net, 2020.

Augustus Odena, Kensen Shi, David Bieber, Rishabh Singh, and Charles Sutton. Bustle: Bottom-up program-synthesis through learning-guided exploration. In ICLR. Open Review.net, 2021.

Saswat Padhi, Prateek Jain, Daniel Perelman, Oleksandr Polozov, S. Gulwani, and T. Millstein. Flashprofile: a framework for synthesizing data profiles. Proceedings of the ACM on Programming Languages, 2:1 28, 2018.

Richard Pattis, J Roberts, and M Stehlik. Karel the robot. A gentele introduction to the Art of Programming, 1981.

Yewen Pu, Leslie Pack Kaelbling, and Armando Solar-Lezama. Learning to acquire information. In Gal Elidan, Kristian Kersting, and Alexander T. Ihler (eds.), Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017. AUAI Press, 2017.

Yewen Pu, Zachery Miranda, Armando Solar-Lezama, and L. Kaelbling. Selecting representative examples for program synthesis. In ICML, 2018.

Hongyu Ren and J. Leskovec. Beta embeddings for multi-hop logical reasoning in knowledge graphs. In Neur IPS, 2020.

H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In COLT, pp. 287 294. ACM, 1992.

Richard Shin, Illia Polosukhin, and D. Song. Improving neural program synthesis with inferred execution traces. In Neur IPS, 2018.

Published as a conference paper at ICLR 2022

Richard Shin, Neel Kant, Kavi Gupta, Christopher M. Bender, Brandon Trabucco, Rishabh Singh, and D. Song. Synthetic datasets for neural program synthesis. In ICLR (Poster). Open Review.net, 2019.

Changjian Shui, Fan Zhou, Christian Gagn e, and B. Wang. Deep active learning: Unified and principled method for query and training. In AISTATS, 2020.

Armando Solar-Lezama, Liviu Tancau, R. Bod ık, S. Seshia, and V. Saraswat. Combinatorial sketching for finite programs. In ASPLOS XII, 2006.

Ke Sun and F. Nielsen. Information-geometric set embeddings (igse): From sets to probability distributions. Ar Xiv, abs/1911.12463, 2019.

Shao-Hua Sun, Hyeonwoo Noh, S. Somasundaram, and Joseph J. Lim. Neural program synthesis from diverse demonstration videos. In ICML, 2018.

Yonglong Tian, Andrew Luo, Xingyuan Sun, Kevin Ellis, W. Freeman, J. Tenenbaum, and Jiajun Wu. Learning to infer and execute 3d shape programs. In ICLR (Poster). Open Review.net, 2019.

A aron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Ar Xiv, abs/1807.03748, 2018.

Abhinav Verma, V. Murali, Rishabh Singh, P. Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. In ICML, volume 80 of Proceedings of Machine Learning Research, pp. 5052 5061. PMLR, 2018.

Chenglong Wang, Alvin Cheung, and R. Bod ık. Interactive query synthesis from input-output examples. Proceedings of the 2017 ACM International Conference on Management of Data, 2017.

Weixiong Zhang. Complete anytime beam search. In AAAI/IAAI, 1998.

Amit Zohar and Lior Wolf. Automatic program synthesis of long programs with a learned garbage collector. In Neur IPS, 2018.

Published as a conference paper at ICLR 2022

A TRAINING DETAILS

A.1 TRAINING ALGORITHM

Algorithm 1 Training process

1: function TRAIN( ) 2: Initialize max iterations N and max query times T 3: for i {1 . . . N} do 4: p Underlying programs 5: (x0, y0) < sos > 6: Je K {(x0, y0)} 7: L 0 8: for t {1 . . . T} do 9: v Encoderp(p) Encode program, Equation (4) 10: µ, log(σ2) IO-ENCODER(Je K) Encode input-output examples 11: xt Decoder(µ, log(σ2)) Query next input 12: yt p(xt) Get next output from the oracle 13: Je K Je K {(xt, yt)} 14: µ , log(σ 2) IO-ENCODER(Je K) 15: L Loss(v, µ , log(σ 2)) + L Info NCE loss, Equation (2) 16: end for 17: Update parameters w.r.t. L 18: end for 19: end function

1: function IO-ENCODER({xk, yk}K) 2: for i {1 . . . K} do 3: [µi, log(σ2 i )] MLPe(Encodere({xi, yi}) Encode single example, Equation (5) 4: end for 5: for i {1 . . . K} do

6: wi exp(MLPattention([µi,log(σ2 i )])) PK j=1 exp(MLPattention(µj,log(σ2 j ))) Calculate attentions, Equation (8)

7: end for 8: [µ, log(σ2)] = PK i=1 wi[µi, log(σ2 i )] Intersection, Equation (7) 9: return µ, log(σ2) 10: end function

A.2 MODEL DETAILS AND HYPER PARAMETERS

Karel. The query encoder is the same as the one used by Bunel et al. (2018) composed of a Convolutional Neural Network (CNN) with the residual link. After the query encoder, there is an MLP to project the embedding into µ, log(σ2). The dimension of µ and σ is set to 256, and thus the hidden size of MLP is 512. The query decoder is similar to the query encoder except that the number of channels is reversed, and additional batch normalization is added to keep the generation more stable. The program encoder is a two-layer LSTM with a hidden size of 256.

Note that in the well-designed dataset, the size of the grid world can be changed from 2 to 16. However, for the convenience of training, we set the size of the query world fixed to 16. Moreover, to guarantee that all queries can be recognized by the program simulator, we split the query into three parts: boundaries (the map size, set to 16 16), agent position (where the agent is and towards), and map state (the placement of markers and obstacles), and generate them as follows:

Boundaries: The boundaries indicate the map size, fixed to 16 16. Agent position: generate a 4 16 16 one-hot vector where 4 indicates four facing directions and 16 16 indicates the position. Map state: generate a 12 one-hot vector on the 16 16 map including obstacle, 1-10 markers, and empty grid.

Published as a conference paper at ICLR 2022

Input-examples

Output-examples

Res Blocks Linear IO embeddings

Input-output examples N

IO embeddings MLP

Activation Conv Re LU

Conv Re LU Conv Re LU Next layers Activation

Conv BN LRe LU

Conv BN LRe LU

Conv BN LRe LU Next layers

Latent code

Reshape Decoder

Gumbel Softmax

Gumbel Softmax

Concat Agent position

Query Encoder Intersect Encoder Encoder

16 16 boundaries

(c) IO Encoder (d) Distribution encoder (MLPe)

(a) Res Block (b) Res BNBlock

(g) The query network for Karel

Input-output examples N

IO Encoder MLPe

log (𝜎ଶ) 𝜇, log (𝜎ଶ), latent code Res BNBlocks Representation

(e) Encoder (f) Decoder

Figure 6: The architecture of the query network for Karel (zoom in for a better view).

Input-output examples N

Latent code

Linear Gumbel Softmax

Gumbel Softmax

Concat Int proposition Query Encoder Intersect Encoder Encoder

Embed Input-output examples N

Dense Blocks

IO embeddings MLP

IO embeddings

Linear List proposition

Type filter Padding Int/List/Null

Input types

Input-output examples N

IO Encoder MLPe

(d) Encoder

Linear Activation Next layers Linear Linear

(c) Distribution Encoder (MLPe)

(b) IO Encoder (a) Dense Block

(e) The query network for list processing

Figure 7: The architecture of the query network for list processing (zoom in for a better view). Type filter chooses query between the list proposition and int proposition depends on the input types.

The architecture details are shown in figure 6.

For training, the learning rate of the query network is set to 10 4 with the Adam optimizer Kingma & Ba (2014) while the learning rate of the synthesis network stays the same with the original methods. The batch size is 128, and the random seed is set to 100.

List processing. The query encoder is the same as the one in PCCoder. The dimension of µ and σ is set to 256 for dataset D1 and 128 for dataset D2 without much tuning. The query decoder is a single-layer linear network with dimension 256 for dataset D1 and 128 for dataset D2. The program encoder is a single layer LSTM with hidden size 256 for dataset D1 and 128 for dataset D2.

The inputs of list processing consist of three types: INT, LIST, and NULL. Thus, each query is NULL or an integer or a list consisting of integers in the range of [ 256, 255]. Each integer is represented by a 512 one-hot vector. For the well-designed input-output examples, the length of a LIST is sampled stochastically with the maximum length of 20. However, to make the query network simpler, we fix the length of queries to 20. The query network generates all three types

Published as a conference paper at ICLR 2022

Prog p := def run() : s Stmt s := while(b) : s | repeat(r) : s | s1; s2 | a | if(b) : s | ifelse(b) : s1 else : s2 Cond b := front Is Clear() | left Is Clear() | right Is Clear() | markers Present() | no Markers Present() | not b Action a := move() | turn Right() | turn Left() | pick Marker() | put Marker() Cste r := 0 | 1 | | 19

Figure 8: The DSL of Karel

separately by different networks and chooses among them according to the type of inputs using a type filter.

The details of the query network for list processing are shown in figure 7.

For training, the learning rate of the query network is set to 10 4 with a 0.1 decay every 40 epochs and the Adam optimizer Kingma & Ba (2014). The learning rate of the synthesis network is 10 3, which stays the same with the original methods with 0.1 decay every 4 epochs. The batch size of the query process is 64, the batch size of synthesis is 32 for D1 and 100 for D2. The random seed is set to 100.

A.3 OTHER TRAINING TECHNIQUES

Latent code. The subsequent queries are easy to fail into the mode collapse, generating similar queries repetitively. To tackle this problem, we introduce a latent code, like the one in Info GAN (Chen et al., 2016), as another input, which indicates the current query step, and ask the encoder in the next query step to decode it accurately. Specifically, given a latent code c which indicates the current query step, the query network is supposed to maximize the mutual information between c and the output query x to alleviate the mode collapse problem. To generate x under the condition of c, we concatenate c with the encoded representation µ and log(σ2), and then send it to the decoder (Equation (9)), shown in Figure 6(g) and Figure 7(e). To maximize the mutual information I(c; x), we design a network aiming to classify c from x, which models the distribution Q(c|x) (details can be seen in Chen et al. (2016)). This network shares its parameters with the Encodere (Equation (5)) except for the last layer, and it is optimized with the cross-entropy loss.

Gumbel-Softmax. Note that the queries are discrete, which makes the query network cannot be optimized by back-propagation. Thus, we take advantage of the Gumbel-Softmax distribution (Maddison et al., 2017; Jang et al., 2017) and make the query process differentiable.

Curriculum learning. The query network is trained progressively. At the beginning of the training, the query network generates one query only. As the training goes on, the limit of the number of queries increases until it achieves five. Specifically, this limit increases every two epochs. Curriculum learning helps to make the training process more stable.

Kullback-Leibler (KL) divergence. A failed attempt. Similar to the latent code, we tried to add the reciprocal of the KL divergence between different queries on the same program as another loss to make the queries more diverse. However, this loss does not have a significant impact on training most of the time and makes the training process unstable, and thus it is abandoned in our final version.

Published as a conference paper at ICLR 2022

Hero facing North Hero facing South

Hero facing West

Hero facing East

Obstacle Grid boundary

1 marker 2 marker 3 marker 4 marker 5 marker 6 marker 7 marker 8 marker 9 marker 10 marker

Table 4: Representation of a cell in grid world

move() put Marker() turn Right() move()

Input Output Program

Figure 9: An example of Karel.

B EXPERIMENT DETAILS

B.1 THE KAREL TASK

Karel is an educational programming language, which can be used to control a robot to move in a 2D grid world (Pattis et al., 1981). Figure 8 presents the domain language specification (DSL) of Karel. Figure 9 shows a Karel program example. Each cell of the grid world is represented as a 16-dimensional vector corresponding to the features described in Table 4 (Bunel et al., 2018).

Handling of crashes. The executor of Karel will get a CRASH result and then terminates if the agent:

Picks a marker while no marker is presented.

Puts a marker down while the number of the markers in the cell exceeds the limit (the limit is set to 10).

Walks up an obstacle or out of boundaries.

Falls into an infinite loop (the loop with more than 105 API calls).

In the early stage of training, the query network generates the queries randomly, and thus the programs run into these crashed states easily, making the queries cannot be applied to training. To avoid these situations, we modify the executor so that when crashes happen, the state stays still without change and the program keeps executing.

Published as a conference paper at ICLR 2022

Basic Function := +1 | 1 | 2 | 2 | ( 1) | 2 | 3 | 3 | 4 | 4 | > 0 | < 0 | %2 | %2 == 1 First-order Function := HEAD | LAST | TAKE | DROP | ACCESS | MINIMUM | MAXIMUN | REVERSE | SORT | SUM Higher-order Function := MAP | FILTER | COUNT | ZIPWITH | SCANL1

Figure 10: The DSL of list processing

FILTER (<0) MAP ( 4) SORT REVERSE

[-17, -3, 4, 11, 0, -5, -9, 13, 6, 6, -8, 11]

[-12, -20, -32, -36, -68]

Figure 11: An example of list processing.

B.2 THE LIST PROCESSING TASK

The list processing task takes 1-3 lists or integers as input examples and produces a list or an integer as the output example. An example is shown in Figure 11. Figure 10 shows the DSL of list processing. Following Zohar & Wolf (2018), we generate two datasets: D1 with program length 4 and D2 with program length up to 12.

Handling of the out of range problem. Constraint propagation guarantees that the execution of programs never obtains intermediate values that are out of range [ 256, 255]. However, when we query randomly in the early stage of training, the out-of-range problem can easily occur. To handle this problem, we truncate the intermediate values to [ 256, 255] while querying to ensure that all queries can yield legal responses.

B.3 THE EVOLUTION OF DISTRIBUTION ENTROPY

To make a sanity check of the F-space formulation, we study how the entropy of the distribution changes over the step of the query.

The entropy of a multivariate Normal distribution N(x; µ, Σ)) is given by

2 (1 + ln(2π)), (11)

where Σ denotes the covariance matrix and D denotes the dimension of x. In our case, we have assumed the independence of each dimension of x which means that Σ is a diagonal matrix. Hence

1 2ln(|Σ|) = 1

i σ2 i ) = 1

i ln(σ2 i ), (12)

where σ2 i is the diagonal element of Σ, indicating the variance of each dimension. D is a constant during querying, so we calculate the mean of log(σ2) as an equivalent substitute to show the change of the entropy. Results are shown in Figure 12. In our experiments, Karel performs best and this is also revealed by the entropy that Karel decreases much faster than list processing. On the contrary, the entropy of list processing decreases slowly and has worse performance in our experiment. This performance may be increased by tuning the query network more carefully.

Published as a conference paper at ICLR 2022

1 2 3 4 5 Query step

1 2 3 4 5 Query step

1 2 3 4 5 Query step

Figure 12: The entropy of the distribution decays when query goes on.

Table 5: The statistics on the query times of each crash-aware QBC query step.

avg 5.72 2.47 2.96 4.27 4.86 max 214 200 224 357 226 min 0 0 0 0 0

B.4 QUERY BY COMMITTEE

In this section, we present the details of our baseline algorithm: query by committee (Seung et al., 1992). Algorithm 2 shows the crash-aware version of the QBC algorithm, which selects the query that can result in the most diverse outputs. The crash-unaware version can be obtained simply by removing all CRASH judgments. Note that, compared with the crash-unaware version, the crashaware version queries the underlying program more times for CRASH judgment (see Table 5), and thus results in a much better performance.

C THEOREMS AND PROOFS

C.1 THE PRODUCT OF TWO NORMAL DISTRIBUTIONS

In this section, we will show that the product of two normal distributions is a scaled normal distribution.

Theorem C.1 (The product of two normal distributions). Given two normal distributions that sat-

isfies p(x) = 1

2πσe (x µ)2

2σ2 , the product of them is a scaled normal distribution in the form of

2πσ e (x µ )2

2σ 2 , where α is the scale factor.

Proof. Suppose we have two normal distributions pa and pb:

2πσa e (x µa)2

2πσb e (x µb)2

The product of pa and pb:

pa(x)pb(x) = 1

2πσa e (x µa)2

2πσb e (x µb)2

= 1 2πσaσb e ( (x µa)2

2σ2a + (x µb)2

Published as a conference paper at ICLR 2022

Algorithm 2 Query by committee (QBC): crash-aware

1: function QUERY( ) 2: Initialize trained model M, query pool Q and max query times T 3: p Underlying program (oracle) 4: repeat 5: x1 samplefrom Q 6: y1 p(x1) 7: until y1 not CRASH 8: Je K {(x1, y1)} 9: for t {2 . . . T} do 10: program candidates Jˆp K M(Je K) Get top K predictions by beam search 11: xt SELECT-QUERY(Q, p, Jˆp K) 12: yt p(xt) 13: Je K Je K {(xt, yt)} 14: end for 15: return Je K 16: end function

1: function SELECT-QUERY(Q, p, Jˆp K) 2: repeat 3: queries sample 100 times from Q 4: score list [] Acquisition scores of queries 5: for q queries do 6: sq 0 7: for ˆp Jˆp K do 8: ˆy ˆp(q) 9: if ˆy is unique then 10: sq sq + 1 The more diversity the output, the better the query 11: end if 12: end for 13: score list.append((q, sq)) 14: end for 15: Sort score list in a descending order with sq 16: for q score list do Select the query with the highest score and without CRASH 17: y p(q) 18: if y not CRASH then 19: return q 20: end if 21: end for 22: until a query is found If all 100 queries result in CRASH, repeat this process. 23: end function

Consider the index part

2σ2a + (x µb)2

2σ2 b = (σ2 a + σ2 b)x2 2(µbσ2 a + µaσ2 b)x + (µ2 aσ2 b + µ2 bσ2 a) 2σ2a2σ2 b

= x2 2 µbσ2 a+µaσ2 b σ2 a+σ2 b x + µ2 bσ2 a+µ2 aσ2 b σ2 a+σ2 b 2σ2 aσ2 b σ2a+σ2 b

= (x µbσ2 a+µaσ2 b σ2a+σ2 b )2

2σ2aσ2 b σ2a+σ2 b

µ2 bσ2 a+µ2 aσ2 b σ2a+σ2 b ( µbσ2 a+µaσ2 b σ2a+σ2 b )2

2σ2aσ2 b σ2a+σ2 b = γ + λ,

Published as a conference paper at ICLR 2022

γ = (x µbσ2 a+µaσ2 b σ2a+σ2 b )2

2σ2aσ2 b σ2a+σ2 b

µ2 bσ2 a+µ2 aσ2 b σ2a+σ2 b ( µbσ2 a+µaσ2 b σ2a+σ2 b )2

2σ2aσ2 b σ2a+σ2 b

Simplify λ:

µ2 bσ2 a+µ2 aσ2 b σ2a+σ2 b ( µbσ2 a+µaσ2 b σ2a+σ2 b )2

2σ2aσ2 b σ2a+σ2 b

= (µ2 bσ2 a + µ2 aσ2 b)(σ2 a + σ2 b) (µbσ2 a + µaσ2 b)2

2σ2aσ2 b(σ2a + σ2 b)

= σ2 aσ2 b(µ2 a + µ2 b 2µaµb) 2σ2aσ2 b(σ2a + σ2 b)

2(σ2a + σ2 b).

Finally, we get

pa(x)pb(x) = 1 2πσaσb e λe γ

2πσ e (x µ )2

2σ 2 , (18)

µ = µbσ2 a + µaσ2 b σ2a + σ2 b ,

σ2aσ2 b σ2a + σ2 b ,

2π(σ2a + σ2 b) e (µa µb)2

2(σ2a+σ2 b ) .

Next, we will show that approximating the new µ and log(σ 2) with the weighted sum of the [µa, µb] and [log(σ2 a), log(σ2 b)] is reasonable.

For µ , it is obvious that it can be seen as

µ = µbσ2 a + µaσ2 b σ2a + σ2 b ,

= σ2 b σ2a + σ2 b µa + σ2 a σ2a + σ2 b µb, (20)

which is a weighted sum. For log(σ 2), we can rewrite it as follows

log(σ 2) = log( σ2 aσ2 b σ2a + σ2 b ),

= log(σ2 a) + log(σ2 b) log(σ2 a + σ2 b)

= log(σ2 a) + log(σ2 b) (hlog(σ2 a) + klog(σ2 b) + log( 1

σ2k b σ2(h 1) a + 1

σ2h a σ2(k 1) b ))

= (1 h)log(σ2 a) + (1 k)log(σ2 b) log( 1

σ2k b σ2(h 1) a + 1

σ2h a σ2(k 1) b ),

Published as a conference paper at ICLR 2022

program q A q B q C q D p0 p1

Figure 13: An example: 8 candidate programs p0 7 and 4 queries q A D each with 2 possible responses { , }.

where h and k are two coefficients. With suitable h and k learned, the last term can be ignored ( 1 σ2k b σ2(h 1) a + 1 σ2h a σ2(k 1) b 1) and thus log(σ 2) can be approximate by the weighted sum of

[log(σ2 a), log(σ2 b)].

C.2 THE RECURRENT TRAINING PROCESS

We adopt a recurrent training process to model the mutual information between the input-output examples and the programs more accurately. Traditionally, the next query is gained by selecting the query with the maximum mutual information conditioned on the former ones:

qk = arg max x I(P; (x, y)|Je Kk 1) (22)

However, this greedy strategy fails in some cases. An example is shown in Figure 13. Suppose that after a series of queries, and finally there are only 8 candidate programs P = {p0, ..., p7} distributed uniformly and 4 queries Q = {q A, ..., q D} each with 2 possible responses { , } left, our goal is to find out the underlying program with as few queries as possible (conditioned on the former queries which are omitted). According to the definition of the mutual information I(X; Y ) = P

y Y P(x, y)log P (x,y) P (x)P (y), we can calculate the mutual information between the 4 queries and the programs as follows (we use q A D = / to represent pi(q A D) = / for simplicity):

I(P; q A) = X

q A { , } P(p, q A)log P(p, q A)

P(p)P(q A) = 8 1

2 ) = 1. (23)

Samilarly, I(P; q B) = I(P; q C) = I(P; q D) = 1. Thus, Equation 22 suggests that 4 queries share the same priority. When the query process continues, however, this is not the case. For the second query (let q = (q A, q B)):

I(P; q) = X

q { , }2 P(p, q)log P(p, q)

P(p)P(q) = 6 1

8 ) = 1.81,

(24) where { , }2 = { , } { , }. Similarly,

I(P; (q A, q C)) = I(P; (q A, q D)) = I(P; (q C, q D)) = 2, I(P; (q B, q C)) = I(P; (q B, q D)) = 1.81. (25)

Equation 24 and Equation 25 show that considering the longer horizon, q B is the worst choice as the first query because it gets the minimum mutual information whichever the second query is. This conclusion is also straightforward without calculation: if we choose q A,C,D as the first query, then we only need to query for two more times; And if we choose q B as the first query, then we need 3.25 queries on average with 4 queries in the worst case.

To this end, we adopt a recurrent training process as shown in Figure 3. In the recurrent training process, the gradient can be propagated through multiple query steps and thus the current query selection can be affected by future queries.

Published as a conference paper at ICLR 2022

D ADDITIONAL RELATED WORK

D.1 ACTIVE LEARNING

Active learning is a research domain that aims to reduce the cost of labeling by selecting the most representative samples iteratively from the unlabeled dataset and then asking them to an oracle for labeling while training. According to Shui et al. (2020), active learning can be divided into pool-based sampling (Angluin, 1988; King et al., 2004), which judges whether a sample should be selected for query-based on the evaluation of the entire dataset; steam-based sampling (Dagan & Argamon, 1995; Krishnamurthy, 2002), which judges each sample independently compared to pool-based sampling; and membership query synthesis (Lewis & Gale, 1994), which means that the unlabeled sample can be generated by the learner instead of selecting from the dataset only. Although active learning involves querying an oracle iteratively, which is similar to our framework, there are still two main differences between them. (1) For the final purpose, active learning aims to finish the training stage at a low cost, and the inference stage remains the same as other machine learning tasks without the query process, while our framework aims to find the oracle (i.e.the underlying program) in a symbolic form, and the query process is retained during inference. (2) For the query process, active learning assumes only one oracle (i.e.the same query always gets the same label). In contrast, our framework assumes multiple oracles (i.e.the same query will get different responses if the underlying programs are different).

D.2 AUTOMATED BLACK-BOX TESTING

Black-box testing is a method of software testing aiming to examine the functionality of a blackbox (such as a piece of a program) by a large number of test cases without perceiving its internal structures. The most famous automated test cases generation method is learning-based testing (LBT) (Meinke, 2004; Meinke & Niu, 2010; Meinke & Sindhu, 2011). LBT is an iterative approach to generate test cases by interacting with the black-box, which sounds similar to the query problem mentioned in this paper. However, the settings and the purpose of the black-box testing are totally different from ours. In detail, the black-box testing assumes the existence of a target requirement (or target function) and a black-box implementation of this requirement (such as a piece of program which is unknown). The purpose is to check the black-box implementation to ensure that there is no bug or difference between the target requirement and the black-box by generating test cases (i.e.input-output examples). In a contrast, in our settings, the target function does not exist, and all we have is a black-box (or called oracle / underlying program in our paper). Our goal is to query this black box and guess which program is hidden inside based on the experience learned from the training set.

E TWO QUERY EXAMPLES

In this section, we show two query examples that cover all branches of the program while the welldesigned dataset fails to do this. Each example consists of the underlying program and the corresponding important queries that improve the branch coverage. See Figure 14 and Figure 15.

Published as a conference paper at ICLR 2022

I O Program

if markers Present:

put Marker if not right Is Clear:

put Marker move pick Marker turn Right repeat R=2:

turn Left put Marker put Marker move put Marker

Figure 14: An query example of Karel. The number in the cell denotes the amount of markers.

Published as a conference paper at ICLR 2022

I O Program

repeat R=2:

put Marker turn Right if front Is Clear:

put Marker put Marker turn Right put Marker while right Is Clear:

move move pick Marker turn Right turn Left turn Left move put Marker turn Left

Figure 15: Another query example of Karel. The number in the cell denotes the amount of markers.