# learning_to_infer_program_sketches__7a8c98de.pdf Learning to Infer Program Sketches Maxwell Nye 1 2 Luke Hewitt 1 2 3 Joshua Tenenbaum 1 2 4 Armando Solar-Lezama 2 Our goal is to build systems which write code automatically from the kinds of speciﬁcations humans can most easily provide, such as examples and natural language instruction. The key idea of this work is that a ﬂexible combination of pattern recognition and explicit reasoning can be used to solve these complex programming problems. We propose a method for dynamically integrating these types of information. Our novel intermediate representation and training algorithm allow a program synthesis system to learn, without direct supervision, when to rely on pattern recognition and when to perform symbolic search. Our model matches the memorization and generalization performance of neural synthesis and symbolic search, respectively, and achieves state-of-the-art performance on a dataset of simple English descriptionto-code programming problems. 1. Introduction An open challenge in AI is to automatically write code from the kinds of speciﬁcations humans can easily provide, such as examples or natural language instruction. Such a system must determine both what the task is and how to write the correct code. Consider writing a simple function which maps inputs to outputs: [2, 3, 4, 5, 6] [2, 4, 6] [5, 8, 3, 2, 2, 1, 12] [8, 2, 2, 12] A novice programmer would not recognize from experience any of the program, and would have to reason about the entire program structure from ﬁrst principles. This reasoning would be done by considering the deﬁnitions and syntax of the primitives in the programming language, and ﬁnding a 1MIT Brain and Cognitive Sciences 2MIT CSAIL 3MIT-IBM AI Lab 4Center for Brains, Minds and Machines (CBMM). Correspondence to: Maxwell Nye . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). way to combine these language constructs to construct an expression with the desired behavior. A moderately experienced programmer might immediately recognize, from learned experience, that because the output list is always a subset of the input list, a filter function is appropriate: filter(input, ) where is a lambda function which ﬁlters elements in the list. The programmer would then have to reason about the correct code for . Finally, a programmer very familiar with this domain might immediately recognize both the need for a filter function, as well as the correct semantics for the lambda function, allowing the entire program to be written in one shot: filter(input, lambda x: x%2==0) Depending on the familiarity of the domain and the complexity of the problem, humans use a ﬂexible combination of recognition of learned patterns and explicit reasoning to solve programming problems (Lake et al., 2017). Familiar patterns are used, when they exist, and for unfamiliar code elements, explicit reasoning is employed. This ﬂexibility is not unique to input-output examples. For example, a natural language speciﬁcation could be used to further specify the desired program, i.e., Find the even values in a list. In this case, the process of writing code is analogous. For example, a programmer might learn that ﬁnd X in a list means filter, and even corresponds to the code x%2==0. For a less familiar task, such as Find values in the list which are powers of two, a programmer might recognize the need for filter, but would need to reason about how to write a lambda function which classiﬁes powers of two. We propose a system which mimics the human ability to dynamically incorporate pattern recognition and reasoning to solve programming problems from examples or natural language speciﬁcation. We show that without direct supervision, our model learns to ﬁnd good intermediates between pattern recognition and symbolic reasoning components, and outperforms existing models on several programming tasks. Learning to Infer Program Sketches Recent work (Murali et al., 2017; Dong & Lapata, 2018) has attempted to combine learned pattern recognition and explicit reasoning using program sketches schematic outlines of full programs (Solar-Lezama, 2008). In Murali et al. (2017), a neural network is trained to output program sketches when conditioned on a spec, and candidate sketches are converted into full programs using symbolic synthesis techniques, which approximate explicit reasoning from ﬁrst principles. However, previous systems use static, hand-designed intermediate sketch grammars, which do not allow the system to learn how much to rely on pattern recognition and how much to rely on symbolic search. The neural network is trained to map from spec to a pre-speciﬁed sketch, and cannot learn to output a more detailed sketch, if the pattern matching task is easy, or learn to output a more general sketch, if the task is too difﬁcult. Ideally, a neuro-symbolic synthesis system should dynamically take advantage of the relative strengths of its components. When given an easy or familiar programming task, for example, it should rely on its learned pattern recognition, and output a fuller program with a neural network, so that less time is required for synthesis. In contrast, when given a hard task, the system should learn to output a less complete sketch and spend more time ﬁlling in the sketch with search techniques. We believe that this ﬂexible integration of neural and symbolic computation, inspired by humans, is necessary for powerful, domain-general intelligence, and for solving difﬁcult programming tasks. The key idea in this work is to allow a system to learn a suitable intermediate sketch representation between a learned neural proposer and a symbolic search mechanism. Inspired by Murali et al. (2017), our technique comprises a learned neural sketch generator and a enumerative symbolic program synthesizer. In contrast to previous work, however, we use a ﬂexible and domain-general sketch grammar, and a novel self-supervised training objective, which allows the network to learn how much to rely on each component of the system. The result is a ﬂexible, domain-general program synthesis system, which has the ability to learn sophisticated patterns from data, comparably to Devlin et al. (2017), as well as utilize explicit symbolic search for difﬁcult or out-of-sample problems, as in Balog et al. (2016). Without explicit supervision, our model learns good intermediates between neural network and synthesis components. This allows our model to increase data efﬁciency and generalize better to out-of-sample test tasks compared to RNNbased models. Concretely, our contributions are as follows: We develop a novel neuro-symbolic program synthesis system, which writes programs from input-output examples and natural language speciﬁcation by learning a suitable intermediate sketch representation between a neural network sketch generator and a symbolic synthesizer. We introduce a novel training objective, which we used to train our system to ﬁnd suitable sketch representations without explicit supervision. We validate our system by demonstrating our results in two programming-by-example domains, list processing problems and string transformation problems, and achieve state-of-the-art performance on the Algo Lisp English-to-code test dataset. 2. Problem Formulation Assume that we have a DSL which deﬁnes a space of programs, G. In addition, we have a set of program speciﬁcations, or specs, which we wish to solve . We assume each spec Xi is satisﬁed by some true unknown program Fi. If our speciﬁcation contains a set of IO examples Xi = {(xij, yij)}j=1..n, then we can say that we have solved a task Xi if we ﬁnd the true program Fi, which must satisfy all of the examples: j : Fi(xij) = yij Our goal is to build a system which, given Xi, can quickly recover Fi. For our purposes, quickly is taken to mean that such a solution is found within some threshold time, Time(Xi Fi) < t. Formally, then, we wish to maximize the probability that our system solves each problem within this threshold time: max log P h Time(Xi Fi) < t i (1) Additionally, for some domains our spec X may contain additional informal information, such as natural language instruction. In this case, we can apply same formalism, maximizing the probability that the true program Fi is found given the spec Xi, within the threshold time. 3. Our Approach: Learning to Infer Sketches 3.1. System Overview: Our approach, inspired by work such as Murali et al. (2017), is to represent the relationship between program speciﬁcation and program using an intermediate representation called a program sketch. However in contrast to previous work, where the division of labor between generating sketches and ﬁlling them in is ﬁxed, our approach allows this division of labor to be learned, without additional supervision. We deﬁne a sketch simply as a valid program tree in the DSL, where any number of subtrees has been replaced by Learning to Infer Program Sketches a special token: . Intuitively, this token designates locations in the program tree for which pattern-based recognition is difﬁcult, and more explicit search methods are necessary. Our system consists of two main components: 1) a sketch generator, and 2) a program synthesizer. The sketch generator is a distribution over program sketches given the spec: qφ(sketch|X). The generator is parametrized by a recurrent neural network, and is trained to assign high probability to sketches which are likely to quickly yield programs satisfying the spec when given to the synthesizer. Details about the learning scheme and architecture will be discussed below. The program synthesizer takes a sketch as a starting point, and performs an explicit symbolic search to ﬁll in the holes in order to ﬁnd a program which satisﬁes the spec. Given a set of test problems, in the form of a set of specs, the system searches for correct programs as follows: The sketch generator, conditioned on the program spec, outputs a distribution over sketches. A ﬁxed number of candidate sketches {si} are extracted from the generator. This set {si} is then given to the program synthesizer, which searches for full programs maximizing the likelihood of the spec. For each candidate sketch, the synthesizer uses symbolic enumeration techniques to search for full candidate programs which are formed by ﬁlling in the holes in the sketch. Using our approach, our system is able to ﬂexibly learn the optimal amount of detail needed in the sketches, essentially learning how much to rely on each component of the system. Furthermore, due to our domain-general sketch grammar, we are easily able to implement our system in multiple different domains with very little overhead. 3.2. Learning to Infer Sketches via Self-supervision By using sketches as an intermediate representation, we reframe our program synthesis problem (Eq. 1) as follows: learn a sketch generator qφ(s|X) which, given a spec Xi, produces a good sketch s from which the synthesizer can quickly ﬁnd the solution Fi. We may thus wish to maximize the probability that our sketch generator produces a good sketch: max φ log Ps qφ( |Xi) h Time(s Fi) < t i (2) where Time(s Fi) is the time taken for the synthesizer to discover the solution to Xi by ﬁlling the holes in sketch s, and t is the synthesizer s evaluation budget. In order to learn a system which is most robust, we make one ﬁnal modiﬁcation to Equation (2): at train time we do not necessarily know what the timeout will be during evaluation, so we would like to train a system which is agnostic to the amount of time it would have. Ideally, if a program can be found entirely (or almost entirely) using familiar patterns, then the sketch generator should assign high probability to very complete sketches. However, if a program is unfamiliar or difﬁcult, the sketches it favors should be very general, so that the synthesizer must do most of the computation. To do this, we can train the generator to output sketches which would be suitable for a wide distribution of evaluation budgets. This can be achieved by allowing the budget t to be a random variable, sampled from some distribution Dt. Adding this uncertainty to Equation (2) yields: max φ log P t Dt s qφ( |Xi) h Time(s Fi) < t i (3) In practice, we can achieve this maximization by selfsupervised training. That is, given a dataset of program-spec pairs, for each spec we optimize the generator to produce only the sketches from which we can quickly recover its underlying program. Thus, given training data as (F, X) pairs, our training objective may be written: obj = E t Dt (F,X) G log X s:Time(s F )0 (Map +1 input) Count >0 (Map (HOLE)) [1, 3, -4, 3]-> 3 [-3, 0, 2, -1]-> 2 [7,-4,-5, 2]-> 2 .25 .03 .02 .06 .40 .05 Figure 1. Schematic overview of our model. A program spec (in the form of examples) is fed into a sketch generator, which outputs a distribution over sketches. In our experiments, the neural sketch generator is parametrized by a seq-to-seq RNN with attention. The program sketch is given to a program synthesizer, which searches for full programs which satisfy the spec. Our enumerative synthesizer is guided by a learned recognizer, which is conditioned on the spec and the sketch and predicts the likelihood of using each program token to ﬁll in the sketch. also has an additional learned LSTM language model as in Bunel et al. (2018), which reweights the program token probabilities from the seq-to-seq model. This LSTM language model is simply trained to assign high probability to likely sequences of tokens via supervised learning. 4.2. Synthesis via Enumeration Our symbolic sketch synthesizer is based on Ellis et al. (2018) and Balog et al. (2016) and has two components: a breadth-ﬁrst probabilistic enumerator, which enumerates candidate programs from most to least likely, and a neural recognizer, which uses the spec to guide this probabilistic enumeration. The enumerator, based on Ellis et al. (2018) uses a strongly typed DSL, and assumes that all programs are expressions in λ-calculus. Each primitive has an associated production probability. These production probabilities constitute the parameters, θ, for a probabilistic context free grammar, thus inducing a distribution over programs p(F|s, θ). Synthesis proceeds by enumerating candidate programs which satisfy a sketch in decreasing probability under this distribution. Enumeration is done in parallel for all the candidate sketches, until a full program is found which satisﬁes the input-output examples in the spec, or until the enumeration budget is exceeded. The learned recognizer is inspired by Menon et al. (2013) and the Deepcoder system in Balog et al. (2016). For a given task, an RNN encodes each spec into a latent vector. The latent vectors are averaged, and the result is passed into a feed-forward MLP, which terminates in a softmax layer. The resulting vector is used as the set of production probabilities θ which the enumerator uses to guide search. SKETCHADAPT succeeds by exploiting the fundamental difference in search capabilities between its neural and symbolic components. Pure-synthesis approaches can enumerate and check candidate programs extremely quickly we enumerate roughly 3 103 prog/sec, and the fastest enu- merator for list processing exceeds 106 prog/sec. However, generating expressions larger than a few nodes requires searching an exponentially large space, making enumeration impractical for large programs. Conversely, seq2seq networks (and tree RNNs) require fewer samples to ﬁnd a solution but take much more time per sample (many milliseconds per candidate in a beam search) so are restricted to exploring only hundreds of candidates. They therefore succeed when the solution is highly predictable (even if it is long), but fail if even a small portion of the program is too difﬁcult to infer. By ﬂexibly combining these two approaches, our system searches the space of programs more effectively than either approach alone; SKETCHADAPT uses learned patterns to guide a beam search when possible, and fast enumeration for portions of the program which are difﬁcult to recognize. This contrasts with Murali et al. (2017), where the division of labor is ﬁxed and cannot be learned. 4.3. Training The training objective above (Eq. 4) requires that for each training program F, we know the set of sketches which can be synthesized into F in less than time t (where the synthesis time is given by Time(s F).) A simple way to determine this set would be to simulate synthesis for each candidate sketch, to determine synthesis can succeed in less time than t. In practice, we do not run synthesis during training of the sketch generator to determine Time(s F). One beneﬁt of the probabilistic enumeration is that it provides an estimate of the enumeration time of a sketch. It is easy to calculate the likelihood p(F|s, θ) of any full program F, given sketch s and production probabilities θ given by the recognizer θ = r(X). Because we enumerate programs in decreasing order of likelihood, we know that search time (expressed as number of evaluations) can be upper bounded using the likelihood: Time(s F) [p(F|s, θ)] 1. Thus, we can use the inverse likelihood to lower bound Equation (4) by: obj E t Dt (F,X) G log X s:p 1(F |s,θ) token. This model is comparable to the Deepcoder system in Balog et al. (2016), which was developed to solve the list transformation tasks we examine in subsection 5.1. The Generator only alternate model is a fully seq-to-seq RNN, equivalent in architecture to our sketch generator, trained simply to predict the entire program. This model Algorithm 1 SKETCHADAPT Training Require: Sketch Generator qφ(sketch|X); Recognizer rψ(X, sketch); Enumerator dist. p(F|θ, sketch), Base Parameters θbase Train Recognizer, rψ: for F, X in Dataset (or sampled from DSL) do Sample t Dt sketches, probs list all possible sketches of F, with probs given by p(F|s, θbase) sketch sketch with largest prob s.t. prob < t. θ rψ(X, sketch) grad. step on ψ to maximize log p(F|θ, sketch) end for Train Sketch Generator, qφ: for F, X in Dataset (or sampled from DSL) do Sample t Dt θ rψ(X) sketches, probs list all possible sketches of F, with probs given by p(F|s, θ) sketch sketch with largest prob s.t. prob < t. grad. step on φ to maximize log qφ(sketch|X) end for is comparable to the Robust Fill model in Devlin et al. (2017), which was developed to solve the string transformation tasks we examine in subsection 5.2. This model is also comparable to the sequence-to-sequence models in Polosukhin & Skidanov (2018). 5.1. List Processing In our ﬁrst, small scale experiment, we examine problems that require an agent to synthesize programs which transform lists of integers. We use the list processing DSL from Balog et al. (2016), which consists of 34 unique primitives. The primitives consist of ﬁrst order list functions, such as Algorithm 2 SKETCHADAPT Evaluation Require: Sketch Generator qφ(sketch|X); Recognizer rψ(X, sketch); Enumerator dist. p(F|θ, sketch) function synthesize Program(X) sketches beam search qφ( |X) for sketch in sketches {in parallel} do θsketch rψ(X, sketch) while timeout not exceeded do F next full prog. from enumerate(sketch, θsketch) if F satisﬁes X then return F end if end while end for Learning to Infer Program Sketches Table 1. Example list processing programs Input Output Program 1, [-101, 63, 64, 79, 119, 91, -56, 47, -74, -33] 39 (MAXIMUM (MAP DIV3 (DROP input0 input1))) 4, [-6, -96, -45, 17, 26, -38, 17, -18, -112, -48] 8 2, [-9, 5, -8, -9, 9, -3, 7, -5, -10, 1] [100, 16] (TAKE input0 (MAP SQR (MAP DEC input1))) 3, [-5, -8, 0, 10, 2, -7, -3, -5, 6, 2] [36, 81, 1] 101 102 103 104 105 Number of candidates evaluated per problem % of problems solved List Processing: length 3 test programs Sketch Adapt (ours) Synthesizer only (Deepcoder) Generator only (Robust Fill) 101 102 103 104 105 Number of candidates evaluated per problem % of problems solved List Processing: length 4 test programs Sketch Adapt (ours) Synthesizer only (Deepcoder) Generator only (Robust Fill) Figure 2. Left: Results of model trained on list processing programs of length 3, using a beam size of 100, plotted as a function of the enumeration budget. Right: Generalization results: models trained on list processing programs of length 3, evaluated on programs of length 4. Although the synthesis-only Deepcoder model was developed for the list processing problems, our SKETCHADAPT model requires much less enumeration to achieve high accuracy for the within-sample length 3 programs, and performs comparably for the out-of-sample length 4 programs, far exceeding the Generator only RNN-based model. 101 102 103 104 Number of candidates evaluated per problem % of problems solved String Editing Sketch Adapt, beam 100 (ours) Sketch Adapt, beam 50 (ours) Generator only, beam 100 (Robust Fill) Generator only, beam 50 (Robust Fill) Synthesizer only (Deepcoder) Figure 3. Performance on string editing problems. Although Robust Fill was developed for string editing problems, SKETCHADAPT achieves higher accuracy on these tasks. head, last and reverse, higher order list functions, such as map, filter and zipwith, and lambda functions such as min, max and negate. Our programs are semantically equivalent, but differ from those in Balog et al. (2016) in that we use no bound variables (apart from input variables), and instead synthesize a single s-expression. As in Balog et al. (2016), the spec for each task is a small number of input-output examples. See Table 1 for sample programs and examples. Our goal was to determine how well our system could perform in two regimes: within-sample, where the test data is similar to the training data, and out-of-sample, where the test data distribution is different from the training distribution. We trained our model on programs of length 3, and tested its performance two datasets, one consisting of 100 programs of length 3, and the other with 100 length 4 programs. With these experiments, we could determine how well our system synthesizes easy and familiar programs (length 3), and difﬁcult programs which require generalization (length 4). During both training and evaluation, the models were conditioned on 5 example input-output pairs, which contain integers with magnitudes up to 128. In Figure 2, we plot the proportion of tasks solved as a function of the number of candidate programs enumerated per task. Although a Generator only RNN model is able to synthesize many length 3 programs, it performs very poorly on the out-of-sample length 4 programs. We also observe that, while the Synthesizer only model can take advantage of a large enumeration budget and solve a higher propor- Learning to Infer Program Sketches Table 2. Example string editing programs Input Output Program Madelaine M- (concat list (expr Get Up To Char) (concat single (Constant dash))) Olague O118-980-214 214 (concat single (expr Get Token Number -1)) 938-242-504 504 tion of out-of-sample tasks than the Generator only RNN, it does not take advantage of learned patterns to synthesize the length 3 programs quickly, due to poor inductive biases. Only our model is able to perform well on both within-sample and out-of-sample tasks. 5.2. String Transformations In our second test domain, we explored programs which perform string transformations, as in Gulwani (2011). These problems involve ﬁnding a program which maps an input string to an output string. Typically, these programs are used to manipulate the syntactic form of the underlying information, with minimal changes to the underlying semantics. Examples include converting a list of First Name Last Name to Last Initial, Firstname . These problems have been studied by Gulwani (2011); Polozov & Gulwani (2015); Devlin et al. (2017) and others. We show that our system is able to accurately recover these programs. As our test corpus, we used string editing problems from the Sy Gu S (Alur et al., 2016) program synthesis competition, and string editing tasks used in Ellis et al. (2018). We excluded tasks requiring multiple input strings or a prespeciﬁed string constant, leaving 48 Sy Gu S programs and 79 programs from Ellis et al. (2018). Because we had a limited corpus of problems, we trained our system on synthetic data only, sampling all training programs from the DSL. Because our system has no access to the test distribution, this domain allows us to see how well our method is able to solve real-world problems when trained only on a synthetic distribution. Furthermore, the string editing DSL is much larger than the list transformation DSL. This means that enumerative search is both slower and less effective than for the list transformation programs, where a fast enumerator could brute force the entire search space (Balog et al., 2016). Because of this, the Synthesizer only model is not able to consistently enumerate sufﬁciently complex programs from scratch. We trained our model using self-supervision, sampling training programs randomly from the DSL and conditioning the models on 4 examples of input-output pairs, and evaluated on our test corpus. We plot our results in Figure 3. Overall, SKETCHADAPT outperforms the Synthesizer 2000 4000 6000 8000 79214 (full dataset) Number of training programs used % of test programs solved Our model Generator only (Robust Fill) Synthesizer only (Deepcoder) Figure 4. Algo Lisp: varying training data size. We trained our model and baselines on various dataset sizes, and evaluated performance on a held-out test dataset. Our SKETCHADAPT system considerably outperforms baselines in the low-data regime. only model, and matches or exceeds the performance of the Generator only RNN model, which is noteworthy given that it is equivalent to Robust Fill, which was designed to synthesize string editing programs. We also note that the beam size used in evaluation of the Generator only RNN model has a large impact on performance. However, the performance of our SKETCHADAPT system is less dependent on beam size, suggesting that the system is able to effectively supplement a smaller beam size with enumeration. 5.3. Algolisp: Description to Programs Our ﬁnal evaluation domain is the Algo Lisp DSL and dataset, introduced in Polosukhin & Skidanov (2018). The Algo Lisp dataset consists of programs which manipulate lists of integers and lists of strings. In addition to inputoutput examples, each speciﬁcation also contains a natural language description of the desired program. We use this dataset to examine how well our system can take advantage of highly unstructured speciﬁcation data such as natural language, in addition to input-output examples. The Algo Lisp problems are very difﬁcult to solve using only examples, due to the very large search space and program complexity (see Table 4: SKETCHADAPT, IO only). However, the natural language description makes it possible, with enough data, to learn a highly accurate semantic parsing scheme and solve many of the test tasks. In addition, because this domain uses real data, and not data generated from self-supervision, we wish to determine how dataefﬁcient our algorithm is. Therefore, we train our model on Learning to Infer Program Sketches Table 3. Example problems from the Algo Lisp dataset Spec Program Consider an array of numbers, ( filter a ( lambda1 ( == ( ﬁnd elements in the given array not divisible by two % arg1 2 ) 1))) You are given an array of numbers, (reduce(reverse(digits(deref (sort a) your task is to compute median (/ (len a) 2)))) 0 in the given array with its digits reversed (lambda2 (+(* arg1 10) arg2))) Table 4. Algolisp results on full dataset Model Full dataset Filtered1 (Dev) Test (Dev) Test SKETCHADAPT (Ours) (88.8) 90.0 (95.0) 95.8 Synthesizer only (5.2) 7.3 (5.6) 8.0 Generator only (91.4) 88.6 (98.4) 95.6 SKETCHADAPT, IO only (4.9) 8.3 (5.6) 8.8 Seq2Tree+Search (86.1) 85.8 - - SAPS2 (83.0) 85.2 (93.2) 92.0 Table 5. Algolisp generalization results: Trained on 8000 programs, excluding Odd concept: Model Even Odd SKETCHADAPT (Ours) 34.4 29.8 Synthesizer only 23.7 0.0 Generator only 4.5 1.1 subsets of the data of various sizes to test generalization. Figure 4 and Table 4 depict our main results for this domain, testing all systems with a maximum timeout of 300 seconds per task.1 When using a beam size of 10 on the full dataset, SKETCHADAPT and the Generator only RNN baseline far exceed previously reported state of art performance and achieve near-perfect accuracy, whereas the Synthesizers only model is unable to achieve high performance. However, when a smaller number of training programs is used, SKETCHADAPT signiﬁcantly outperforms the Generator only RNN baseline. These results indicates that the symbolic search allows SKETCHADAPT to perform stronger generalization than pure neural search methods. Strong generalization to unseen subexpressions: As a ﬁnal test of generalization, we trained SKETCHADAPT and our baseline models on a random sample of 8000 training programs, excluding all those which contain the function odd as expressed by the Algo Lisp subexpression (lambda1(== (% arg1 2) 1)) (in python, lambda x: x%2==1). We then evaluate on all 635 test programs containing odd , as well as the 638 containing even (lambda x: x%2==0). As shown in Table 5, the Generator only RNN baseline exhibits only weak generalization, solving novel tasks which require the even subex- 1As in Bednarek et al. (2018), we ﬁlter the test and dev datasets for only those tasks for which reference programs satisfy the given specs. The ﬁltered version is also used for Figure 4. 2Introduced in Bednarek et al. (2018) pression but not those which require the previously unseen odd subexpression. By contrast, SKETCHADAPT exhibits strong generalization to both even and odd programs. 6. Related Work Our work takes inspiration from the neural program synthesis work of Balog et al. (2016), Devlin et al. (2017) and Murali et al. (2017). Much recent work has focused on learning programs using deep learning, as in Kalyan et al. (2018), Bunel et al. (2018), Shin et al. (2018), or combining symbolic and learned components, such as Parisotto et al. (2016), Kalyan et al. (2018), Chen et al. (2017), Zohar & Wolf (2018), and Zhang et al. (2018). Sketches have also been explored for semantic parsing (Dong & Lapata, 2018) and differentiable programming (Boˇsnjak et al., 2017). We also take inspiration from the programming languages literature, particularly Sketch (Solar-Lezama, 2008) and angelic nondeterminism (Bodik et al., 2010). Other work exploring symbolic synthesis methods includes λ2 (Feser et al., 2015) and Schkufza et al. (2016). Learning programs has also been studied from a Bayesian perspective, as in EC (Dechter et al., 2013), Bayesian Program Learning (Lake et al., 2015), and inference compilation (Le et al., 2016). 7. Discussion We developed a novel neuro-symbolic scheme for synthesizing programs from examples and natural language. Our system, SKETCHADAPT, combines neural networks and symbolic synthesis by learning an intermediate sketch representation, which dynamically adapts its speciﬁcity for each task. Empirical results show that SKETCHADAPT recognizes common motifs as effectively as pure RNN approaches, while matching or exceeding the generalization of symbolic synthesis methods. We believe that difﬁcult program synthesis tasks cannot be solved without ﬂexible integration of pattern recognition and explicit reasoning, and this work provides an important step towards this goal. We also hypothesize that learned integration of different forms of computation is necessary not only for writing code, but also for other complex AI tasks, such as high-level planning, rapid language learning, and sophisticated question answering. In future work, we plan to explore the ideas presented here for other difﬁcult AI domains. Learning to Infer Program Sketches Acknowledgements The authors would like to thank Kevin Ellis and Lucas Morales for very useful feedback, as well as assistance using the EC codebase. M. N. is supported by an NSF Graduate Fellowship and an MIT BCS Hilibrand Graduate Fellowship. L. H. is supported by the MIT-IBM Watson AI Lab. Alur, R., Fisman, D., Singh, R., and Solar-Lezama, A. Sygus-comp 2016: Results and analysis. ar Xiv preprint ar Xiv:1611.07627, 2016. Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning to write programs. ar Xiv preprint ar Xiv:1611.01989, 2016. Bednarek, J., Piaskowski, K., and Krawiec, K. Ain t nobody got time for coding: Structure-aware program synthesis from natural language. ar Xiv preprint ar Xiv:1810.09717, 2018. Bodik, R., Chandra, S., Galenson, J., Kimelman, D., Tung, N., Barman, S., and Rodarmor, C. Programming with angelic nondeterminism. In ACM Sigplan Notices, volume 45, pp. 339 352. ACM, 2010. Boˇsnjak, M., Rockt aschel, T., Naradowsky, J., and Riedel, S. Programming with a differentiable forth interpreter. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 547 556. JMLR. org, 2017. Bunel, R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcement learning for neural program synthesis. ar Xiv preprint ar Xiv:1805.04276, 2018. Chen, X., Liu, C., and Song, D. Towards synthesizing complex programs from input-output examples. ar Xiv preprint ar Xiv:1706.01284, 2017. Dechter, E., Malmaud, J., Adams, R. P., and Tenenbaum, J. B. Bootstrap learning via modular concept discovery. In IJCAI, pp. 1302 1309, 2013. Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. Robustﬁll: Neural program learning under noisy i/o. ar Xiv preprint ar Xiv:1703.07469, 2017. Dong, L. and Lapata, M. Coarse-to-ﬁne decoding for neural semantic parsing. ar Xiv preprint ar Xiv:1805.04793, 2018. Ellis, K., Morales, L., Sabl e-Meyer, M., Solar-Lezama, A., and Tenenbaum, J. Learning libraries of subroutines for neurally guided bayesian program learning. NIPS, 2018. Feser, J. K., Chaudhuri, S., and Dillig, I. Synthesizing data structure transformations from input-output examples. In ACM SIGPLAN Notices, volume 50, pp. 229 239. ACM, 2015. Gulwani, S. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, volume 46, pp. 317 330. ACM, 2011. Kalyan, A., Mohta, A., Polozov, O., Batra, D., Jain, P., and Gulwani, S. Neural-guided deductive search for real-time program synthesis from examples. 2018. Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017. Le, T. A., Baydin, A. G., and Wood, F. Inference compilation and universal probabilistic programming. ar Xiv preprint ar Xiv:1610.09900, 2016. Menon, A., Tamuz, O., Gulwani, S., Lampson, B., and Kalai, A. A machine learning framework for programming by example. In International Conference on Machine Learning, pp. 187 195, 2013. Murali, V., Qi, L., Chaudhuri, S., and Jermaine, C. Neural sketch learning for conditional program generation. ar Xiv preprint ar Xiv:1703.05698, 2017. Parisotto, E., Mohamed, A.-r., Singh, R., Li, L., Zhou, D., and Kohli, P. Neuro-symbolic program synthesis. ar Xiv preprint ar Xiv:1611.01855, 2016. Polosukhin, I. and Skidanov, A. Neural program search: Solving programming tasks from description and examples. ar Xiv preprint ar Xiv:1802.04335, 2018. Polozov, O. and Gulwani, S. Flashmeta: a framework for inductive program synthesis. In ACM SIGPLAN Notices, volume 50, pp. 107 126. ACM, 2015. Schkufza, E., Sharma, R., and Aiken, A. Stochastic program optimization. Communications of the ACM, 59(2):114 122, 2016. Shin, R., Polosukhin, I., and Song, D. Improving neural program synthesis with inferred execution traces. In Advances in Neural Information Processing Systems, pp. 8931 8940, 2018. Solar-Lezama, A. Program synthesis by sketching. Ph D thesis, University of California, Berkeley, 2008. Learning to Infer Program Sketches Zhang, L., Rosenblatt, G., Fetaya, E., Liao, R., Byrd, W., Might, M., Urtasun, R., and Zemel, R. Neural guided constraint logic programming for program synthesis. In Advances in Neural Information Processing Systems, pp. 1744 1753, 2018. Zohar, A. and Wolf, L. Automatic program synthesis of long programs with a learned garbage collector. In Advances in Neural Information Processing Systems, pp. 2098 2107, 2018.