# learning_differentiable_programs_with_admissible_neural_heuristics__0efca39c.pdf Learning Differentiable Programs with Admissible Neural Heuristics Ameesh Shah UC Berkeley ameesh@berkeley.edu ezhan@caltech.edu Jennifer J. Sun jjsun@caltech.edu Abhinav Verma verma@utexas.edu yyue@caltech.edu Swarat Chaudhuri swarat@cs.utexas.edu We study the problem of learning differentiable functions expressed as programs in a domain-specific language. Such programmatic models can offer benefits such as composability and interpretability; however, learning them requires optimizing over a combinatorial space of program architectures . We frame this optimization problem as a search in a weighted graph whose paths encode top-down derivations of program syntax. Our key innovation is to view various classes of neural networks as continuous relaxations over the space of programs, which can then be used to complete any partial program. This relaxed program is differentiable and can be trained end-to-end, and the resulting training loss is an approximately admissible heuristic that can guide the combinatorial search. We instantiate our approach on top of the A algorithm and an iteratively deepened branch-and-bound search, and use these algorithms to learn programmatic classifiers in three sequence classification tasks. Our experiments show that the algorithms outperform stateof-the-art methods for program learning, and that they discover programmatic classifiers that yield natural interpretations and achieve competitive accuracy. 1 Introduction An emerging body of work advocates program synthesis as an approach to machine learning. The methods here learn functions represented as programs in symbolic, domain-specific languages (DSLs) [12, 11, 49, 44, 46, 45]. Such symbolic models have a number of appeals: they can be more interpretable than neural models, they use the inductive bias embodied in the DSL to learn reliably, and they use compositional language primitives to transfer knowledge across tasks. In this paper, we study how to learn differentiable programs, which use structured, symbolic primitives to compose a set of parameterized, differentiable modules. Differentiable programs have recently attracted much interest due to their ability to leverage the complementary advantages of programming language abstractions and differentiable learning. For example, recent work has used such programs to compactly describe modular neural networks that operate over rich, recursive data types [44]. To learn a differentiable program, one needs to induce the program s architecture while simultaneously optimizing the parameters of the program s modules. This co-design task is difficult because the space of architectures is combinatorial and explodes rapidly. Prior work has approached this challenge using methods such as greedy enumeration, Monte Carlo sampling, Monte Carlo tree Equal Contribution. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. search, and evolutionary algorithms [46, 44, 10]. However, such approaches can often be expensive, due to not fully exploiting the structure of the underlying combinatorial search problem. In this paper, we show that the differentiability of programs opens up a new line of attack on this search problem. A standard strategy for combinatorial optimization is to exploit (ideally fairly tight) continuous relaxations of the search space [31, 6, 48, 42, 24, 2, 45]. Optimization in the relaxed space is typically easier and can efficiently guide search algorithms towards good or optimal solutions. In the case of program learning, we propose to use various classes of neural networks as relaxations of partial programs. We frame our problem as searching a graph, in which nodes encode program architectures with missing expressions, and paths encode top-down program derivations. For each partial architecture u encountered during this search, the relaxation amounts to substituting the unknown part of u with a neural network with free parameters. Because programs are differentiable, this network can be trained on the problem s end-to-end loss. If the space of neural networks is an (approximate) proper relaxation of the space of programs (and training identifies a near-optimum neural network), then the training loss for the relaxation can be viewed as an (approximately) admissible heuristic. We instantiate our approach, called NEAR (abbreviation for Neural Admissible Relaxation), on top of two informed search algorithms: A and an iteratively deepened depth-first search that uses a heuristic to direct branching as well as branch-and-bound pruning (IDS-BB). We evaluate the algorithms in the task of learning programmatic classifiers in three behavior classification applications. We show that the algorithms substantially outperform state-of-the-art methods for program learning, and can learn classifier programs that bear natural interpretations and are close to neural models in accuracy. To summarize, the paper makes three contributions. First, we identify a tool heuristics obtained by training neural relaxations of programs for accelerating combinatorial searches over differentiable programs. So far as we know, this is the first approach to exploit the differentiability of a programming language in program synthesis. Second, we instantiate this idea using two classic search algorithms. Third, we present promising experimental results in three sequence classification applications. 2 Problem Formulation We view a program in our domain-specific language (DSL) as a pair ( , ), where is a discrete (program) architecture and is a vector of real-valued parameters. The architecture is generated using a context-free grammar [21]. The grammar consists of a set of rules X ! σ1 . . . σk, where X is a nonterminal and σ1, . . . , σk are either nonterminals or terminals. A nonterminal stands for a missing subexpression; a terminal is a symbol that can actually appear in a program s code. The grammar starts with an initial nonterminal, then iteratively applies the rules to produce a series of partial architectures: sentences made from one or more nonterminals and zero or more terminals. The process continues until there are no nonterminals left, i.e., we have a complete architecture. The semantics of the architecture is given by a function [[ ]](x, ), defined by rules that are fixed for the DSL. We require this function to be differentiable in . Also, we define a structural cost for architectures. Let each rule r in the DSL grammar have a non-negative real cost s(r). The structural cost of is s( ) = P r2R( ) s(r), where R( ) is the multiset of rules used to create . Intuitively, architectures with lower structural cost are simpler are more human-interpretable. To define our learning problem, we assume an unknown distribution D(x, y) over inputs x and labels y, and consider the prediction error function ( , ) = E(x,y) D[1([[ ]](x, ) 6= y)], where 1 is the indicator function. Our goal is to find an architecturally simple program with low prediction error, i.e., to solve the optimization problem: ( , ) = arg min (s( ) + ( , )). (1) Program Learning for Sequence Classification. Program learning is applicable in many settings; we specifically study it in the sequence classification context [9]. Now we sketch our DSL for this domain. Like many others DSLs for program synthesis [15, 3, 44], our DSL is purely functional. The language has the following characteristics: Programs in the DSL operate over two data types: real vectors and sequences of real vectors. We assume a simple type system that makes sure that these types are used consistently. ::= x | c | ( 1, . . . , k) | ( 1, . . . , k) | if 1 then 2 else 3 | sel S x map (λx1. 1) x | fold (λx1. 1) c x | mapprefix (λx1. 1) x Figure 1: Grammar of DSL for sequence classification. Here, x, c, , and represent inputs, constants, basic algebraic operations, and parameterized library functions, respectively. sel S returns a vector consisting of a subset S of the dimensions of an input x. map(if Dist Affine[.0217]; .2785(x) then Acc Affine[ .0007,.0055,.0051, .0025];3.7426(x) else Dist Affine[ .2143];1.822)(x) Figure 2: Synthesized program classifying a sniff action between two mice in the CRIM13 dataset. Dist Affine and Acc Affine are functions that first select the parts of the input that represent distance and acceleration measurements, respectively, and then apply affine transformations to the resulting vectors. In the parameters (subscripts) of these functions, the brackets contain the weight vectors for the affine transformation, and the succeeding values are the biases. The program achieves an accuracy of 0.87 (vs. 0.89 for RNN baseline) and can be interpreted as follows: if the distance between two mice is small, they are doing a sniff (large bias in else clause). Otherwise, they are doing a sniff if the difference between their accelerations is small. Programs use a set of fixed algebraic operations as well as a library of differentiable, parame- terized functions . Because we are motivated by interpretability, the library used in our current implementation only contains affine transformations. In principle, it could be extended to include other kinds of functions as well. Programs use a set of higher-order combinators to recurse over sequences. In particular, we allow the standard map and fold combinators. To compactly express sequence-to-sequence functions, we also allow a special mapprefix combinator. Let g be a function that maps sequences to vectors. For a sequence x, mapprefix(g, x) equals the sequence hg(x[1:1]), g(x[1:2]), . . . , g(x[1:n])i, where x[1:i] is the i-th prefix of x. Programs can use a conditional branching construct. However, to avoid discontinuities, we interpret this construct in terms of a smooth approximation: [[if 1 > 0 then 2 else 3]](x, ( 1, 2, 3)) = σ(β [[ 1]](x, 1)) [[ 2]](x, 2) + (1 σ(β [[ 1]](x, 1))) [[ 3]](x, 3). (2) Here, σ is the sigmoid function and β is a temperature hyperparameter. As β ! 0, this approximation approaches the usual if-then-else construct. Figure 1 summarizes our DSL in the standard Backus-Naur form [47]. Figures 2 and 3 show two programs synthesized by our learning procedure using our DSL with libraries of domain-specific affine transformations (see the supplementary material). Both programs offer an interpretation in their respective domains, while offering respectable performance against an RNN baseline. 3 Program Learning using NEAR We formulate our program learning problem as a form of graph search. The search derives program architectures top-down: it begins with the empty architecture, generates a series of partial architectures following the DSL grammar, and terminates when a complete architecture is derived. In more detail, we imagine a graph G in which: The node set consists of all partial and complete architectures permissible in the DSL. The source node u0 is the empty architecture. Each complete architecture is a goal node. Edges are directed and capture single-step applications of rules of the DSL. Edges can be divided into: (i) internal edges (u, u0) between partial architectures u and u0, and (ii) goal edges (u, ) between partial architecture u and complete architecture . An internal edge (u, u0) exists if one can obtain u0 by substituting a nonterminal in u following a rule of the DSL. A goal edge (u, ) exists if we can complete u into by applying a rule of the DSL. map(multiply(add(O ense Affine(x), Ball Affine(x)), add(O ense Affine(x), Ball Affine(x))) Figure 3: Synthesized program classifying the ballhandler for basketball. Offense Affine() and Ball Affine() are parameterized affine transformations over the XY-coordinates of the offensive players and the ball (see the appendix for full parameters). multiply and add are computed element-wise. The program structure can be interpreted as computing the Euclidean norm/distance between the offensive players and the ball and suggests that this quantity can be important for determining the ballhandler. On a set of learned parameters (not shown), this program achieves an accuracy of 0.905 (vs. 0.945 for an RNN baseline). The cost of an internal edge (u, u0) is given by the structural cost s(r), where r is the rule used to construct u0 from u. The cost of a goal edge (u, ) is s(r)+ ( , ), where = arg min ( , ) and r is the rule used to construct from u. A path in the graph G is defined as usual, as a sequence of nodes u1, . . . , uk such that there is an edge (ui, ui+1) for each i 2 {1, . . . , k 1}. The cost of a path is the sum of the costs of these edges. Our goal is to discover a least-cost path from the source u0 to some goal node . Then by construction of our edge costs, is an optimal solution to our learning problem in Eq. (1). 3.1 Neural Relaxations as Admissible Heuristics Figure 4: An example of program learning formulated as graph search. Structural costs are in red, heuristic values in black, prediction errors in blue, O refers to a nonterminal in a partial architecture, and the path to a goal node returned by A*-NEAR search is in teal. The main challenge in our search problem is that our goal edges contain rich cost information, but this information is only accessible when a path has been explored until the end. A heuristic function h(u) that can predict the value of choices made at nodes u encountered early in the search can help with this difficulty. If such a heuristic is admissible i.e., underestimates the cost-to-go it enables the use of informed search strategies such as A and branchand-bound while guaranteeing optimal solutions. Our NEAR approach (abbreviation for Neural Admissible Relaxation) uses neural approximations of spaces of programs to construct a heuristic that is -close to being admissible. Let a completion of a partial architecture u be a (complete) architecture u[ 1, . . . , k] obtained by replacing the nonterminals in u by suitably typed architectures i. Let u be the parameters of u and be parameters of the i-s. The cost-to-go at u is given by: J(u) = min 1,..., k, u, ((s(u[ 1, . . . , k] s(u)) + (u[ 1, . . . , k], ( u, )) (3) where the structural cost s(u) is the sum of the costs of the grammatical rules used to construct u. To compute a heuristic cost h(u) for a partial architecture u encountered during search, we substitute the nonterminals in u with neural networks parameterized by !. These networks are type-correct for example, if a nonterminal is supposed to generate subexpressions whose inputs are sequences, then the neural network used in its place is recurrent. We show an example of NEAR used in a program learning-graph search formulation in Figure 4. We view the neurosymbolic programs resulting from this substitution as tuples (u, ( u, !)). We define a semantics for such programs by extending our DSL s semantics, and lift the function to assign costs (u, ( u, !)) to such programs. The heuristic cost for u is now given by: w, (u, ( u, !)). (4) As (u, ( u, !)) is differentiable in ! and u, we can compute h(u) using gradient descent. -Admissibility. In practice, the neural networks that we use may only form an approximate relaxation of the space of completions and parameters of architectures; also, the training of these networks may not reach global optima. To account for these errors, we consider an approximate notion of admissibility. Many such notions have been considered in the past [20, 31, 43]; here, we follow a definition used by Harris [20]. For a fixed constant > 0, let an -admissible heuristic be a function h (u) over architectures such that h (u) J(u) + for all u. Now consider any completion u[ 1, . . . , k] of an architecture u. As neural networks with adequate capacity are universal function approximators, there exist parameters ! for our neurosymbolic program such that for all u, 1, . . . , k, u, and : (u, ( u, ! )) (u[ 1, . . . , k], ( u, )) + . (5) Because edges in our search graph have non-negative costs, s(u) s(u[ 1, . . . , k]), implying: h(u) min 1,..., k, u, (u[ 1, . . . , k], ( u, )) + min 1,..., k, u, (u[ 1, . . . , k], ( u, )) + (s(u[ 1, . . . , k]) s(u)) + = J(u) + . (6) In other words, h(u) is -admissible. Empirical Considerations. We have formulated our learning problem in terms of the true prediction error ( , ). In practice, we must use statistical estimates of this error. Following standard practice, we use an empirical validation error to choose architectures, and an empirical training error is used to choose module parameters. This means that in practice, the cost of a goal edge (u, ) in our graph is val( , arg min train( , )). One complication here is that our neural heuristics encode both the completions of an architecture and the parameters of these completions. Training a heuristic on either the training loss or the validation loss will introduce an additional error. Using standard generalization bounds, we can argue that for adequately large training and validation sets, this error is bounded (with probability arbitrarily close to 1) in either case, and that our heuristic is -admissible with high probability in spite of this error. 3.2 Integrating NEAR with Graph Search Algorithms Algorithm 1: A* Search Input: Graph G with source u0 S := {u0}; f(u0) := 1; while S 6= ; do v := arg minu2S f(u); S := S \ {v}; if v is a goal node then return v, fv; else foreach child u of v do Compute g(u), h(u), f(u); S := S [ {u}; The NEAR approach can be used in conjunction with any heuristic search algorithm [36] over architectures. Specifically, we have integrated NEAR with two classic graph search algorithms: A [31] (Algorithm 1) and an iteratively deepened depth-first search with branch-and-bound pruning (IDS-BB) (Appendix A). Both algorithms maintain a search frontier by computing an f-score for each node: f(u) = g(u) + h(u), where g(u) is the incurred path cost from the source node u0 to the current node u, and h(u) is a heuristic estimate of the cost-to-go from node u. Additionally, IDS-BB prunes nodes from the frontier that have a higher f-score than the minimum path cost to a goal node found so far. -Optimality. An important property of a search algorithm is optimality: when multiple solutions exist, the algorithm finds an optimal solution. Both A and IDS-BB are optimal given admissible heuristics. An argument by Harris [20] shows that under heuristics that are -admissible in our sense, the algorithms return solutions that at most an additive constant away from the optimal solution. Let C denote the optimal path cost in our graph G, and let h(u) be an -admissible heuristic (Eq. (6)). Suppose IDS-BB or A returns a goal node G that does not have the optimal path cost C . Then there must exist a node u O on the frontier that lies along the optimal path and has yet to be expanded. This lets us establish an upper bound on the path cost of G: g( G) = f( G) f(u O) = g(u O) + h(u O) g(u O) + J(u O) + C + . (7) This line of reasoning can also be extended to the Branch-and-Bound component of the NEAR-guided IDS-BB algorithm. Consider encountering a goal node during search that sets the branch-and-bound upper threshold to be a cost C. In the remainder of search, some node up with an f-cost greater than C is pruned, and the optimal path from up to a goal node will not be searched. Assuming the heuristic function h is -admissible, we can set a lower bound on the optimal path cost from up, f(u p), to be C by the following: p) = g(up) + J(up) f(up) = g(up) + h(up) + > C = g(up) + h(up) > C . (8) Thus, the IDS-BB algorithm will find goal paths are at worst an additive factor of more than any pruned goal path. 4 Experiments 4.1 Datasets for Sequence Classification For all datasets below, we augment the base DSL in Figure 1 with domain-specific library functions that include 1) learned affine transformations over a subset of features, and 2) sliding window featureaveraging functions. Full details, such as structural cost functions used and any pre/post-processing, are provided in the appendix. CRIM13. The CRIM13 dataset [5] contains trajectories for a pair of mice engaging in social behaviors, annotated for different actions per frame by behavior experts; we aim to learn programs for classifying actions at each frame for fixed-size trajectories. Each frame is represented by a 19-dimensional feature vector: 4 features capture the xy-positions of the mice, and the remaining 15 features are derived from the positions, such as velocities and distance between mice. We learn programs for two actions that can be identified the tracking features: sniff" and other" ( other" is used when there is no behavior of interest occurring). We cut every 100 frames as a trajectory, and in total we have 12404 training, 3077 validation, and 2953 test trajectories. Fly-vs.-Fly. We use the Aggression and Boy-meets-Boy datasets within the Fly-vs.-Fly environment that tracks a pair of fruit flies and their actions as they interact in different contexts [14]. We aim to learn programs that classify trajectories as one of 7 possible actions displaying aggressive, threatening, and nonthreatening behaviors. The length of trajectories can range from 1 to over 10000 frames, but we segment the data into trajectories with a maximum length of 300 for computational efficiency. The average length of a trajectory in our training set is 42.06 frames. We have 5339 training, 594 validation, and 1048 test trajectories. Basketball. We use a subset of the basketball dataset from [50] that tracks the movements of professional basketball players. Each trajectory is of length 25 and contains the xy-positions of 5 offensive players, 5 defensive players, and the ball (22 features per frame). We aim to learn programs that can predict which offensive player has the ball (the "ballhandler") or whether the ball is being passed. In total, we have 18,000 trajectories for training, 2801 for validation, and 2693 for test. 4.2 Overview of Baseline Program Learning Strategies We compare our NEAR-guided graph search algorithms, A*-NEAR and IDS-BB-NEAR, with four baseline program learning strategies: 1) top-down enumeration, 2) Monte-Carlo sampling, 3) Monte Carlo tree search, and 4) a genetic algorithm. We also compare the performance of these program learning algorithms with an RNN baseline (1-layer LSTM). Top-down enumeration. We synthesize and evaluate complete programs in order of increasing complexity measured using the structural cost s( ). This strategy is widely employed in program learning contexts [44, 46, 45] and is provably complete. Since our graph G grows infinitely, our implementation is akin to breadth-first search up to a specified depth. Monte-Carlo (MC) sampling. Starting from the source node u0, we sample complete programs by sampling rules (edges) with probabilities proportional to their structural costs s(r). The next node chosen along a path has the best average performance of samples that descended from that node. We repeat the procedure until we reach a goal node and return the best program found among all samples. Monte-Carlo tree search (MCTS). Starting from the source node u0, we traverse the graph until we reach a complete program using the UCT selection criteria [23], where the value of a node is inversely proportional to the cost of its children.2 In the backpropagation step we update the value of all nodes along the path. After some iterations, we choose the next node in the path with the highest value. We repeat the procedure until we reach a goal node and return the best program found. 2MCTS with this node value definition will visit shallow programs more frequently than MC sampling. CRIM13-sniff CRIM13-other Fly-vs.-Fly Bball-ballhandler Acc. F1 Depth Acc. F1 Depth Acc. F1 Depth Acc. F1 Depth Enum. .851 .221 3 .707 .762 2 .819 .863 2 .844 .857 6.3 MC .843 .281 7 .630 .715 1 .833 .852 4 .841 .853 6 MCTS .745 .338 8.7 .666 .749 1 .817 .857 4.7 .711 .729 8 Genetic .829 .181 1.7 .727 .768 3 .850 .868 6 .843 .853 6.7 IDS-BB-NEAR .829 .446 6 .729 .768 1.3 .876 .892 4 .889 .903 8 A*-NEAR .821 .369 6 .706 .764 2.7 .872 .885 4 .906 .918 8 RNN .889 .481 - .756 .785 - .963 .964 - .945 .950 - Table 1: Mean accuracy, F1-score, and program depth of learned programs (3 trials). Programs found using our NEAR algorithms consistently achieve better F1-score than baselines and match more closely to the RNN s performance. Our algorithms are also able to search and find programs of much greater depth than the baselines. Experiment hyperparameters are included in the appendix. Figure 5: Median minimum path cost to a goal node found at a given time, across 3 trials (for trials that terminate first, we extend the plots so the median remains monotonic). A*-NEAR (blue) and IDS-BB-NEAR (green) will often find a goal node with a smaller path cost, or find one of similar performance but much faster. Genetic algorithm. We follow the formulation in Valkov et al. [44]. In our genetic algorithm, crossover, selection, and mutation operations evolve a population of programs over a number of generations until a predetermined number of programs have been trained. The crossover and mutation operations only occur when the resulting program is guaranteed to be type-safe. For all baseline algorithms, as well as A*-NEAR and IDS-BB-NEAR, model parameters ( ) were learned with the training set, whereas program architectures ( ) were evaluated using the performance on the validation set. Additionally, all baselines (including NEAR algorithms) used F1-score [38] error as the evaluation objective by which programs were chosen. To account for class imbalances, F1-scoring is commonly used as an evaluation metric in behavioral classification domains, such as those considered in our work [14, 5]. Our full implementation is available in [39]. 4.3 Experimental Results Performance of learned programs. Table 1 shows the performance results on the test sets of our program learning algorithms, averaged over 3 seeds. The same structural cost function s( ) is used for all algorithms, but can vary across domains (see Appendix). Our NEAR-guided search algorithms consistently outperform other baselines in F1-score while accuracy is comparable (note that our does not include accuracy). Furthermore, NEAR-guided search algorithms are capable are finding deeper and more complex programs that can offer non-trivial interpretations, such as the ones shown in Figures 2 and 3. Lastly, we verify that our learned programs are comparable with highly expressive RNNs, and see that there is at most a 10% drop in F1-score when using NEAR-guided search algorithms with our DSL. (a) CRIM13-sniff (b) Bball-ballhandler Figure 6: As we increase λ in Eq. (9), we observe that A*-NEAR will learn programs with decreasing program depth and also decreasing F1-score. This highlights that we can use λ to control the trade-off between structural cost and performance. mapprefix(Sliding Window Average( Position Affine(x))) Figure 7: Synthesized depth 2 program classifying a sniff action between two mice in the CRIM13 dataset. The sliding window average is over the last 10 frames. The program achieves F1 score of 0.22 (vs. 0.48 for RNN baseline). This program is synthesized using λ = 8. Efficiency of NEAR-guided graph search. Figure 5 tracks the progress of each program learning algorithm during search by following the median best path cost (Eq. (1)) at a given time across 3 independent trials. For times where only 2 trials are active (i.e. one trial had already terminated), we report the average. Algorithms for each domain were run on the same machine to ensure consistency, and each non-NEAR baseline was set up such to have at least as much time as our NEAR-guided algorithms for their search procedures (see Appendix). We observe that NEAR-guided search algorithms are able to find low-cost solutions more efficiently than existing baselines, while maintaining an overall shorter running time. Cost-performance trade-off. We can also consider a modification of our objective in Eq. (1) that allows us to use a hyperparameter λ to control the trade-off between structural cost (a proxy for interpretability) and performance: ( , ) = arg min (λ s( ) + ( , )). (9) To visualize this trade-off, we run A*-NEAR with the modified objective Eq. (9) for various values of λ. Note that λ = 1 is equivalent to our experiments in Table 1. Figure 6 shows that for the Basketball and CRIM13 datasets, as we increase λ, which puts more weight on the structural cost, the resulting programs found by A*-NEAR search have decreasing F1-scores but are also more shallow. This confirms our expectations that we can control the trade-off between structural cost and performance, which allows users of NEAR-guided search algorithms to adjust to their preferences. Unlike the other two experimental domains, the most performant programs learned in Fly-vs.-Fly were relatively shallow, so we omitted this domain as the trade-off showed little change in program depth. We illustrate the implications of this tradeoff on interpretability using the depth-2 program in Figure 7 and the depth-8 program in Figure 8, both synthesized for the same task of detecting a sniff action in the CRIM13 dataset. The depth-2 program says that a sniff occurs if the intruder mouse is close to the right side of the cage and both mice are near the bottom of the cage, and can be seen to apply a position bias (regarding the location of the action) on the action. This program is simple, due to the large weight on the structural cost, and has a low F1-score. In contrast, the deeper program in Figure 8 has performance comparable to an RNN but is more difficult to interpret. Our interpretation of this program is that it evaluates the likelihood of sniff by applying a position bias, then using the velocity of the mice if the mice are close together and not moving fast, and using distance between the mice otherwise. 5 Related Work Neural Program Induction. The literature on neural program induction (NPI) [19, 34, 25, 37] develops methods to learn neural networks that can perform procedural (program-like) tasks, typically using architectures augmented with differentiable memory. Our approach differs from these methods in that its final output is a symbolic program. However, since our heuristics are neural approximation map(add( Position Affine(x), if add( Velocity Affine(x), Dist Affine(x)) then Velocity Affine(x) else Dist Affine(x))) Figure 8: Synthesized depth 8 program classifying a sniff action between two mice in the CRIM13 dataset. The program achieves F1 score of 0.46 (vs. 0.48 for RNN baseline). This program is synthesized using λ = 1. of programs, our work can be seen as repeatedly performing NPI as the program is being produced. While we have so far used classical feedforward and recurrent architectures to implement our neural heuristics, future work could use richer models from the NPI literature to this end. DSL-based Program Synthesis. There is a large body of research on synthesis of programs from DSLs. In many of these methods, the goal is not learning but finding a program that satisfies a hard constraint [1, 41, 32, 15]. However, there is also a growing literature on learning programs from (noisy) data [26, 13, 46, 11, 44, 45]. Of these methods, TERPRET [18] and NEURAL TERPRET [17] allows gradient descent as a mechanism for learning program parameters. However, unlike NEAR, these approaches do not allow a general search over program architectures permitted by a DSL, and require a detailed hand-written template of the program for even the simplest tasks. While the Houdini framework [44] combines gradient-based parameter learning with search over program architectures, this search is not learning-accelerated and uses a simple type-directed enumeration. As reported in our experiments, NEAR outperforms this enumeration-based approach. Many recent methods for program synthesis use statistical models to guide the search over program architectures [3, 7, 12, 8, 11, 30, 16, 29]. In particular, Lee et al. [27] use a probabilistic model to guide an A search over programs. Most of these models (including the one in Lee et al. [27]) are trained using corpora of synthesis problems and corresponding solutions, which are not available in our setting. There is a category of methods based on reinforcement learning (RL) [16, 4]. Unlike NEAR, these methods do not directly exploit the structure of the search space. Combining them with our approach would be an interesting topic of future work. Structure Search using Relaxations. Our search problem bears similarities with the problems of searching over neural architectures and the structure of graphical models. Prior work has used relaxations to solve these problems [28, 40, 51, 33, 48]. Specifically, the A* lasso approach for learning sparse Bayesian networks [48] uses a dense network to construct admissible heuristics, and DARTS computes a differentiable relaxation of neural architecture search [28, 40]. The key difference between these efforts and ours is that the design space in our problem is much richer, making the methods in prior work difficult to apply. In particular, DARTS uses a composition of softmaxes over all possible candidate operations between a fixed set of nodes that constitute a neural architecture, and the heuristics in the A* lasso method come from a single, simple function class. However, in our setting, there is no fixed bound on the number of expressions in a program, different sets of operations can be available at different points of synthesis, and the input and output type of the heuristic (and therefore, its architecture) can vary based on the part of the program derived so far. 6 Conclusions We have a presented a novel graph search approach to learning differentiable programs. Our method leverages a novel construction of an admissible heuristic using neural relaxations to efficiently search over program architectures. Our experiments show that programs learned using our approach can have competitive performance, and that our search-based learning procedure substantially outperforms conventional program learning approaches. There are many directions for future work. One direction is to extend the approach to richer DSLs and neural heuristic architectures, for example, those suited to reinforcement learning [45] and generative modeling [35]. Another is to combine NEAR with classical program synthesis methods based on symbolic reasoning. A third is to integrate NEAR into more complex program search problems, e.g., when there is an initial program as a starting point and the goal is to search for refinements. A fourth is to more tightly integrate with real-world applications to evaluate the interpretability of learned programs as it impacts downstream tasks. Broader Impact Programmatic models described using high-level DSLs are a powerful mechanism for summarizing automatically discovered knowledge in a human-interpretable way. Specifically, such models are more interpretable than state-of-the-art neural models while also tending to provide higher performance than shallower linear or decision tree models. Also, programmatic models allow for natural incorporation of inductive bias and allow the user to influence the semantic meaning of learned programs. For these reasons, efforts on program learning, such as ours, can lead to more widespread use of machine learning in fields, such as healthcare, autonomous driving, and the natural sciences, where safety and accountability are critical and there human-held prior knowledge (such as the laws of nature) that can usefully direct the learning process. The flipside of this is that the bias introduced in program learning can just as easily be exploited by users who desire specific outcomes from the learner. Ultimately, users of program learning methods must ensure that any incorporated inductive bias will not lead to unfair or misleading programs. Funding Acknowledgment This work was supported in part by NSF Award # CCF-1918651, NSF Award # CCF-1704883, DARPA PAI, Raytheon, a Rice University Graduate Research Fellowship (for Shah), a JP Morgan Chase Fellowship (for Verma), and NSERC Award # PGSD3-532647-2019 (for Sun). [1] Rajeev Alur, Rastislav Bodík, Eric Dallal, Dana Fisman, Pranav Garg, Garvit Juniwal, Hadas Kress-Gazit, P. Madhusudan, Milo M. K. Martin, Mukund Raghothaman, Shambwaditya Saha, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntax-guided synthesis. In Dependable Software Systems Engineering, pages 1 25. 2015. [2] A Bagchi and A Mahanti. Admissible heuristic search in and/or graphs. Theoretical Computer Science, 24(2):207 219, 1983. [3] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tar- low. Deepcoder: Learning to write programs. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. [4] Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Lever- aging grammar and reinforcement learning for neural program synthesis. ar Xiv preprint ar Xiv:1805.04276, 2018. [5] Xavier P Burgos-Artizzu, Piotr Dollár, Dayu Lin, David J Anderson, and Pietro Perona. Social behavior recognition in continuous video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1322 1329. IEEE, 2012. [6] Eugene Charniak and Saadia Husain. A new admissible heuristic for minimal-cost proofs. Brown University, Department of Computer Science, 1991. [7] Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. [8] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mo- hamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy I/O. Co RR, abs/1703.07469, 2017. [9] Thomas G. Dietterich. Machine learning for sequential data: A review. In Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops SSPR 2002 and SPR 2002, Windsor, Ontario, Canada, August 6-9, 2002, Proceedings, pages 15 30, 2002. [10] Kevin Ellis, Maxwell I. Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando Solar- Lezama. Write, execute, assess: Program synthesis with a REPL. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 9165 9174, 2019. [11] Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. Learning to infer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, pages 6059 6068, 2018. [12] Kevin Ellis, Armando Solar-Lezama, and Josh Tenenbaum. Sampling for bayesian program learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1289 1297, 2016. [13] Kevin Ellis, Armando Solar-Lezama, and Joshua B. Tenenbaum. Unsupervised learning by program synthesis. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 973 981, 2015. [14] Eyrun Eyjolfsdottir, Steve Branson, Xavier P Burgos-Artizzu, Eric D Hoopfer, Jonathan Schor, David J Anderson, and Pietro Perona. Detecting social actions of fruit flies. In European Conference on Computer Vision, pages 772 787. Springer, 2014. [15] John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from input-output examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, pages 229 239, 2015. [16] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. Syn- thesizing programs for images using reinforced adversarial learning. Co RR, abs/1804.01118, 2018. [17] Alexander L Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable programs with neural libraries. In International Conference on Machine Learning, pages 1213 1222, 2017. [18] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program induction. ar Xiv preprint ar Xiv:1608.04428, 2016. [19] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014. [20] Larry R Harris. The heuristic search under conditions of error. Artificial Intelligence, 5(3):217 234, 1974. [21] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation, 3rd Edition. Pearson international edition. Addison-Wesley, 2007. [22] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [23] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282 293. Springer, 2006. [24] Richard E Korf. Recent progress in the design and analysis of admissible heuristic functions. In International Symposium on Abstraction, Reformulation, and Approximation, pages 45 55. Springer, 2000. [25] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. ar Xiv preprint ar Xiv:1511.06392, 2015. [26] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. [27] Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. Accelerating search-based program synthesis using learned probabilistic models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, pages 436 449, 2018. [28] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. [29] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Neural sketch learning for conditional program generation. In ICLR, 2018. [30] Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis. ar Xiv preprint ar Xiv:1611.01855, 2016. [31] Judea Pearl. Heuristics: Intelligent search strategies for computer problem solving. Addision Wesley, 1984. [32] Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, pages 107 126, 2015. [33] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 4780 4789. AAAI Press, 2019. [34] Scott Reed and Nando De Freitas. Neural programmer-interpreters. ar Xiv preprint ar Xiv:1511.06279, 2015. [35] Daniel Ritchie, Anna Thomas, Pat Hanrahan, and Noah Goodman. Neurally-guided procedural models: Amortized inference for procedural graphics programs using neural networks. In Advances in neural information processing systems, pages 622 630, 2016. [36] Stuart Russell and Peter Norvig. Artificial intelligence: a modern approach. 2002. [37] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. Meta-learning with memory-augmented neural networks. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1842 1850. JMLR.org, 2016. [38] Yutaka Sasaki. The truth of the f-measure. Teach Tutor Master, 01 2007. [39] Ameesh Shah, Eric Zhan, and Jennifer Sun. Near code repository. https://github.com/ trishullab/near, 2020. [40] Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture search. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. Open Review.net, 2018. [41] Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. Combinatorial sketching for finite programs. In ASPLOS, pages 404 415, 2006. [42] David Sontag, Talya Meltzer, Amir Globerson, Tommi S Jaakkola, and Yair Weiss. Tightening lp relaxations for map using message passing. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. [43] Richard Anthony Valenzano, Shahab Jabbari Arfaee, Jordan Thayer, Roni Stern, and Nathan R Sturtevant. Using alternative suboptimality bounds in heuristic search. In Twenty-Third International Conference on Automated Planning and Scheduling, 2013. [44] Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles A. Sutton, and Swarat Chaudhuri. HOUDINI: lifelong learning as program synthesis. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 8701 8712, 2018. [45] Abhinav Verma, Hoang Minh Le, Yisong Yue, and Swarat Chaudhuri. Imitation-projected programmatic reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), 2019. [46] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. In International Conference on Machine Learning, pages 5052 5061, 2018. [47] Glynn Winskel. The formal semantics of programming languages: an introduction. MIT press, [48] Jing Xiang and Seyoung Kim. A* lasso for learning a sparse bayesian network structure for continuous variables. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 2418 2426, 2013. [49] Halley Young, Osbert Bastani, and Mayur Naik. Learning neurosymbolic generative models via program synthesis. In International Conference on Machine Learning (ICML), 2019. [50] Yisong Yue, Patrick Lucey, Peter Carr, Alina Bialkowski, and Iain Matthews. Learning fine- grained spatial models for dynamic sports play prediction. In 2014 IEEE international conference on data mining, pages 670 679. IEEE, 2014. [51] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017.