# learning_differentiable_programs_with_admissible_neural_heuristics__0efca39c.pdf

Learning Differentiable Programs with Admissible

Neural Heuristics

Ameesh Shah

UC Berkeley

ameesh@berkeley.edu

ezhan@caltech.edu

Jennifer J. Sun

jjsun@caltech.edu

Abhinav Verma

verma@utexas.edu

yyue@caltech.edu

Swarat Chaudhuri

swarat@cs.utexas.edu

We study the problem of learning differentiable functions expressed as programs in a domain-speciﬁc language. Such programmatic models can offer beneﬁts such as composability and interpretability; however, learning them requires optimizing over a combinatorial space of program architectures . We frame this optimization problem as a search in a weighted graph whose paths encode top-down derivations of program syntax. Our key innovation is to view various classes of neural networks as continuous relaxations over the space of programs, which can then be used to complete any partial program. This relaxed program is differentiable and can be trained end-to-end, and the resulting training loss is an approximately admissible heuristic that can guide the combinatorial search. We instantiate our approach on top of the A algorithm and an iteratively deepened branch-and-bound search, and use these algorithms to learn programmatic classiﬁers in three sequence classiﬁcation tasks. Our experiments show that the algorithms outperform stateof-the-art methods for program learning, and that they discover programmatic classiﬁers that yield natural interpretations and achieve competitive accuracy.

1 Introduction

An emerging body of work advocates program synthesis as an approach to machine learning. The methods here learn functions represented as programs in symbolic, domain-speciﬁc languages (DSLs) [12, 11, 49, 44, 46, 45]. Such symbolic models have a number of appeals: they can be more interpretable than neural models, they use the inductive bias embodied in the DSL to learn reliably, and they use compositional language primitives to transfer knowledge across tasks.

In this paper, we study how to learn differentiable programs, which use structured, symbolic primitives to compose a set of parameterized, differentiable modules. Differentiable programs have recently attracted much interest due to their ability to leverage the complementary advantages of programming language abstractions and differentiable learning. For example, recent work has used such programs to compactly describe modular neural networks that operate over rich, recursive data types [44].

To learn a differentiable program, one needs to induce the program s architecture while simultaneously optimizing the parameters of the program s modules. This co-design task is difﬁcult because the space of architectures is combinatorial and explodes rapidly. Prior work has approached this challenge using methods such as greedy enumeration, Monte Carlo sampling, Monte Carlo tree

Equal Contribution.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

search, and evolutionary algorithms [46, 44, 10]. However, such approaches can often be expensive, due to not fully exploiting the structure of the underlying combinatorial search problem.

In this paper, we show that the differentiability of programs opens up a new line of attack on this search problem. A standard strategy for combinatorial optimization is to exploit (ideally fairly tight) continuous relaxations of the search space [31, 6, 48, 42, 24, 2, 45]. Optimization in the relaxed space is typically easier and can efﬁciently guide search algorithms towards good or optimal solutions. In the case of program learning, we propose to use various classes of neural networks as relaxations of partial programs. We frame our problem as searching a graph, in which nodes encode program architectures with missing expressions, and paths encode top-down program derivations. For each partial architecture u encountered during this search, the relaxation amounts to substituting the unknown part of u with a neural network with free parameters. Because programs are differentiable, this network can be trained on the problem s end-to-end loss. If the space of neural networks is an (approximate) proper relaxation of the space of programs (and training identiﬁes a near-optimum neural network), then the training loss for the relaxation can be viewed as an (approximately) admissible heuristic.

We instantiate our approach, called NEAR (abbreviation for Neural Admissible Relaxation), on top of two informed search algorithms: A and an iteratively deepened depth-ﬁrst search that uses a heuristic to direct branching as well as branch-and-bound pruning (IDS-BB). We evaluate the algorithms in the task of learning programmatic classiﬁers in three behavior classiﬁcation applications. We show that the algorithms substantially outperform state-of-the-art methods for program learning, and can learn classiﬁer programs that bear natural interpretations and are close to neural models in accuracy.

To summarize, the paper makes three contributions. First, we identify a tool heuristics obtained by training neural relaxations of programs for accelerating combinatorial searches over differentiable programs. So far as we know, this is the ﬁrst approach to exploit the differentiability of a programming language in program synthesis. Second, we instantiate this idea using two classic search algorithms. Third, we present promising experimental results in three sequence classiﬁcation applications.

2 Problem Formulation

We view a program in our domain-speciﬁc language (DSL) as a pair ( , ), where is a discrete (program) architecture and is a vector of real-valued parameters. The architecture is generated

using a context-free grammar [21]. The grammar consists of a set of rules X ! σ1 . . . σk, where X is a nonterminal and σ1, . . . , σk are either nonterminals or terminals. A nonterminal stands for a missing subexpression; a terminal is a symbol that can actually appear in a program s code. The grammar starts with an initial nonterminal, then iteratively applies the rules to produce a series of partial architectures: sentences made from one or more nonterminals and zero or more terminals. The process continues until there are no nonterminals left, i.e., we have a complete architecture.

The semantics of the architecture is given by a function [[ ]](x, ), deﬁned by rules that are ﬁxed for the DSL. We require this function to be differentiable in . Also, we deﬁne a structural cost for architectures. Let each rule r in the DSL grammar have a non-negative real cost s(r). The structural cost of is s( ) = P

r2R( ) s(r), where R( ) is the multiset of rules used to create . Intuitively, architectures with lower structural cost are simpler are more human-interpretable.

To deﬁne our learning problem, we assume an unknown distribution D(x, y) over inputs x and labels y, and consider the prediction error function ( , ) = E(x,y) D[1([[ ]](x, ) 6= y)], where 1 is the indicator function. Our goal is to ﬁnd an architecturally simple program with low prediction error, i.e., to solve the optimization problem:

( , ) = arg min

(s( ) + ( , )). (1)

Program Learning for Sequence Classiﬁcation. Program learning is applicable in many settings; we speciﬁcally study it in the sequence classiﬁcation context [9]. Now we sketch our DSL for this domain. Like many others DSLs for program synthesis [15, 3, 44], our DSL is purely functional. The language has the following characteristics:

Programs in the DSL operate over two data types: real vectors and sequences of real vectors. We

assume a simple type system that makes sure that these types are used consistently.

::= x | c | ( 1, . . . , k) | ( 1, . . . , k) | if 1 then 2 else 3 | sel S x

map (λx1. 1) x | fold (λx1. 1) c x | mappreﬁx (λx1. 1) x Figure 1: Grammar of DSL for sequence classiﬁcation. Here, x, c, , and represent inputs, constants, basic algebraic operations, and parameterized library functions, respectively. sel S returns a vector consisting of a subset S of the dimensions of an input x.

map(if Dist Aﬃne[.0217]; .2785(x)

then Acc Aﬃne[ .0007,.0055,.0051, .0025];3.7426(x) else Dist Aﬃne[ .2143];1.822)(x) Figure 2: Synthesized program classifying a sniff action between two mice in the CRIM13 dataset. Dist Aﬃne and Acc Aﬃne are functions that ﬁrst select the parts of the input that represent distance and acceleration measurements, respectively, and then apply afﬁne transformations to the resulting vectors. In the parameters (subscripts) of these functions, the brackets contain the weight vectors for the afﬁne transformation, and the succeeding values are the biases. The program achieves an accuracy of 0.87 (vs. 0.89 for RNN baseline) and can be interpreted as follows: if the distance between two mice is small, they are doing a sniff (large bias in else clause). Otherwise, they are doing a sniff if the difference between their accelerations is small.

Programs use a set of ﬁxed algebraic operations as well as a library of differentiable, parame-

terized functions . Because we are motivated by interpretability, the library used in our current implementation only contains afﬁne transformations. In principle, it could be extended to include other kinds of functions as well. Programs use a set of higher-order combinators to recurse over sequences. In particular, we allow

the standard map and fold combinators. To compactly express sequence-to-sequence functions, we also allow a special mappreﬁx combinator. Let g be a function that maps sequences to vectors. For a sequence x, mappreﬁx(g, x) equals the sequence hg(x[1:1]), g(x[1:2]), . . . , g(x[1:n])i, where x[1:i] is the i-th preﬁx of x. Programs can use a conditional branching construct. However, to avoid discontinuities, we interpret

this construct in terms of a smooth approximation:

[[if 1 > 0 then 2 else 3]](x, ( 1, 2, 3))

= σ(β [[ 1]](x, 1)) [[ 2]](x, 2) + (1 σ(β [[ 1]](x, 1))) [[ 3]](x, 3). (2)

Here, σ is the sigmoid function and β is a temperature hyperparameter. As β ! 0, this approximation approaches the usual if-then-else construct.

Figure 1 summarizes our DSL in the standard Backus-Naur form [47]. Figures 2 and 3 show two programs synthesized by our learning procedure using our DSL with libraries of domain-speciﬁc afﬁne transformations (see the supplementary material). Both programs offer an interpretation in their respective domains, while offering respectable performance against an RNN baseline.

3 Program Learning using NEAR

We formulate our program learning problem as a form of graph search. The search derives program architectures top-down: it begins with the empty architecture, generates a series of partial architectures following the DSL grammar, and terminates when a complete architecture is derived.

In more detail, we imagine a graph G in which:

The node set consists of all partial and complete architectures permissible in the DSL. The source node u0 is the empty architecture. Each complete architecture is a goal node. Edges are directed and capture single-step applications of rules of the DSL. Edges can be divided

into: (i) internal edges (u, u0) between partial architectures u and u0, and (ii) goal edges (u, ) between partial architecture u and complete architecture . An internal edge (u, u0) exists if one can obtain u0 by substituting a nonterminal in u following a rule of the DSL. A goal edge (u, ) exists if we can complete u into by applying a rule of the DSL.

map(multiply(add(O ense Aﬃne(x), Ball Aﬃne(x)), add(O ense Aﬃne(x), Ball Aﬃne(x)))

Figure 3: Synthesized program classifying the ballhandler for basketball. Offense Afﬁne() and Ball Afﬁne() are parameterized afﬁne transformations over the XY-coordinates of the offensive players and the ball (see the appendix for full parameters). multiply and add are computed element-wise. The program structure can be interpreted as computing the Euclidean norm/distance between the offensive players and the ball and suggests that this quantity can be important for determining the ballhandler. On a set of learned parameters (not shown), this program achieves an accuracy of 0.905 (vs. 0.945 for an RNN baseline).

The cost of an internal edge (u, u0) is given by the structural cost s(r), where r is the rule used to

construct u0 from u. The cost of a goal edge (u, ) is s(r)+ ( , ), where = arg min ( , ) and r is the rule used to construct from u.

A path in the graph G is deﬁned as usual, as a sequence of nodes u1, . . . , uk such that there is an edge (ui, ui+1) for each i 2 {1, . . . , k 1}. The cost of a path is the sum of the costs of these edges. Our

goal is to discover a least-cost path from the source u0 to some goal node . Then by construction of our edge costs, is an optimal solution to our learning problem in Eq. (1).

3.1 Neural Relaxations as Admissible Heuristics

Figure 4: An example of program learning formulated as graph search. Structural costs are in red, heuristic values in black, prediction errors in blue, O refers to a nonterminal in a partial architecture, and the path to a goal node returned by A*-NEAR search is in teal.

The main challenge in our search problem is that our goal edges contain rich cost information, but this information is only accessible when a path has been explored until the end. A heuristic function h(u) that can predict the value of choices made at nodes u encountered early in the search can help with this difﬁculty. If such a heuristic is admissible i.e., underestimates the cost-to-go it enables the use of informed search strategies such as A and branchand-bound while guaranteeing optimal solutions. Our NEAR approach (abbreviation for Neural Admissible Relaxation) uses neural approximations of spaces of programs to construct a heuristic that is -close to being admissible.

Let a completion of a partial architecture u be a (complete) architecture u[ 1, . . . , k] obtained by replacing the nonterminals in u by suitably typed architectures i. Let u be the parameters of u and be parameters of the i-s. The cost-to-go at u is given by:

J(u) = min 1,..., k, u, ((s(u[ 1, . . . , k] s(u)) + (u[ 1, . . . , k], ( u, )) (3)

where the structural cost s(u) is the sum of the costs of the grammatical rules used to construct u.

To compute a heuristic cost h(u) for a partial architecture u encountered during search, we substitute the nonterminals in u with neural networks parameterized by !. These networks are type-correct for example, if a nonterminal is supposed to generate subexpressions whose inputs are sequences, then the neural network used in its place is recurrent. We show an example of NEAR used in a program learning-graph search formulation in Figure 4.

We view the neurosymbolic programs resulting from this substitution as tuples (u, ( u, !)). We deﬁne a semantics for such programs by extending our DSL s semantics, and lift the function to assign costs (u, ( u, !)) to such programs. The heuristic cost for u is now given by:

w, (u, ( u, !)). (4)

As (u, ( u, !)) is differentiable in ! and u, we can compute h(u) using gradient descent.

-Admissibility. In practice, the neural networks that we use may only form an approximate relaxation of the space of completions and parameters of architectures; also, the training of these networks may not reach global optima. To account for these errors, we consider an approximate notion of admissibility. Many such notions have been considered in the past [20, 31, 43]; here, we follow a deﬁnition used by Harris [20]. For a ﬁxed constant > 0, let an -admissible heuristic be a function h (u) over architectures such that h (u) J(u) + for all u. Now consider any completion u[ 1, . . . , k] of an architecture u. As neural networks with adequate capacity are universal function approximators, there exist parameters ! for our neurosymbolic program such that for all u, 1, . . . , k, u, and :

(u, ( u, ! )) (u[ 1, . . . , k], ( u, )) + . (5) Because edges in our search graph have non-negative costs, s(u) s(u[ 1, . . . , k]), implying:

h(u) min 1,..., k, u, (u[ 1, . . . , k], ( u, )) +

min 1,..., k, u, (u[ 1, . . . , k], ( u, )) + (s(u[ 1, . . . , k]) s(u)) + = J(u) + . (6)

In other words, h(u) is -admissible.

Empirical Considerations. We have formulated our learning problem in terms of the true prediction error ( , ). In practice, we must use statistical estimates of this error. Following standard practice, we use an empirical validation error to choose architectures, and an empirical training error is used to choose module parameters. This means that in practice, the cost of a goal edge (u, ) in our graph is val( , arg min train( , )).

One complication here is that our neural heuristics encode both the completions of an architecture and the parameters of these completions. Training a heuristic on either the training loss or the validation loss will introduce an additional error. Using standard generalization bounds, we can argue that for adequately large training and validation sets, this error is bounded (with probability arbitrarily close to 1) in either case, and that our heuristic is -admissible with high probability in spite of this error.

3.2 Integrating NEAR with Graph Search Algorithms

Algorithm 1: A* Search Input: Graph G with source u0 S := {u0}; f(u0) := 1; while S 6= ; do

v := arg minu2S f(u); S := S \ {v}; if v is a goal node then

return v, fv; else

foreach child u of v do

Compute g(u), h(u), f(u); S := S [ {u};

The NEAR approach can be used in conjunction with any heuristic search algorithm [36] over architectures. Speciﬁcally, we have integrated NEAR with two classic graph search algorithms: A [31] (Algorithm 1) and an iteratively deepened depth-ﬁrst search with branch-and-bound pruning (IDS-BB) (Appendix A). Both algorithms maintain a search frontier by computing an f-score for each node: f(u) = g(u) + h(u), where g(u) is the incurred path cost from the source node u0 to the current node u, and h(u) is a heuristic estimate of the cost-to-go from node u. Additionally, IDS-BB prunes nodes from the frontier that have a higher f-score than the minimum path cost to a goal node found so far.

-Optimality. An important property of a search algorithm is optimality: when multiple solutions exist, the algorithm ﬁnds an optimal solution. Both A and IDS-BB are optimal given admissible heuristics. An argument by Harris [20] shows that under heuristics that are -admissible in our sense, the algorithms return solutions that at most an additive constant away from the optimal solution. Let C denote the optimal path cost in our graph G, and let h(u) be an -admissible heuristic (Eq. (6)). Suppose IDS-BB or A returns a goal node G that does not have the optimal path cost C . Then there must exist a node u O on the frontier that lies along the optimal path and has yet to be expanded. This lets us establish an upper bound on the path cost of G:

g( G) = f( G) f(u O) = g(u O) + h(u O) g(u O) + J(u O) + C + . (7)

This line of reasoning can also be extended to the Branch-and-Bound component of the NEAR-guided

IDS-BB algorithm. Consider encountering a goal node during search that sets the branch-and-bound upper threshold to be a cost C. In the remainder of search, some node up with an f-cost greater than C is pruned, and the optimal path from up to a goal node will not be searched. Assuming the heuristic function h is -admissible, we can set a lower bound on the optimal path cost from up, f(u

p), to be C by the following:

p) = g(up) + J(up) f(up) = g(up) + h(up) + > C = g(up) + h(up) > C . (8)

Thus, the IDS-BB algorithm will ﬁnd goal paths are at worst an additive factor of more than any pruned goal path.

4 Experiments

4.1 Datasets for Sequence Classiﬁcation

For all datasets below, we augment the base DSL in Figure 1 with domain-speciﬁc library functions that include 1) learned afﬁne transformations over a subset of features, and 2) sliding window featureaveraging functions. Full details, such as structural cost functions used and any pre/post-processing, are provided in the appendix.

CRIM13. The CRIM13 dataset [5] contains trajectories for a pair of mice engaging in social behaviors, annotated for different actions per frame by behavior experts; we aim to learn programs for classifying actions at each frame for ﬁxed-size trajectories. Each frame is represented by a 19-dimensional feature vector: 4 features capture the xy-positions of the mice, and the remaining 15 features are derived from the positions, such as velocities and distance between mice. We learn

programs for two actions that can be identiﬁed the tracking features: sniff" and other" ( other" is used when there is no behavior of interest occurring). We cut every 100 frames as a trajectory, and in total we have 12404 training, 3077 validation, and 2953 test trajectories.

Fly-vs.-Fly. We use the Aggression and Boy-meets-Boy datasets within the Fly-vs.-Fly environment that tracks a pair of fruit ﬂies and their actions as they interact in different contexts [14]. We aim to learn programs that classify trajectories as one of 7 possible actions displaying aggressive, threatening, and nonthreatening behaviors. The length of trajectories can range from 1 to over 10000 frames, but we segment the data into trajectories with a maximum length of 300 for computational efﬁciency. The average length of a trajectory in our training set is 42.06 frames. We have 5339 training, 594 validation, and 1048 test trajectories.

Basketball. We use a subset of the basketball dataset from [50] that tracks the movements of professional basketball players. Each trajectory is of length 25 and contains the xy-positions of 5 offensive players, 5 defensive players, and the ball (22 features per frame). We aim to learn programs that can predict which offensive player has the ball (the "ballhandler") or whether the ball is being passed. In total, we have 18,000 trajectories for training, 2801 for validation, and 2693 for test.

4.2 Overview of Baseline Program Learning Strategies

We compare our NEAR-guided graph search algorithms, A*-NEAR and IDS-BB-NEAR, with four baseline program learning strategies: 1) top-down enumeration, 2) Monte-Carlo sampling, 3) Monte Carlo tree search, and 4) a genetic algorithm. We also compare the performance of these program learning algorithms with an RNN baseline (1-layer LSTM).

Top-down enumeration. We synthesize and evaluate complete programs in order of increasing complexity measured using the structural cost s( ). This strategy is widely employed in program learning contexts [44, 46, 45] and is provably complete. Since our graph G grows inﬁnitely, our implementation is akin to breadth-ﬁrst search up to a speciﬁed depth.

Monte-Carlo (MC) sampling. Starting from the source node u0, we sample complete programs by sampling rules (edges) with probabilities proportional to their structural costs s(r). The next node chosen along a path has the best average performance of samples that descended from that node. We repeat the procedure until we reach a goal node and return the best program found among all samples.

Monte-Carlo tree search (MCTS). Starting from the source node u0, we traverse the graph until we reach a complete program using the UCT selection criteria [23], where the value of a node is inversely proportional to the cost of its children.2 In the backpropagation step we update the value of all nodes along the path. After some iterations, we choose the next node in the path with the highest value. We repeat the procedure until we reach a goal node and return the best program found.

2MCTS with this node value deﬁnition will visit shallow programs more frequently than MC sampling.

CRIM13-sniff CRIM13-other Fly-vs.-Fly Bball-ballhandler

Acc. F1 Depth Acc. F1 Depth Acc. F1 Depth Acc. F1 Depth

Enum. .851 .221 3 .707 .762 2 .819 .863 2 .844 .857 6.3

MC .843 .281 7 .630 .715 1 .833 .852 4 .841 .853 6

MCTS .745 .338 8.7 .666 .749 1 .817 .857 4.7 .711 .729 8

Genetic .829 .181 1.7 .727 .768 3 .850 .868 6 .843 .853 6.7

IDS-BB-NEAR .829 .446 6 .729 .768 1.3 .876 .892 4 .889 .903 8

A*-NEAR .821 .369 6 .706 .764 2.7 .872 .885 4 .906 .918 8

RNN .889 .481 - .756 .785 - .963 .964 - .945 .950 -

Table 1: Mean accuracy, F1-score, and program depth of learned programs (3 trials). Programs found using our NEAR algorithms consistently achieve better F1-score than baselines and match more closely to the RNN s performance. Our algorithms are also able to search and ﬁnd programs of much greater depth than the baselines. Experiment hyperparameters are included in the appendix.

Figure 5: Median minimum path cost to a goal node found at a given time, across 3 trials (for trials that terminate ﬁrst, we extend the plots so the median remains monotonic). A*-NEAR (blue) and IDS-BB-NEAR (green) will often ﬁnd a goal node with a smaller path cost, or ﬁnd one of similar performance but much faster.

Genetic algorithm. We follow the formulation in Valkov et al. [44]. In our genetic algorithm, crossover, selection, and mutation operations evolve a population of programs over a number of generations until a predetermined number of programs have been trained. The crossover and mutation operations only occur when the resulting program is guaranteed to be type-safe.

For all baseline algorithms, as well as A*-NEAR and IDS-BB-NEAR, model parameters ( ) were learned with the training set, whereas program architectures ( ) were evaluated using the performance on the validation set. Additionally, all baselines (including NEAR algorithms) used F1-score [38] error as the evaluation objective by which programs were chosen. To account for class imbalances, F1-scoring is commonly used as an evaluation metric in behavioral classiﬁcation domains, such as those considered in our work [14, 5]. Our full implementation is available in [39].

4.3 Experimental Results

Performance of learned programs. Table 1 shows the performance results on the test sets of our program learning algorithms, averaged over 3 seeds. The same structural cost function s( ) is used for all algorithms, but can vary across domains (see Appendix). Our NEAR-guided search algorithms consistently outperform other baselines in F1-score while accuracy is comparable (note that our does not include accuracy). Furthermore, NEAR-guided search algorithms are capable are ﬁnding deeper and more complex programs that can offer non-trivial interpretations, such as the ones shown in Figures 2 and 3. Lastly, we verify that our learned programs are comparable with highly expressive RNNs, and see that there is at most a 10% drop in F1-score when using NEAR-guided search algorithms with our DSL.

(a) CRIM13-sniff

(b) Bball-ballhandler

Figure 6: As we increase λ in Eq. (9), we observe that A*-NEAR will learn programs with decreasing program depth and also decreasing F1-score. This highlights that we can use λ to control the trade-off between structural cost and performance.

mappreﬁx(Sliding Window Average( Position Aﬃne(x)))

Figure 7: Synthesized depth 2 program classifying a sniff action between two mice in the CRIM13 dataset. The sliding window average is over the last 10 frames. The program achieves F1 score of 0.22 (vs. 0.48 for RNN baseline). This program is synthesized using λ = 8.

Efﬁciency of NEAR-guided graph search. Figure 5 tracks the progress of each program learning algorithm during search by following the median best path cost (Eq. (1)) at a given time across 3 independent trials. For times where only 2 trials are active (i.e. one trial had already terminated), we report the average. Algorithms for each domain were run on the same machine to ensure consistency, and each non-NEAR baseline was set up such to have at least as much time as our NEAR-guided algorithms for their search procedures (see Appendix). We observe that NEAR-guided search algorithms are able to ﬁnd low-cost solutions more efﬁciently than existing baselines, while maintaining an overall shorter running time.

Cost-performance trade-off. We can also consider a modiﬁcation of our objective in Eq. (1) that allows us to use a hyperparameter λ to control the trade-off between structural cost (a proxy for interpretability) and performance:

( , ) = arg min

(λ s( ) + ( , )). (9)

To visualize this trade-off, we run A*-NEAR with the modiﬁed objective Eq. (9) for various values of λ. Note that λ = 1 is equivalent to our experiments in Table 1. Figure 6 shows that for the Basketball and CRIM13 datasets, as we increase λ, which puts more weight on the structural cost, the resulting programs found by A*-NEAR search have decreasing F1-scores but are also more shallow. This conﬁrms our expectations that we can control the trade-off between structural cost and performance, which allows users of NEAR-guided search algorithms to adjust to their preferences. Unlike the other two experimental domains, the most performant programs learned in Fly-vs.-Fly were relatively shallow, so we omitted this domain as the trade-off showed little change in program depth.

We illustrate the implications of this tradeoff on interpretability using the depth-2 program in Figure 7 and the depth-8 program in Figure 8, both synthesized for the same task of detecting a sniff action in the CRIM13 dataset. The depth-2 program says that a sniff occurs if the intruder mouse is close to the right side of the cage and both mice are near the bottom of the cage, and can be seen to apply a position bias (regarding the location of the action) on the action. This program is simple, due to the large weight on the structural cost, and has a low F1-score. In contrast, the deeper program in Figure 8 has performance comparable to an RNN but is more difﬁcult to interpret. Our interpretation of this program is that it evaluates the likelihood of sniff by applying a position bias, then using the velocity of the mice if the mice are close together and not moving fast, and using distance between the mice otherwise.

5 Related Work

Neural Program Induction. The literature on neural program induction (NPI) [19, 34, 25, 37] develops methods to learn neural networks that can perform procedural (program-like) tasks, typically using architectures augmented with differentiable memory. Our approach differs from these methods in that its ﬁnal output is a symbolic program. However, since our heuristics are neural approximation

map(add( Position Aﬃne(x),

if add( Velocity Aﬃne(x), Dist Aﬃne(x))

then Velocity Aﬃne(x) else Dist Aﬃne(x))) Figure 8: Synthesized depth 8 program classifying a sniff action between two mice in the CRIM13 dataset. The program achieves F1 score of 0.46 (vs. 0.48 for RNN baseline). This program is synthesized using λ = 1.

of programs, our work can be seen as repeatedly performing NPI as the program is being produced. While we have so far used classical feedforward and recurrent architectures to implement our neural heuristics, future work could use richer models from the NPI literature to this end.

DSL-based Program Synthesis. There is a large body of research on synthesis of programs from DSLs. In many of these methods, the goal is not learning but ﬁnding a program that satisﬁes a hard constraint [1, 41, 32, 15]. However, there is also a growing literature on learning programs from (noisy) data [26, 13, 46, 11, 44, 45]. Of these methods, TERPRET [18] and NEURAL TERPRET [17] allows gradient descent as a mechanism for learning program parameters. However, unlike NEAR, these approaches do not allow a general search over program architectures permitted by a DSL, and require a detailed hand-written template of the program for even the simplest tasks. While the Houdini framework [44] combines gradient-based parameter learning with search over program architectures, this search is not learning-accelerated and uses a simple type-directed enumeration. As reported in our experiments, NEAR outperforms this enumeration-based approach.

Many recent methods for program synthesis use statistical models to guide the search over program architectures [3, 7, 12, 8, 11, 30, 16, 29]. In particular, Lee et al. [27] use a probabilistic model to guide an A search over programs. Most of these models (including the one in Lee et al. [27]) are trained using corpora of synthesis problems and corresponding solutions, which are not available in our setting. There is a category of methods based on reinforcement learning (RL) [16, 4]. Unlike NEAR, these methods do not directly exploit the structure of the search space. Combining them with our approach would be an interesting topic of future work.

Structure Search using Relaxations. Our search problem bears similarities with the problems of searching over neural architectures and the structure of graphical models. Prior work has used relaxations to solve these problems [28, 40, 51, 33, 48]. Speciﬁcally, the A* lasso approach for learning sparse Bayesian networks [48] uses a dense network to construct admissible heuristics, and DARTS computes a differentiable relaxation of neural architecture search [28, 40]. The key difference between these efforts and ours is that the design space in our problem is much richer, making the methods in prior work difﬁcult to apply. In particular, DARTS uses a composition of softmaxes over all possible candidate operations between a ﬁxed set of nodes that constitute a neural architecture, and the heuristics in the A* lasso method come from a single, simple function class. However, in our setting, there is no ﬁxed bound on the number of expressions in a program, different sets of operations can be available at different points of synthesis, and the input and output type of the heuristic (and therefore, its architecture) can vary based on the part of the program derived so far.

6 Conclusions

We have a presented a novel graph search approach to learning differentiable programs. Our method leverages a novel construction of an admissible heuristic using neural relaxations to efﬁciently search over program architectures. Our experiments show that programs learned using our approach can have competitive performance, and that our search-based learning procedure substantially outperforms conventional program learning approaches.

There are many directions for future work. One direction is to extend the approach to richer DSLs and neural heuristic architectures, for example, those suited to reinforcement learning [45] and generative modeling [35]. Another is to combine NEAR with classical program synthesis methods based on symbolic reasoning. A third is to integrate NEAR into more complex program search problems, e.g., when there is an initial program as a starting point and the goal is to search for reﬁnements. A fourth is to more tightly integrate with real-world applications to evaluate the interpretability of learned programs as it impacts downstream tasks.

Broader Impact

Programmatic models described using high-level DSLs are a powerful mechanism for summarizing automatically discovered knowledge in a human-interpretable way. Speciﬁcally, such models are more interpretable than state-of-the-art neural models while also tending to provide higher performance than shallower linear or decision tree models. Also, programmatic models allow for natural incorporation of inductive bias and allow the user to inﬂuence the semantic meaning of learned programs.

For these reasons, efforts on program learning, such as ours, can lead to more widespread use of machine learning in ﬁelds, such as healthcare, autonomous driving, and the natural sciences, where safety and accountability are critical and there human-held prior knowledge (such as the laws of nature) that can usefully direct the learning process. The ﬂipside of this is that the bias introduced in program learning can just as easily be exploited by users who desire speciﬁc outcomes from the learner. Ultimately, users of program learning methods must ensure that any incorporated inductive bias will not lead to unfair or misleading programs.

Funding Acknowledgment

This work was supported in part by NSF Award # CCF-1918651, NSF Award # CCF-1704883, DARPA PAI, Raytheon, a Rice University Graduate Research Fellowship (for Shah), a JP Morgan Chase Fellowship (for Verma), and NSERC Award # PGSD3-532647-2019 (for Sun).

[1] Rajeev Alur, Rastislav Bodík, Eric Dallal, Dana Fisman, Pranav Garg, Garvit Juniwal, Hadas

Kress-Gazit, P. Madhusudan, Milo M. K. Martin, Mukund Raghothaman, Shambwaditya Saha, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntax-guided synthesis. In Dependable Software Systems Engineering, pages 1 25. 2015.

[2] A Bagchi and A Mahanti. Admissible heuristic search in and/or graphs. Theoretical Computer

Science, 24(2):207 219, 1983.

[3] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tar-

low. Deepcoder: Learning to write programs. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.

[4] Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Lever-

aging grammar and reinforcement learning for neural program synthesis. ar Xiv preprint ar Xiv:1805.04276, 2018.

[5] Xavier P Burgos-Artizzu, Piotr Dollár, Dayu Lin, David J Anderson, and Pietro Perona. Social

behavior recognition in continuous video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1322 1329. IEEE, 2012.

[6] Eugene Charniak and Saadia Husain. A new admissible heuristic for minimal-cost proofs.

Brown University, Department of Computer Science, 1991.

[7] Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In 7th

International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[8] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mo-

hamed, and Pushmeet Kohli. Robustﬁll: Neural program learning under noisy I/O. Co RR, abs/1703.07469, 2017.

[9] Thomas G. Dietterich. Machine learning for sequential data: A review. In Structural, Syntactic,

and Statistical Pattern Recognition, Joint IAPR International Workshops SSPR 2002 and SPR 2002, Windsor, Ontario, Canada, August 6-9, 2002, Proceedings, pages 15 30, 2002.

[10] Kevin Ellis, Maxwell I. Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando Solar-

Lezama. Write, execute, assess: Program synthesis with a REPL. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural

Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 9165 9174, 2019. [11] Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. Learning to infer

graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, pages 6059 6068, 2018. [12] Kevin Ellis, Armando Solar-Lezama, and Josh Tenenbaum. Sampling for bayesian program

learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1289 1297, 2016. [13] Kevin Ellis, Armando Solar-Lezama, and Joshua B. Tenenbaum. Unsupervised learning by program synthesis. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 973 981, 2015. [14] Eyrun Eyjolfsdottir, Steve Branson, Xavier P Burgos-Artizzu, Eric D Hoopfer, Jonathan Schor,

David J Anderson, and Pietro Perona. Detecting social actions of fruit ﬂies. In European Conference on Computer Vision, pages 772 787. Springer, 2014. [15] John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations

from input-output examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015, pages 229 239, 2015. [16] Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. Syn-

thesizing programs for images using reinforced adversarial learning. Co RR, abs/1804.01118, 2018. [17] Alexander L Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differentiable

programs with neural libraries. In International Conference on Machine Learning, pages 1213 1222, 2017. [18] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli,

Jonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program induction. ar Xiv preprint ar Xiv:1608.04428, 2016. [19] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014. [20] Larry R Harris. The heuristic search under conditions of error. Artiﬁcial Intelligence, 5(3):217

234, 1974. [21] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory,

languages, and computation, 3rd Edition. Pearson international edition. Addison-Wesley, 2007. [22] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014. [23] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European

conference on machine learning, pages 282 293. Springer, 2006. [24] Richard E Korf. Recent progress in the design and analysis of admissible heuristic functions.

In International Symposium on Abstraction, Reformulation, and Approximation, pages 45 55. Springer, 2000. [25] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines.

ar Xiv preprint ar Xiv:1511.06392, 2015. [26] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept

learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. [27] Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. Accelerating search-based program

synthesis using learned probabilistic models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, pages 436 449, 2018. [28] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search.

In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[29] Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Neural sketch learning for

conditional program generation. In ICLR, 2018. [30] Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and

Pushmeet Kohli. Neuro-symbolic program synthesis. ar Xiv preprint ar Xiv:1611.01855, 2016. [31] Judea Pearl. Heuristics: Intelligent search strategies for computer problem solving. Addision

Wesley, 1984. [32] Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, pages 107 126, 2015. [33] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image

classiﬁer architecture search. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 4780 4789. AAAI Press, 2019. [34] Scott Reed and Nando De Freitas. Neural programmer-interpreters. ar Xiv preprint ar Xiv:1511.06279, 2015. [35] Daniel Ritchie, Anna Thomas, Pat Hanrahan, and Noah Goodman. Neurally-guided procedural

models: Amortized inference for procedural graphics programs using neural networks. In Advances in neural information processing systems, pages 622 630, 2016. [36] Stuart Russell and Peter Norvig. Artiﬁcial intelligence: a modern approach. 2002. [37] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap.

Meta-learning with memory-augmented neural networks. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1842 1850. JMLR.org, 2016. [38] Yutaka Sasaki. The truth of the f-measure. Teach Tutor Master, 01 2007. [39] Ameesh Shah, Eric Zhan, and Jennifer Sun. Near code repository. https://github.com/

trishullab/near, 2020. [40] Richard Shin, Charles Packer, and Dawn Song. Differentiable neural network architecture

search. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. Open Review.net, 2018. [41] Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat.

Combinatorial sketching for ﬁnite programs. In ASPLOS, pages 404 415, 2006. [42] David Sontag, Talya Meltzer, Amir Globerson, Tommi S Jaakkola, and Yair Weiss. Tightening lp

relaxations for map using message passing. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2012. [43] Richard Anthony Valenzano, Shahab Jabbari Arfaee, Jordan Thayer, Roni Stern, and Nathan R

Sturtevant. Using alternative suboptimality bounds in heuristic search. In Twenty-Third International Conference on Automated Planning and Scheduling, 2013. [44] Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles A. Sutton, and Swarat Chaudhuri.

HOUDINI: lifelong learning as program synthesis. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada, pages 8701 8712, 2018. [45] Abhinav Verma, Hoang Minh Le, Yisong Yue, and Swarat Chaudhuri. Imitation-projected

programmatic reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), 2019. [46] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri.

Programmatically interpretable reinforcement learning. In International Conference on Machine Learning, pages 5052 5061, 2018. [47] Glynn Winskel. The formal semantics of programming languages: an introduction. MIT press,

[48] Jing Xiang and Seyoung Kim. A* lasso for learning a sparse bayesian network structure for

continuous variables. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting

held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 2418 2426, 2013. [49] Halley Young, Osbert Bastani, and Mayur Naik. Learning neurosymbolic generative models via

program synthesis. In International Conference on Machine Learning (ICML), 2019. [50] Yisong Yue, Patrick Lucey, Peter Carr, Alina Bialkowski, and Iain Matthews. Learning ﬁne-

grained spatial models for dynamic sports play prediction. In 2014 IEEE international conference on data mining, pages 670 679. IEEE, 2014. [51] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In 5th

International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017.