# plans_neurosymbolic_program_learning_from_videos__8725f7cf.pdf

PLANS: Neuro-Symbolic Program Learning from Videos

Raphaël Dang-Nhu ETH Zürich dangnhur@ethz.ch

Recent years have seen the rise of statistical program learning based on neural models as an alternative to traditional rule-based systems for programming by example. Rule-based approaches offer correctness guarantees in an unsupervised way as they inherently capture logical rules, while neural models are more realistically scalable to raw, high-dimensional input, and provide resistance to noisy I/O speciﬁcations. We introduce PLANS (Program Le Arning from Neurally inferred Speciﬁcations), a hybrid model for program synthesis from visual observations that gets the best of both worlds, relying on (i) a neural architecture trained to extract abstract, high-level information from each raw individual input (ii) a rule-based system using the extracted information as I/O speciﬁcations to synthesize a program capturing the different observations. In order to address the key challenge of making PLANS resistant to noise in the network s output, we introduce a dynamic ﬁltering algorithm for I/O speciﬁcations based on selective classiﬁcation techniques. We obtain state-of-the-art performance at program synthesis from diverse demonstration videos in the Karel and Vi ZDoom environments, while requiring no ground-truth program for training.

1 Introduction

The problem of automatically generating a program satisfying given speciﬁcations is a long-standing challenge of artiﬁcial intelligence (Waldinger, Lee, 1969). The sub-area of inductive program synthesis, also known as programming by example (Gulwani, 2011), focuses on speciﬁcations formed of examples of desired I/O program behavior. In existing systems, I/O examples typically are instances of simple programming data types such as booleans, integers, ﬂoats or strings, or elementary data structures such as lists. Extending programming by example to more complex input domains is a very useful task, as it opens the range of possible real-world applications. Sun et al. (2018) made an important step in this direction by introducing the task of program synthesis from diverse demonstrations videos. This is a supervised learning problem in which the input is a set of videos of an agent s behavior provided as raw visual input, and the desired output is a program summarizing the decision making logic (or policy) of this agent. From the point of view of imitation learning, this task is important because policies represented as programs might generalize more strongly outside the training distribution, require less data to learn, and allow formal veriﬁcation of safety properties.

Inductive synthesis techniques can be divided in two categories. The traditional symbolic or rulebased approaches (Jha et al., 2010; Alur et al., 2013) typically rely on efﬁcient search through a user-deﬁned domain speciﬁc language, leveraging constraint solvers based on SAT and SMT solving for search-space pruning. Alternatively, recent efforts aim at leveraging statistical learning methods, by phrasing inductive synthesis as a supervised learning task (Gaunt et al., 2016; Parisotto et al., 2017). Rule-based approaches have several advantages: they offer guarantees that the generated program is correct (i.e., satisﬁes the provided speciﬁcations). Additionally, they demonstrate better generalization to unseen inputs. The latter was experimentally shown by Gaunt et al. (2016) in the

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Terpret framework, and conﬁrmed by Devlin et al. (2017) on the Flash Fill benchmark (Gulwani et al., 2012). However, they are computationally expensive and can suffer from scalability issues. As a result, they are inherently unable to handle raw visual input. Besides, they struggle to deal with incorrect input as they systematically attempt to satisfy all examples. Statistical approaches (see Allamanis et al. (2018) for a survey) appear as a way to mitigate these problems, by handling images (Ellis et al., 2018), scenes (Liu et al., 2019) or videos (Sun et al., 2018) as inputs, and learning to be robust against noise (Devlin et al., 2017). But this comes at the cost of generating large datasets labeled with ground-truth programs for training.

We introduce PLANS, a hybrid model for program synthesis from demonstration videos that combines neural and rule-based techniques in order to get the best of both worlds. It has two components: First, a neural architecture is trained to infer a high-level description of each individual behavior from raw visual observation. Second, a rule-based solver uses the information inferred by the network as I/O speciﬁcations to synthesize a program capturing the different behaviors. The neural component allows to handle raw visual input by transforming it into abstract speciﬁcations, while the rule-based system removes the need for ground-truth programs by inherently capturing logical rules. Therefore, our model uses strictly less supervision than prior work. We address the key challenge of making PLANS robust to the noise introduced by the imperfect inference of speciﬁcations, which has a dramatic impact on the solver s performance if not properly handled. Speciﬁcally, we introduce an adaptive ﬁltering algorithm for I/O speciﬁcations based on selective classiﬁcation techniques, in which only high-conﬁdence predictions are passed to the solver. Using this method, PLANS clearly outperforms previous methods (Sun et al., 2018; Duan et al., 2019). Finally, our model maintains a reasonable temporal complexity, as it relies on several independent solver calls that can be performed in parallel. To summarize, we make the following contributions:

We develop a neural architecture for inferring I/O speciﬁcations from videos and an encoding of the synthesis task into an off-the-shelf rule-based solver (Rosette (Torlak, Bodik, 2013)).

We address the key problem of making the rule-based solver robust to noise in the network s output by developing an adaptive ﬁltering algorithm for speciﬁcations.

We evaluate PLANS on the Karel and Vi ZDoom benchmarks and show signiﬁcant performance improvements compared to state-of-the-art end-to-end neural architectures, despite using strictly less supervision signals.

2 Background

2.1 Motivation

Understanding an agent s decision making logic is a long standing task in artiﬁcial intelligence, originally motivated by the performance beneﬁts of imitation learning techniques (Schaal, 1997; Ng, Russell, 2000; Ross et al., 2011). In addition, it is related to making algorithms more interpretable to humans. In particular, recent work has focused on interpreting the neural policy representation in modern deep reinforcement learning algorithms (Lipton, 2018; Montavon et al., 2018). In this context, program representation of policies has several advantages. First, it gives a concise representation that is suitable for intuitive human understanding. Second, it permits to leverage symbolic veriﬁcation techniques to ensure safety guarantees in the learning process (Verma et al., 2018; Zhu et al., 2019). The core aspect of these programmatic approaches is the ability to synthesize a summarized representation of the different observed behaviors of the agent.

The setting considered here is speciﬁc in two ways. First, most existing techniques assume that a high-level description of actions performed by the agent is always available. Second, the control-ﬂow interpretability of the synthesized program is limited by the complexity of the underlying state space representation. This can lead to degenerate situations where the execution of the program depends on non-interpretable boolean conditions. For instance, in the case of visual input, this might be a threshold on the value of a single pixel in a 80 80 image. On the contrary, we assume that ground truth action sequences are not available at test time, and that the high dimension of the state-space requires restricting possible boolean conditions to high-level description of the state. In the context of playing video games, an example of meaningful condition is a boolean that indicates whether an enemy is present in the environment. These assumptions create a black-box setting where the speciﬁcations for synthesis of a program policy are not immediately available and need to be inferred.

2.2 Task description

Here, we describe the task of program synthesis from diverse demonstration videos (Sun et al., 2018).

Agent-environment interaction We consider an environment with observable state s S, and an agent interacting with this environment in discrete timesteps. We assume that the agent has a ﬁxed set of possible actions A, and that the inﬂuence of each action on the state of the environment is determined by a deterministic transition function T : S A S. Additionally, we suppose that the agent behaves according to an unknown, deterministic policy η(s) A that determines the next action depending on the sequence of previously observed states s = (s1, . . . , si). Finally, we assume that the agent has control over the end of the interaction by means of a speciﬁc action end A.

Demonstrations A demonstration τ is formed of a sequence of states s = (s1, . . . , s T ) together with a sequence of actions of same length a = (a1, . . . , a T ), satisfying the following conditions (i) The actions of the agent follow the policy η. That is, for all i in {1 . . . T}, we have ai = η((s1, . . . , si)). (ii) The environment transitions are determined by T , i.e. for all i in {1 . . . T 1}, we have si+1 = T (si, ai). (iii) The sequence is complete, i.e. the last action a T is end. Since the transition function and policy are deterministic, the demonstration is uniquely determined by the environment s initial state s1. Therefore, we can generate demonstrations for a policy by sampling random initial states and applying the policy and transition iteratively. In the case of visual state observations, the state sequence s can be seen as video composed of successive frames.

Perceptions In order to address the interpretability issue arising from the high dimensionality of S, we assume the existence of d perception primitives γ1, . . . , γd : S {true, false}. Intuitively, the perceptions (γ1(s), . . . , γd(s)) give a high-level, abstract representation of the state. For instance, in the context of video games, γ1(s) could be true if and only if s the agent is facing an enemy. The underlying idea is that the agent s behavior η should only depend on the compact state representation provided by the perception primitives. Given a demonstration τ, we deﬁne d associated perception sequences p1, . . . , pd as pj = (pj 1, . . . , pj T ) = (γj(s1), . . . , γj(s T )).

Program synthesis from demonstration videos Sun et al. (2018) proposed the following supervised learning task, where the input is a set of demonstrations generated with an unknown policy η and different initial states, and the desired output is a representation of η. The benchmark datasets are generated such that η can always be represented by a program in a given domain speciﬁc language (DSL). The execution of the program speciﬁes which sequence of actions is executed, with different control-ﬂow constructs, such as conditional branchings and loops. The boolean conditions used for these constructs are only allowed to depend on the high-level representation γ(s) = (γ1(s), . . . , γl(s)) of the state obtained via the perception primitives. The DSL is formally deﬁned in Figure 1.

Supervision signals Sun et al. (2018) make the assumption that neither the action nor the perception sequences are available at test time. They can only be used as a supervision signal in the training phase. PLANS conforms to this assumption, explaining the need for a neural model that learns to infer the I/O speciﬁcations. Compared to previous work, the main speciﬁcity of PLANS is that it does not require program supervision. That is, the ground-truth program is never accessible to the model.

2.3 Summary of previous approaches

demo2program Sun et al. (2018) designed an end-to-end neural model for the task of program synthesis from demonstration videos. It is based on the following components (i) a convolutional neural network to encode each of the video frames (ii) an encoder LSTM layer that takes as input each video as a sequence of frame encodings (iii) a summarizer LSTM layer that re-encodes the video with aggregated video encodings as initial state (iv) a relational network module that summarizes the different video encodings (v) a ﬁnal decoder LSTM layer that outputs the program in the form of a sequence of code tokens. They demonstrate that the summarizer LSTM layer as well as the relational module are essential to obtain reliable identiﬁcation of the correct program control-ﬂow.

watch-reason-code Duan et al. (2019) proposed to reduce the memory footprint of the architecture by introducing a deviation-pooling strategy replacing the relational module, and to use multiple decoding layers to reﬁne the generated program, obtaining slight performance improvements.

3 The PLANS model

Figure 2 gives a high-level overview of PLANS. We describe the neural component (Section 3.1), the rule-based solver (Section 3.2) and our adaptive ﬁltering algorithm (Section 3.3).

3.1 Neural inference of speciﬁcations

The ﬁrst component of PLANS is a neural architecture that learns to infer the action and perception sequences from raw visual input. This formalizes as a sequence-to-sequence learning task where the input is a video demonstration τ = (s1, . . . , s T ), and the desired outputs are (i) the ground-truth action sequence (a1, . . . , a T ) (ii) the ground-truth perception sequences (p1 1, . . . , p1 T ) . . . (pl 1, . . . , pl T ). To perform this task, we used a vanilla seq2seq model enhanced with an attention mechanism. In order to interpret visual input, we use a multi-layer convolutional network as a state embedding, i.e the state si is encoded by a convolutional network to a state vector vi before it is fed to the encoder layer. We use two different decoder layers for actions and perceptions, while the embedding and encoder layer are common for both action and perceptions. To ensure reproducibility of our results, extensive description of hyperparameters and training process can be found in the supplementary material.

3.2 Encoding into an inductive synthesis problem

Program m s ; end Statement s s1 ; s2 | while(b) : s | repeat(r) : s r N | if(b) : s else : s2 | if(b) : s | Action a a A Condition b not b | Perception γi i {1 . . . l}

Figure 1: DSL for representing policies.

Our rule-based synthesizer is implemented in Rosette (Torlak, Bodik, 2013), which itself builds on the Z3 solver (De Moura, Bjørner, 2008). Rosette allows to specify a program with holes (or sketch (Lezama, 2008)), and a desired behavior. Then, it automatically ﬁlls the holes by encoding the resulting problem into a set of logical constraints that have to be satisﬁed. A very convenient functionality of Rosette is the possibility to specify complex holes that can be ﬁlled with expressions from a predeﬁned grammar, allowing to encode the DSL of Figure 1, and restricting the search to valid programs. In the supplementary material, we show how we encoded the different controlﬂow constructs of the DSL in Rosette.

The task of program synthesis from diverse demonstration videos naturally phrases as a programming by example problem. The I/O speciﬁcations contain the perception sequences, which serve as program input, and the corresponding action sequence, which is the desired output. In order to achieve good generalization of the synthesized program to unseen demonstrations, it is crucial to privilege simpler solutions among several programs satisfying the I/O speciﬁcations. That is, we aim at ﬁnding the solution with the least cost, deﬁned as the number of control-ﬂow branchings (if or while statements). Since Rosette has no native support for cost functions, we propose to bound the number of control-ﬂow statements in the Rosette encoding of the DSL, and make repeated calls to the solver while increasing this bound, until the resulting set of constraints is satisﬁable. We give additional details about this procedure in the supplementary material.

Rule-based synthesizers have the advantage that they are both sound and complete: any program returned by the solver is guaranteed to satisfy all the provided speciﬁcations, and if there exists a satisfying way to ﬁll the holes, the solver will ﬁnd a solution in ﬁnite time (in the case of bounded number of control-ﬂow statements). These guarantees are usually associated with a high computational cost. In the case at hand, thanks to the relative simplicity of the DSL which contains no variable assignments, we did not observe time to be a problem. All solver queries terminate in a few seconds, which in orders of magnitude is similar to the time needed by the neural architecture to infer the different speciﬁcations. Besides, the different calls to the solver are independent and can be made in parallel. We give precise time measurements in the supplementary material.

Demo encoder

Demo encoder

Demo encoder

Action decoder

Action decoder

Action decoder

Erroneous prediction

(move, move, move, end)

(move, move, end)

(move, end)

(True, True, True, False)

(False, True, False)

(True, False)

Perception decoder

Perception decoder

Perception decoder

Filtering unreliable predictions

Rule-based program synthesizer

while(front Is Clear()) | move end

High prediction conﬁdence

Low prediction conﬁdence

High prediction conﬁdence

Figure 2: Overview of PLANS. The video demonstrations are individually fed at pixel-level into the neural encoder-decoder architecture, which infers the high-level action and perception sequences. For the sake of clarity, we only show one perception sequence corresponding to the front Is Clear() perception primitive. The different sequences are then given to a rule-based solver, that generates a program summarizing the different demonstrations. The demonstration with inconsistent perception sequence is identiﬁed thanks to its low prediction conﬁdence, and is not provided to the solver.

3.3 Synthesis from noisy speciﬁcations

The ability to generate correct perception-action speciﬁcations for the synthesis task is crucial to the performance of PLANS. Indeed, existing rule-based solvers are very sensitive to even small amounts of input noise (Devlin et al., 2017). Consequently, our model can only achieve satisfying performance if all the speciﬁcations given to the solver are reliable. While this assumption is met on the Karel benchmark, we observe in the Vi ZDoom environment that the quasi totality (above 99%) of test programs have at least one erroneous predicted action or perception token among the 25 observed demonstrations. Existing work on rule-based synthesis from noisy data (Raychev et al., 2016) propose dataset sub-sampling as a way to avoid overﬁtting to incorrect speciﬁcations. This is not directly applicable to our setting. Indeed, it only guarantees to ﬁnd a program that is close to the correct solution, in the sense that it satisﬁes as many I/O speciﬁcations as possible, with a focus on datasets containing a large number of examples. In contrast, we deal with a ﬁxed and limited amount of demonstrations, and aim at identifying exactly the original program.

Instead, we designed a simple yet effective ﬁltering algorithm for demonstrations, based on the principle of selective classiﬁcation. This method makes use of the neural network s prediction conﬁdence. For a given sequence token, the prediction conﬁdence is deﬁned as the probability of the predicted class in the softmax output layer. The ﬁltering works in two steps: First, we assign to each demonstration τ an action conﬁdence level actionconf(τ) [0, 1] and a perception conﬁdence level perconf(τ) [0, 1] to characterize the global conﬁdence of the whole sequence of predictions. Then, if either one of the two conﬁdence levels is too low, we discard the whole demonstration. We explore two variants of this method: a static ﬁltering method that uses ﬁxed thresholds for the action and prediction conﬁdence, and a dynamic method that incrementally increases the thresholds while making repeated calls to the solver, until a valid solution is found. Before we describe these algorithms in detail, let us formally deﬁne the action and prediction conﬁdence levels.

Deﬁnition. Let τ be a demonstration, for which the neural architecture predicts action sequence a = (a1, . . . , a T ) and perception sequences p1 = (p1 1, . . . , p1 T ), . . . , pl = (pl 1, . . . , pl T ). Let us denote conf(ai) (resp conf(pj i)) for the prediction conﬁdence of action token ai (resp perception token pj i), which is deﬁned as the probability of ai (resp. pj i) in the softmax output layer. We deﬁne the action conﬁdence actionconf(τ) of the demonstration as the minimum of the action tokens conﬁdence. That is, actionconf(τ) = min i conf(ai).

Similarly, we deﬁne the perception conﬁdence perconf(a) of τ to be the minimum of the perception tokens conﬁdence, formally perconf(τ) = min i,j conf(pj i).

Note that for the perception conﬁdence level, the min is taken both across timesteps (indexed by i) and perception primitives (indexed by j).

Static ﬁltering In the static ﬁltering strategy, we ﬁx two thresholds ϵa and ϵp for the action and perception conﬁdence respectively. Then, we ﬁlter all the I/O examples for which the underlying demonstration τ veriﬁes actionconf(τ) < ϵa or perconf(τ) < ϵp. ϵa are ϵp are treated as hyperparameters and optimized on the validation dataset. Tuning the thresholds creates a trade-off between the number of remaining demonstrations and their reliability.

Dynamic ﬁltering In practice, we observe the static ﬁltering strategy to be efﬁcient at ﬁltering out incorrect action sequences. However, it is sub-optimal for ﬁltering perception sequences, because the best threshold ϵp varies signiﬁcantly across programs. To mitigate this problem, we employ a dynamic ﬁltering strategy for perception sequences. We sort the demonstrations by increasing perception conﬁdence, i.e. we have τ1, . . . , τk with perconf(τ1) < . . . . perconf(τk). Then, we incrementally ﬁlter out the demonstrations by increasing order of conﬁdence, while making repeated calls to the program synthesizer, until the resulting set of constraints is satisﬁable. This progressively reduces the number of I/O examples, with the hope that incorrect predictions will have low conﬁdence and will be ﬁltered ﬁrst. This dynamic ﬁltering strategy is adaptive in the sense that it removes the need for tuning the threshold ϵp. It is coupled with the search for optimal program cost in the following way: in the outer loop, we increment the perception threshold ϵp, and in the inner loop, we increment the number of control-ﬂow statements. In the supplementary material, we provide a detailed description of this algorithm. We also experimented with a decremental scheme and obtained similar accuracy.

4 Experimental results

In this section, we report experimental observations concerning the performance of PLANS.

Benchmarks Karel (Pattis, 1981) is an educational programming language that controls a robot navigating through a grid world with walls and markers. Vi ZDoom (Kempka et al., 2016) is an open-source platform for Doom, a classical ﬁrst-person shooter game. It allows training bots via reinforcement learning from visual observations. We use the three evaluation metrics designed by Sun et al. (2018). To measure execution accuracy, we compare if the predicted and ground-truth programs behave similarly on a ﬁxed number of not previously observed initial states. Sequence accuracy measures if the predicted and ground-truth programs match exactly. Program accuracy is similar to sequence accuracy, with identiﬁcation of some semantically equivalent programs: e.g., repeat(3) : move and move : move : move will be considered as equivalent by program accuracy.

Experimental setup We performed all experiments on a machine with 2.00GHz Intel Xeon E52650 CPUs and using a single Ge Force RTX 2080 Ti GPU. We make our implementation public and provide additional details about the experiments duration in the supplementary material.

4.1 Comparison with demo2program and watch-reason-code

Results on the main Karel and Vi ZDoom benchmarks are presented in Table 1. Values for both baselines are directly reported from prior work, and show best obtained performance. For PLANS, we report mean and standard deviation over three independent runs. PLANS strongly improves execution accuracy compared to prior work: approximately +15% absolute improvement for Karel, and +10% for Vi ZDoom. This means that PLANS is signiﬁcantly better at capturing the different possible behaviors. We also observe improvements of program accuracy, though less signiﬁcant. We surmise that this is due to imperfections of the program accuracy metric, which captures some but not all semantically equivalent programs. For instance, we observe that if(front Is Clear()) : move else : turn and if(not front Is Clear()) : turn else : move are not recognized as equivalent. An important direction for future work is to improve this metric in the original benchmark released by Sun et al. (2018). Finally, we observe no sequence accuracy improvement in the Karel benchmark. However, we believe that the obtained performance is still very reasonable. Indeed, as opposed to the two baselines, PLANS does not access any ground-truth program during training. Therefore, it can not be expected to distinguish between semantically equivalent programs.

In the Vi ZDoom environement, our results conﬁrm that the performance of PLANS heavily relies on the ﬁltering heuristics. Without ﬁltering, the model achieves very poor performance. With static ﬁltering only, the model achieves reasonable accuracy but fails to outperform both baselines. Dynamic ﬁltering yields the best results.

Table 1: Accuracy comparison on the main Karel and Vi ZDoom benchmarks. Values are in %.

% Karel Vi ZDoom Model Execution Program Sequence Execution Program Sequence demo2program 72.1 48.9 41.0 78.4 62.5 53.2 watch-reason-code 74.7 51.2 43.3 68.1 63.4 55.8 PLANS (none) 91.6 1.3 53.9 1.0 34.2 0.5 25.8 0.9 21.7 0.6 19.3 0.6 PLANS (static) 77.5 1.5 57.9 0.9 51.2 0.8 PLANS (dynamic) 88.0 0.6 65.5 0.6 58.8 0.6

Table 2: Comparison on the Vi ZDoom if-else benchmark.

% Vi ZDoom if-else Model Execution Program Sequence demo2program 89.4 69.1 58.8 watch-reason-code 82.1 67.8 57.7 PLANS (none) 16.3 1.2 13.8 1.0 10.2 0.6 PLANS (static) 77.0 0.6 62.4 0.7 50.2 0.7 PLANS (dynamic) 91.4 0.7 74.6 0.7 62.1 0.7

1 5 10 15 20 25 30 35 40 Observed demonstrations

Execution accuracy (%)

PLANS demo2program

Figure 3: Inﬂuence of number of observed demonstrations on exec. accuracy (Vi ZDoom, main benchmark).

4.2 Additional experiments

Number of observed demonstrations In Figure 3, we show the inﬂuence of the number of observed demonstrations on Vi ZDoom execution accuracy. We observe consistent improvement over demo2program when the number of demonstrations increases. We do not report values for watchreason-code as it is not given by prior work and no implementation is provided. Watch-reason-code is anyway signiﬁcantly less accurate than demo2program for this metric. This experiment conﬁrms the claim that our model reliably yields superior performance on the Vi ZDoom benchmark, and generalizes better to unseen initial states.

If-else dataset In order to speciﬁcally assess the ability of PLANS to identify control-ﬂow divergences among demonstrations, we consider a speciﬁc dataset with programs composed of only one if-else branching statement. Results are shown in Table 2. Contrary to the main experiment, we observe similar improvements on all three metrics. We attribute this observation to the fact that programs in this dataset have less semantically equivalent variations, because of their simple form.

5 Related work

Learning graphics programs from images and scenes A related line of work is focused on inference of graphical programs from visual input (Ellis et al., 2015; Ganin et al., 2018; Liu et al., 2019; Tian et al., 2019). Most notably, Ellis et al. (2018) also use a hybrid architecture composed of a neural network and a rule-based solver. However, it can not be considered a programming by example task, as the speciﬁcation only relates to a single input image, for which a compact, synthetic representation is desired. In contrast, we address the challenging task of summarizing the decision making in several videos, which requires to identify the control-ﬂow divergences between the different observed behaviors. Besides, they ensure robustness against noise by incrementally adding speciﬁcations while rendering the resulting image and measuring its similarity with the input. This approach to identifying wrong speciﬁcations is task-speciﬁc as it requires the ability to incrementally render the generated speciﬁcations into an image, and does not apply to our setting.

Synthesis of programs from noisy examples Our work conﬁrms the idea that naive rule-based solvers are very vulnerable to noise in speciﬁcations (Robust Fill, (Devlin et al., 2017)). We propose a general approach to mitigate this problem when the I/O speciﬁcations are inferred by a statistical learning system, and the noise originates from inaccuracies of this system. We believe that such settings will become increasingly common with the rise of hybrid neuro-symbolic models for machine reasoning. According to the taxonomy of Raychev et al. (2016), our ﬁltering algorithms fall into the category of dataset cleaning methods, aiming at removing incorrect speciﬁcations before they are fed to the solver. Other categories include (i) probabilistic program synthesis (Nori et al., 2015), which relaxes the speciﬁcations into stochastic constraints (ii) genetic programming as a way of exploring large solution spaces while maximizing an objective function (Cramer, 1985; Baker, 1987). We do not compare with the algorithm of Raychev et al. (2016) because it does not build easily on off-the-shelf solvers, and requires unrealistic re-implementation.

Neurally enhanced program synthesis An important line of work aims at using neural architectures to speed-up rule-based program synthesis by guiding the search process. Deep Coder (Balog et al., 2017) predicts an order on the program space in order to guide the rule-based solver towards potential solutions. Lee et al. (2018) propose to learn a probabilistic model attributing a likelihood to each program, with the aim of ﬁrst exploring likely solutions. Ellis et al. (2018) learn a bias optimal strategy based on Levin search (Levin, 1973), with the goal of efﬁciently allocating compute time to different parts of the program search space. We did not need to use such approaches as synthesis was reasonably fast in the Karel and Vi ZDoom benchmarks.

Programmatic reinforcement learning The idea of representing reinforcement learning policies as programs is investigated in a growing body of work. However, the search space of programs is highly non-smooth and makes the resulting problem intractable. Different works aim at mitigating this problem by ﬁrst training a good neural policy via deep reinforcement learning. The policy is then used as an oracle providing I/O examples to generate a programmatic policy with similar behavior (Bastani et al., 2018; Verma et al., 2018, 2019). Despite some high-level similarities, the setting of program synthesis from diverse demonstration videos has some fundamental particularities (i) we aim at ﬁnding at program matching exactly the different examples, rather than just exhibiting similar behavior (ii) we consider that high-level description of each video is only available at training time (iii) we only dispose of a ﬁxed, limited number of demonstrations, while the neural policy can be used to generate an arbitrary number of examples.

Black-box imitation learning There exists prior work in the ﬁeld of imitation learning which does not assume access to agent s actions. Torabi et al. (2018) develop a method for behavioral cloning from state observations. In (Nair et al., 2017), a robot learns pixel-level inverse dynamics by examining the inﬂuence of its own actions, in order to be able to infer the actions of the expert. However, this line of work does not consider program representations of policies.

Selective prediction Our ﬁltering strategies relates to the concept of selective prediction in statistical learning, also known as reject option. Since the seminal work of Chow (1957), the idea of rejecting certain predictions because of their low conﬁdence has been extensively studied in various settings (Fumera, Roli, 2002; Geifman, El-Yaniv, 2017). Our action and prediction conﬁdences can be seen as extensions of the softmax response (Geifman, El-Yaniv, 2019) for sequence prediction tasks. Other works have proposed to integrate the reject option in the learning process Cortes et al. (2016): a potential direction for future work is to integrate these mechanisms in our neural architecture. In the speciﬁc ﬁeld of statistical program learning, very recent work has proposed to use selective prediction in order to ensure adversarial robustness for code (Bielik, Vechev, 2020).

6 Conclusion

A crucial challenge for intelligent systems is the ability to perform abstract reasoning from raw, unstructured data. To achieve this goal, prior work (Ellis et al., 2018) has evidenced the power of combining formal rule-based techniques with neural architectures. PLANS is the ﬁrst application of such hybrid systems to the challenging task of identifying an agent s decision making logic from multiple videos. It sets up a new state-of-the-art for this task, with strictly less supervision signals. To efﬁciently combine neural and rule-based components, we developed an adaptive ﬁltering algorithm

for neurally inferred speciﬁcations. We believe that this method is quite general and will be applicable to similar hybrid neuro-symbolic systems. For future work, we also aim at developing ﬁne-grained ﬁltering algorithms that only remove tokens instead of full sequences. This might help improving further the sample efﬁciency of our method.

Acknowledgments

We thank Frederik Benzing and Wouter Tonnon for helpful discussions and proofreading.

Funding disclosure

The author was funded by ETH Master Scholarship Programme during his studies.

Broader Impact

The ability to understand agents decision making logic from visual input can be applied to real-world observations such as videos of people driving. In this context, a simple example of logic that can be represented by a program is: "if the trafﬁc light is green, move, else stop". Since PLANS requires no program supervision, it extends the applicability of prior work to new settings where no ground-truth programs are available. We believe that PLANS can eventually be applied in all contexts involving interaction between agents (human, animal or robotic) and their environment.

In the case of human agents, it is essential to foresee the impact on privacy of the observed subjects. In order to avoid privacy violations, it must be carefully assessed in each application of PLANS which components of a person s decision making logic can be interpreted without being intrusive. While the concern of data privacy is not speciﬁc to PLANS, it opens up new ethical and legal questions about the exact meaning and status of describing an agent s behavior.

It is also crucial to protect individuals from misinterpretation of their behavior. Prior systems for program synthesis from visual observations essentially have two failures modes: The ﬁrst one corresponds to the case where the behavior in one individual video is not understood correctly, which will lead the synthesized program to capture behaviors that have not been actually observed. The second one occurs when the different videos are individually correctly interpreted, but the branching between the different behaviors (represented by the program control-ﬂow) is not correctly inferred. PLANS addresses the ﬁrst problem by leveraging the prediction conﬁdence of its neural component, and the second by using a rule-based system that offers correctness guarantees. These aspects make PLANS more robust than prior work, which implies more reliability for end-users.

Allamanis Miltiadis, Barr Earl T, Devanbu Premkumar, Sutton Charles. A survey of machine learning for big code and naturalness // ACM Computing Surveys (CSUR). 2018. 51, 4. 1 37.

Alur Rajeev, Bodik Rastislav, Juniwal Garvit, Martin Milo MK, Raghothaman Mukund, Seshia Sanjit A, Singh Rishabh, Solar-Lezama Armando, Torlak Emina, Udupa Abhishek. Syntax-guided synthesis. 2013.

Baker James E. Reducing bias and inefﬁciency in the selection algorithm // Proceedings of the second international conference on genetic algorithms. 1987.

Balog Matej, Gaunt Alexander L, Brockschmidt Marc, Nowozin Sebastian, Tarlow Daniel. Deepcoder: Learning to write programs // International Conference on Learning Representations. 2017.

Bastani Osbert, Pu Yewen, Solar-Lezama Armando. Veriﬁable reinforcement learning via policy extraction // Advances in Neural Information Processing Systems. 2018.

Bielik Pavol, Vechev Martin. Adversarial Robustness for Code // International Conference on Machine Learning. 2020.

Chow Chi-Keung. An optimum character recognition system using decision functions // IRE Transactions on Electronic Computers. 1957. 4. 247 254.

Cortes Corinna, De Salvo Giulia, Mohri Mehryar. Learning with rejection // International Conference on Algorithmic Learning Theory. 2016.

Cramer Nichael Lynn. A representation for the adaptive generation of simple sequential programs // Proceedings of the ﬁrst international conference on genetic algorithms. 1985. 183 187.

De Moura Leonardo, Bjørner Nikolaj. Z3: An efﬁcient SMT solver // International conference on Tools and Algorithms for the Construction and Analysis of Systems. 2008.

Devlin Jacob, Uesato Jonathan, Bhupatiraju Surya, Singh Rishabh, Mohamed Abdel-rahman, Kohli Pushmeet. Robustﬁll: Neural program learning under noisy i/o // International Conference on Machine Learning. 2017.

Duan Xuguang, Wu Qi, Gan Chuang, Zhang Yiwei, Huang Wenbing, Hengel Anton van den, Zhu Wenwu. Watch, Reason and Code: Learning to Represent Videos Using Program // Proceedings of the 27th ACM International Conference on Multimedia. 2019.

Ellis Kevin, Ritchie Daniel, Solar-Lezama Armando, Tenenbaum Josh. Learning to infer graphics programs from hand-drawn images // Advances in Neural Information Processing Systems. 2018.

Ellis Kevin, Solar-Lezama Armando, Tenenbaum Josh. Unsupervised learning by program synthesis // Advances in Neural Information Processing Systems. 2015.

Fumera Giorgio, Roli Fabio. Support vector machines with embedded reject option // International Workshop on Support Vector Machines. 2002.

Ganin Yaroslav, Kulkarni Tejas, Babuschkin Igor, Eslami SM, Vinyals Oriol. Synthesizing programs for images using reinforced adversarial learning // ar Xiv preprint ar Xiv:1804.01118. 2018.

Gaunt Alexander L, Brockschmidt Marc, Singh Rishabh, Kushman Nate, Kohli Pushmeet, Taylor Jonathan, Tarlow Daniel. Terpret: A probabilistic programming language for program induction // ar Xiv preprint ar Xiv:1608.04428. 2016.

Geifman Yonatan, El-Yaniv Ran. Selective classiﬁcation for deep neural networks // Advances in Neural Information Processing Systems. 2017. 4878 4887.

Geifman Yonatan, El-Yaniv Ran. Selectivenet: A deep neural network with an integrated reject option // ar Xiv preprint ar Xiv:1901.09192. 2019.

Gulwani Sumit. Automating string processing in spreadsheets using input-output examples // ACM Sigplan Notices. 2011. 46, 1. 317 330.

Gulwani Sumit, Harris William R, Singh Rishabh. Spreadsheet data manipulation using examples // Communications of the ACM. 2012. 55, 8. 97 105.

Jha Susmit, Gulwani Sumit, Seshia Sanjit A, Tiwari Ashish. Oracle-guided component-based program synthesis // 2010 ACM/IEEE 32nd International Conference on Software Engineering. 2010.

Kempka Michał, Wydmuch Marek, Runc Grzegorz, Toczek Jakub, Ja skowski Wojciech. Vizdoom: A doom-based ai research platform for visual reinforcement learning // 2016 IEEE Conference on Computational Intelligence and Games (CIG). 2016.

Lee Woosuk, Heo Kihong, Alur Rajeev, Naik Mayur. Accelerating search-based program synthesis using learned probabilistic models // ACM SIGPLAN Notices. 2018. 53, 4. 436 449.

Levin Leonid Anatolevich. Universal sequential search problems // Problemy peredachi informatsii. 1973. 9, 3. 115 116.

Lezama A Solar. Program synthesis by sketching. 2008.

Lipton Zachary C. The mythos of model interpretability // Queue. 2018. 16, 3. 31 57.

Liu Yunchao, Wu Zheng, Ritchie David A., Freeman William T., Tenenbaum Joshua B., Wu Jiajun. Learning to Describe Scenes with Programs // International Conference on Learning Representations. 2019.

Montavon Grégoire, Samek Wojciech, Müller Klaus-Robert. Methods for interpreting and understanding deep neural networks // Digital Signal Processing. 2018. 73. 1 15.

Nair Ashvin, Chen Dian, Agrawal Pulkit, Isola Phillip, Abbeel Pieter, Malik Jitendra, Levine Sergey. Combining self-supervised learning and imitation for vision-based rope manipulation // 2017 IEEE International Conference on Robotics and Automation (ICRA). 2017.

Ng Andrew Y., Russell Stuart J. Algorithms for Inverse Reinforcement Learning // International Conference on Machine Learning. 2000.

Nori Aditya V, Ozair Sherjil, Rajamani Sriram K, Vijaykeerthy Deepak. Efﬁcient synthesis of probabilistic programs // ACM SIGPLAN Notices. 2015. 50, 6. 208 217.

Parisotto Emilio, Mohamed Abdel-rahman, Singh Rishabh, Li Lihong, Zhou Dengyong, Kohli Pushmeet. Neuro-symbolic program synthesis // International Conference on Learning Representations. 2017.

Pattis Richard E. Karel the robot: a gentle introduction to the art of programming. 1981.

Raychev Veselin, Bielik Pavol, Vechev Martin, Krause Andreas. Learning programs from noisy data // ACM SIGPLAN Notices. 2016. 51, 1. 761 774.

Ross Stéphane, Gordon Geoffrey, Bagnell Drew. A reduction of imitation learning and structured prediction to no-regret online learning // Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics. 2011.

Schaal Stefan. Learning from demonstration // Advances in Neural Information Processing Systems. 1997.

Sun Shao-Hua, Noh Hyeonwoo, Somasundaram Sriram, Lim Joseph. Neural program synthesis from diverse demonstration videos // International Conference on Machine Learning. 2018.

Tian Yonglong, Luo Andrew, Sun Xingyuan, Ellis Kevin, Freeman William T., Tenenbaum Joshua B., Wu Jiajun. Learning to Infer and Execute 3D Shape Programs // International Conference on Learning Representations. 2019.

Torabi Faraz, Warnell Garrett, Stone Peter. Behavioral cloning from observation // Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence. 2018.

Torlak Emina, Bodik Rastislav. Growing solver-aided languages with rosette // Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reﬂections on programming & software. 2013.

Verma Abhinav, Le Hoang, Yue Yisong, Chaudhuri Swarat. Imitation-Projected Programmatic Reinforcement Learning // Advances in Neural Information Processing Systems. 2019.

Verma Abhinav, Murali Vijayaraghavan, Singh Rishabh, Kohli Pushmeet, Chaudhuri Swarat. Programmatically interpretable reinforcement learning // International Conference on Machine Learning. 2018.

Waldinger Richard J, Lee Richard CT. PROW: A step toward automatic program writing // Proceedings of the 1st international joint conference on Artiﬁcial intelligence. 1969.

Zhu He, Xiong Zikang, Magill Stephen, Jagannathan Suresh. An inductive synthesis framework for veriﬁable reinforcement learning // Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2019.