# neural_program_synthesis_from_diverse_demonstration_videos__eadf7118.pdf

Neural Program Synthesis from Diverse Demonstration Videos

Shao-Hua Sun * 1 Hyeonwoo Noh * 2 Sriram Somasundaram 1 Joseph J. Lim 1

Interpreting decision making logic in demonstration videos is key to collaborating with and mimicking humans. To empower machines with this ability, we propose a neural program synthesizer that is able to explicitly synthesize underlying programs from behaviorally diverse and visually complicated demonstration videos. We introduce a summarizer module as part of our model to improve the network s ability to integrate multiple demonstrations varying in behavior. We also employ a multi-task objective to encourage the model to learn meaningful intermediate representations for end-to-end training. We show that our model is able to reliably synthesize underlying programs as well as capture diverse behaviors exhibited in demonstrations. The code is available at https://shaohua0116.github.io/demo2program.

1. Introduction

Imagine you are watching others driving cars. You will easily notice many common behaviors even if you know nothing about driving. For example, cars stop when the trafﬁc light turns to red and move again when the light turns to green. Cars also slow down when pedestrians are seen jay-walking. Through observation, humans can abstract behaviors and understand the reasoning behind behaviors especially extracting the structural relationship between actions (e.g. start, slow down, stop) and perception (e.g. light, pedestrian).

Can machines also reason decision making logic behind behaviors? There has been tremendous effort and success in understanding behaviors such as recognizing actions (Simonyan & Zisserman, 2014), describing activities in languages (Venugopalan et al., 2015), and predicting future

*Equal contribution 1Department of Computer Science, University of Southern California, California, USA 2Department of Computer Science and Engineering, POSTECH, Pohang, Korea. Correspondence to: Shao-Hua Sun <shaohuas@usc.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

while front Is Clear():

move() turn Right() if there Is Pig():

attack() else:

if not there Is Wolf():

spawn Pig() else:

give Bone()

Synthesized Program Demonstrations

Figure 1. An illustration of neural program synthesis from demonstrations. Given multiple demonstration videos exhibiting diverse behaviors, our neural program synthesizer learn to produce interpretable and executable underlying programs. Divergence above occurs based on perception in the second frame.

outcomes (Srivastava et al., 2015). Yet, interpreting reasons behind behaviors is relatively unexplored and is a crucial skill for machines to collaborate with and mimic humans. Hence, our goal is to step towards developing a method that can interpret perception-based decision making logic from diverse behaviors seen in multiple visual demonstrations.

Our insight is to exploit declarative programs, structured in a formal language, as representations of decision making logics. The formal language is composed of action blocks, perception blocks, and control ﬂow (e.g. if/else). Programs written in such a language can explicitly model the connection between an observation (e.g. trafﬁc light, biker) and an action (e.g. stop). An example is shown in Figure 11. Described in a formal language, programs are logically interpretable and executable. Thus, the problem of interpreting decision making logic from visual demonstrations can be reduced to extracting an underlying program.

In fact, there have been many neural network frameworks proposed recently for program induction or synthesis. First, a variety of frameworks (Kaiser & Sutskever, 2016; Reed & De Freitas, 2016; Xu et al., 2018; Devlin et al., 2017a) propose to induce latent representations of underlying programs. While they can be efﬁcient at mimicking desired behaviors, they do not explicitly yield interpretable programs, resulting

1The illustrated environment is not tested in our experiments.

Neural Program Synthesis from Diverse Demonstration Videos

in inexplicable failure cases. On the other hand, another line of work (Devlin et al., 2017b; Bunel et al., 2018) directly synthesize programs from input/output pairs, giving full interpretability. While successful, the limited information in the input/output pairs restricts applicability in synthesizing programs with rich expressibility. Hence, in this paper, we develop a model that synthesizes programs from visually complex and sequential inputs that demonstrate more branching conditions and long term effects, increasing the complexity of the underlying programs.

To this end, we develop a program synthesizer augmented with a summarizer module that is capable of encoding the interrelationship between multiple demonstrations and summarizing them into compact aggregated representations. In addition, to enable efﬁcient end-to-end training, we introduce auxiliary tasks to encourage the model to learn the knowledge that is essential to infer an underlying program.

We extensively evaluate our model in two environments: a fully observable, third-person environment (Karel) and a partially observable, egocentric game (Vi ZDoom). Our experiments in both environments with a variety of settings present the strength of explicitly modeling programs for reasoning underlying conditions and the necessity of the proposed components (the summarizer module and the auxiliary tasks).

In summary, in this paper, we introduce a novel problem of program synthesis from diverse demonstration videos and a method to address it. This substantially enables machines to explicitly interpret decision making logic and interact with humans. We also demonstrate that our algorithm can synthesize programs reliably on multiple environments.

2. Related Work

Program Induction Learning to perform a speciﬁc task by inducing latent representations of underlying task-speciﬁc programs is known as program induction. Various approaches have been developed: designing end-to-end differentiable architectures (Graves et al., 2014; 2016; Zaremba & Sutskever, 2015; Kaiser & Sutskever, 2016; Joulin & Mikolov, 2015; Grefenstette et al., 2015; Neelakantan et al., 2015), learning to call subprograms using step-by-step supervision (Reed & De Freitas, 2016; Cai et al., 2017), and few-shot program induction (Devlin et al., 2017a). Contrary to our work, those method do not return explicit programs.

Program Synthesis The line of work in program synthesis focuses on explicitly producing programs that are restricted to certain languages. (Balog et al., 2017) train a model to predict program attributes and used external search algorithms for inductive program synthesis. (Parisotto et al., 2017; Devlin et al., 2017b) directly synthesize simple string transformation programs. (Bunel et al., 2018) employ re-

Program m := def run() : s Statement s := while(b) : (s) | s1; s2 | a | repeat(r) : (s) | if(b) : (s) | ifelse(b) : (s1) else : (s2) Repetition r := Number of repetitions Condition b := percept | not b Perception p := Domain dependent perception primitives Action a := Domain dependent action primitives

Figure 2. Domain speciﬁc language for the program representation. The program is composed of domain dependent perception and action primitives and control ﬂows.

inforcement learning to directly optimize the execution of generated programs. However, those methods are limited to synthesizing programs from input-output pairs, which substantially restricts the expressibility of the programs that are considered; instead, we address the problem of synthesizing programs from full demonstrations videos.

Imitation Learning The methods that are concerned with acquiring skills from expert demonstrations, dubbed imitation learning, can be split into behavioral cloning (Pomerleau, 1989; 1991; Ross et al., 2011) which casts the problem as a supervised learning task and inverse reinforcement learning (Ng et al., 2000) that extracts estimated reward functions given demonstrations. Recently, (Duan et al., 2017; Finn et al., 2017; Xu et al., 2018) have studied the task of mimicking given few demonstrations. This line of work can be considered as program induction, as they imitate demonstrations without explicitly modeling underlying programs. While those methods are able to mimic given few demonstrations, it is not clear if they could deal with multiple demonstrations with diverse branching conditions.

3. Problem Overview

In this section, we deﬁne our formulation for program synthesis from diverse demonstration videos. We deﬁne programs in a domain speciﬁc language (DSL) with perception primitives, action primitives, and control ﬂows. Action primitives deﬁne the way that agents can interact with an environment, while perception primitives describe how agents can percept it. Control ﬂow can include if/else statements, while loops, repeat statements, and simple logic operations. An example of control ﬂow introduced in (Pattis, 1981) is shown in Figure 2. Note that we focus on perceptions with boolean types in this paper, although a more generic perception type constraint is possible.

A program η is a deterministic function that outputs an action a A given a history of states at time step t, Ht = (s1, s2, ..., st), where s S is a state of the environment. The generation of an action given the history of states is

Neural Program Synthesis from Diverse Demonstration Videos

Demo Encoder

Program Decoder

Action Decoder

move() if left Is Clear():

turn Left() REPEAT R=5: turn Right() if Markers Present(): pickup Marker() else:

Demo Encoder

Demo Encoder

move() def run() move()

c c front Is Clear() right Is Clear() Markers Present()

move() turn Left() pickup Marker()

move() turn Right() pickup Marker()

move() turn Left() move()

Figure 3. Model Architecture. The demonstration encoder encodes each of the k demonstrations separately and the summarizer network aggregates them to construct a summary vector. The summary vector is used by the program decoder to produce program tokens sequentially. The encoded demonstrations are used to decode the action sequence and perception conditions as additional supervision.

represented as at = η (Ht). In this paper, we focus on programs that can be represented in DSL by a code C = (w1, w2, ..., w N), which consists a sequence of tokens w.

A demonstration τ = ((s1, a1), (s2, a2), ..., (s T , a T )) is a sequence of state and action tuples generated by an underlying program η given an initial state s1. Given an initial state s1 and its corresponding state history H1, the program generates new action a1 = η (H1). The following state s2 is generated by a state transition function T: s2 T(s1, a1). The newly sampled state is incorporated into the state history H2 = H1 (s2) and this process is iterated until the end of ﬁle action EOF A is returned by the program. A set of demonstrations D = {τ1, τ2, ..., τK} can be generated by running a single program η on different initial states s1 1, s2 1, ..., s K 1 , where each initial state is sampled from an initial state distribution (i.e. sk 1 P0(s1)).

While we are interested in inferring a program η from a set of demonstrations D, it is preferable to predict a code C instead, because it is a more accessible representation while immediately convertible to a program. Formally, we formulate the problem as a sequence prediction where the input is a set of demonstrations D and the output is a code sequence ˆC. Note that our objective is not about inferring a code perfectly but instead generating a code that can infer the underlying program, which models the diverse behaviors appearing in the demonstrations in an executable form.

4. Approach

Inferring a program behind a set of demonstrations requires (1) interpreting each demonstration video (2) spotting and summarizing the difference among demonstrations to infer the conditions behind the taken actions (3) describing the understanding of demonstrations in a written language, Based on this intuition, we design a neural architecture composed of three components:

Demonstration Encoder receives a demonstration video as input and produces an embedding that captures an agent s actions and perception.

Summarizer Module discovers and summarizes where actions diverge between demonstrations and upon which branching conditions subsequent actions are taken.

Program Decoder represents the summarized understanding of demonstrations as a code sequence.

The details of the three main components are described in the Section 4.1, and the learning objective of the proposed model is described in Section 4.2. Section 4.3 introduces auxiliary tasks for encouraging the model to learn the knowledge that is essential to infer a program.

4.1. Model Architecture

Figure 3 illustrates the overall architecture of the proposed model, The details of each component are described in the following sections.

4.1.1. DEMONSTRATION ENCODER

The demonstration encoder receives a demonstration video as input and produces an latent vector that captures the actions and perception of an agent. At each time step, to interpret visual input, we employ a stack of convolutional layers, to encode a state st to its embedding as a state vector vt state = CNNenc(st) Rd, where t [1, T] is the time-step.

Since the demonstration encoder needs to handle demonstrations with variable numbers of frames, we employ an LSTM (Long Short Term Memory) (Hochreiter & Schmidhuber, 1997) to encode each state vector and summarized representation at the same time.

ct enc, ht enc = LSTMenc(vt state, ct 1 enc , ht 1 enc ), (1)

Neural Program Synthesis from Diverse Demonstration Videos

where, t [1, T] is the time step, while ct enc and ht enc denote the cell state and the hidden state. While ﬁnal state tuples (c T enc, h T enc) encode the overall idea of the demonstration, intermediate hidden states {h1 enc, h2 enc, ..., h T enc} contain high level understanding of each state, which are used as an input to the following modules. Note that these operations are applied to all K demonstrations while the index k is dropped in the equations for simplicity.

4.1.2. SUMMARIZER MODULE

Inferring an underlying program from demonstrations that exhibits different behaviors requires the ability to discover and summarize where actions diverge between demonstrations and upon which branching conditions subsequent actions are taken. The summarizer module ﬁrst re-encodes each demonstration with the context of all encoded demonstrations to infer branching conditions. Then, the module aggregates all encoded demonstration vectors to obtain the summarized representation. An illustration of the summarizer is shown in Figure 4.

The ﬁrst summarization is performed by a reviewer module, an LSTM initialized with the average-pooled ﬁnal state tuples of the demonstration encoder outputs, which can be written as follows:

c0 review = 1

k=1 c T,k enc , h0 review = 1

k=1 h T,k enc , (2)

where (c T,k enc , h T,k enc ) is the ﬁnal state tuple of the kth demonstration encoder. Then the reviewer LSTM encodes the hidden states by

ct,k review, ht,k review = LSTMreview(ht,k enc, ct 1,k review , ht 1,k review ), (3)

where the ﬁnal hidden state becomes a demonstration vector vk demo = h T,k review Rd, which includes the summarized information within a single demonstration.

The ﬁnal summarization, which is performed across multiple demonstrations, is performed by an aggregation module, which gets K demonstration vectors and aggregates them into a single compact vector representation. To effectively model complex relations between demonstrations, we employ a relational network (RN) module (Santoro et al., 2017). The aggregation process is formally written as follows.

vsummary = RN v1 demo, ..., v K demo = 1 K2

i,j gθ(vi demo, vj demo),

(4) where vsummary Rd is the summarized demonstration vector and gθ is an MLP parameterized by θ jointly trained with the summarizer module.

We show that employing the summarizer module signiﬁcantly alleviates the difﬁculty of handling multiple demon-

Demonstration

Figure 4. Summarizer Module. The demonstration encoder (inner layer) encodes each demonstration starting from a zero state. The summarizer module (outer layer) aggregates the outputs of the demonstration encoder with a relation network to provide context from other demonstrations.

strations and improve generalization over different number of generations in Section 5 .

4.1.3. PROGRAM DECODER

The program decoder synthesizes programs from a summarized representation of all the demonstrations. We use LSTMs similar to (Sutskever et al., 2014; Vinyals et al., 2015) as a program decoder. Initialized with the summarized vector vsummary, the LSTM at each time step gets the previous token embedding as an input and outputs a probability of the following program tokens as in the Eq. 5. During training, the previous ground truth token is fed as an input, and during inference, the predicted token in the previous steps is fed as an input.

4.2. Learning

The proposed model learns a conditional distribution between a set of demonstrations D and a corresponding code C = {w1, w2, ..., w N}. By employing the LSTM program decoder, this problem becomes an autoregressive sequence prediction (Sutskever et al., 2014). For a given demonstration and previous code token wi 1, our model is trained to predict the following ground truth token w i , where the cross entropy loss is optimized.

Lcode = 1 NM

n=1 log p(w m,n|W m m,n 1, Dm), (5)

where M is the total number of training examples, wm,n is the nth token of the mth training example and Dm are mth training demonstrations. Wm,n = {wm,1, ..., wm,n} is the history of previous token inputs at time step n.

Neural Program Synthesis from Diverse Demonstration Videos

4.3. Multi-task Objective

To reason an underlying program from a set of demonstrations, the primary and essential step is recognizing actions and perceptions happening in each step of the demonstration. However, it can be difﬁcult to learn meaningful representations solely from the sequence loss of programs when environments increase in visual complexity. To alleviate this issue, we propose to predict action sequences and perception vectors from the demonstrations as auxiliary tasks. An overview of the auxiliary tasks are illustrated in Figure 3.

Predicting action sequences Given a demo vector vk demo encoded by the summarizer, an action decoder LSTM produces a sequence of actions. During training, a sequential cross entropy loss similar to Equation 5 is optimized:

Laction = 1 MKT

t=1 log p(ak m,t|Ak m,t 1, vk demo),

(6) where, ak m,t is the t-th action token in k th demonstration of m-th training example, Ak m,t = {ak m,1, ..., akm, t} is the history of previous actions at time step t.

Predicting perceptions We denote a perception vector Φ = {φ1, ..., φL} {0, 1}L as an L dimensional binary vector obtained by executing L perception primitives e.g. front Is Clear() on a given state s. Speciﬁcally, we formulate the perception vector prediction as a sequential multi-label binary classiﬁcation problem and optimizes the binary cross entropy:

Lperception =

l=1 log p(φk m,t,l|P k m,t 1, vk demo),

where P k m,t = {f(Φk m,1), ..., f(Φk m,t)} is the history of encoded previous perception vectors and f( ) is an encoding function.

The aggregated multi-task objective is as follows: L = Lcode + αLaction + βLperception, where α and β are hyperparameters controlling the importance of each loss. We set α = β = 1 to equally optimize the objectives for all the experiments.

5. Experiments

We perform experiments in different environments: Karel (Pattis, 1981) and Vi ZDoom (Kempka et al., 2016). We ﬁrst describe the experimental setup and then present the experimental results.

5.1. Evaluation Metric

To verify whether a model is able to infer an underlying program η from a given set of demonstrations D, we evaluate accuracy based on the synthesized codes and the underlying program (sequence accuracy and program accuracy) as well as the execution of the program (execution accuracy).

Sequence accuracy Comparison in the code space is based on the instantiated code C of a ground truth program and the synthesized code ˆC from a program synthesizer. The sequence accuracy counts exact match of two code sequences, which is formally written as: Accseq = 1 M PM m=1 1seq(C m, ˆCm), where M is the number of testing examples and 1seq( , ) is the indicator function of exact sequence match.

Program accuracy While the sequence accuracy is simple, it is a pessimistic estimation of program accuracy since it does not consider program aliasing different codes with identical program semantics (e.g. repeat(2):(move()) and move() move()). Therefore, we measure the program accuracy by enumerating variations of codes. Speciﬁcally, we exploit the syntax of DSL to identify variations: e.g. unfolding repeat statements, decomposing if-else statement into two if statements, etc. Formally, the program accuracy is Accprogram = 1 M PM m=1 1prog(C m, ˆCm), where 1prog(C m, ˆCm) is an indicator function that returns 1 if any variations of ˆCm match any variations of C m. Note that the program accuracy is only computable when the DSL is relatively simple and some assumptions are made i.e. termination of loops. The details of computing program accuracy are presented in the supplementary material.

Execution accuracy To evaluate how well a synthesized program can capture the behaviors of an underlying program, we compare the execution results of the synthesized program code ˆC and the demonstrations D generated by a ground truth program η , where both are generated from the same set of sampled initial states IK = {s1 1, ..., s K 1 }. We formally deﬁne the execution accuracy as: Accexecution = 1 M PM m=1 1execution(D m, ˆDm), where 1execution(D m, ˆDm) is the indicator function of exact sequence match. Note that when the number of sampled initial states becomes inﬁnitely large, the execution accuracy converges to the program accuracy.

5.2. Evaluation Setting

For training and evaluation, we collect Mtrain training programs and Mtest test programs. Each program code C m is randomly sampled from an environment speciﬁc DSL and compiled into an executable form η m. The corresponding demonstrations D m = {τ1, ..., τK} are gener-

Neural Program Synthesis from Diverse Demonstration Videos

while front Is Clear():

move() put Marker() turn Left() move() put Marker() move() move()

Underlying Program Synthesized Program

unseen demo

turn Right() turn Right() while front Is Clear():

move() if markers Present():

turn Left() move() else:

turn Right()

turn Right() turn Right() while front Is Clear():

move() else:

turn Right()

Figure 5. Karel Results. Seen training examples are on top row (in blue) and unseen testing examples are on the bottom row (in green). (a) A successful case with a program sequence match (b) Due to a missing branch condition execution in training data (top images), the synthesized program doesn t incorporate the condition, resulting in execution mismatch in lower right testing image.

ated by running the program on K = Kseen + Kunseen different initial states. The seen demonstrations are used as an input to the program synthesizer, and the unseen demonstrations are used for computing execution accuracy. We train our model on the training set Ωtrain = {(C 1, D 1), ..., (C Mtrain, D Mtrain)} and test them on the testing set Ωtest = {(C 1, D 1), ..., (C Mtest, D Mtest)}. Note that Ωtrain and Ωtest are disjoint. Both sequence and execution accuracies are used for the evaluation. The training details are described in the supplementary material.

5.3. Baselines

We compare our proposed model (ours) against baselines to evaluate the effectiveness of: (1) explicitly modeling the underlying programs (2) our proposed model with the summarizer module and multi-task objective. To address (1), we design a program induction baseline based on (Duan et al., 2017), which bypasses synthesizing programs and directly predicts action sequences. We modiﬁed the architecture to incorporate multiple demonstrations as well as pixel inputs. The details are presented in the supplementary material. For a fair comparison with our model that gets supervision of perception primitives, we feed the perception primitive vector of every frame as an input to the induction baseline . To verify (2), we compose a program synthesis baseline simply consisting of a demonstration encoder and a program decoder without a summarizer module and multi-task loss. To integrate all the demonstration encoder outputs across demos, an average pooling layer is applied.

We ﬁrst focus on a visually simple environment to verify the feasibility of program synthesis from demonstrations. We consider Karel (Pattis, 1981) featuring an agent navigating through a gridworld with walls and interacting with markers

based on the underlying program.

5.4.1. ENVIRONMENT AND DATASET

Karel has 5 action primitives for moving and interacting with markers and 5 perception primitives for detecting obstacles and markers. A gridworld of 8 8 size is used for our experiments. To evaluate the generalization ability of the program synthesizer to novel programs, we randomly generate 35,000 unique programs and split them into a training set with 25,000 program, a validation set with 5,000 program, and a testing set with 5,000 programs. The maximum length of the program codes is 43. For each program, 10 seen demonstrations and 5 unseen demonstrations are generated. The maximum length of the demonstrations is 20.

Methods Execution Program Sequence Induction baseline 62.8% (69.1%) - - Synthesis baseline 64.1% 42.4% 35.7% + summarizer (ours) 68.6% 45.3% 38.3% + multi-task loss (ours-full) 72.1% 48.9% 41.0%

Table 1. Performance evaluation on Karel environment. Synthesis baseline outperforms induction baseline . The summarizer module and the multi-task objective introduce signiﬁcant improvement.

5.4.2. PERFORMANCE EVALUATION

The evaluation results of our proposed model and baselines are shown in Table. 1. Comparison of execution accuracy shows relative performance of the proposed model and the baselines. Synthesis baseline outperforms induction baseline based on the execution accuracy, which shows the advantage of explicit modeling the underlying programs. Induction baseline often matches some of the Kunseen demonstration, but fails to match all of them from a single program. This observation is supported by the number in the parenthesis (69.1%), which counts the number of correct

Neural Program Synthesis from Diverse Demonstration Videos

demonstrations while execution accuracy counts the number of program whose demonstrations match perfectly. This ﬁnding has also been reported in (Devlin et al., 2017b).

The proposed model shows consistent improvement over synthesis baseline for all the evaluation metrics. The sequence accuracy for our full model is 41.0%, which is a reasonable generalization performance given that none of the test programs are seen during training. We observe that our model often synthesizes programs that do not exactly match with the ground truth program but are semantically identical. For example, given a ground truth program repeat(4):( turn Left; turn Left; turn Left ), our model predicts repeat (12): ( turn Left ). These cases are considered correct for program accuracy. Note that comparison based on the execution and sequence accuracy is consistent with the program accuracy, which justiﬁes using them as a proxy for the program accuracy when it is not computable.

The qualitative success and failure cases of the proposed model are described in Figure 5. The Figure 5(a) shows a correct case where a single program is used to generate diverse action sequences. Figure 5(b) show a failure case, where part of the ground truth program tokens are not generated due to missing seen demonstration hitting that condition.

Methods k=3 k=5 k=10 Synthesis baseline 58.5% 60.1% 64.1% + summarizer (ours) 60.6% 63.1% 68.6% Improvement 2.1% 3.0% 4.5%

Table 2. Effect of the summarizer module. Employing the proposed summarizer module brings more improvement as the number of seen demonstration increases over synthesis baseline .

5.4.3. EFFECT OF SUMMARIZER

To verify the effectiveness of our proposed summarizer module, we conduct experiments where models are trained on varying numbers of demonstrations and compare the execution accuracy in Table. 2. As the number of demonstrations increases, both models enjoy a performance gain due to extra available information. However, the gap between our proposed model and synthesis baseline also grows, which demonstrates the effectiveness of our summarizer module.

5.5. Vi ZDoom

Doom is a 3D ﬁrst-person shooter game where a player can move in a continuous space and interact with monsters, items and weapons. We use Vi ZDoom (Kempka et al., 2016), an open-source Doom-based AI platform, for our experiments. Vi ZDoom s increased visual complexity and a richer DSL could test the boundary of models in state comprehension, demo summarization, and program synthesis.

5.5.1. ENVIRONMENT AND DATASET

The Vi ZDoom environment has 7 action primitives including diverse motions and attack as well as 6 perception primitives checking the existence of different monsters and whether they are targeted. Each state is represented by an image with 120 160 3 pixels. For each demonstration, initial state is sampled by randomly spawning different types of monsters and ammos in different location and placing an agent randomly. To ensure that the program behavior results in the same execution, we control the environment to be deterministic.

We generate 80,000 training programs and 8,000 testing programs. To encourage diverse behavior of generated program, we give a higher sampling rate to the perception primitives that has higher entropy over K different initial states. We use 25 seen demonstrations for program synthesis and 10 unseen demonstrations for execution accuracy measure. The maximum length of programs is 32 and the maximum length of demonstrations is 20.

5.5.2. PERFORMANCE EVALUATION

Table. 3 shows the result on Vi ZDoom environment. Synthesis baseline outperforms induction baseline in terms of the execution accuracy, which shows the strength of program synthesis for understanding diverse demonstrations. In addition, the proposed summarizer module and the multitask objective bring improvement in terms of all evaluation metrics. Also we found that the syntax of the synthesized programs is about 99.9% accurate. This tells that the program synthesizer correctly learn the syntax of the DSL.

Figure 6 shows the qualitative result. It is shown that the generated program covers different conditional behavior in the demonstration successfully. In the example, the synthesized program does not match the underlying program in the code space, while matching the underlying program in the program space.

Methods Execution Program Sequence Induction baseline 35.1% (60.6%) - - Synthesis baseline 48.2% 39.9% 33.1% Ours-full 78.4% 62.5% 53.2%

Table 3. Performance evaluation on Vi ZDoom environment. The proposed model outperforms induction baseline and synthesis baseline signiﬁcantly as the environment is more visually complex.

5.5.3. ANALYSIS

To verify the importance of inferring underlying conditions, we perform evaluation only with programs containing a single if-else statement with two branching consequences. This setting is sufﬁciently simple to isolate other diverse factors that might affect the evaluation result. For the experiment,

Neural Program Synthesis from Diverse Demonstration Videos

Underlying Program

in Target Hell Knight

! attack() in Target Hell Knight not in Target Demon

! move Right()

in Target Hell Knight and

in Target Demon

in Target Hell Knight

in Target Demon

if in Target Hell Knight:

attack() if in Target Demon: attack() else: move Right()

if in Target Hell Knight:

attack() if not in Target Demon: move Right() else: attack()

Synthesized Program

Figure 6. Vi ZDoom results. Annotations below frames are the perception conditions and actions. Hellknight, Revenant, and Demon monsters are white, black, and pink respectively. The model is able to correctly percepts the condition and actions as well as synthesize a precise program. Note that the synthesized and the underlying program are semantically identical.

Methods Execution Program Sequence Induction baseline 26.5% (83.1%) - - Synthesis baseline 59.9% 44.4% 36.1% Ours-full 89.4% 69.1% 58.8%

Table 4. If-else experiment on Vi ZDoom environment. Single ifelse statement with two branching consequences is used to evaluate ability of inferring underlying conditions.

we use 25 seen demonstrations to understand a behavior and 10 unseen demonstrations for testing. The result is shown in Table. 4. Induction baseline has difﬁculty inferring the underlying condition to match all unseen demonstrations most of the times. In addition, our proposed model outperforms synthesis baseline ,2 which demonstrates the effectiveness of the summarizer module and the multi-task objective.

Figure 7 illustrates how models trained with a ﬁxed number (25) of seen demonstration generalize to fewer or more seen demonstrations during testing time. This shows our model and synthesis baseline are able to leverage more seen demonstrations to synthesize more accurate programs as well as achieve reasonable performance when fewer demonstrations are given. On the contrary, Induction baseline could not exploit more than 10 demonstrations well.

5.5.4. DEBUGGING THE SYNTHESIZED PROGRAM

One of the intriguing properties of the program synthesis is that synthesized programs are interpretable and interactable by human. This makes it possible to debug a synthesized program and ﬁx minor mistakes to correct the behaviors. To verify this idea, we use edit distance between synthesized program and ground truth program as a number of minimum token that is required to get a exactly correct program. With

Figure 7. Generalization over different number of Kseen. The baseline models and our model trained with 25 seen demonstration are evaluated with fewer or more seen demonstrations.

this setting, we found that ﬁxing at most 2 program token provides 4.9% improvement in sequence accuracy and 4.1% improvement in execution accuracy.

6. Conclusion

We propose the task of synthesizing a program from diverse demonstration videos. To address this, we introduce a model augmented with a summarizer module to deal with branching conditions and a multi-task objective to induce meaningful latent representations. Our method is evaluated on a fully observable, third-person environment (Karel environment) and a partially observable, egocentric game (Vi ZDoom environment). The experiments demonstrate that the proposed model is able to reliably infer underlying programs and achieve satisfactory performances.

Neural Program Synthesis from Diverse Demonstration Videos

Acknowledgments

We thank the anonymous ICML reviewers for insightful comments. This project was supported by the center for super intelligence, Kakao Brain, and SKT.

Balog, Matej, Gaunt, Alexander L, Brockschmidt, Marc, Nowozin, Sebastian, and Tarlow, Daniel. Deepcoder: Learning to write programs. In International Conference on Learning Representations, 2017.

Bunel, Rudy R, Hausknecht, Matthew, Devlin, Jacob, Singh, Rishabh, and Kohli, Pushmeet. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations, 2018.

Cai, Jonathon, Shin, Richard, and Song, Dawn. Making neural programming architectures generalize via recursion. In International Conference on Learning Representations, 2017.

Devlin, Jacob, Bunel, Rudy R, Singh, Rishabh, Hausknecht, Matthew, and Kohli, Pushmeet. Neural program metainduction. In Neural Information Processing Systems, 2017a.

Devlin, Jacob, Uesato, Jonathan, Bhupatiraju, Surya, Singh, Rishabh, Mohamed, Abdel-rahman, and Kohli, Pushmeet. Robustﬁll: Neural program learning under noisy i/o. In International Conference on Machine Learning, 2017b.

Duan, Yan, Andrychowicz, Marcin, Stadie, Bradly, Ho, Open AI Jonathan, Schneider, Jonas, Sutskever, Ilya, Abbeel, Pieter, and Zaremba, Wojciech. One-shot imitation learning. In Neural Information Processing Systems, 2017.

Finn, Chelsea, Yu, Tianhe, Zhang, Tianhao, Abbeel, Pieter, and Levine, Sergey. One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, 2017.

Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014.

Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley, Tim, Danihelka, Ivo, Grabska-Barwi nska, Agnieszka, Colmenarejo, Sergio G omez, Grefenstette, Edward, Ramalho, Tiago, Agapiou, John, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 2016.

Grefenstette, Edward, Hermann, Karl Moritz, Suleyman, Mustafa, and Blunsom, Phil. Learning to transduce with unbounded memory. In Neural Information Processing Systems, 2015.

Hochreiter, Sepp and Schmidhuber, J urgen. Long short-term memory. Neural computation, 1997.

Joulin, Armand and Mikolov, Tomas. Inferring algorithmic patterns with stack-augmented recurrent nets. In Neural Information Processing Systems, 2015.

Kaiser, Łukasz and Sutskever, Ilya. Neural gpus learn algorithms. In International Conference on Learning Representations, 2016.

Kempka, Michał, Wydmuch, Marek, Runc, Grzegorz, Toczek, Jakub, and Ja skowski, Wojciech. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Computational Intelligence and Games, 2016.

Neelakantan, Arvind, Le, Quoc V, and Sutskever, Ilya. Neural programmer: Inducing latent programs with gradient descent. In International Conference on Learning Representations, 2015.

Ng, Andrew Y, Russell, Stuart J, et al. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, 2000.

Parisotto, Emilio, Mohamed, Abdel-rahman, Singh, Rishabh, Li, Lihong, Zhou, Dengyong, and Kohli, Pushmeet. Neuro-symbolic program synthesis. In International Conference on Learning Representations, 2017.

Pattis, Richard E. Karel the robot: a gentle introduction to the art of programming. John Wiley & Sons, Inc., 1981.

Pomerleau, Dean A. Alvinn: An autonomous land vehicle in a neural network. In Neural Information Processing Systems, 1989.

Pomerleau, Dean A. Efﬁcient training of artiﬁcial neural networks for autonomous navigation. Neural Computation, 1991.

Reed, Scott and De Freitas, Nando. Neural programmerinterpreters. In International Conference on Learning Representations, 2016.

Ross, St ephane, Gordon, Geoffrey, and Bagnell, Drew. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artiﬁcial Intelligence and Statistics, 2011.

Santoro, Adam, Raposo, David, Barrett, David G, Malinowski, Mateusz, Pascanu, Razvan, Battaglia, Peter, and Lillicrap, Tim. A simple neural network module for relational reasoning. In Neural Information Processing Systems, 2017.

Neural Program Synthesis from Diverse Demonstration Videos

Simonyan, Karen and Zisserman, Andrew. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems, 2014.

Srivastava, Nitish, Mansimov, Elman, and Salakhudinov, Ruslan. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 2015.

Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural networks. In Neural Information Processing Systems, 2014.

Venugopalan, Subhashini, Rohrbach, Marcus, Donahue, Jeffrey, Mooney, Raymond, Darrell, Trevor, and Saenko, Kate. Sequence to sequence-video to text. In International Conference on Computer Vision, 2015.

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. In Conference on Computer Vision and Pattern Recognition, 2015.

Xu, Danfei, Nair, Suraj, Zhu, Yuke, Gao, Julian, Garg, Animesh, Fei-Fei, Li, and Savarese, Silvio. Neural task programming: Learning to generalize across hierarchical tasks. In International Conference on Robotics and Automation, 2018.

Zaremba, Wojciech and Sutskever, Ilya. Reinforcement learning neural turing machines-revised. ar Xiv preprint ar Xiv:1505.00521, 2015.