# learning_neurosymbolic_generative_models_via_program_synthesis__ab407fce.pdf

Learning Neurosymbolic Generative Models via Program Synthesis

Halley Young 1 Osbert Bastani 1 Mayur Naik 1

Generative models have become signiﬁcantly more powerful in recent years. However, these models continue to have difﬁculty capturing global structure in data. For example, images of buildings typically contain spatial patterns such as windows repeating at regular intervals, but stateof-the-art models have difﬁculty generating these patterns. We propose to address this problem by incorporating programs representing global structure into generative models e.g., a 2D for-loop may represent a repeating pattern of windows along with a framework for learning these models by leveraging program synthesis to obtain training data. On both synthetic and real-world data, we demonstrate that our approach substantially outperforms state-of-the-art at both generating and completing images with global structure.

1. Introduction

There has been much interest recently in generative models, following the introduction of both variational autoencoders (VAEs) (Kingma & Welling, 2014) and generative adversarial networks (GANs) (Goodfellow et al., 2014). These models have successfully been applied to a range of tasks, including image generation (Radford et al., 2015), image completion (IIzuka et al., 2017), texture synthesis (Jetchev et al., 2017; Xian et al., 2018), sketch generation (Ha & Eck, 2017), and music generation (Dieleman et al., 2018).

Despite their successes, generative models still have difﬁculty capturing global structure. For example, consider the image completion task in Figure 1. The original image (left) is of a building, for which the global structure is a 2D repeating pattern of windows. Given a partial image (middle left), the goal is to predict the completion of the image. As can be seen, a state-of-the-art image completion algorithm has trou-

1University of Pennsylvania, USA. Correspondence to: Halley Young <halleyy@seas.upenn.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

ble reconstructing the original image (right) (IIzuka et al., 2017). Real-world data often contains such global structure, including repetitions, reﬂectional or rotational symmetry, or even more complex patterns.

In recent years, program synthesis (Solar-Lezama et al., 2006) has emerged as a promising approach to capturing patterns in data (Ellis et al., 2015; 2018; Valkov et al., 2018). The idea is that simple programs can capture global structure that evades state-of-the-art deep neural networks. A key beneﬁt of using program synthesis is that we can design the space of programs to capture different kinds of structure e.g., repeating patterns (Ellis et al., 2018), symmetries, or spatial structure (Deng et al., 2018) depending on the application domain. The challenge is that for the most part, existing approaches have synthesized programs that operate directly over raw data. Since programs have difﬁculty operating over perceptual data, existing approaches have largely been limited to very simple data e.g., detecting 2D repeating patterns of simple shapes (Ellis et al., 2018).

We propose to address these shortcomings by synthesizing programs that represent the underlying structure of highdimensional data. In particular, we decompose programs into two parts: (i) a sketch s 2 S that represents the skeletal structure of the program (Solar-Lezama et al., 2006), with holes that are left unimplemented, and (ii) components c 2 C that can be used to ﬁll these holes. We consider perceptual components i.e., holes in the sketch are ﬁlled with raw perceptual data. For example, the program

represents part of the structure in the original image x in Figure 1 (left). The code is the sketch, and the component is a sub-image from the given partial image. Together, we call such a program a neurosymbolic program.

Building on these ideas, we propose an approach called Synthesis-guided Generative Models (SGM) that combines neurosymbolic programs representing global structure with state-of-the-art deep generative models. By incorporating programmatic structure, SGM substantially improves the quality of these models. As can be seen, the completion produced using SGM (middle right of Figure 1) substantially outperforms state-of-the-art.

Learning Neurosymbolic Generative Models via Program Synthesis

original image x partial image xpart completion ˆx (ours) completion ˆx (baseline)

Figure 1. The task is to complete the partial image xpart (middle left) into an image that is close to the original image x (left). By

incorporating programmatic structure, our approach (middle right) substantially outperforms state-of-the-art (IIzuka et al., 2017) (right).

SGM can be used for both generation and completion. The generation pipeline is shown in Figure 2. At a high level, SGM for generation operates in two phases:

First, it generates a program that represents the global

structure in the image to be generated. In particular, it generates a program P = (s, c) representing the latent global structure in the image (left in Figure 2), where s is a sketch and c is a perceptual component.

Second, our algorithm executes P to obtain a struc-

ture rendering xstruct representing the program as an image (middle of Figure 2). Then, our algorithm uses a deep generative model to complete xstruct into a full image (right of Figure 2). The structure in xstruct helps guide the deep generative model towards images that preserve the global structure.

The image-completion pipeline (see Figure 3) is similar.

Training these models end-to-end is challenging, since a priori, ground truth global structure is unavailable. To address this shortcoming, we leverage domain-speciﬁc program synthesis algorithms to produce examples of programs that represent global structure of the training data. In particular, we propose a synthesis algorithm tailored to the image domain, which extracts programs with nested for-loops that can represent multiple 2D repeating patterns in images. Then, we use these example programs as supervised training data.

Our programs can capture rich spatial structure in the training data. For example, in Figure 2, the program structure encodes a repeating structure of 0 s and 2 s on the whole image, and a separate repeating structure of 3 s on the righthand side of the image. Furthermore, in Figure 1, the generated image captures the idea that the repeating pattern of windows does not extend to the bottom portion of the image.

Contributions. We propose an architecture of generative models that incorporates programmatic structure, as well as an algorithm for training these models (Section 2). Our learning algorithm depends on a domain-speciﬁc program synthesis algorithm for extracting global structure from the training data; we propose such an algorithm for the image domain (Section 3). Finally, we evaluate our approach on

synthetic data and on a real-world dataset of building facades (Tyleˇcek & ˇS ara, 2013), both on the task of generation from scratch and on generation from a partial image. We show that our approach substantially outperforms several state-of-the-art deep generative models (Section 4).

Related work. There has been growing interest in applying program synthesis to machine learning e.g., for small data (Liang et al., 2010), interpretability (Wang & Rudin, 2015; Verma et al., 2018), safety (Bastani et al., 2018), and lifelong learning (Valkov et al., 2018). Most relevantly, there has been interest in using programs to capture structure that deep learning models have difﬁculty representing (Lake et al., 2015; Ellis et al., 2015; 2018; Pu et al., 2018). For instance, Ellis et al. (2015) proposes an unsupervised learning algorithm for capturing repeating patterns in simple line drawings; however, not only are their domains simple, but they can only handle a very small amount of noise. Similarly, Ellis et al. (2018) captures 2D repeating patterns of simple circles and polygons; however, rather than synthesizing programs with perceptual components, they learn a simple mapping from images to symbols as a preprocessing step. The closest work we are aware of is Valkov et al. (2018), which synthesizes programs with neural components (i.e., components implemented as neural networks); however, their application is to lifelong learning, not generation, and to learning with supervision (labels) rather than to unsupervised learning of structure.

There has been related work on synthesizing probabilistic programs (Hwang et al., 2011; Perov & Wood, 2014), including applications to learning structured ranking functions (Nori et al., 2015) and for learning design patterns (Talton et al., 2012). More recently, Deep Prob Log (Manhaeve et al., 2018) has extended the probabilistic logic programming language Prob Log (De Raedt et al., 2007) to include learned neural components.

Additionally, there has been work extending neural module networks (Andreas et al., 2016) to generative models (Deng et al., 2018). These algorithms essentially learn a collection of neural components that can be composed together based on hierarchical structure. However, they require that the

Learning Neurosymbolic Generative Models via Program Synthesis

sampled program P single for loop rendering structure rendering xstruct generated image ˆx

Figure 2. SGM generation pipeline: (i) Sample a latent vector z p(z), and sample a program P = (s, c) pφ(s, c | z) (left: a single

sampled for loop). (ii) Execute P to obtain a structure rendering (middle left: the rendering of the single for loop shown on the left, middle right: the structure rendering). (iii) Sample a completion ˆx p (x | s, c) of xstruct into a full image (right).

structure be available (albeit in natural language form) both for training the model and for generating new images.

Finally, there has been work incorporating spatial structure into models for generating textures (Jetchev et al., 2017); however, their work only handles a single inﬁnite repeating 2D pattern. In contrast, we can capture a rich variety of spatial patterns parameterized by a space of programs e.g., the image in Figure 1 generated by our approach contains different repeating patterns in different parts of the image.

2. Generative Models with Latent Structure

We describe our proposed architecture for generative models that incorporate programmatic structure. For most of this section, we focus on generation; we discuss how we adapt these techniques to image completion at the end. We illustrate our generation pipeline in Figure 2.

Let p ,φ(x) be a distribution over a space X with unknown parameters , φ that we want to estimate. We study the setting where x is generated based on some latent structure, which consists of a program sketch s 2 S and a perceptual component c 2 C, and where the structure is in turn generated conditioned on a latent vector z 2 Z i.e.,

p (x | s, c)pφ(s, c | z)p(z)dcdz.

Figure 2 shows an example of a sampled program P = (s, c) pφ(s, c | z) (left) and a sampled completion ˆx

p (x | s, c) (right). To sample a completion, our model executes P to obtain a structure rendering xstruct = eval(P) (middle), and then uses p (x | s, c) = p (x | xstruct).

We now describe our algorithm for learning the parameters , φ of p ,φ, followed by a description of our choices of architecture for pφ(s, c | z) and p (x | s, c).

Learning algorithm. Given training data {x(i)}n

i=1 X, where x(i) p ,φ(x), the maximum likelihood estimate is

MLE = arg max

log p ,φ(x(i)).

Since log p ,φ(x) is intractable to optimize, we use an approach based on the variational autoencoder (VAE). In particular, we use a variational distribution

q φ(s, c, z | x) = q φ(z | s, c)q(s, c | x),

which has parameters φ. Then, we optimize φ while simultaneously optimizing , φ. Using q φ(s, c, z | x), the evidence lower bound on the log-likelihood is

log p ,φ(x) Eq(s,c,z|x)[log p (x | s, c)]

DKL(q(s, c, z | x) k pφ(s, c | z)p(z))

= Eq(s,c|x)[log p (x | s, c)] (1)

+ Eq(s,c|x),q φ(z|s,c)[log pφ(s, c | z)]

Eq(s,c|x)[DKL(q φ(z | s, c) k p(z))]

H(q(s, c | x)),

where DKL is the KL divergence and H is information entropy. Thus, we can approximate , φ by optimizing the lower bound (1) instead of log p ,φ(x). However, (1) remains intractable since we are integrating over all program sketches s 2 S and perceptual components c 2 C. As we describe next, our approach is to synthesize a single point estimate sx 2 S and cx 2 C for each x 2 X.

Synthesizing structure. For a given x 2 X, we use program synthesis to infer a single likely choice sx 2 S and cx 2 C of the latent structure. The program synthesis algorithm must be tailored to a speciﬁc domain; we propose an algorithm for inferring for-loop structure in images in Section 3. Then, we use these point estimates in place of the integrals over S and C i.e., we assume that

q(s, c | x) = δ(s sx)δ(c cx),

where δ is the Dirac delta function. Plugging into (1) gives

log p ,φ(x) log p (x | sx, cx) (2)

+ Eq φ(z|sx,cx)[log pφ(sx, cx | z)]

DKL(q φ(z | sx, cx) k p(z)).

Learning Neurosymbolic Generative Models via Program Synthesis

partial image xpart synthesized program Ppart extrapolated program ˆP

structure rendering ˆxstruct completion ˆx (ours) original image x

Figure 3. SGM image completion pipeline: (i) Given a partial image xpart (top left), use our program synthesis algorithm (Section 3) to

synthesize a program Ppart representing the structure in the partial image (top middle). (ii) Extrapolate Ppart to a program ˆP = f (Ppart) representing the structure of the full image. (iii) Execute ˆP to obtain a rendering of the program structure ˆxstruct (bottom left). (iv) Complete ˆxstruct into an image ˆx (bottom middle), which resembles the original image x (bottom right).

where we have dropped the degenerate terms log δ(s sx) and log δ(c cx) (which are constant with respect to the parameters , φ, φ). As a consequence, (1) decomposes into two parts that can be straightforwardly optimized i.e.,

log p ,φ(x) L( ; x) + L(φ, φ; x)

L( ; x) = log p (x | sx, cx)

L(φ, φ; x) = Eq φ(z|sx,cx)[log pφ(sx, cx | z)]

DKL(q φ(z | sx, cx) k p(z)),

where we can optimize independently from φ, φ:

φ , φ = arg max

L(φ, φ; x(i)).

Latent structure VAE. Note that L(φ, φ; x) is exactly equal to the objective of a VAE, where q φ(z | s, c) is the encoder and pφ(s, c | z) is the decoder i.e., learning the distribution over latent structure is equivalent to learning the parameters of a VAE. The architecture of this VAE depends on the representation of s and c. In the case of for-loop structure in images, we use a sequence-to-sequence VAE.

Generating data with structure. The term L( ; x) corresponds to learning a probability distribution (conditioned

on the latent structure s and c) e.g., we can estimate this distribution using another VAE. As before, the architecture of this VAE depends on the representation of s and c. Rather than directly predicting x based on s and c, we can leverage the program structure more directly by ﬁrst executing the program P = (s, c) to obtain its output xstruct = eval(P), which we call a structure rendering. In particular, xstruct is a more direct representation of the global structure represented by P, so it is often more suitable to use as input to a neural network. The middle of Figure 2 shows an example of a structure rendering for the program on the left. Then, we can train a model p (x | s, c) = p (x | xstruct).

In the case of images, we use a VAE with convolutional layers for the encoder qφ and transpose convolutional layers for the decoder p . Furthermore, instead of estimating the entire distribution p (x | s, c), we also consider two non-probabilistic approaches that directly predict x from xstruct, which is an image completion problem. We can solve this problem using GLCIC, a state-of-the-art image completion model (IIzuka et al., 2017). We can also use Cycle GAN (Zhu et al., 2017), which solves the more general problem of mapping a training set of structured renderings {xstruct} to a training set of completed images {x}.

Image completion. In image completion, we are given a set of training pairs (xpart, x ), and the goal is to learn a model that predicts the complete image x given a partial

Learning Neurosymbolic Generative Models via Program Synthesis

image xpart. Compared to generation, our likelihood is now conditioned on xpart i.e., p ,φ(x | xpart). Now, we describe how we modify each of our two models p (x | s, c) and pφ(s, c | z) to incorporate this extra information.

First, the programmatic structure is no longer fully latent, since we can observe the structure in xpart. In particular, we leverage our program synthesis algorithm to help perform completion. We ﬁrst synthesize programs P and Ppart representing the global structure in x and xpart, respectively. Then, we train a model f that predicts P given Ppart i.e., it extrapolates Ppart to a program ˆP = f (Ppart) representing the structure of the full image. Thus, unlike generation, where we sample a program ˆP = (s, c) pφ(s, c | z), we use the extrapolated program ˆP = f (Ppart). Second, when we execute ˆP = (s, c) to obtain a structure rendering xstruct, we render it on top of the given partial image xpart. Finally, we sample a completion ˆx p (x | xstruct) as before. Our image completion pipeline is shown in Figure 3.

3. Synthesizing Programmatic Structure

Image representation. Since the images we work with are very high dimensional, for tractability, we assume that each image x 2 RNM NM is divided into a grid containing N rows and N columns, where each grid cell has size M M pixels (where M 2 N is a hyperparameter). For example, this grid structure is apparent in Figure 3 (top right), where N = 15, M = 17 and N = 9, M = 16 for the facade and synthetic datasets respectively. For t, u 2 [N] = {1, ..., N}, we let xtu 2 RM M denote the sub-image at the (t, u) position in the N N grid.

Program grammar. Given this structure, we consider programs that draw 2D repeating patterns of M M subimages on the grid. More precisely, we consider programs

P = ((s1, c1), ..., (sk, ck)) 2 (S C)k

that are length k lists of pairs consisting of a sketch s 2 S and a perceptual component c 2 C; here, k 2 N is a hyperparameter. 1 A sketch s 2 S has form

s = for (i, j) 2 {1, ..., n} {1, ..., n0} do

draw(a i + b, a0 j + b0, ??)

where n, a, b, n0, a0, b0 2 N are undetermined parameters that must satisfy a n + b N and a0 n0 + b0 N, and where ?? is a hole to be ﬁlled by a perceptual component, which is an M M sub-image c 2 RM M. 2 Then, upon

1So far, we have assumed that a program is a single pair P = (s, c), but the generalization to a list of pairs is straightforward.

2For colored images, we have I 2 RM M 3.

executing the (i, j) iteration of the for-loop, the program renders sub-image I at position (t, u) = (a i+b, a0 j +b0) in the N N grid. Figure 3 (top middle) shows an example of a sketch s where its hole is ﬁlled with a sub-image c, and Figure 3 (bottom left) shows the image rendered upon executing P = (s, c). Figure 2 shows another such example.

Program synthesis problem. Given a training image x 2 RNM NM, our program synthesis algorithm outputs the parameters nh, ah, bh, n0

h of each sketch sh in the program (for h 2 [k]), along with a perceptual component ch to ﬁll the hole in sketch sh. Together, these parameters deﬁne a program P = ((s1, c1), ..., (sk, ck)).

The goal is to synthesize a program that faithfully represents the global structure in x. We capture this structure using a boolean tensor B(x) 2 {0, 1}N N N N, where

t,u,t0,u0 =

1 if d(xtu, xt0u0) 0 otherwise,

where 2 R+ is a hyperparameter, and d(I, I0) is a distance metric between on the space of sub-images. In our implementation, we use a weighted sum of earthmover s distance between the color histograms of I and I0, and the number of SIFT correspondences between I and I0.

Additionally, we associate a boolean tensor with a given program P = ((s1, c1), ..., (sk, ck)). First, for a sketch s 2 S with parameters a, b, n, a0, b0, n0, we deﬁne

cover(s) = {(a i + b, a0 j + b0) | i 2 [n], j 2 [n0]},

i.e., the set of grid cells where sketch renders a sub-image upon execution. Then, we have

t,u,t0,u0 =

1 if (t, u), (t0, u0) 2 cover(s) 0 otherwise,

t,u,t0,u0 indicates whether the sketch s renders a subimage at both of the grid cells (t, u) and (t0, u0). Then,

B(P ) = B(s1) _ ... _ B(sk),

where the disjunction of boolean tensors is deﬁned elementwise. Intuitively, B(P ) identiﬁes the set of pairs of grid cells (t, u) and (t0, u0) that are equal in the image rendered upon

executing each pair (s, c) in P. 3

Finally, our program synthesis algorithm aims to solve the following optimization problem:

P = arg max

(P; x) = k B(x) B(P )k1 + λk B(x) B(P )k1,

3Note that the covers of different sketches in P can overlap; ignoring this overlap does not signiﬁcantly impact our results.

Learning Neurosymbolic Generative Models via Program Synthesis

Algorithm 1 Synthesizes a program P representing the global structure of a given image x 2 RNM NM.

Input: X = {x} RNM NM

ˆC {xtu | t, u 2 [N]} P ? for h 2 {1, ..., k} do

sh, ch = arg max(s,c)2S ˆ C (Ph 1 [ {(s, c)}; x) P P [ {(sh, ch)} end for Output: P

where and are applied elementwise, and λ 2 R+ is a hyperparameter. In other words, the objective of (3) is the number of true positives (i.e., entries where B(P ) = B(x) = 1), and the number of false negatives (i.e., entries where B(P ) = B(x) = 0), and computes their weighted sum. Thus, the objective of (3) measures for how well P represents the global structure of x. For tractability, we restrict the search space in (3) to programs of the form

P = ((s1, c1), ..., (sk, ck)) 2 (S ˆC)k

ˆC = {xtu | t, u 2 [N]}.

In other words, rather than searching over all possible subimages c 2 RM M, we only search over the sub-images that actually occur in the training image x. This may lead to a slightly sub-optimal solution, for example, in cases where the optimal sub-image to be rendered is in fact an interpolation between two similar but distinct sub-images in the training image. However, we found that in practice this simplifying assumption still produced viable results.

Program synthesis algorithm. Exactly optimizing (3) is in general an NP-complete problem. Thus, our program synthesis algorithm uses a partially greedy heuristic. In particular, we initialize the program to P = ?. Then, on each iteration, we enumerate all pairs (s, c) 2 S

ˆC and determine the pair (sh, ch) that most increases the objective in (3), where ˆC is the set of all sub-images xtu for t, u 2 [N]. Finally, we add (sh, ch) to P. We show the full algorithm in Algorithm 1. Theorem 3.1. If λ = 0, then ( ˆP; x) (1 e 1) (P ; x), where ˆP is returned by Algorithm 1 and P solves (3).

Proof. If λ = 0, then optimizing (P; x) is equivalent to set cover, where the items are tuples

{(t, u, t0, u0) 2 [N]4 | B(x)

t,u,t0,u0 = 1},

and the sets are (s, c) 2 S ˆC. The theorem follows from (Hochbaum, 1997).

In general, (3) is not submodular, but we ﬁnd that the greedy heuristic still works well in practice.

4. Experiments

We perform two experiments one for generation from scratch and one for image completion. We ﬁnd substantial improvement in both tasks. Details on neural network architectures are in Appendix A, and additional examples for image completion are in Appendix B.

4.1. Datasets

Synthetic dataset. We developed a synthetic dataset based on MNIST. Each image consists of a 9 9 grid, where each grid cell is 16 16 pixels. Each grid cell is either ﬁlled with a colored MNIST digit or a solid color background. The program structure is a 2D repeating pattern of an MNIST digit; to add natural noise, we each iteration of the for-loop in a sketch sh renders different MNIST digits, but with the same MNIST label and color. Additionally, we chose the program structure to contain correlations characteristic of real-world images e.g., correlations between different parts of the program, correlations between the program and the background, and noise in renderings of the same component. Examples are shown in Figure 4. We give details of how we constructed this dataset in Appendix A. This dataset contains 10,000 training and 500 test images.

Facades dataset. Our second dataset consists of 1855 images (1755 training, 100 testing) of building facades.4

These images were all scaled to a size of 256 256 3 pixels, and were divided into a grid of 15 15 cells each of size 17 or 18 pixels. These images contain repeating patterns of objects such as windows and doors.

4.2. Generation from Scratch

Experimental setup. First, we evaluate our approach SGM applied to generation from scratch. We focus on the synthetic dataset we found that our facades dataset was too small to produce meaningful results. For the ﬁrst stage of SGM (i.e., generating the program P = (s, c)), we use a LSTM architecture for the encoder pφ(s, c | z) and a feedforward architecture for the decoder q φ(z | s, c). As described in Section 2, we use Algorithm 1 to synthesize programs Px = (sx, cx) representing each training image x 2 Xtrain. Then, we train pφ and q φ on the training set of programs {Px | x 2 X}.

For the second stage of SGM (i.e., completing the structure rendering xstruct into an image x), we use a variational encoder-decoder (VED)

p (x | s, c) =

p (x | w) q (w | xstruct)dw,

where q (w | xstruct) encodes a structure rendering xstruct

4We chose a large training set since our dataset is so small.

Learning Neurosymbolic Generative Models via Program Synthesis

Model Score

SGM (Cycle GAN) 85.51 BL (Spatial GAN) 258.68 SGM (VED) 59414.7 BL (VAE) 60368.4

SGM (VED Stage 1 pφ(s, c | z)) 32.0 SGM (VED Stage 2 p (x | s, c)) 59382.6

Table 1. Performance of our approach SGM versus the baseline (BL) for generation from scratch. We report Fr echet inception distance for GAN-based models, and negative log-likelihood for the VAE-based models

into a latent vector w, and p (x | w) decodes the latent vector to a whole image We train p and q using the reconstruction error kˆx x k. Additionally, we trained a Cycle-GAN model to map structure renderings to complete images, by giving the Cycle GAN model unaligned pairs of xstruct and x as training data. We compare our VED model to a VAE (Kingma & Welling, 2014), and compare our Cycle GAN model to a Spatial GAN (Jetchev et al., 2017).

Results. We measure performance for SGM with the VED and the baseline VAE using the variational lower bound on the negative log-likelihood (NLL) (Zhao et al., 2017) on a held-out test set. For our approach, we use the lower bound (2),5 which is the sum of the NLLs of the ﬁrst and second stages; we report these NLLs separately as well. Figure 4 shows examples of generated images. For SGM and Spatial GAN, we use Fr echet inception distance (Heusel et al., 2017). Table 1 shows these metrics of both our approach and the baseline.

Discussion. The models based on our approach quantitatively improve over the respective baselines. The examples of images generated using our approach with VED completion appear to contain more structure than those generated using the baseline VAE. Similarly, the images generated using our approach with Cycle GAN clearly capture more complex structure than the unbounded 2D repeating texture patterns captured by Spatial GAN.

4.3. Image Completion

Experimental setup. Second, we evaluated our approach SGM for image completion, on both our synthetic and the facades dataset. For this task, we compare using three image completion models: GLCIC (IIzuka et al., 2017), Cycle GAN (Zhu et al., 2017), and the VED architecture described in Section 4.2. GLCIC is a state-of-the-art image completion model. Cycle GAN is a generic image-to-image transformer.

5Technically, p (x | sx, cx) is lower bounded by the loss of the variational encoder-decoder).

Original Images

SGM (Cycle GAN)

Baseline (Spatial GAN)

Baseline (VAE)

Figure 4. Examples of synthetic images generated using our ap-

proach, SGM (with VED and Cycle Gan), and the baseline (a VAE and a Spatial GAN). Images in different rows are unrelated since the task is generation from scratch.

It uses unpaired training data, but we found that for our task, it outperforms approaches such as Pix2Pix (Isola et al., 2017) that take paired training data. For each model, we trained two versions:

Our approach (SGM): As described in Section 2 (for

image completion), given a partial image xpart, we use Algorithm 1 to synthesize a program Ppart. We extrapolate Ppart to ˆP = f (Ppart), and execute ˆP to obtain a structure rendering xstruct. Finally, we train the image completion model (GLCIC, Cycle GAN, or VED) to complete xstruct to the original image x .

Baseline: Given a partial image xpart, we train the im-

age completion model (GLCIC, Cycle GAN, or VED) to directly complete xpart to the original image x .

Learning Neurosymbolic Generative Models via Program Synthesis

Original Image (Synthetic) Original Image (Facades)

SGM (GLCIC, Synthetic) SGM (GLCIC, Facades)

Baseline (GLCIC, Synthetic) Baseline (GLCIC, Facades)

Figure 5. Examples of images generated using our approach (SGM) and the baseline, using GLCIC for image completion.

Model Synthetic Facades SGM BL SGM BL

GLCIC 106.8 163.66 141.8 195.9 Cycle GAN 91.8 218.7 124.4 251.4 VED 44570.4 52442.9 8755.4 8636.3

Table 2. Performance of our approach SGM versus the baseline (BL) for image completion. We report Fr echet distance for GANbased models, and negative log-likelihood (NLL) for the VED.

Results. As in Section 4.2, we measure performance using Fr echet inception distance for GLCIC and Cycle GAN, and negative log-likelihood (NLL) for the VED, reported on a held-out test set. We show these results in Table 2. We show examples of completed image using GLCIC in Figure 5. We show additional examples of completed images, including those completed using Cycle GAN and VED, in Appendix B.

Discussion. Our approach SGM outperforms the baseline in every case except the VED on the facades dataset. We believe the last result is since both VEDs failed to learn any meaningful structure (see Figure 7 in Appendix B).

A key reason why the baselines perform so poorly on the facades dataset is that the dataset is very small. Nevertheless, SGM substantially outperforms the baselines even on the larger synthetic dataset. Finally, generative models such as GLCIC are known to perform poorly away from the edges of the given partial image (IIzuka et al., 2017). A beneﬁt of our approach is that it provides global context for models such as GLCIC that are good at performing local completion.

5. Conclusion

We have proposed a new approach to generation that incorporates programmatic structure into state-of-the-art deep learning models. In our experiments, we have demonstrated the promise of our approach to improve generation of highdimensional data with global structure that current state-ofthe-art deep generative models have difﬁculty capturing.

There are a number of directions for future work that could improve the quality of the images generated using our approach. Most importantly, we have relied on a relatively simple grammar of programs. Designing more expressive program grammars that can more accurately capture global structure could substantially improve our results. Examples of possible extensions include if-then-else statements and variable grids. Furthermore, it may be useful to incorporate spatial transformations so we can capture patterns that are distorted due to camera projection.

Correspondingly, more sophisticated synthesis algorithms may be needed for these domains. In particular, learningbased program synthesizers may be necessary to infer more complex global structure. Devising new learning algorithms e.g., based on reinforcement learning would be needed to learn these synthesizers in conjunction with the parameters of the SGM model.

Acknowledgements

We thank the anonymous reviewers for insightful feedback. This research was supported by NSF awards #1737858 and #1836936.

Learning Neurosymbolic Generative Models via Program Synthesis

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. Neural

module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39 48, 2016.

Bastani, O., Pu, Y., and Solar-Lezama, A. Veriﬁable rein-

forcement learning via policy extraction. In NIPS, 2018.

De Raedt, L., Kimmig, A., and Toivonen, H. Problog: A

probabilistic prolog and its application in link discovery. In Proceedings of the 20th International Joint Conference on Artiﬁcal Intelligence, IJCAI 07, pp. 2468 2473, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation. cfm?id=1625275.1625673.

Deng, Z., Chen, J., Fu, Y., and Mori, G. Probabilistic neural programmed networks for scene generation. In Advances in Neural Information Processing Systems, pp. 4032 4042, 2018.

Dieleman, S., van den Oord, A., and Simonyan, K. The chal-

lenge of realistic music generation: modelling raw audio at scale. In Advances in Neural Information Processing Systems, pp. 8000 8010, 2018.

Ellis, K., Solar-Lezama, A., and Tenenbaum, J. Unsuper-

vised learning by program synthesis. In Advances in neural information processing systems, pp. 973 981, 2015.

Ellis, K., Ritchie, D., Solar-Lezama, A., and Tenenbaum,

J. Learning to infer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, pp. 6060 6069, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural

information processing systems, pp. 2672 2680, 2014.

Ha, D. and Eck, D. A neural representation of sketch draw-

ings. ar Xiv preprint ar Xiv:1704.03477, 2017.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017.

Hochbaum, D. S. Approximating covering and packing

problems: set cover, vertex cover, independent set, and related problems. Approximation Algorithms for NPHard Problem, pp. 94 143, 1997.

Hwang, I., Stuhlm uller, A., and Goodman, N. D. Induc-

ing probabilistic programs by bayesian program merging. ar Xiv preprint ar Xiv:1110.5667, 2011.

IIzuka, S., Simo-Serra, E., and Ishikawa, H. Globally and lo-

cally consistent image completion. In ACM Trans. Graph., 2017.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to-

image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967 5976. IEEE, 2017.

Jetchev, N., Bergmann, U., and Vollgraf, R. Texture synthe-

sis with spatial generative adversarial networks. 2017.

Kingma, D. P. and Welling, M. Auto-encoding variational

bayes. In ICLR, 2014.

Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.

Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

Liang, P., Jordan, M. I., and Klein, D. Learning programs:

A hierarchical bayesian approach. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 639 646, 2010.

Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T.,

and Raedt, L. D. Deepproblog: Neural probabilistic logic programming. Co RR, abs/1805.10872, 2018. URL http://arxiv.org/abs/1805.10872.

Nori, A. V., Ozair, S., Rajamani, S. K., and Vijaykeerthy, D.

Efﬁcient synthesis of probabilistic programs. In PLDI, volume 50, pp. 208 217. ACM, 2015.

Perov, Y. N. and Wood, F. Learning probabilistic programs.

In NIPS Probabilistic Programming Workshop, 2014.

Pu, Y., Miranda, Z., Solar-Lezama, A., and Kaelbling, L.

Selecting representative examples for program synthesis. In International Conference on Machine Learning, pp. 4158 4167, 2018.

Radford, A., Metz, L., and Chintala, S. Unsupervised rep-

resentation learning with deep convolutional generative adversarial networks. In ICLR, 2015.

Solar-Lezama, A., Tancau, L., Bodik, R., Seshia, S., and

Saraswat, V. Combinatorial sketching for ﬁnite programs. In ASPLOS, volume 41, pp. 404 415. ACM, 2006.

Talton, J., Yang, L., Kumar, R., Lim, M., Goodman, N., and

Mˇech, R. Learning design patterns with bayesian grammar induction. In Proceedings of the 25th annual ACM symposium on User interface software and technology, pp. 63 74. ACM, 2012.

Tyleˇcek, R. and ˇS ara, R. Spatial pattern templates for recog-

nition of objects with regular structure. In Proc. GCPR, Saarbrucken, Germany, 2013.

Learning Neurosymbolic Generative Models via Program Synthesis

Valkov, L., Chaudhari, D., Srivastava, A., Sutton, C., and

Chaudhuri, S. Houdini: Lifelong learning as program synthesis. In Advances in Neural Information Processing Systems, pp. 8701 8712, 2018.

Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri,

S. Programmatically interpretable reinforcement learning. In ICML, 2018.

Wang, F. and Rudin, C. Falling rule lists. In Artiﬁcial

Intelligence and Statistics, pp. 1013 1022, 2015.

Xian, W., Sangkloy, P., Agrawal, V., Raj, A., Lu, J., Fang,

C., Yu, F., and Hays, J. Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8456 8465, 2018.

Zhao, S., Song, J., and Ermon, S. Towards deeper understanding of variational autoencoding models. ar Xiv preprint ar Xiv:1702.08658, 2017.

Zhu, J.-Y., Park, T., and Efros, A. Unpaired image-to-image

translation using cycle-consistent adversarial networks. In ICCV., 2017.