# multiobjects_generation_with_amortized_structural_regularization__4b32b944.pdf Multi-object Generation with Amortized Structural Regularization Kun Xu, Chongxuan Li, Jun Zhu , Bo Zhang Dept. of Comp. Sci. & Tech., Institute for AI, THBI Lab, BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China {kunxu.thu, chongxuanli1991}@gmail.com, {dcszj, dcszb}@tsinghua.edu.cn Deep generative models (DGMs) have shown promise in image generation. However, most of the existing methods learn a model by simply optimizing a divergence between the marginal distributions of the model and the data, and often fail to capture rich structures, such as attributes of objects and their relationships, in an image. Human knowledge is a crucial element to the success of DGMs to infer these structures, especially in unsupervised learning. In this paper, we propose amortized structural regularization (ASR), which adopts posterior regularization (PR) to embed human knowledge into DGMs via a set of structural constraints. We derive a lower bound of the regularized log-likelihood in PR and adopt the amortized inference technique to jointly optimize the generative model and an auxiliary recognition model for inference efficiently. Empirical results show that ASR outperforms the DGM baselines in terms of inference performance and sample quality. 1 Introduction Figure 1: An illustration of the overlapping problem. The first bounding box is in red, and the second one is in green. The overlapping area is in purple. Defining the prior distribution in the auto-regressive manner is still challenging since some locations are not valid even for the first bounding box as shown in the right panel. Deep generative models (DGMs) [19, 26, 10] have made significant progress in image generation, which largely promotes the downstream applications, especially in unsupervised learning [5, 7] and semisupervised learning [20, 6]. In most of the real-world settings, visual data is often presented as a scene of multiple objects with complicated relationships among them. However, most of the existing methods [19, 10] lack of a mechanism to capture the underlying structures in images, including regularities (e.g., size, shape) of an object and the relationships among objects. This is because they adopt a single feature vector to represent the whole image and consequently focus on generating images with a single main object [17]. It largely impedes DGMs generalizing to complex scene images. How to solve the problem in an unsupervised manner is still largely open. The key to address the problem is to model the structures explicitly. Existing work attempts to solve the problem via structured DGMs [8, 24], where a structured prior distribution over latent variables is used to encode the structural information of images and regularize the model behavior under the framework of maximum likelihood estimation (MLE). However, there are two potential limitations of such methods. First, merely maximizing data s log-likelihood of such models often fails to capture the Corresponding author. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. structures in an unsupervised manner [21]. Maximizing the marginal likelihood does not necessarily encourage the model to capture the reasonable structures because the latent structures are integrated out. Besides, the optimizing process often gets stuck in local optima because of the highly non-linear functions defined by neural networks, which may also result in undesirable behavior. Second, it is generally challenging to design a proper prior distribution which is both flexible and computationally tractable. Consider the case where we want to uniformly sample several 20 20 bounding boxes in a 50 50 image without overlap. It is difficult to define a tractable prior distribution, as shown in Fig.1. Though it is feasible to set the probability to zero when the prior knowledge is violated using indicator functions, other challenges like non-convexity and non-differentiability will be introduced to the optimization problem. In contrast, the posterior regularization (PR) [9] and its generalized version in Bayesian inference, i.e., regularized Bayesian inference (Reg Bayes) [30], provide a general framework to embed human knowledge in generative models, which directly regularizes the posterior distribution instead of designing proper prior distributions. In PR and Reg Bayes, a valid posterior set is defined according to the human knowledge, and the KL-divergence between the true posterior and the valid set (see the formal definition in Sec. 2.2) is minimized to regularize the behavior of structured DGMs. However, the valid set consists of sample-specific variational distributions. Therefore, the number of parameters in the variational distribution grows linearly with the number of training samples, and it requires an inner loop for accurately approximating the regularized posterior [28]. The above computational issue makes it non-trivial to apply PR to large-scale datasets and DGMs directly. In this paper, we propose a flexible amortized structural regularization (ASR) framework to improve the performance of structured generative models based on PR. ASR is a general framework to properly incorporate structural knowledge into DGMs by extending PR to the amortized setting, and its objective function is denoted as the log-likelihood of the training data along with a regularization term over the posterior distribution. The regularization term can help the model to capture reasonable structures of an image, and to avoid unsatisfactory behavior that violates the constraints. We derive a lower bound of the regularized log-likelihood and use an amortized recognition model to approximate the constrained posterior distribution. By slacking the constraints as a penalty term, ASR can be optimized efficiently using gradient-based methods. We apply ASR to the state-of-the-art structured generative models [8] for the multi-object image generation tasks. Empirical results demonstrate the effectiveness of our proposed method, and both the inference and generative performance are improved under the help of human knowledge. 2 Preliminary 2.1 Iterative generative models for multiple objects Attend-Infer-Repeat (AIR) [8] is a structured latent variable model, which decomposes an image as several objects. The attributes of objects (i.e., appearance, location, and scale) are represented by a set of random variables z = {zapp, zloc, zscale}. The generative process starts from sampling the number of objects n p(n), and then n sets of latent variables are sampled independently as zi p(z). The final image is composed by adding these objects into an empty canvas. Specifically, the joint distribution and its marginal over the observed data can be formulated as follows: p(x, z, n) = p(n) Y i=1:n p(zi)p(x|z, n), p(x) = X z p(x, z, n)dz. The conditional distribution p(x|z, n) is usually formulated as a multi-variant Gaussian distribution with mean µ = P i=1:n fdec(zi), or a Bernoulli distribution with probability p = P i=1:n fdec(zi) for pixels in images. fdec is a decoder network that transfers the latent variables to the image space. In an unsupervised manner, AIR can infer the number of objects, as well as the latent variables for each object efficiently using amortized variational inference. The latent variables are inferred iteratively, and the number of objects n is represented by zpres: a n + 1 binary dimensional vector with n ones followed by a zero. The i-th element of zpres denotes whether the inference process is terminated or not. Then the inference model can be formulated as follows: q(z, n|x) = q(zn+1 pres = 0|x, z