{"title": "Multi-objects Generation with Amortized Structural Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 6619, "page_last": 6629, "abstract": "Deep generative models (DGMs) have shown promise in image generation. However, most of the existing methods learn a model by simply optimizing a divergence between the marginal distributions of the model and the data, and often fail to capture rich structures, such as attributes of objects and their relationships, in an image.\nHuman knowledge is a crucial element to the success of DGMs to infer these structures, especially in unsupervised learning.\nIn this paper, we propose amortized structural regularization (ASR), which adopts posterior regularization (PR) to embed human knowledge into DGMs via a set of structural constraints.\nWe derive a lower bound of the regularized log-likelihood in PR and adopt the amortized inference technique to jointly optimize the generative model and an auxiliary recognition model for inference efficiently.\nEmpirical results show that ASR outperforms the DGM baselines in terms of inference performance and sample quality.", "full_text": "Multi-object Generation with Amortized Structural\n\nRegularization\n\nKun Xu, Chongxuan Li, Jun Zhu\u2217, Bo Zhang\n\nDept. of Comp. Sci. & Tech., Institute for AI, THBI Lab, BNRist Center,\nState Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China\n{kunxu.thu, chongxuanli1991}@gmail.com, {dcszj, dcszb}@tsinghua.edu.cn\n\nAbstract\n\nDeep generative models (DGMs) have shown promise in image generation. How-\never, most of the existing methods learn a model by simply optimizing a divergence\nbetween the marginal distributions of the model and the data, and often fail to\ncapture rich structures, such as attributes of objects and their relationships, in an im-\nage. Human knowledge is a crucial element to the success of DGMs to infer these\nstructures, especially in unsupervised learning. In this paper, we propose amor-\ntized structural regularization (ASR), which adopts posterior regularization (PR) to\nembed human knowledge into DGMs via a set of structural constraints. We derive\na lower bound of the regularized log-likelihood in PR and adopt the amortized\ninference technique to jointly optimize the generative model and an auxiliary recog-\nnition model for inference ef\ufb01ciently. Empirical results show that ASR outperforms\nthe DGM baselines in terms of inference performance and sample quality.\n\n1\n\nIntroduction\n\nDeep generative models (DGMs) [19, 26, 10] have\nmade signi\ufb01cant progress in image generation, which\nlargely promotes the downstream applications, es-\npecially in unsupervised learning [5, 7] and semi-\nsupervised learning [20, 6]. In most of the real-world\nsettings, visual data is often presented as a scene\nof multiple objects with complicated relationships\namong them. However, most of the existing meth-\nods [19, 10] lack of a mechanism to capture the un-\nderlying structures in images, including regularities\n(e.g., size, shape) of an object and the relationships\namong objects. This is because they adopt a single\nfeature vector to represent the whole image and conse-\nquently focus on generating images with a single main\nobject [17]. It largely impedes DGMs generalizing to\ncomplex scene images. How to solve the problem in\nan unsupervised manner is still largely open.\nThe key to address the problem is to model the structures explicitly. Existing work attempts to solve\nthe problem via structured DGMs [8, 24], where a structured prior distribution over latent variables\nis used to encode the structural information of images and regularize the model behavior under the\nframework of maximum likelihood estimation (MLE). However, there are two potential limitations of\nsuch methods. First, merely maximizing data\u2019s log-likelihood of such models often fails to capture the\n\nFigure 1: An illustration of the overlapping\nproblem. The \ufb01rst bounding box is in red,\nand the second one is in green. The over-\nlapping area is in purple. De\ufb01ning the prior\ndistribution in the auto-regressive manner\nis still challenging since some locations are\nnot valid even for the \ufb01rst bounding box as\nshown in the right panel.\n\n\u2217Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fstructures in an unsupervised manner [21]. Maximizing the marginal likelihood does not necessarily\nencourage the model to capture the reasonable structures because the latent structures are integrated\nout. Besides, the optimizing process often gets stuck in local optima because of the highly non-linear\nfunctions de\ufb01ned by neural networks, which may also result in undesirable behavior. Second, it is\ngenerally challenging to design a proper prior distribution which is both \ufb02exible and computationally\ntractable. Consider the case where we want to uniformly sample several 20 \u00d7 20 bounding boxes in\na 50 \u00d7 50 image without overlap. It is dif\ufb01cult to de\ufb01ne a tractable prior distribution, as shown in\nFig.1. Though it is feasible to set the probability to zero when the prior knowledge is violated using\nindicator functions, other challenges like non-convexity and non-differentiability will be introduced\nto the optimization problem.\nIn contrast, the posterior regularization (PR) [9] and its generalized version in Bayesian inference,\ni.e., regularized Bayesian inference (RegBayes) [30], provide a general framework to embed human\nknowledge in generative models, which directly regularizes the posterior distribution instead of\ndesigning proper prior distributions. In PR and RegBayes, a valid posterior set is de\ufb01ned according\nto the human knowledge, and the KL-divergence between the true posterior and the valid set (see the\nformal de\ufb01nition in Sec. 2.2) is minimized to regularize the behavior of structured DGMs. However,\nthe valid set consists of sample-speci\ufb01c variational distributions. Therefore, the number of parameters\nin the variational distribution grows linearly with the number of training samples, and it requires an\ninner loop for accurately approximating the regularized posterior [28]. The above computational\nissue makes it non-trivial to apply PR to large-scale datasets and DGMs directly.\nIn this paper, we propose a \ufb02exible amortized structural regularization (ASR) framework to improve\nthe performance of structured generative models based on PR. ASR is a general framework to\nproperly incorporate structural knowledge into DGMs by extending PR to the amortized setting, and\nits objective function is denoted as the log-likelihood of the training data along with a regularization\nterm over the posterior distribution. The regularization term can help the model to capture reasonable\nstructures of an image, and to avoid unsatisfactory behavior that violates the constraints. We derive a\nlower bound of the regularized log-likelihood and use an amortized recognition model to approximate\nthe constrained posterior distribution. By slacking the constraints as a penalty term, ASR can be\noptimized ef\ufb01ciently using gradient-based methods. We apply ASR to the state-of-the-art structured\ngenerative models [8] for the multi-object image generation tasks. Empirical results demonstrate\nthe effectiveness of our proposed method, and both the inference and generative performance are\nimproved under the help of human knowledge.\n\n2 Preliminary\n\nIterative generative models for multiple objects\n\n2.1\nAttend-Infer-Repeat (AIR) [8] is a structured latent variable model, which decomposes an image\nas several objects. The attributes of objects (i.e., appearance, location, and scale) are represented\nby a set of random variables z = {zapp, zloc, zscale}. The generative process starts from sampling\nthe number of objects n \u223c p(n), and then n sets of latent variables are sampled independently as\nzi \u223c p(z). The \ufb01nal image is composed by adding these objects into an empty canvas. Speci\ufb01cally,\nthe joint distribution and its marginal over the observed data can be formulated as follows:\n\np(x, z, n) = p(n)\n\np(zi)p(x|z, n), p(x) =\n\np(x, z, n)dz.\n\nn\n\nz\n\ni=1:n\n\nwith mean \u00b5 =(cid:80)\n\ni=1:n fdec(zi), or a Bernoulli distribution with probability p =(cid:80)\n\nThe conditional distribution p(x|z, n) is usually formulated as a multi-variant Gaussian distribution\ni=1:n fdec(zi)\nfor pixels in images. fdec is a decoder network that transfers the latent variables to the image space.\nIn an unsupervised manner, AIR can infer the number of objects, as well as the latent variables\nfor each object ef\ufb01ciently using amortized variational inference. The latent variables are inferred\niteratively, and the number of objects n is represented by zpres: a n + 1 binary dimensional vector\nwith n ones followed by a zero. The i-th element of zpres denotes whether the inference process is\nterminated or not. Then the inference model can be formulated as follows:\n\nq(z, n|x) = q(zn+1\n\npres = 0|x, z<n)\n\nq(zi|x, z<i)q(zi\n\npres = 1|x, z<i).\n\n(1)\n\n(cid:89)\n\n(cid:90)\n\n(cid:88)\n\n(cid:89)\n\ni=1:n\n\n2\n\n\fThe inference model iteratively infers the latent variable zi of the i-th object conditioning on the\nprevious inferred latent variables z<i and the input image x until zn+1\nBy explicitly modeling the location and appearance of each object, AIR is capable of modeling an\nimage with structural information, rather than a simple feature vector. It is worth noting that the\nnumber of steps n and latent variable zi are pre-de\ufb01ned and cannot be learned from data. In the\nfollowing, we modify the original AIR by introducing a parametric prior to capture the dependency\namong objects. Details are illustrated in Sec. 3.1.\n\npres = 0.\n\n2.2 Posterior regularization for structured generative model\nPosterior regularization (PR) [9, 30] provides a principled approach to regularize latent variable\nmodels with a set of structural constraints. There are some cases where designing a prior distribution\nfor the prior knowledge is intractable, whereas they can be easily presented as a set of constraints [30].\nIn these cases, PR is more \ufb02exible than designing proper prior distributions.\nSpeci\ufb01cally, a latent variable model is denoted as p(X, Z; \u03b8) = p(Z; \u03b8)p(X|Z; \u03b8) where X is the\ntraining data and Z is the corresponding latent variable. \u03b8 denotes the parameters of p, and takes value\nfrom \u0398, which is generally R|\u0398| with |\u0398| denotes the dimension of the parameter space. PR proposes\nto regularize the posterior distribution to certain constraints under the framework of maximization\nlikelihood estimation (MLE). Generally, the constraints are de\ufb01ned as the expectation of certain\nstatistics \u03c8(X, Z) \u2208 Rd, and they form a set of valid posterior distribution Q as follows:\n\nQ = {q(Z)|Eq(Z)[\u03c8(X, Z)] \u2264 0},\n\n(2)\nwhere d is the number of constraints, and 0 is a d-dimension zero vector. To regularize the posterior\ndistribution P (Z|X; \u03b8) to be close to Q, PR proposes to add a regularization term \u2126(p(Z|X; \u03b8)) to\nthe MLE objective. The optimization problem and regularization are given by:\n\nmax\n\nJ(\u03b8) = log\n\n\u03b8\n\np(X, Z; \u03b8)dZ \u2212 \u2126(p(Z|X; \u03b8)).\nKL(q(Z)||p(Z|X; \u03b8)).\n\nZ\n\n\u2126(p(Z|X; \u03b8)) = KL(Q||p(Z|X; \u03b8)) = min\nq\u2208Q\n\n(4)\nThe regularization term is the KL divergence between Q and p(Z|X; \u03b8) as de\ufb01ned in Eqn. (4). When\nthe regularization term is convex, a close-form solution can be found based on convex analysis [3].\nTherefore, the EM algorithm [28] can be applied to optimize the regularized likelihood J(\u03b8) [9].\nHowever, EM is largely limited when we extend the PR to DGMs because of the highly non-linearity\nintroduced by neural networks. We therefore propose our method by introducing amortized variational\ninference to ef\ufb01ciently solve the problem.\n\n(3)\n\n(cid:90)\n\n3 Method\nIn this section, we \ufb01rst de\ufb01ne a variant of AIR which uses a parametric prior distribution to capture\nthe dependency among objects. Then we give a formal de\ufb01nition of the amortized structural regular-\nization (ASR) framework. We mainly follow the notation in Sec. 2, and we abuse the notation when\nthey share the same role in PR and ASR. We illustrate our proposed framework in Fig. 2.\n\n3.1 Generative & inference model\nThe prior distribution in the vanilla AIR is \ufb01xed, and the latent variables of objects are sampled\nindependently. Therefore, the structures, i.e., the attributes and their dependency, cannot be captured\nby the generative model. We propose to modify the generative model by using a learnable prior.\nSpeci\ufb01cally, an auxiliary variable zpres is used to model the number of objects by denoting whether\nthe generation process is terminated at step t (i.e., zt\npres = 1). Besides, the\nattributes (i.e., latent variables of each object) are sampled conditioned on previously sampled latent\nvariables. Formally, the joint distribution is de\ufb01ned as follows:\n\npres = 0) or not (i.e., zt\n\n(cid:32) n(cid:89)\n\n(cid:33)\npres = 1|z<t; \u03b8)p(zt|z<t; \u03b8)\n\np(zt\n\np(x|z, zpres; \u03b8),\n\n(5)\n\np(x, z, n; \u03b8) = p(zn+1\n\npres = 0|z\u2264n; \u03b8)\n\nwhere the \u03b8 denotes the parameters for both the prior distribution and conditional distribution and\nwe set z0\npres = 1 and z0 = 0. In the following, we omit the \u03b8 for simplicity. Following AIR, the\n\nt=1\n\n3\n\n\fFigure 2: The proposed framework. The blue arrows denote the generative and inference network in\nAIR. The red arrows highlight the difference between ASR and AIR. The red arrows in the generative\nmodel represent the dependency among the latent variables in the generative model. A regularization\nterm is introduced to regularize the generative model, and we use the overlapping term as an example.\n\nconditional distribution p(x|z, zpres) is de\ufb01ned as p(x|z, zpres) = p(x|(cid:80)\n\ni=1:n fdec(zi)). We use a\nrecurrent neural network (RNN) [11] to model the dependency among the latent variables z, zpres,\nand use a feed-forward neural network as the decoder to map the latent variables to the image space.\nThe latent variable z consists of three parts: z = {zapp, zloc, zscale}, which represent the appearances,\nlocations, and scales respectively. The distribution of zt conditioned on previous z<t is given by:\n\np(zt|z<t; \u03b8) = p(zt\n\nloc|z<t)p(zt\n\nscale|z<t, zt\n\nloc)p(zapp),\n\nwhere the current scale and location are sampled conditionally on previous sampled results. Since we\nonly consider the spatial relation among the objects, the dependency among the appearances of them\nis ignored and the appearance variables are independently sampled from a simple prior distribution.\nThe inference model is de\ufb01ned mainly following the AIR, which is given by:\n\nn(cid:89)\n\n(cid:90)\n\nZ\n\n4\n\nq(z, n|x; \u03c6) = q(zn+1\n\npres = 0|z\u2264n, x; \u03c6)\n\nq(zi\n\npres = 1|z<t, x; \u03c6)q(zt|z<t, x; \u03c6),\n\n(6)\n\nwhere the \u03c6 \u2208 \u03a6 denotes the parameters and \u03a6 denotes the parameter space of \u03c6. Similar to the\ngenerative process, the variational posterior distribution q(zt|z<t, x; \u03c6) is given by:\nscale).\n\nq(zt|z<t, x) = q(zt\n\nloc)q(zapp|zt\n\nloc|z<t)q(zt\n\nscale|z<t, zt\n\nloc, zt\n\nt=1\n\nThe generative model de\ufb01ned in Eqn. (5) is powerful enough to capture complex structures. However,\ndirectly optimizing the marginal log-likelihood (or its lower bound) of training data often stacks at\ncertain local optima, where the model fails to capture the structures. This phenomenon emerges in\nthe baselines as reported in both previous work [21] and our experiments. See details in Sec. 6.1.\n\n3.2 Amortized structural regularization\nIn original PR, a set of statistics \u03c8 is used to de\ufb01ne the valid set Q in Eqn. (2). In ASR, we generalize\nthe constraints as a functional F that maps a distribution de\ufb01ned over the latent space to Rd, with d\ndenoting the number of constraints. The resulted valid set Q is given by:\n\nQ = {q(Z)|F (q(Z)) \u2264 0},\n\n(7)\n\nwhere 0 is a d-dimension zero-vector. To train the DGMs using gradient-based methods ef\ufb01ciently,\nwe require that the functional F is differentiable w.r.t. q.\nMotivated by PR, ASR regularizes the posterior distribution P (Z|X; \u03b8) to be close to the valid set Q,\nby minimizing a regularization term \u2126(p(Z|X; \u03b8)) along with maximizing the likelihood of training\ndata. The objective function is given by:\n\nmax\n\n\u03b8\n\nJ(\u03b8) = log\n\np(X, Z; \u03b8)dZ \u2212 \u2126(p(Z|X; \u03b8)).\n\n(8)\n\n\ud835\udc671\ud835\udc672\ud835\udc4b\ud835\udc671\ud835\udc672\ud835\udc4bDecoder\ud835\udc4b\u2032KL-DivergenceReconstruction Error\ud835\udc45(\ud835\udc5e\ud835\udc4d\ud835\udc4b;\ud835\udf19)|\ud835\udc67\ud835\udc651\u2212\ud835\udc67\ud835\udc652|\ud835\udc67\ud835\udc6012\ud835\udc67\ud835\udc6022\ud835\udc67\ud835\udc6012+\ud835\udc67\ud835\udc6022-max{\ud835\udc67\ud835\udc651\u2212\ud835\udc67\ud835\udc652,\ud835\udc67\ud835\udc661\u2212\ud835\udc67\ud835\udc662}\u22640|\ud835\udc67\ud835\udc661\u2212\ud835\udc67\ud835\udc662|\ud835\udc67\ud835\udc6012\ud835\udc67\ud835\udc6022\fThe de\ufb01nition of the regularization term \u2126 follows original PR as in Eqn. (4). Note that\nKL(q(Z)||p(Z|X; \u03b8)) \u2265 \u2126(p(Z|X; \u03b8)) for all q(Z) \u2208 Q. It enables us to obtain a lower bound of\nJ(\u03b8) by substituting KL(q(Z)||p(Z|X; \u03b8)) for \u2126(p(Z|X; \u03b8)), which is given by:\np(X, Z; \u03b8)dZ \u2212 KL(q(Z)||p(Z|X; \u03b8)) = J(cid:48)(\u03b8, q).\n\n(9)\nFollowing the variational inference, the lower bound J(cid:48) can be formulated as the evidence lower\nbound (ELBO), and Problem (8) is converted as a constrained optimization problem as follows:\n\nJ(\u03b8) \u2265 log\n\n(cid:90)\n\nZ\n\nJ(cid:48)(\u03b8, q) = log p(X) \u2212 Eq(Z) log\n\nmax\n\n\u03b8,q(Z)\u2208Q\n\nq(Z)\n\np(Z|X; \u03b8)\n\n= Eq(Z) log\n\np(X, Z; \u03b8)\n\nq(Z)\n\n.\n\nMotivated by amortized variational inference [19], we introduce a recognition model q(Z|X; \u03c6) to\napproximate the variational distribution q where \u03c6 denotes the parameters of the recognition model.\nTherefore, the lower bound can be optimized w.r.t. \u03b8 and \u03c6 jointly, which is given by:\n\nmax\n\n\u03b8\u2208\u0398,\u03c6\u2208\u03a6,q(Z|X;\u03c6)\u2208Q\n\nEq(Z|X;\u03c6) log\n\np(X, Z; \u03b8)\nq(Z|X; \u03c6)\n\n.\n\n(10)\n\nWe abuse the notation J(cid:48)(\u03b8, \u03c6) to denote the amortized version of the lower bound.\nProblem (10) is a constrained optimization problem. In order to ef\ufb01ciently solve Problem (10), we\npropose to slack the constraints as a penalty, and add it to the objective function J(cid:48)(\u03b8, \u03c6) as:\n\nwhere R(q) =(cid:80)\n\nJ(cid:48)(\u03b8, \u03c6) \u2212 R(q(Z|X; \u03c6)),\n\nmax\n\n\u03b8\u2208\u0398,\u03c6\u2208\u03a6\n\n(11)\ni=1:d \u03bbi max{Fi(q), 0}, and \u03bbi is the coef\ufb01cient for the i-th constraint of F (q). For\nsuf\ufb01cient large \u03bb, Problem (11) is equivalent to Problem (10) and we treat it as a hyperparameter.\nThe training procedure is described in Appendix A.\nIt is worth noting that we implicitly add another regularization to the generative model when de\ufb01ning\nq using a parametric model: the posterior distribution p(Z|X; \u03b8) can be represented by q(Z|X; \u03c6).\nThis regularization term has the same effect as in VAE [19, 27], which is introduced to make the\noptimization process more ef\ufb01cient. In contrast, it is the penalty term R(q(Z|X; \u03c6)) that embeds\nhuman knowledge into DGMs and regularizes DGMs for desirable behavior.\n\n4 Application on multi-object generation\nIn the following, we give two examples of applying ASR to image generation with multiple objects.\nIn this section, we mainly focus on regularizing on the number of objects, and the spatial relationships\namong them. Therefore, the functional F in Eqn. (7) are de\ufb01ned over q(zpres, zloc, zscale).\n\n4.1 ASR regularization on the number of objects\nIn this setting, we consider the case where each image contains a certain number of objects. For\nexample, each image has either 2 or 4 objects, and images of each number of objects appear\nof the same frequency. We de\ufb01ne the possible numbers of objects as L (cid:40) [K], where [K] =\n{0, 1,\u00b7\u00b7\u00b7 , K \u2212 1} is the set of all non-negative integer less than K, and K is the largest number\nof objects we consider. Since we use zpres to denote the number of objects, an image x with n\nobjects is equivalent to the corresponding latent variable zpres|x = un with probability one, where\nun is a n + 1 dimension binary vector with n ones followed by a zero. We further denote qi as\nqi(zpres = uj) = 1(i = j), where 1 is the indicator function. The valid posterior is given by\nVzpres = {qi}i\u2208L. According to ASR, we regularize our variational posterior q(Z|X; \u03c6) in the valid\nposterior set Vzpres. Besides, we also regularize the marginal distribution to quni(z) = 1|L|\ni\u2208L qi,\nwhich is a uniform distribution over Vzpres. The valid posterior set is given by:\nQnum = {q(Z|X)|q(Z|X = x) \u2208 Vzpres \u2200 x \u2208 D, Ep(X)q(Z|X) = quni(Z)},\n\nwhere D denotes the set of all training samples. As the constraints are de\ufb01ned in the equality form,\nand we reformulate it in the inequality form, and the regularization term Rnum are given by:\nKL(qi||q(Z|X)) \u2264 0, KL(quni(Z)||Ep(X)q(Z|X)) \u2264 0},\n2 KL(qu(Z)||Ep(X)q(Z|X)).\n\nQnum = {q(Z|X)| min\nqi\u2208Vzpres\nRnum(q(Z|X)) = \u03bbnum\n\nKL(qi||q(Z|X)) + \u03bbnum\n\n(cid:80)\n\nmin\n\n1\n\nqi\u2208Qnum\n\nThe \u03bbnum\n\n1\n\nand \u03bbnum\n\n2\n\nare the hyper-parameters to balance the penalty term and the log-likelihood.\n\n5\n\n\f4.2 ASR regularization on overlap\nIn this setting, we focus on the overlap problem, and we introduce several regularization terms to\nreduce the overlap among objects, which is de\ufb01ned over the location of bounding boxes. The location\nof a bounding box is determined by its center zloc = (zx, zy), and scale zscale, and the functional F o\nis de\ufb01ned over these latent variables.\nThe \ufb01rst set of regularization terms directly penalize the overlap. Given the centers and scales of the\ni-th and j-th bounding box, they are not overlapped if and only if both of the following constraints\ny| \u2264 0. These constraints have a\nare satis\ufb01ed: zi\nstraightforward explanation and are illustrated in Fig. 2.\nIn the following, we denote (cid:96)(x) = max{x, 0} for simplicity, and we de\ufb01ne the functional F o as:\n\nx| \u2264 0, zi\n\ny \u2212 zj\n\n\u2212 |zi\n\n\u2212 |zi\n\nscale+zj\n\nscale+zj\n\nscale\n\nscale\n\n2\n\n2\n\n1 (q) = Eq(z)\nF o\n\nscale + zj\nzi\n\nscale\n\n2\n\n\u2212 max{|zi\n\nx \u2212 zj\n\nx|,|zi\n\ny \u2212 zj\n\ny|}) \u2264 0,\n\nx \u2212 zj\n(cid:88)\n\n(cid:96)(\n\ni,j<n,i(cid:54)=j\n\nwhich regularizes each pair of the bounding boxes to reduce overlap.\nSimply regularizing the overlap by minimizing F1 usually results in the fact that the inferred bounding\nboxes are of different size: a big bounding box that covers the whole image, and several bounding\nboxes of extremely small size that lie beside the boundary of the image, or out of the image. To\novercome this issue, we add another two regularization terms, where the \ufb01rst one regularize the\nbounding boxes stay within the image, and the second regularize the bounding boxes are of the same\nsize. The \ufb01rst set of regularization terms are formulated as the following four constraints:\n\n2 (q) = Eq(z)\nF o\n\n4 (q) = Eq(z)\nF o\n\n(cid:96)(\n\nzi\nscale\n2\n\n\u2212 zi\n\nx) \u2264 0, F o\n\n3 (q) = Eq(z)\n\n(cid:96)(\n\nzi\nscale\n2\n\n\u2212 zi\n\ny) \u2264 0, F o\n\n5 (q) = Eq(z)\n\n(cid:96)(zi\n\nx +\n\n(cid:96)(zi\n\ny +\n\nzi\nscale\n2\n\nzi\nscale\n2\n\n\u2212 S) \u2264 0,\n\n\u2212 S) \u2264 0,\n\n(cid:88)\n(cid:88)\n\ni=1:n\n\ni=1:n\n\n(cid:88)\n(cid:88)\n\ni=1:n\n\ni=1:n\n\n(cid:88)\n\ni=1:n\n\n7 (q) = Eq(z)\nF o\n\n(cid:88)\n\ni,j<n\n\n(cid:88)\n\nand the second set of regularization terms are given by:\n\n6 (q) = Eq(z)\nF o\n\n(cid:96)(cmin \u2212 zi\n\nscale) + (cid:96)(zi\n\nscale \u2212 cmax) \u2264 0,\nscale| \u2212 \u0001) \u2264 0,\n\n(cid:96)(|zi\n\nscale \u2212 zj\n\nwhere S denotes the size of the \ufb01nal image, cmin/cmax denotes the possible minimum/maximum\nsize of an object, and \u0001 denotes the perturbation of the size for objects. Therefore, the regularization\nfor reducing overlap is given by:\n\nRo(q) =\n\n\u03bbo\ni F o\n\ni (q).\n\n(12)\n\ni=1:7\n\n5 Related work\nRecently, several methods [8, 12, 16, 29, 24] introduce structural information to deep generative\nmodels. Eslami et al. [8] propose the Attend-Infer-Repeat (AIR), which de\ufb01nes an iterative generative\nprocess to compose an image with multiple objects. Greff et al. [12] further generalize this method\nto more complicated images, by jointly modeling the background and objects using masks. Li et al.\n[24] use graphical networks to model the latent structures of an image, and generalize probabilistic\ngraphical models to the context of implicit generative models. Johnson et al. [16] introduce the scene\ngraph as conditional information to generate scene images. Xu et al. [29] use the and-or graph to\nmodel the latent structures and use a re\ufb01nement network to map the structures to the image space.\nTo embed prior knowledge into structured generative models, posterior regularization (PR) [9]\nprovides a \ufb02exible framework to regularize model w.r.t. a set of structural constraints. Zhu et al. [30]\ngeneralize this framework to the Bayesian inference and apply it in the non-parametric setting. Shu\net al. [27] introduce to regularize the smoothness of the inference model to improve the generalization\non both inference and generation and refer it as amortized inference regularization. Li et al. [23]\npropose to regularize the latent space of a latent variable model with large-margin in the context of\namortized variational inference, which can also be considered as a special case of PR. Bilen et al. [2]\napply PR to the object detection in a discriminative manner and improve the detection accuracy.\n\n6\n\n\f(a) The reconstruction of AIR-13.\n\n(b) The reconstruction of AIR-\npPrior-13.\n\n(c) The reconstruction of AIR-ASR-\n13.\n\nFigure 3: The reconstruction results of Multi-MNIST on 1 or 3 objects.\n\n6 Experiments\n\nIn this section, we present the empirical results of ASR on two dataset: Multi-MNIST [8] and\nMulti-Sprites [12], which are the multi-object version of MNIST [22] and dSprites [13]. We use AIR-\npPrior to denote the variants of AIR proposed in this paper, and AIR-ASR to denote the regularized\nAIR-pPrior using ASR.\nWe implement our model using TenworFlow [1] library. In our experiments, the RNNs in both\nthe generative model and recognition model are LSTM [14] with 256 hidden units. A variational\nauto-encoder [19] is used to encode and decode the appearance latent variables, and both the encoder\nand decoder are implemented as a two-layer MLP with 512 and 256 units. We use the Adam\noptimizer [18] with learning rate as 0.001, \u03b21 = 0.9, and \u03b22 = 0.999. We train models with 300\nepochs with batch size as 64. Our code is attached in the supplementary materials for reproducing.\nIn this paper, we use four metrics for quantitative evaluation: negative ELBO (nELBO), squared\nerror (SE), inference accuracy (ACC) and mean intersection over union (mIoU). The nELBO is\nan upper bound of negative log-likelihood, where a lower value indicates a better approximation\nof data distribution. The SE is the squared error between the original image and its reconstruc-\ntion, and it is summed over pixels. The ACC is de\ufb01ned as 1(numinf == numgt), numinf and\nnumgt are the number of objects inferred by the recognition model and ground truth respectively.\nThis evaluation metric demonstrates whether the inference model can correctly infer the exact\n(cid:80)\nnumber of objects in an image. Besides, we also use another evaluation metric mIoU to evalu-\nate the accuracy of inferred location for each objects. The mIoU of a single image is de\ufb01ned as\ni=1:min{numinf ,numgt} IoU (z\u03c0i, gti)/ max{numinf , numgt}, where \u03c0 is a permutation\nmax\u03c0\nof {1, 2,\u00b7\u00b7\u00b7 , numinf} and gti is the ground truth location for the i-th object.\n\nTable 1: Results on regularization on the number of objects. The numbers followed the model name\ndenotes the possible number of objects for a certain image. Results are averaged over 3 runs.\n\nMethods\nAIR-13\nAIR-pPrior-13\nAIR-ASR-13\nAIR-14\nAIR-pPrior-14\nAIR-ASR-14\nAIR-24\nAIR-pPrior-24\nAIR-ASR-24\n\nnELBO\n\n404.41 \u00b1 4.58\n405.21 \u00b1 1.17\n360.20 \u00b1 19.67\n543.44 \u00b1 54.71\n519.06 \u00b1 5.47\n441.54 \u00b1 30.97\n639.49 \u00b1 23.13\n643.28 \u00b1 8.67\n495.73 \u00b1 35.80\n\nACC\n\n0.81 \u00b1 0.23\n0.48 \u00b1 0.00\n0.96 \u00b1 0.00\n0.48 \u00b1 0.03\n0.50 \u00b1 0.00\n0.96 \u00b1 0.01\n0.55 \u00b1 0.09\n0.00 \u00b1 0.00\n0.98 \u00b1 0.01\n\nSE\n\n31.94 \u00b1 4.68\n49.42 \u00b1 0.24\n28.84 \u00b1 1.11\n52.77 \u00b1 4.92\n68.72 \u00b1 0.55\n41.05 \u00b1 7.11\n57.69 \u00b1 4.88\n83.35 \u00b1 0.44\n48.54 \u00b1 5.60\n\nmIoU\n\n0.61 \u00b1 0.13\n0.43 \u00b1 0.01\n0.61 \u00b1 0.00\n0.43 \u00b1 0.07\n0.43 \u00b1 0.00\n0.55 \u00b1 0.08\n0.46 \u00b1 0.06\n0.10 \u00b1 0.00\n0.54 \u00b1 0.08\n\n6.1 ASR regularization on the number of objects\nWhen regularizing on the number of objects, we consider three settings on Multi-MNIST: 1 or 3\nobjects, 1 or 4 objects, and 2 or 4 objects. 40000 training samples are synthesized where 20000\nimages for each number of objects. 2000 images are used as the test data to evaluate the performance\n\u2208 {1, 10, 100}, and we\nfor inference. In this setting, we evaluate our methods with \u03bbnum\n\ufb01nally set \u03bbnum\nAs illustrated in Fig. 3, AIR-pPrior simply treats the whole image as a single object, and fails to\nidentify the objects in an image. With a powerful decoder network, the generative model tends to\nignore the latent structures. The ASR can successfully regularize the model towards proper behavior.\n\n= 10 and \u03bbnum\n\n= 100.\n\n, \u03bbnum\n\n1\n\n2\n\n1\n\n2\n\n7\n\n\f(a) The reconstruction of AIR-3.\n\n(b) The reconstruction of AIR-\npPrior-3.\n\n(c) The reconstruction of AIR-ASR-\n3.\n\nFigure 4: The reconstruction results of Multi-MNIST on 3 objects. There is no overlap among\nobjects in the training data. ASR can successfully infer the underlying structures, and improve the\nreconstruction results.\n\n(a) The generative results of AIR-3.\n\n(b) The generative results of AIR-\npPrior-3.\n\n(c) The generative results of AIR-\nASR-3.\n\nFigure 5: The generative results of Multi-dSprites on 3 objects without overlap.\n\nIn AIR-ASR, the inference model can successfully identify each object, and the generative model\nlearns the underlying structures. The original AIR has a better performance compared to AIR-pPrior,\nas the prior distribution can partly regularize the generative model. However, the original AIR still\ntreats two objects close to each other as one object. The performance of these three models on the\nother two settings shares the same property, i.e., original AIR tends to merge objects and AIR-pPrior\nstacks at a local optimum. The other reconstruct results are illustrated in the Appendix.\nTable 1 presents the quantitative results. AIR-ASR outperforms its baseline and the original AIR on\nall the evaluation metrics, which demonstrates the effectiveness of our proposed method. Speci\ufb01cally,\nASR can signi\ufb01cantly regularize the model in terms of the inference steps and achieves the accuracy\nup to 96% for all the three settings. It is worth noting that introducing a proper regularization will not\naffect the ELBO which is the objective function of AIR and AIR-pPrior. The main reason is that ASR\ncan encourage the model to avoid the unsatisfactory behavior which violate the structural constraints.\nDuring the training process, all of the three models suffer from sever instability. It results the fact that\nthe nELBO is of large variance. The results largely depend on the initialization and the randomness\nin the training process. We try to reduce the effect of randomness by \ufb01xing the initialization and\naveraging our results over multiple runs.\n\nTable 2: Experimental Results on regularization over overlap. Results are averaged over 3 runs.\n\nnELBO\n\nMethods\n34.8 \u00b1 8.9 0.13 \u00b1 0.05\nAIR\nAIR-pPrior 306.6 \u00b1 58.8 41.5 \u00b1 15.4 0.35 \u00b1 0.10 274.3 \u00b1 64.4 29.3 \u00b1 12.1 0.21 \u00b1 0.13\nAIR-ASR 337.3 \u00b1 55.1 36.5 \u00b1 3.9 0.67 \u00b1 0.05 271.8 \u00b1 18.8 20.9 \u00b1 2.1 0.61 \u00b1 0.03\n\nmulti-MNIST\n37.5 \u00b1 3.8 0.25 \u00b1 0.03 341.5 \u00b1 76.5\n\n328.5 \u00b1 17.1\n\nmulti-dSprites\n\nmIoU\n\nSE\n\nmIoU\n\nnELBO\n\nSE\n\n6.2 ASR regularization on the overlap\nWhen regularizing the overlap, we evaluate models on both Multi-MNIST and Multi-dSprites data.\nWe use 20000 images with three non-overlapping objects as training data and use 1000 images to\nevaluate performance. Since the number of objects is \ufb01xed, we simply set both the generative and\ni=1:7 in {1, 10, 20, 100},\ninference steps to 3 for fair comparison. We search the hyper-parameters \u03bbo\nand we set \u03bbo\nThe reconstruction of Multi-MNIST and generative results of Multi-Sprites are demonstrated in Fig. 4\nand Fig. 5 correspondingly. In Fig. 4, the original AIR still merges two objects as one, and it cannot\n\n6 to 20, and \u03bbo\n\n1 \u223c \u03bbo\n\n5 to 1, \u03bbo\n\n7 to 10.\n\n8\n\n\fcapture the non-overlapping structures. AIR-pPrior has a similar performance. In contrast, AIR-ASR\nsigni\ufb01cantly outperforms its baselines, and infers the location of bounding boxes without overlap.\nIn terms of generative results, the sample quality of AIR-ASR surpasses AIR\u2019s and AIR-pPrior\u2019s,\nwhere the AIR-ASR can generate multiple objects without overlap whereas its baseline cannot. It\ndemonstrates that the ASR can embed human knowledge into DGMs.\nTable 2 presents the quantitative results. The AIR-ASR surpasses its baselines signi\ufb01cantly in terms\nof mIoU, which indicates that DGMs successfully captures the non-overlapping structures with\nASR. It is worth noting that for the Multi-MNIST setting, the nELBO of AIR-pPrior is better than\nAIR-ASR\u2019s. However, AIR-ASR still surpasses AIR-pPrior in terms of the SE and the mIoU, which\nindicates that AIR-ASR gives better reconstruction results and identi\ufb01es the location of objects more\naccurately. This results also verify the claim that simply optimizing the marginal log-likelihood\ncannot guarantee the generative model to capture the underlying distribution.\n\n7 Conclusion\n\nWe present a framework ASR to embed human knowledge to improve the inference and generative\nperformance in structured DGMs for multi-object generation. ASR encodes human knowledge as a\nset of structural constraints, and the framework can be optimized ef\ufb01ciently. We use the number of\nobjects and the spatial relationships among them as two examples to demonstrate the effectiveness of\nour proposed method. In Multi-MNIST and Multi-dSprites datasets, ASR signi\ufb01cantly improves its\nbaselines and successfully captures the underlying structures of the training data.\nASR is a general framework to properly incorporate structural knowledge into DGMs as long as the\nknowledge can be quantitatively represented and can be applied to a wide range of structured DGMs.\nIn this paper, we only consider the cases with hard constraints on synthetic datasets. For one thing, it\nis shown that PR can be extended to \u201cselectively\u201d incorporate uncertain knowledge (e.g., with noise)\nrepresented by the general language of \ufb01rst-order logic [25], where highly uncertain knowledge will\nbe dropped according to the faithfulness of \ufb01tting the given data. Further, Hu et al. [15] extend PR\nto the learnable constraints using differentiable neural networks. ASR extends PR to an amortized\nversion for structured generation, thereby inheriting the generality in a principled manner. For another,\nrecently signi\ufb01cant progress has been made in structured generative models [12, 4] for more realistic\nmulti-object images. Together with the theoretical generality and the practical progress, ASR can be\napplied to more complicated applications and we leave it as future work.\n\nAcknowledgements\n\nThis work was supported by the National Key Research and Development Program of China (No.\n2017YFA0700904), NSFC Projects (Nos. 61620106010, 61621136008), Beijing NSF Project (No.\nL172037), Beijing Academy of Arti\ufb01cial Intelligence (BAAI), Tiangong Institute for Intelligent\nComputing, the JP Morgan Faculty Research Program and the NVIDIA NVAIL Program with\nGPU/DGX Acceleration. C. Li was supported by the Chinese postdoctoral innovative talent support\nprogram and Shuimu Tsinghua Scholar.\n\nReferences\n[1] Mart\u00b4\u0131n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and\nImplementation ({OSDI} 16), pages 265\u2013283, 2016.\n\n[2] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with\n\nposterior regularization. In British Machine Vision Conference, volume 3, 2014.\n\n[3] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[4] Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt\nBotvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representa-\ntion. arXiv preprint arXiv:1901.11390, 2019.\n\n9\n\n\f[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[6] LI Chongxuan, Tau\ufb01k Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. In\n\nAdvances in neural information processing systems, pages 4088\u20134098, 2017.\n\n[7] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni,\nKai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture\nvariational autoencoders. arXiv preprint arXiv:1611.02648, 2016.\n\n[8] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E\nHinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. In\nAdvances in Neural Information Processing Systems, pages 3225\u20133233, 2016.\n\n[9] Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al. Posterior regularization for structured\n\nlatent variable models. Journal of Machine Learning Research, 11(Jul):2001\u20132049, 2010.\n\n[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[11] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nrecurrent neural networks. In 2013 IEEE international conference on acoustics, speech and\nsignal processing, pages 6645\u20136649. IEEE, 2013.\n\n[12] Klaus Greff, Rapha\u00a8el Lopez Kaufmann, Rishab Kabra, Nick Watters, Chris Burgess, Daniel\nZoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation\nlearning with iterative variational inference. arXiv preprint arXiv:1903.00450, 2019.\n\n[13] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\nvolume 3, 2017.\n\n[14] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[15] Zhiting Hu, Zichao Yang, Ruslan R Salakhutdinov, LIANHUI Qin, Xiaodan Liang, Haoye\nDong, and Eric P Xing. Deep generative models with learnable knowledge constraints. In\nAdvances in Neural Information Processing Systems, pages 10501\u201310512, 2018.\n\n[16] Justin Johnson, Agrim Gupta, and Li Fei-Fei.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n1219\u20131228, 2018.\n\nImage generation from scene graphs.\n\n[17] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for\n\nimproved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[19] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[20] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in neural information processing systems,\npages 3581\u20133589, 2014.\n\n[21] Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer,\nrepeat: Generative modelling of moving objects. In Advances in Neural Information Processing\nSystems, pages 8606\u20138616, 2018.\n\n[22] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n10\n\n\f[23] Chongxuan Li, Jun Zhu, Tianlin Shi, and Bo Zhang. Max-margin deep generative models. In\n\nAdvances in neural information processing systems, pages 1837\u20131845, 2015.\n\n[24] Chongxuan Li, Max Welling, Jun Zhu, and Bo Zhang. Graphical generative adversarial networks.\n\narXiv preprint arXiv:1804.03429, 2018.\n\n[25] Shike Mei, Jun Zhu, and Jerry Zhu. Robust regbayes: Selectively incorporating \ufb01rst-order logic\ndomain knowledge into bayesian models. In International Conference on Machine Learning,\npages 253\u2013261, 2014.\n\n[26] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural\n\nnetworks. arXiv preprint arXiv:1601.06759, 2016.\n\n[27] Rui Shu, Hung H Bui, Shengjia Zhao, Mykel J Kochenderfer, and Stefano Ermon. Amortized\ninference regularization. In Advances in Neural Information Processing Systems, pages 4393\u2013\n4402, 2018.\n\n[28] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and\nvariational inference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[29] Kun Xu, Haoyu Liang, Jun Zhu, Hang Su, and Bo Zhang. Deep structured generative models.\n\narXiv preprint arXiv:1807.03877, 2018.\n\n[30] Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and\napplications to in\ufb01nite latent svms. The Journal of Machine Learning Research, 15(1):1799\u2013\n1847, 2014.\n\n11\n\n\f", "award": [], "sourceid": 3589, "authors": [{"given_name": "Taufik", "family_name": "Xu", "institution": "Tsinghua University"}, {"given_name": "Chongxuan", "family_name": "LI", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}