{"title": "Max-Margin Deep Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1837, "page_last": 1845, "abstract": "Deep generative models (DGMs) are effective on learning multilayered representations of complex data and performing inference of input data by exploring the generative ability. However, little work has been done on examining or empowering the discriminative ability of DGMs on making accurate predictions. This paper presents max-margin deep generative models (mmDGMs), which explore the strongly discriminative principle of max-margin learning to improve the discriminative power of DGMs, while retaining the generative capability. We develop an efficient doubly stochastic subgradient algorithm for the piecewise linear objective. Empirical results on MNIST and SVHN datasets demonstrate that (1) max-margin learning can significantly improve the prediction performance of DGMs and meanwhile retain the generative ability; and (2) mmDGMs are competitive to the state-of-the-art fully discriminative networks by employing deep convolutional neural networks (CNNs) as both recognition and generative models.", "full_text": "Max-Margin Deep Generative Models\n\nChongxuan Li\u2020, Jun Zhu\u2020, Tianlin Shi\u2021, Bo Zhang\u2020\n\n\u2020Dept. of Comp. Sci. & Tech., State Key Lab of Intell. Tech. & Sys., TNList Lab,\n\nCenter for Bio-Inspired Computing Research, Tsinghua University, Beijing, 100084, China\n{licx14@mails., dcszj@, dcszb@}tsinghua.edu.cn; stl501@gmail.com\n\n\u2021Dept. of Comp. Sci., Stanford University, Stanford, CA 94305, USA\n\nAbstract\n\nDeep generative models (DGMs) are effective on learning multilayered represen-\ntations of complex data and performing inference of input data by exploring the\ngenerative ability. However, little work has been done on examining or empower-\ning the discriminative ability of DGMs on making accurate predictions. This pa-\nper presents max-margin deep generative models (mmDGMs), which explore the\nstrongly discriminative principle of max-margin learning to improve the discrim-\ninative power of DGMs, while retaining the generative capability. We develop an\nef\ufb01cient doubly stochastic subgradient algorithm for the piecewise linear objec-\ntive. Empirical results on MNIST and SVHN datasets demonstrate that (1) max-\nmargin learning can signi\ufb01cantly improve the prediction performance of DGMs\nand meanwhile retain the generative ability; and (2) mmDGMs are competitive to\nthe state-of-the-art fully discriminative networks by employing deep convolutional\nneural networks (CNNs) as both recognition and generative models.\n\n1\n\nIntroduction\n\nMax-margin learning has been effective on learning discriminative models, with many examples\nsuch as univariate-output support vector machines (SVMs) [5] and multivariate-output max-margin\nMarkov networks (or structured SVMs) [30, 1, 31]. However, the ever-increasing size of complex\ndata makes it hard to construct such a fully discriminative model, which has only single layer of\nadjustable weights, due to the facts that: (1) the manually constructed features may not well capture\nthe underlying high-order statistics; and (2) a fully discriminative approach cannot reconstruct the\ninput data when noise or missing values are present.\nTo address the \ufb01rst challenge, previous work has considered incorporating latent variables into\na max-margin model, including partially observed maximum entropy discrimination Markov net-\nworks [37], structured latent SVMs [32] and max-margin min-entropy models [20]. All this work\nhas primarily focused on a shallow structure of latent variables. To improve the \ufb02exibility, learn-\ning SVMs with a deep latent structure has been presented in [29]. However, these methods do not\naddress the second challenge, which requires a generative model to describe the inputs. The re-\ncent work on learning max-margin generative models includes max-margin Harmoniums [4], max-\nmargin topic models [34, 35], and nonparametric Bayesian latent SVMs [36] which can infer the\ndimension of latent features from data. However, these methods only consider the shallow structure\nof latent variables, which may not be \ufb02exible enough to describe complex data.\nMuch work has been done on learning generative models with a deep structure of nonlinear hidden\nvariables, including deep belief networks [25, 16, 23], autoregressive models [13, 9], and stochastic\nvariations of neural networks [3]. For such models, inference is a challenging problem, but for-\ntunately there exists much recent progress on stochastic variational inference algorithms [12, 24].\nHowever, the primary focus of deep generative models (DGMs) has been on unsupervised learning,\n\n1\n\n\fwith the goals of learning latent representations and generating input samples. Though the latent\nrepresentations can be used with a downstream classi\ufb01er to make predictions, it is often bene\ufb01cial\nto learn a joint model that considers both input and response variables. One recent attempt is the\nconditional generative models [11], which treat labels as conditions of a DGM to describe input\ndata. This conditional DGM is learned in a semi-supervised setting, which is not exclusive to ours.\nIn this paper, we revisit the max-margin principle and present a max-margin deep generative model\n(mmDGM), which learns multi-layer representations that are good for both classi\ufb01cation and in-\nput inference. Our mmDGM conjoins the \ufb02exibility of DGMs on describing input data and the\nstrong discriminative ability of max-margin learning on making accurate predictions. We formulate\nmmDGM as solving a variational inference problem of a DGM regularized by a set of max-margin\nposterior constraints, which bias the model to learn representations that are good for prediction. We\nde\ufb01ne the max-margin posterior constraints as a linear functional of the target variational distribu-\ntion of the latent presentations. Then, we develop a doubly stochastic subgradient descent algorithm,\nwhich generalizes the Pagesos algorithm [28] to consider nontrivial latent variables. For the varia-\ntional distribution, we build a recognition model to capture the nonlinearity, similar as in [12, 24].\nWe consider two types of networks used as our recognition and generative models: multiple layer\nperceptrons (MLPs) as in [12, 24] and convolutional neural networks (CNNs) [14]. Though CNNs\nhave shown promising results in various domains, especially for image classi\ufb01cation, little work has\nbeen done to take advantage of CNN to generate images. The recent work [6] presents a type of\nCNN to map manual features including class labels to RBG chair images by applying unpooling,\nconvolution and recti\ufb01cation sequentially; but it is a deterministic mapping and there is no random\ngeneration. Generative Adversarial Nets [7] employs a single such layer together with MLPs in a\nminimax two-player game framework with primary goal of generating images. We propose to stack\nthis structure to form a highly non-trivial deep generative network to generate images from latent\nvariables learned automatically by a recognition model using standard CNN. We present the detailed\nnetwork structures in experiments part. Empirical results on MNIST [14] and SVHN [22] datasets\ndemonstrate that mmDGM can signi\ufb01cantly improve the prediction performance, which is competi-\ntive to the state-of-the-art methods [33, 17, 8, 15], while retaining the capability of generating input\nsamples and completing their missing values.\n\n2 Basics of Deep Generative Models\nWe start from a general setting, where we have N i.i.d. data X = {xn}N\nn=1. A deep generative\nmodel (DGM) assumes that each xn \u2208 RD is generated from a vector of latent variables zn \u2208 RK,\nwhich itself follows some distribution. The joint probability of a DGM is as follows:\n\np(X, Z|\u03b1, \u03b2) =\n\np(zn|\u03b1)p(xn|zn, \u03b2),\n\nn=1\n\n(1)\nwhere p(zn|\u03b1) is the prior of the latent variables and p(xn|zn, \u03b2) is the likelihood model for gen-\nerating observations. For notation simplicity, we de\ufb01ne \u03b8 = (\u03b1, \u03b2). Depending on the structure\nof z, various DGMs have been developed, such as the deep belief networks [25, 16], deep sigmoid\nnetworks [21], deep latent Gaussian models [24], and deep autoregressive models [9]. In this paper,\nwe focus on the directed DGMs, which can be easily sampled from via an ancestral sampler.\nHowever, in most cases learning DGMs is challenging due to the intractability of posterior inference.\nThe state-of-the-art methods resort to stochastic variational methods under the maximum likelihood\nestimation (MLE) framework, \u02c6\u03b8 = argmax\u03b8 log p(X|\u03b8). Speci\ufb01cally, let q(Z) be the variational\ndistribution that approximates the true posterior p(Z|X, \u03b8). A variational upper bound of the per\nsample negative log-likelihood (NLL) \u2212 log p(xn|\u03b1, \u03b2) is:\n\nL(\u03b8, q(zn); xn) (cid:44) KL(q(zn)||p(zn|\u03b1)) \u2212 Eq(zn)[log p(xn|zn, \u03b2)],\n\nL(\u03b8, q(Z); X)(cid:44)(cid:80)\n\n(2)\nwhere KL(q||p) is the Kullback-Leibler (KL) divergence between distributions q and p. Then,\nnL(\u03b8, q(zn); xn) upper bounds the full negative log-likelihood \u2212 log p(X|\u03b8).\nIt is important to notice that if we do not make restricting assumption on the variational distribution\nq, the lower bound is tight by simply setting q(Z) = p(Z|X, \u03b8). That is, the MLE is equivalent to\nsolving the variational problem: min\u03b8,q(Z) L(\u03b8, q(Z); X). However, since the true posterior is in-\ntractable except a handful of special cases, we must resort to approximation methods. One common\n\nN(cid:89)\n\n2\n\n\fassumption is that the variational distribution is of some parametric form, q\u03c6(Z), and then we opti-\nmize the variational bound w.r.t the variational parameters \u03c6. For DGMs, another challenge arises\nthat the variational bound is often intractable to compute analytically. To address this challenge, the\nearly work further bounds the intractable parts with tractable ones by introducing more variational\nparameters [26]. However, this technique increases the gap between the bound being optimized and\nthe log-likelihood, potentially resulting in poorer estimates. Much recent progress [12, 24, 21] has\nbeen made on hybrid Monte Carlo and variational methods, which approximates the intractable ex-\npectations and their gradients over the parameters (\u03b8, \u03c6) via some unbiased Monte Carlo estimates.\nFurthermore, to handle large-scale datasets, stochastic optimization of the variational objective can\nbe used with a suitable learning rate annealing scheme. It is important to notice that variance reduc-\ntion is a key part of these methods in order to have fast and stable convergence.\nMost work on directed DGMs has been focusing on the generative capability on inferring the obser-\nvations, such as \ufb01lling in missing values [12, 24, 21], while little work has been done on investigating\nthe predictive power, except the semi-supervised DGMs [11] which builds a DGM conditioned on\nthe class labels and learns the parameters via MLE. Below, we present max-margin deep generative\nmodels, which explore the discriminative max-margin principle to improve the predictive ability of\nthe latent representations, while retaining the generative capability.\n\n3 Max-margin Deep Generative Models\nWe consider supervised learning, where the training data is a pair (x, y) with input features x \u2208 RD\nand the ground truth label y. Without loss of generality, we consider the multi-class classi\ufb01cation,\nwhere y \u2208 C = {1, . . . , M}. A max-margin deep generative model (mmDGM) consists of two\ncomponents: (1) a deep generative model to describe input features; and (2) a max-margin classi\ufb01er\nto consider supervision. For the generative model, we can in theory adopt any DGM that de\ufb01nes a\njoint distribution over (X, Z) as in Eq. (1). For the max-margin classi\ufb01er, instead of \ufb01tting the input\nfeatures into a conventional SVM, we de\ufb01ne the linear classi\ufb01er on the latent representations, whose\nlearning will be regularized by the supervision signal as we shall see. Speci\ufb01cally, if the latent\nrepresentation z is given, we de\ufb01ne the latent discriminant function F (y, z, \u03b7; x) = \u03b7(cid:62)f (y, z),\nwhere f (y, z) is an M K-dimensional vector that concatenates M subvectors, with the yth being z\nand all others being zero, and \u03b7 is the corresponding weight vector.\nWe consider the case that \u03b7 is a random vector, following some prior distribution p0(\u03b7). Then\nour goal is to infer the posterior distribution p(\u03b7, Z|X, Y), which is typically approximated by a\nvariational distribution q(\u03b7, Z) for computational tractability. Notice that this posterior is different\nfrom the one in the vanilla DGM. We expect that the supervision information will bias the learned\nrepresentations to be more powerful on predicting the labels at testing. To account for the uncertainty\nof (\u03b7, Z), we take the expectation and de\ufb01ne the discriminant function F (y; x) = Eq\nand the \ufb01nal prediction rule that maps inputs to outputs is:\n\n(cid:2)\u03b7(cid:62)f (y, z)(cid:3) ,\n\n\u02c6y = argmax\n\ny\u2208C\n\nF (y; x).\n\n(3)\n\nNote that different from the conditional DGM [11], which puts the class labels upstream, the above\nclassi\ufb01er is a downstream model, in the sense that the supervision signal is determined by condi-\ntioning on the latent representations.\n\n3.1 The Learning Problem\nWe want to jointly learn the parameters \u03b8 and infer the posterior distribution q(\u03b7, Z). Based on the\nequivalent variational formulation of MLE, we de\ufb01ne the joint learning problem as solving:\n\n(4)\n\nN(cid:88)\n\nn=1\n\nL(\u03b8, q(\u03b7, Z); X) + C\n\n(cid:26)Eq[\u03b7(cid:62)\u2206fn(y)] \u2265 \u2206ln(y) \u2212 \u03ben\n\n\u03ben\n\nmin\n\n\u03b8,q(\u03b7,Z),\u03be\n\n\u2200n, y \u2208 C, s.t. :\n\n\u03ben \u2265 0,\n\nwhere \u2206fn(y) = f (yn, zn) \u2212 f (y, zn) is the difference of the feature vectors; \u2206ln(y) is the loss\nfunction that measures the cost to predict y if the true label is yn; and C is a nonnegative regular-\nization parameter balancing the two components. In the objective, the variational bound is de\ufb01ned\n\n3\n\n\fmin\n\nas L(\u03b8, q(\u03b7, Z); X) = KL(q(\u03b7, Z)||p0(\u03b7, Z|\u03b1)) \u2212 Eq [log p(X|Z, \u03b2)], and the margin constraints\nare from the classi\ufb01er (3). If we ignore the constraints (e.g., setting C at 0), the solution of q(\u03b7, Z)\nwill be exactly the Bayesian posterior, and the problem is equivalent to do MLE for \u03b8.\nBy absorbing the slack variables, we can rewrite the problem in an unconstrained form:\n\nL(\u03b8, q(\u03b7, Z); X) + CR(q(\u03b7, Z; X)),\n\nwhere the hinge loss is: R(q(\u03b7, Z); X) = (cid:80)N\nror of classi\ufb01er (3), that is, R(q(\u03b7, Z); X) \u2265(cid:80)\n\n(5)\nn=1 maxy\u2208C(\u2206ln(y) \u2212 Eq[\u03b7(cid:62)\u2206fn(y)]). Due to the\nconvexity of max function, it is easy to verify that the hinge loss is an upper bound of the training er-\nn \u2206ln(\u02c6yn). Furthermore, the hinge loss is a convex\nfunctional over the variational distribution because of the linearity of the expectation operator. These\nproperties render the hinge loss as a good surrogate to optimize over. Previous work has explored\nthis idea to learn discriminative topic models [34], but with a restriction on the shallow structure of\nhidden variables. Our work presents a signi\ufb01cant extension to learn deep generative models, which\npose new challenges on the learning and inference.\n\n\u03b8,q(\u03b7,Z)\n\n3.2 The Doubly Stochastic Subgradient Algorithm\nThe variational formulation of problem (5) naturally suggests that we can develop a variational\nalgorithm to address the intractability of the true posterior. We now present a new algorithm to\nsolve problem (5). Our method is a doubly stochastic generalization of the Pegasos (i.e., Primal\nEstimated sub-GrAdient SOlver for SVM) algorithm [28] for the classic SVMs with fully observed\ninput features, with the new extension of dealing with a highly nontrivial structure of latent variables.\nFirst, we make the structured mean-\ufb01eld (SMF) assumption that q(\u03b7, Z) = q(\u03b7)q\u03c6(Z). Under the\nassumption, we have the discriminant function as Eq[\u03b7(cid:62)\u2206fn(y)] = Eq(\u03b7)[\u03b7(cid:62)]E\nq\u03c6(z(n))[\u2206fn(y)].\nMoreover, we can solve for the optimal solution of q(\u03b7) in some analytical form.\nIn fact,\nby the calculus of variations, we can show that given the other parts the solution is q(\u03b7) \u221d\n, where \u03c9 are the Lagrange multipliers (See [34] for de-\np0(\u03b7) exp\nIf the prior is normal, p0(\u03b7) = N (0, \u03c32I), we have the normal posterior: q(\u03b7) =\ntails).\nEq\u03c6[\u2206fn(y)]. Therefore, even though we did not make a para-\nmetric form assumption of q(\u03b7), the above results show that the optimal posterior distribution of \u03b7\nis Gaussian. Since we only use the expectation in the optimization problem and in prediction, we\ncan directly solve for the mean parameter \u03bb instead of q(\u03b7). Further, in this case we can verify that\n||\u03bb||2\nKL(q(\u03b7)||p0(\u03b7)) =\n2\u03c32 and then the equivalent objective function in terms of \u03bb can be written\nas:\n\nN (\u03bb, \u03c32I), where \u03bb = \u03c32(cid:80)\n\n\u03b7(cid:62)(cid:80)\n\nEq\u03c6[\u2206fn(y)]\n\nn,y \u03c9y\nn\n\nn,y \u03c9y\nn\n\n(cid:16)\n\n(cid:17)\n\nL(\u03b8, \u03c6; X) +\n\nmin\n\u03b8,\u03c6,\u03bb\n\n||\u03bb||2\n2\u03c32 + CR(\u03bb, \u03c6; X),\n\n(6)\n\nn=1 (cid:96)(\u03bb, \u03c6; xn) is the total hinge loss, and the per-sample hinge-loss is\n(cid:62)Eq\u03c6[\u2206fn(y)]). Below, we present a doubly stochastic subgra-\n\n(cid:96)(\u03bb, \u03c6; xn) = maxy\u2208C(\u2206ln(y) \u2212 \u03bb\ndient descent algorithm to solve this problem.\nThe \ufb01rst stochasticity arises from a stochastic estimate of the objective by random mini-batches.\nSpeci\ufb01cally, the batch learning needs to scan the full dataset to compute subgradients, which is\noften too expensive to deal with large-scale datasets. One effective technique is to do stochastic\nsubgradient descent [28], where at each iteration we randomly draw a mini-batch of the training\ndata and then do the variational updates over the small mini-batch. Formally, given a mini batch of\nsize m, we get an unbiased estimate of the objective:\n\nwhere R(\u03bb, \u03c6; X) = (cid:80)N\n\n\u02dcLm :=\n\nN\nm\n\nL(\u03b8, \u03c6; xn) +\n\n||\u03bb||2\n2\u03c32 +\n\nN C\nm\n\n(cid:96)(\u03bb, \u03c6; xn).\n\nm(cid:88)\n\nn=1\n\nm(cid:88)\n\nn=1\n\nThe second stochasticity arises from a stochastic estimate of the per-sample variational bound\nand its subgradient, whose intractability calls for another Monte Carlo estimator. Formally, let\nn \u223c q\u03c6(z|xn, yn) be a set of samples from the variational distribution, where we explicitly put the\n(cid:17)\nzl\nconditions. Then, an estimate of the per-sample variational bound and the per-sample hinge-loss is\n\u02dcL(\u03b8, \u03c6; xn)=\n\nn|\u03b2)\u2212log q\u03c6(zl\n\nn); \u02dc(cid:96)(\u03bb, \u03c6; xn)=max\n\n(cid:62)\n\u2206fn(y, zl\n\nlog p(xn, zl\n\n(cid:88)\n\n(cid:88)\n\n(cid:16)\n\nn)\n\n\u03bb\n\n,\n\n1\nL\n\n\u2206ln(y)\u22121\nL\n\ny\n\nl\n\nl\n\n4\n\n\fn) = f (yn, zl\n\nn) \u2212 f (y, zl\n\nn). Note that \u02dcL is an unbiased estimate of L, while \u02dc(cid:96) is a\nwhere \u2206fn(y, zl\nbiased estimate of (cid:96). Nevertheless, we can still show that \u02dc(cid:96) is an upper bound estimate of (cid:96) under\nexpectation. Furthermore, this biasedness does not affect our estimate of the gradient.\nIn fact,\nby using the equality \u2207\u03c6q\u03c6(z) = q\u03c6(z)\u2207\u03c6 log q\u03c6(z), we can construct an unbiased Monte Carlo\nestimate of \u2207\u03c6(L(\u03b8, \u03c6; xn) + (cid:96)(\u03bb, \u03c6; xn)) as:\n\ng\u03c6 =\n\n1\nL\n\nlog p(zl\n\nn, xn) \u2212 log q\u03c6(zl\n\nn) + C\u03bb\n\n(cid:62)\n\n\u2206fn(\u02dcyn, zl\n\nn)\n\nl=1\nterm roots from the hinge loss with the loss-augmented prediction \u02dcyn =\nn)). For \u03b8 and \u03bb, the estimates of the gradient \u2207\u03b8L(\u03b8, \u03c6; xn)\n\nf (y, zl\n\n(cid:62)\n\nl \u03bb\n\n(cid:17)\u2207\u03c6 log q\u03c6(zl\n\nn),\n\n(7)\n\nL(cid:88)\n\n(cid:16)\n(cid:80)\n(cid:88)\n\nl\n\nwhere the last\nargmaxy(\u2206ln(y) + 1\nand the subgradient \u2207\u03bb(cid:96)(\u03bb, \u03c6; xn) are easier, which are:\nL\n1\nL\n\n\u2207\u03b8 log p(xn, zl\n\nn|\u03b8),\n\ng\u03bb =\n\ng\u03b8 =\n\n1\nL\n\n(cid:88)\n\n(cid:0)f (\u02dcyn, zl\n\nl\n\nn)(cid:1) .\n\nn) \u2212 f (yn, zl\n\nuntil Converge\nreturn \u03b8, \u03bb, and \u03c6\n\nInitialize \u03b8, \u03bb, and \u03c6\nrepeat\n\nn) only depend on the variational distribution,\n\nAlgorithm 1 Doubly Stochastic Subgradient Algorithm\n\ndraw a random mini-batch of m data points\ndraw random samples from noise distribution p(\u0001)\ncompute subgradient g = \u2207\u03b8,\u03bb,\u03c6 \u02dcL(\u03b8, \u03bb, \u03c6; Xm, \u0001)\nupdate parameters (\u03b8, \u03bb, \u03c6) using subgradient g.\n\nNotice that the sampling and the gradient \u2207\u03c6 log q\u03c6(zl\nnot the underlying model.\nThe above estimates consider the gen-\neral case where the variational bound is\nintractable. In some cases, we can com-\npute the KL-divergence term analyti-\ncally, e.g., when the prior and the vari-\national distribution are both Gaussian.\nIn such cases, we only need to estimate\nthe rest intractable part by sampling,\nwhich often reduces the variance [12].\nSimilarly, we could use the expectation\nof the features directly, if it can be computed analytically, in the computation of subgradients (e.g.,\ng\u03b8 and g\u03bb) instead of sampling, which again can lead to variance reduction.\nWith the above estimates of subgradients, we can use stochastic optimization methods such as\nSGD [28] and AdaM [10] to update the parameters, as outlined in Alg. 1. Overall, our algorithm is\na doubly stochastic generalization of Pegasos to deal with the highly nontrivial latent variables.\nNow, the remaining question is how to de\ufb01ne an appropriate variational distribution q\u03c6(z) to obtain\na robust estimate of the subgradients as well as the objective. Two types of methods have been devel-\noped for unsupervised DGMs, namely, variance reduction [21] and auto-encoding variational Bayes\n(AVB) [12]. Though both methods can be used for our models, we focus on the AVB approach. For\ncontinuous variables Z, under certain mild conditions we can reparameterize the variational distri-\nbution q\u03c6(z) using some simple variables \u0001. Speci\ufb01cally, we can draw samples \u0001 from some simple\ndistribution p(\u0001) and do the transformation z = g\u03c6(\u0001, x, y) to get the sample of the distribution\nq(z|x, y). We refer the readers to [12] for more details. In our experiments, we consider the special\nGaussian case, where we assume that the variational distribution is a multivariate Gaussian with a\ndiagonal covariance matrix:\n\nq\u03c6(z|x, y) = N (\u00b5(x, y; \u03c6), \u03c32(x, y; \u03c6)),\n\n(8)\nwhose mean and variance are functions of the input data. This de\ufb01nes our recognition model. Then,\nthe reparameterization trick is as follows: we \ufb01rst draw standard normal variables \u0001l \u223c N (0, I) and\nn = \u00b5(xn, yn; \u03c6) + \u03c3(xn, yn; \u03c6) (cid:12) \u0001l to get a sample. For simplicity,\nthen do the transformation zl\nwe assume that both the mean and variance are function of x only. However, it is worth to emphasize\nthat although the recognition model is unsupervised, the parameters \u03c6 are learned in a supervised\nmanner because the subgradient (7) depends on the hinge loss. Further details of the experimental\nsettings are presented in Sec. 4.1.\n\n4 Experiments\nWe now present experimental results on the widely adopted MNIST [14] and SVHN [22] datasets.\nThough mmDGMs are applicable to any DGMs that de\ufb01ne a joint distribution of X and Z, we\n\n5\n\n\fconcentrate on the Variational Auto-encoder (VA) [12], which is unsupervised. We denote our\nmmDGM with VA by MMVA. In our experiments, we consider two types of recognition models:\nmultiple layer perceptrons (MLPs) and convolutional neural networks (CNNs). We implement all\nexperiments based on Theano [2]. 1\n\n4.1 Architectures and Settings\nIn the MLP case, we follow the settings in [11] to compare both generative and discriminative\ncapacity of VA and MMVA. In the CNN case, we use standard convolutional nets [14] with convo-\nlution and max-pooling operation as the recognition model to obtain more competitive classi\ufb01cation\nresults. For the generative model, we use unconvnets [6] with a \u201csymmetric\u201d structure as the recog-\nnition model, to reconstruct the input images approximately. More speci\ufb01cally, the top-down gen-\nerative model has the same structure as the bottom-up recognition model but replacing max-pooling\nwith unpooling operation [6] and applies unpooling, convolution and recti\ufb01cation in order. The total\nnumber of parameters in the convolutional network is comparable with previous work [8, 17, 15].\nFor simplicity, we do not involve mlpconv layers [17, 15] and contrast normalization layers in our\nrecognition model, but they are not exclusive to our model. We illustrate details of the network\narchitectures in appendix A.\nIn both settings, the mean and variance of the latent z are transformed from the last layer of the\nrecognition model through a linear operation. It should be noticed that we could use not only the\nexpectation of z but also the activation of any layer in the recognition model as features. The only\ntheoretical difference is from where we add a hinge loss regularization to the gradient and back-\npropagate it to previous layers. In all of the experiments, the mean of z has the same nonlinearity\nbut typically much lower dimension than the activation of the last layer in the recognition model,\nand hence often leads to a worse performance. In the MLP case, we concatenate the activations of\n2 layers as the features used in the supervised tasks. In the CNN case, we use the activations of the\nlast layer as the features. We use AdaM [10] to optimize parameters in all of the models. Although it\nis an adaptive gradient-based optimization method, we decay the global learning rate by factor three\nperiodically after suf\ufb01cient number of epochs to ensure a stable convergence.\nWe denote our mmDGM with MLPs by MMVA. To perform classi\ufb01cation using VA, we \ufb01rst learn\nthe feature representations by VA, and then build a linear SVM classi\ufb01er on these features using the\nPegasos stochastic subgradient algorithm [28]. This baseline will be denoted by VA+Pegasos. The\ncorresponding models with CNNs are denoted by CMMVA and CVA+Pegasos respectively.\n\n4.2 Results on the MNIST dataset\nWe present both the prediction performance and the results on generating samples of MMVA and\nVA+Pegasos with both kinds of recognition models on the MNIST [14] dataset, which consists of\nimages of 10 different classes (0 to 9) of size 28\u00d728 with 50,000 training samples, 10,000 validating\nsamples and 10,000 testing samples.\nTable 1: Error rates (%) on MNIST dataset.\n4.2.1 Predictive Performance\nIn the MLP case, we only use 50,000 train-\ning data, and the parameters for classi\ufb01cation are\noptimized according to the validation set. We\nchoose C = 15 for MMVA and initialize it with\nan unsupervised pre-training procedure in classi-\n\ufb01cation.\nthree rows in Table 1 compare\nVA+Pegasos, VA+Class-condtionVA and MMVA,\nwhere VA+Class-condtionVA refers to the best fully\nsupervised model in [11]. Our model outperforms the baseline signi\ufb01cantly. We further use the\nt-SNE algorithm [19] to embed the features learned by VA and MMVA on 2D plane, which again\ndemonstrates the stronger discriminative ability of MMVA (See Appendix B for details).\nIn the CNN case, we use 60,000 training data. Table 2 shows the effect of C on classi\ufb01cation error\nrate and variational lower bound. Typically, as C gets lager, CMMVA learns more discriminative\nfeatures and leads to a worse estimation of data likelihood. However, if C is too small, the super-\nvision is not enough to lead to predictive features. Nevertheless, C = 103 is quite a good trade-off\n\nMODEL\nVA+Pegasos\nVA+Class-conditionVA\nMMVA\nCVA+Pegasos\nCMMVA\nStochastic Pooling [33]\nNetwork in Network [17]\nMaxout Network [8]\nDSN [15]\n\nFirst\n\nERROR RATE\n\n1.04\n0.96\n0.90\n1.35\n0.45\n0.47\n0.47\n0.45\n0.39\n\n1The source code is available at https://github.com/zhenxuan00/mmdgm.\n\n6\n\n\f(a) VA\n\n(b) MMVA\n\n(c) CVA\n\n(d) CMMVA\n\nFigure 1: (a-b): randomly generated images by VA and MMVA, 3000 epochs; (c-d): randomly\ngenerated images by CVA and CMMVA, 600 epochs.\nbetween the classi\ufb01cation performance and generative performance and this is the default setting\nof CMMVA on MNIST throughout this paper. In this setting, the classi\ufb01cation performance of our\nCMMVA model is comparable to the recent state-of-the-art fully discriminative networks (without\ndata augmentation), shown in the last four rows of Table 1.\n4.2.2 Generative Performance\nWe further investigate the generative capability of MMVA\non generating samples. Fig. 1 illustrates the images ran-\ndomly sampled from VA and MMVA models where we\noutput the expectation of the gray value at each pixel to\nget a smooth visualization. We do not pre-train our model\nin all settings when generating data to prove that MMVA\n(CMMVA) remains the generative capability of DGMs.\n\nTable 2: Effects of C on MNIST dataset\nwith a CNN recognition model.\nC ERROR RATE (%) LOWER BOUND\n0\n1\n10\n102\n103\n104\n\n-93.17\n-95.86\n-95.90\n-96.35\n-99.62\n-112.12\n\n1.35\n1.86\n0.88\n0.54\n0.45\n0.43\n\n4.3 Results on the SVHN (Street View House Numbers) dataset\nSVHN [22] is a large dataset consisting of color images of size 32 \u00d7 32. The task is to recognize\ncenter digits in natural scene images, which is signi\ufb01cantly harder than classi\ufb01cation of hand-written\ndigits. We follow the work [27, 8] to split the dataset into 598,388 training data, 6000 validating\ndata and 26, 032 testing data and preprocess the data by Local Contrast Normalization (LCN).\nWe only consider the CNN recognition model here. The network structure is similar to that in\nMNIST. We set C = 104 for our CMMVA model on SVHN by default.\nTable 3 shows the predictive performance.\nIn\nthis more challenging problem, we observe a\nlarger improvement by CMMVA as compared to\nCVA+Pegasos, suggesting that DGMs bene\ufb01t a lot\nfrom max-margin learning on image classi\ufb01cation.\nWe also compare CMMVA with state-of-the-art re-\nsults. To the best of our knowledge, there is no com-\npetitive generative models to classify digits on SVHN\ndataset with full labels.\nWe further compare the generative capability of CMMVA and CVA to examine the bene\ufb01ts from\njointly training of DGMs and max-margin classi\ufb01ers. Though CVA gives a tighter lower bound\nof data likelihood and reconstructs data more elaborately, it fails to learn the pattern of digits in a\ncomplex scenario and could not generate meaningful images. Visualization of random samples from\nCVA and CMMVA is shown in Fig. 2. In this scenario, the hinge loss regularization on recognition\nmodel is useful for generating main objects to be classi\ufb01ed in images.\n\nMODEL\nCVA+Pegasos\nCMMVA\nCNN [27]\nStochastic Pooling [33]\nMaxout Network [8]\nNetwork in Network [17]\nDSN [15]\n\nTable 3: Error rates (%) on SVHN dataset.\n\nERROR RATE\n\n25.3\n3.09\n4.9\n2.80\n2.47\n2.35\n1.92\n\n4.4 Missing Data Imputation and Classi\ufb01cation\nFinally, we test all models on the task of missing data imputation. For MNIST, we consider two types\nof missing values [18]: (1) Rand-Drop: each pixel is missing randomly with a pre-\ufb01xed probability;\nand (2) Rect: a rectangle located at the center of the image is missing. Given the perturbed images,\nwe uniformly initialize the missing values between 0 and 1, and then iteratively do the following\nsteps: (1) using the recognition model to sample the hidden variables; (2) predicting the missing\nvalues to generate images; and (3) using the re\ufb01ned images as the input of the next round. For\nSVHN, we do the same procedure as in MNIST but initialize the missing values with Guassian\n\n7\n\n\f(a) Training data\n\n(b) CVA\n\n(c) CMMVA (C = 103) (d) CMMVA (C = 104)\n\n(a):\n\ntraining data after LCN preprocessing; (b): random samples from CVA; (c-d):\n\nFigure 2:\nrandom samples from CMMVA when C = 103 and C = 104 respectively.\nrandom variables as the input distribution changes. Visualization results on MNIST and SVHN are\npresented in Appendix C and Appendix D respectively.\nIntuitively, generative models with CNNs\ncould be more powerful on learning pat-\nterns and high-level structures, while\ngenerative models with MLPs lean more\nto reconstruct the pixels in detail. This\nconforms to the MSE results shown in\nTable 4: CVA and CMMVA outperform\nVA and MMVA with a missing rectan-\ngle, while VA and MMVA outperform\nCVA and CMMVA with random miss-\ning values. Compared with the baseline,\nmmDGMs also make more accurate com-\npletion when large patches are missing. All of the models infer missing values for 100 iterations.\nWe also compare the classi\ufb01cation performance of CVA, CNN and CMMVA with Rect missing\nvalues in testing procedure in Appendix E. CMMVA outperforms both CVA and CNN.\nOverall, mmDGMs have comparable capability of inferring missing values and prefer to learn high-\nlevel patterns instead of local details.\n\nTable 4: MSE on MNIST data with missing values in\nthe testing procedure.\nNOISE TYPE\nRAND-DROP (0.2) 0.0109 0.0110 0.0111\nRAND-DROP (0.4) 0.0127 0.0127 0.0127\nRAND-DROP (0.6) 0.0168 0.0165 0.0175\nRAND-DROP (0.8) 0.0379 0.0358 0.0453\nRECT (6 \u00d7 6)\n0.0637 0.0645 0.0585\nRECT (8 \u00d7 8)\n0.0850 0.0841 0.0754\nRECT (10 \u00d7 10)\n0.1100 0.1079 0.0978\nRECT (12 \u00d7 12)\n0.1450 0.1342 0.1299\n\nVA MMVA CVA CMMVA\n0.0147\n0.0161\n0.0203\n0.0449\n0.0597\n0.0724\n0.0884\n0.1090\n\n5 Conclusions\nWe propose max-margin deep generative models (mmDGMs), which conjoin the predictive power\nof max-margin principle and the generative ability of deep generative models. We develop a doubly\nstochastic subgradient algorithm to learn all parameters jointly and consider two types of recognition\nmodels with MLPs and CNNs respectively. In both cases, we present extensive results to demon-\nstrate that mmDGMs can signi\ufb01cantly improve the prediction performance of deep generative mod-\nels, while retaining the strong generative ability on generating input samples as well as completing\nmissing values. In fact, by employing CNNs in both recognition and generative models, we achieve\nlow error rates on MNIST and SVHN datasets, which are competitive to the state-of-the-art fully\ndiscriminative networks.\nAcknowledgments\nThe work was supported by the National Basic Research Program (973 Program) of China (Nos.\n2013CB329403, 2012CB316301), National NSF of China (Nos. 61322308, 61332007), Tsinghua TNList Lab\nBig Data Initiative, and Tsinghua Initiative Scienti\ufb01c Research Program (Nos. 20121088071, 20141080934).\nReferences\n[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In ICML, 2003.\n[2] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-\nFarley, and Y. Bengio. Theano: new features and speed improvements. In Deep Learning and Unsuper-\nvised Feature Learning NIPS Workshop, 2012.\n\n[3] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable by back-\n\nprop. In ICML, 2014.\n\n[4] N. Chen, J. Zhu, F. Sun, and E. P. Xing. Large-margin predictive latent subspace learning for multi-view\n\ndata analysis. IEEE Trans. on PAMI, 34(12):2365\u20132378, 2012.\n\n8\n\n\f[5] C. Cortes and V. Vapnik. Support-vector networks. Journal of Machine Learning, 20(3):273\u2013297, 1995.\n[6] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural\n\nnetworks. arXiv:1411.5928, 2014.\n\n[7] I. J. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S.ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, 2014.\n\n[8] I. J. Goodfellow, D.Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout networks. In ICML,\n\n2013.\n\n[9] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. In ICML,\n\n2014.\n\n[10] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[11] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling. Semi-supervised learning with deep genera-\n\ntive models. In NIPS, 2014.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[13] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, 2011.\n[14] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nIn Proceedings of the IEEE, 1998.\n\n[15] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS, 2015.\n[16] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsu-\n\npervised learning of hierarchical representations. In ICML, 2009.\n[17] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.\n[18] R. J. Little and D. B. Rubin. Statistical analysis with missing data. JMLR, 539, 1987.\n[19] L. V. Matten and G. Hinton. Visualizing data using t-SNE. JMLR, 9:2579\u20132605, 2008.\n[20] K. Miller, M. P. Kumar, B. Packer, D. Goodman, and D. Koller. Max-margin min-entropy models. In\n\nAISTATS, 2012.\n\n[21] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.\n[22] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\nunsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning,\n2011.\n\n[23] M. Ranzato, J. Susskind, V. Mnih, and G. E. Hinton. On deep generative models with applications to\n\nrecognition. In CVPR, 2011.\n\n[24] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, 2014.\n\n[25] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.\n[26] L. Saul, T. Jaakkola, and M. Jordan. Mean \ufb01eld theory for sigmoid belief networks. Journal of AI\n\nResearch, 4:61\u201376, 1996.\n\n[27] P. Sermanet, S. Chintala, and Y. Lecun. Convolutional neural networks applied to house numbers digit\n\nclassi\ufb01cation. In ICPR, 2012.\n\n[28] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nSVM. Mathematical Programming, Series B, 2011.\n\n[29] Y. Tang. Deep learning using linear support vector machines. In Challenges on Representation Learning\n\nWorkshop, ICML, 2013.\n\n[30] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n[31] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interde-\n\npendent and structured output spaces. In ICML, 2004.\n\n[32] C. J. Yu and T. Joachims. Learning structural SVMs with latent variables. In ICML, 2009.\n[33] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks.\n\nIn ICLR, 2013.\n\n[34] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: Maximum margin supervised topic models. JMLR, 13:2237\u2013\n\n2278, 2012.\n\n[35] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with data augmentation.\n\nJMLR, 15:1073\u20131110, 2014.\n\n[36] J. Zhu, N. Chen, and E. P. Xing. Bayesian inference with posterior regularization and applications to\n\nin\ufb01nite latent SVMs. JMLR, 15:1799\u20131847, 2014.\n\n[37] J. Zhu, E.P. Xing, and B. Zhang. Partially observed maximum entropy discrimination Markov networks.\n\nIn NIPS, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1113, "authors": [{"given_name": "Chongxuan", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Tianlin", "family_name": "Shi", "institution": "Tsinghua University"}, {"given_name": "Bo", "family_name": "Zhang", "institution": "Tsinghua University"}]}