{"title": "Conditional Generative Moment-Matching Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2928, "page_last": 2936, "abstract": "Maximum mean discrepancy (MMD) has been successfully applied to learn deep generative models for characterizing a joint distribution of variables via kernel mean embedding. In this paper, we present conditional generative moment-matching networks (CGMMN), which learn a conditional distribution given some input variables based on a conditional maximum mean discrepancy (CMMD) criterion. The learning is performed by stochastic gradient descent with the gradient calculated by back-propagation. We evaluate CGMMN on a wide range of tasks, including predictive modeling, contextual generation, and Bayesian dark knowledge, which distills knowledge from a Bayesian model by learning a relatively small CGMMN student network. Our results demonstrate competitive performance in all the tasks.", "full_text": "Conditional Generative Moment-Matching Networks\n\nYong Ren, Jialian Li, Yucen Luo, Jun Zhu\u2217\n\n{renyong15, luoyc15, jl12}@mails.tsinghua.edu.cn; dcszj@tsinghua.edu.cn\n\nDept. of Comp. Sci. & Tech., TNList Lab; Center for Bio-Inspired Computing Research\n\nState Key Lab for Intell. Tech. & Systems, Tsinghua University, Beijing, China\n\nAbstract\n\nMaximum mean discrepancy (MMD) has been successfully applied to learn deep\ngenerative models for characterizing a joint distribution of variables via kernel\nmean embedding.\nIn this paper, we present conditional generative moment-\nmatching networks (CGMMN), which learn a conditional distribution given some\ninput variables based on a conditional maximum mean discrepancy (CMMD) cri-\nterion. The learning is performed by stochastic gradient descent with the gradi-\nent calculated by back-propagation. We evaluate CGMMN on a wide range of\ntasks, including predictive modeling, contextual generation, and Bayesian dark\nknowledge, which distills knowledge from a Bayesian model by learning a rela-\ntively small CGMMN student network. Our results demonstrate competitive per-\nformance in all the tasks.\n\n1 Introduction\n\nDeep generative models (DGMs) characterize the distribution of observations with a multilayered\nstructure of hidden variables under nonlinear transformations. Among various deep learning meth-\nods, DGMs are natural choice for those tasks that require probabilistic reasoning and uncertainty\nestimation, such as image generation [1], multimodal learning [30], and missing data imputation.\nRecently, the predictive power, which was often shown inferior to pure recognition networks (e.g.,\ndeep convolutional networks), has also been signi\ufb01cantly improved by employing the discriminative\nmax-margin learning [18].\nFor the arguably more challenging unsupervised learning, [5] presents a generative adversarial net-\nwork (GAN), which adopts a game-theoretical min-max optimization formalism. GAN has been\nextended with success in various tasks [21, 1]. However, the min-max formalism is often hard to\nsolve. The recent work [19, 3] presents generative moment matching networks (GMMN), which has\na simpler objective function than GAN while retaining the advantages of deep learning. GMMN de-\n\ufb01nes a generative model by sampling from some simple distribution (e.g., uniform) followed through\na parametric deep network. To learn the parameters, GMMN adopts maximum mean discrepancy\n(MMD) [7], a moment matching criterion where kernel mean embedding techniques are used to\navoid unnecessary assumptions of the distributions. Back-propagation can be used to calculate the\ngradient as long as the kernel function is smooth.\nA GMMN network estimates the joint distribution of a set of variables. However, we are more\ninterested in a conditional distribution in many cases, including (1) predictive modeling: compared\nto a generative model that de\ufb01nes the joint distribution p(x, y) of input data x and response variable\ny, a conditional model p(y|x) is often more direct without unnecessary assumptions on modeling x,\nand leads to better performance with fewer training examples [23, 16]; (2) contextual generation: in\nsome cases, we are interested in generating samples based on some context, such as class labels [21],\nvisual attributes [32] or the input information in cross-modal generation (e.g., from image to text [31]\n\n\u2217Corresponding author\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\for vice versa [2]); and (3) building large networks: conditional distributions are essential building\nblocks of a large generative probabilistic model. One recent relevant work [1] provides a good\nexample of stacking multiple conditional GAN networks [21] in a Laplacian pyramid structure to\ngenerate natural images.\nIn this paper, we present conditional generative moment-matching networks (CGMMN) to learn a\n\ufb02exible conditional distribution when some input variables are given. CGMMN largely extends the\ncapability of GMMN to address a wide range of application problems as mentioned above, while\nkeeping the training process simple. Speci\ufb01cally, CGMMN admits a simple generative process,\nwhich draws a sample from a simple distribution and then passes the sample as well as the given\nconditional variables through a deep network to generate a target sample. To learn the parame-\nters, we develop conditional maximum mean discrepancy (CMMD), which measures the Hilbert-\nSchmidt norm (generalized Frobenius norm) between the kernel mean embedding of an empirical\nconditional distribution and that of our generative model. Thanks to the simplicity of the condi-\ntional generative model, we can easily draw a set of samples to estimate the kernel mean embedding\nas well as the CMMD objective. Then, optimizing the objective can be ef\ufb01ciently implemented\nvia back-propagation. We evaluate CGMMN in a wide range of tasks, including predictive model-\ning, contextual generation, and Bayesian dark knowledge [15], an interesting case of distilling dark\nknowledge from Bayesian models. Our results on various datasets demonstrate that CGMMN can\nobtain competitive performance in all these tasks.\n2 Preliminary\nIn this section, we brie\ufb02y review some preliminary knowledge, including maximum mean discrep-\nancy (MMD) and kernel embedding of conditional distributions.\n2.1 Hilbert Space Embedding\nWe begin by providing an overview of Hilbert space embedding, where we represent distributions\nby elements in a reproducing kernel Hilbert space (RKHS). A RKHS F on X with kernel k is a\nHilbert space of functions f : X\u2192 R. Its inner product \u27e8\u00b7,\u00b7\u27e9F satis\ufb01es the reproducing property:\n\u27e8f (\u00b7), k(x,\u00b7)\u27e9F = f (x). Kernel functions are not restricted on Rd. They can also be de\ufb01ned on\ngraphs, time series and structured objects [11]. We usually view \u03c6(x) := k(x,\u00b7) as a (usually\nin\ufb01nite dimension) feature map of x. The most interesting part is that we can embed a distribution\nby taking expectation on its feature map:\n\n\u00b5X := EX[\u03c6(X)] =!\u2126\n\n\u03c6(X)dP (X).\n\nIf EX[k(X, X)] \u2264 \u221e, \u00b5X is guaranteed to be an element in the RKHS. This kind of kernel mean\nembedding provides us another perspective on manipulating distributions whose parametric forms\nare not assumed, as long as we can draw samples from them. This technique has been widely applied\nin many tasks, including feature extractor, density estimation and two-sample test [27, 7].\n2.2 Maximum Mean Discrepancy\nLet X = {xi}N\nj=1 be the sets of samples from distributions PX and PY , re-\ni=1 and Y = {yi}M\nspectively. Maximum Mean Discrepancy (MMD), also known as kernel two sample test [7], is a\nfrequentist estimator to answer the query whether PX = PY based on the observed samples. The\nbasic idea behind MMD is that if the generating distributions are identical, all the statistics are the\nsame. Formally, MMD de\ufb01nes the following difference measure:\n\nMMD[K, PX, PY ] := sup\nf\u2208K\n\n(EX[f (X)] \u2212 EY [f (Y )]),\n\nwhere K is a class of functions. [7] found that the class of functions in a universal RKHS F is rich\nenough to distinguish any two distributions and MMD can be expressed as the difference of their\nmean embeddings. Here, universality requires that k(\u00b7,\u00b7) is continuous and F is dense in C(X)\nwith respect to the L\u221e norm, where C(X) is the space of bounded continuous functions on X. We\nsummarize the result in the following theorem:\nTheorem 1 [7] Let K be a unit ball in a universal RKHS F, de\ufb01ned on the compact metric space\nX , with an associated continuous kernel k(\u00b7,\u00b7). When the mean embedding \u00b5p, \u00b5q \u2208F , the MMD\nobjective function can be expressed as MMD[K, p, q] = \u2225\u00b5p \u2212 \u00b5q\u22252\n. Besides, MMD[K, p, q] = 0 if\nF\nand only if p = q.\n\n2\n\n\fIn practice, an estimate of the MMD objective compares the square difference between the empirical\nkernel mean embeddings:\n\nMMD =######\n\"L2\n\n1\nN\n\nN$i=1\n\n\u03c6(xi) \u2212\n\n1\nM\n\nM$j=1\n\n2\n\n,\n\nF\n\n\u03c6(yi)######\n\nMMD is an unbiased estimator.\n\nwhich can be easily evaluated by expanding the square and using the associated kernel k(\u00b7,\u00b7).\nAsymptotically, \"L2\n2.3 Kernel Embedding of Conditional Distributions\nThe kernel embedding of a conditional distribution P (Y |X) is de\ufb01ned as: \u00b5Y |x := EY |x[\u03c6(Y )] =\n%\u2126 \u03c6(y)dP (y|x). Unlike the embedding of a single distribution, the embedding of a conditional\ndistribution is not a single element in RKHS, but sweeps out a family of points in the RKHS, each\nindexed by a \ufb01xed value of x. Formally, the embedding of a conditional distribution is represented\nas an operator CY |X, which satis\ufb01es the following properties:\n\n1. \u00b5Y |x = CY |X\u03c6(x);\nwhere G is the RKHS corresponding to Y .\n[29] found that such an operator exists under some assumptions, using the technique of cross-\ncovariance operator CXY : G\u2192F :\n\n2. EY |x[g(Y )|x] = \u27e8g, \u00b5Y |x\u27e9G,\n\n(1)\n\nXX satis\ufb01es properties 1 and 2.\n\nXX and CY X directly (See Appendix A.2.2 for more details).\n\ni=1 of size N drawn i.i.d. from P (X, Y ), we can estimate the\n\nCXY := EXY [\u03c6(X) \u2297 \u03c6(Y )] \u2212 \u00b5X \u2297 \u00b5Y ,\nwhere \u2297 is the tensor product. An interesting property is that CXY can also be viewed as an element\nin the tensor product space G\u2297F . The result is summarized as follows.\nTheorem 2 [29] Assuming that EY |X[g(Y )|X] \u2208F , the embedding of conditional distributions\nCY |X de\ufb01ned as CY |X := CY XC\u22121\nGiven a dataset DXY = {(xi, yi)}N\nconditional embedding operator as \"CY |X =\u03a6( K +\u03bbI)\u22121\u03a5\u22a4, where \u03a6= ( \u03c6(y1), ...,\u03c6 (yN )), \u03a5=\n(\u03c6(x1), ...,\u03c6 (xN )), K =\u03a5 \u22a4\u03a5 and \u03bb serves as regularization. The estimator is an element in the\ntensor product space F\u2297G and satis\ufb01es properties 1 and 2 asymptotically. When the domain of X\nis \ufb01nite, we can also estimate C\u22121\n3 Conditional Generative Moment-Matching Networks\nWe now present CGMMN, including a conditional maximum mean discrepancy criterion as the\ntraining objective, a deep generative architecture and a learning algorithm.\n3.1 Conditional Maximum Mean Discrepancy\nGiven conditional distributions PY |X and PZ|X, we aim to test whether they are the same in the\nsense that when X = x is \ufb01xed whether PY |x = PZ|x holds or not. When the domain of X is \ufb01nite,\na straightforward solution is to test whether PY |x = PZ|x for each x separately by using MMD.\nHowever, this is impossible when X is continuous. Even in the \ufb01nite case, as the separate tests do\nnot share statistics, we may need an extremely large number of training data to test a different model\nfor each single value of x. Below, we present a conditional maximum mean discrepancy criterion,\nwhich avoids the above issues.\nRecall the de\ufb01nition of kernel mean embedding of conditional distributions. When X = x is \ufb01xed,\nwe have the kernel mean embedding \u00b5Y |x = CY |X\u03c6(x). As a result, if we have CY |X = CZ|X,\nthen \u00b5Y |x = \u00b5Z|x is also satis\ufb01ed for every \ufb01xed x. By the virtue of Theorem 1, that PY |x = PZ|x\nfollows as the following theorem states.\nTheorem 3 Assuming that F is a universal RKHS with an associated kernel k(\u00b7,\u00b7),\nEY |X[g(Y )|X] \u2208F , EZ|X[g(Z)|X] \u2208F and CY |X, CZ|X \u2208F\u2297G\nIf the embedding of\nconditional distributions CY |X = CZ|X, then PY |X = PZ|X in the sense that for every \ufb01xed x, we\nhave PY |x = PZ|x.\n\n.\n\n3\n\n\fF\u2297G\n\n(2)\n\nL2\n\n2\n\nF\u2297G\n\n,\n\ngram matrices:\n\nXY = {(xi, yi)}N\n\nXY = {(xi, yi)}M\n\nF\u2297G\n\n.\ni=1 and Dd\n\nThe above theorem gives us a suf\ufb01cient condition to guarantee that two conditional distributions are\nthe same. We use the operators to measure the difference of two conditional distributions and we\ncall it conditional maximum mean discrepancy (CMMD), which is de\ufb01ned as follows:\n\nwhere the superscripts s and d denote the two sets of samples, respectively. For notation clarity, we\n\nSuppose we have two sample sets Ds\ni=1. Similar\nas in MMD, in practice we compare the square difference between the empirical estimates of the\nconditional embedding operators:\n\nCMMD =##CY |X \u2212 CZ|X##2\nY |X###\nCMMD =###\"Cd\nY |X \u2212 \"Cs\n\"L2\nde\ufb01ne &K = K + \u03bbI. Then, using kernel tricks, we can compute the difference only in term of kernel\nCMMD =##\u03a6d(Kd + \u03bbI)\u22121\u03a5\u22a4d \u2212 \u03a6s(Ks + \u03bbI)\u22121\u03a5\u22a4s##2\n\"L2\ns Ls&K\u22121\nN )) are implicitly formed feature ma-\nwhere \u03a6d := (\u03c6(yd\ntrices, \u03a6s and \u03a5s are de\ufb01ned similarly for dataset Ds\nXY . Kd =\u03a5 \u22a4d \u03a5d and Ks =\u03a5 \u22a4s \u03a5s are the\ngram matrices for input variables, while Ld =\u03a6 \u22a4d \u03a6d and Ls =\u03a6 \u22a4s \u03a6s are the gram matrices for\noutput variables. Finally, Ksd =\u03a5 \u22a4s \u03a5d and Lds =\u03a6 \u22a4d \u03a6s are the gram matrices between the two\ndatasets on input and out variables, respectively.\nIt is worth mentioning that we have assumed that the conditional mean embedding operator CY |X \u2208\nF\u2297G to have the CMMD objective well-de\ufb01ned, which needs some smoothness assumptions such\nthat C\u22123/2\nXX CXY is Hilbert-Schmidt [8]. In practice, the assumptions may not hold, however, the\nempirical estimator \u03a6(K + \u03bbI)\u22121\u03a5\u22a4 is always an element in the tensor product space which gives\nas a well-justi\ufb01ed approximation (i.e., the Hilbert-Schmidt norm exists) for practical use [29].\nRemark 1 Taking a close look on the objectives of MMD and CMMD, we can \ufb01nd some interesting\nconnections. Suppose N = M. By omitting the constant scalar, the objective function of MMD can\nbe rewritten as\n\ns ( \u2212 2 \u00b7 Tr\u2019Ksd&K\u22121\n\ns ( ,\nd Lds&K\u22121\n\n=Tr\u2019Kd&K\u22121\n\nd ( + Tr\u2019Ks&K\u22121\n\nN )) and \u03a5d := (\u03c6(xd\n\nd Ld&K\u22121\n\n1), ...,\u03c6 (yd\n\n1), ...,\u03c6 (xd\n\n\"L2\n\n\"L2\n\nMMD = Tr(Ld \u00b7 1) + Tr(Ls \u00b7 1) \u2212 2 \u00b7 Tr(Lds \u00b7 1),\n\nwhere 1 is the matrix with all entities equaling to 1. The objective function of CMMD can be\nexpressed as\n\nCMMD = Tr(Ld \u00b7 C1) + Tr(Ls \u00b7 C2) \u2212 2 \u00b7 Tr(Lds \u00b7 C3),\n\nwhere C1, C2, C3 are some matrices based on the conditional variables x in both data sets. The\ndifference is that instead of putting uniform weights on the gram matrix as in MMD, CMMD applies\nnon-uniform weights, re\ufb02ecting the in\ufb02uence of conditional variables. Similar observations have\nbeen shown in [29] for the conditional mean operator, where the estimated conditional embedding\n\u00b5Y |x is a non-uniform weighted combination of \u03c6(xi).\n3.2 CGMMN Nets\nWe now present a conditional DGM and train it by the CMMD criterion. One desirable property of\nthe DGM is that we can easily draw samples from it to estimate the CMMD objective. Below, we\npresent such a network that takes both the given conditional variables and an extra set of random\nvariables as inputs, and then passes through a deep neural network with nonlinear transformations\nto produce the samples of the target variables.\nSpeci\ufb01cally, our network is built on the fact that for any distribution P on sample space K and any\ncontinuous distribution Q on L that are regular enough, there is a function G : L \u2192 K such that\nG(x) \u223cP , where x \u223cQ [12]. This fact has been recently explored by [3, 19] to de\ufb01ne a deep\ngenerative model and estimate the parameters by the MMD criterion. For a conditional model, we\nwould like the function G to depend on the given values of input variables. This can be ful\ufb01lled\nvia a process as illustrated in Fig. 1, where the inputs of a deep neural network (DNN) consist of\ntwo parts \u2014 the input variables x and an extra set of stochastic variables H \u2208 Rd that follow\nsome continuous distribution. For simplicity, we put a uniform prior on each hidden unit p(h) =\nd)i=1\nU (hi), where U (h) = I(0\u2264h\u22641) is a uniform distribution on [0, 1] and I(\u00b7) is the indicator\n\n4\n\n\ffunction that equals to 1 if the predicate holds and 0 otherwise. After passing both x and h through\nthe DNN, we get a sample from the conditional distribution P (Y |x): y = f (x, h|w), where f\ndenotes the deterministic mapping function represented by the network with parameters w. By\n\ndefault, we concatenate x and h and \ufb01ll &x = (x, h) into the network. In this case, we have y =\nf (&x|w).\n\nDue to the \ufb02exibility and rich capability of DNN on \ufb01tting\nnonlinear functions, this generative process can character-\nize various conditional distributions well. For example, a\nsimple network can consist of multiple layer perceptrons\n(MLP) activated by some non-linear functions such as the\nrecti\ufb01ed linear unit (ReLu) [22]. Of course the hidden\nlayer is not restricted to MLP, as long as it supports gra-\ndient propagation. We also use convolutional neural net-\nworks (CNN) as hidden layers [25] in our experiments. It\nis worth mentioning that there exist other ways to com-\nbine the conditional variables x with the auxiliary vari-\nables H. For example, we can add a corruption noise to\nthe conditional variables x to produce the input of the net-\n\ni , ys\n\ni )}M\n\ni=1\n\nFigure 1: An example architecture of\nCGMMN networks.\n\ni , yd\n\ni )}N\n\nXY = {(xs\n\nXY = {(xd\n\nwork, e.g., de\ufb01ne&x = x + h, where h may follow a Gaussian distribution N (0,\u03b7I ) in this case.\n\nWith the above generative process, we can train the network by optimizing the CMMD objective\ni=1 denote the given training dataset.\nwith proper regularization. Speci\ufb01cally, let Ds\nTo estimate the CMMD objective, we draw a set of samples from the above generative model, where\nthe conditional variables can be set by sampling from the training set with/without small perturbation\n(More details are in the experimental section). Thanks to its simplicity, the sampling procedure can\nbe easily performed. Precisely, we provide each x in the training dataset to the generator to get a\ni=1 as the generated samples. Then, we can optimize\nnew sample and we denote Dd\nthe CMMD objective in Eq. (2) by gradient descent. See more details in Appendix A.1.\nAlgorithm 1 Stochastic gradient descent for CGMMN\n1: Input: Dataset D = {(xi, yi)}N\n2: Output: Learned parameters w\n3: Randomly divide training dataset D into mini batches\n4: while Stopping criterion not met do\n5:\n6:\n7:\n8:\n9: end while\n\nDraw a minibatch B from D;\nFor each x \u2208B , generate a y; and set B\u2032 to contain all the generated (x, y);\nCompute the gradient \u2202!L2\nUpdate w using the gradient with proper regularizer.\nNote that the inverse matrices &K\u22121\n\nin the CMMD objective are independent of the model\nparameters, suggesting that we are not restricted to use differentiable kernels on the conditional\nvariables x. Since the computation cost for kernel gram matrix grows cubically with the sample\nsize, we present an mini-batch version algorithm in Alg. 1 and some discussions can be found in\nAppendix A.2.1.\n4 Experiments\nWe now present a diverse range of applications to evaluate our model, including predictive model-\ning, contextual generation and an interesting case of Bayesian dark knowledge [15]. Our results\ndemonstrate that CGMMN is competitive in all the tasks.\n4.1 Predictive Performance\n4.1.1 Results on MNIST dataset\nWe \ufb01rst present the prediction performance on the widely used MINIST dataset, which consists of\nimages in 10 classes. Each image is of size 28 \u00d7 28 and the gray-scale is normalized to be in range\n[0, 1]. The whole dataset is divided into 3 parts with 50, 000 training examples, 10, 000 validation\nexamples and 10, 000 testing examples.\nFor prediction task, the conditional variables are the images x \u2208 [0, 1]28\u00d728, and the generated\nsample is a class label, which is represented as a vector y \u2208 R10\n+ and each yi denotes the con\ufb01dence\nthat x is in class i. We consider two types of architectures in CGMMN \u2014 MLP and CNN.\n\nCMMD\n\n\u2202w on B and B\u2032;\nand &K\u22121\n\nd\n\ns\n\n5\n\n\fTable 1: Error rates (%) on MNIST dataset\n\nError Rate\n\n1.04\n0.90\n0.97\n1.35\n0.47\n0.47\n0.47\n0.45\n0.45\n0.39\n\nModel\nVA+Pegasos [18]\nMMVA [18]\nCGMMN\nCVA + Pegasos [18]\nCGMMN-CNN\nStochastic Pooling [33]\nNetwork in Network [20]\nMaxout Network [6]\nCMMVA [18]\nDSN [17]\n\nWe compare our model, denoted as CGMMN in the\nMLP case and CGMMN-CNN in the CNN case, with\nVaritional Auto-encoder (VA) [14], which is an unsu-\npervised DGM learnt by stochastic variational meth-\nods. To use VA for classi\ufb01cation, a subsequent clas-\nsi\ufb01er is built \u2014 We \ufb01rst learn feature representations\nby VA and then learn a linear SVM on these features\nusing Pegasos algorithm [26]. We also compare with\nmax-margin DGMs (denoted as MMVA with MLP as\nhidden layers and CMMVA in the CNN case) [18],\nwhich is a state-of-the-art DGM for prediction, and\nseveral other strong baselines, including Stochastic\nPooling [33], Network in Network [20], Maxout Net-\nwork [6] and Deeply-supervised nets (DSN) [17].\nIn the MLP case, the model architecture is shown in Fig. 1 with an uniform distribution for hidden\nvariables of dimension 5. Note that since we do not need much randomness for the prediction task,\nthis low-dimensional hidden space is suf\ufb01cient. In fact, we did not observe much difference with a\nhigher dimension (e.g., 20 or 50), which simply makes the training slower. The MLP has 3 hidden\nlayers with hidden unit number (500, 200, 100) with the ReLu activation function. A minibatch size\nof 500 is adopted. In the CNN case, we use the same architecture as [18], where there are 32 feature\nmaps in the \ufb01rst two convolutional layers and 64 feature maps in the last three hidden layers. An\nMLP of 500 hidden units is adopted at the end of convolutional layers. The ReLu activation function\nis used in the convoluational layers and sigmoid function in the last layer. We do not pre-train our\nmodel and a minibatch size of 500 is adopted as well. The total number of parameters in the network\nis comparable with the competitors [18, 17, 20, 6].\nIn both settings, we use AdaM [13] to optimize parameters. After training, we simply draw a sample\nfrom our model conditioned on the input image and choose the index of maximum element of y as\nits prediction.Table 1 shows the results. We can see that CGMMN-CNN is competitive with various\nstate-of-the-art competitors that do not use data augumentation or multiple model voting (e.g., CM-\nMVA). DSN bene\ufb01ts from using more supervision signal in every hidden layer and outperforms the\nother competitors.\nTable 2: Error rates (%) on SVHN dataset\n4.1.2 Results on SVHN dataset\nModel\nError Rate\nWe then report the prediction performance on the Street\nCVA+Pegasos [18]\nView House Numbers (SVHN) dataset. SVHN is a\nCGMMN-CNN\nlarge dataset consisting of color images of size 32 \u00d7 32\nCNN [25]\nin 10 classes. The dataset consists of 598, 388 train-\nCMMVA [18]\ning examples, 6, 000 validation examples and 26, 032\nStochastic Pooling [33]\ntesting examples. The task is signi\ufb01cantly harder than\nNetwork in Network [20]\nclassifying hand-written digits. Following [25, 18], we\nMaxout Network [6]\npreprocess the data by Local Contrast Normalization\nDSN [17]\n(LCN). The architecture of out network is similar to that\nin MNIST and we only use CNN as middle layers here. A minibatch size of 300 is used and the\nother settings are the same as the MNIST experiments.\nTable 2 shows the results. Through there is a gap between our CGMMN and some discriminative\ndeep networks such as DSN, our results are comparable with those of CMMVA, which is the state-\nof-the-art DGM for prediction. CGMMN is compatible with various network architectures and we\nare expected to get better results with more sophisticated structures.\n\n25.3\n3.13\n4.9\n3.09\n2.80\n2.47\n2.35\n1.92\n\n4.2 Generative Performance\n4.2.1 Results on MNIST dataset\nWe \ufb01rst test the generating performance on the widely used MNIST dataset. For generating task,\nthe conditional variables are the image labels. Since y takes a \ufb01nite number of values, as mentioned\nin Sec. 2.3, we estimate CY X and C\u22121\nXX directly and combine them as the estimation of CY |X (See\nAppendix A.2.2 for practical details).\nThe architecture is the same as before but exchanging the position of x and y. For the input layer,\nbesides the label information y as conditional variables (represented by a one-hot-spot vector of\ndimension 10), we further draw a sample from a uniform distribution of dimension 20, which is\n\n6\n\n\f(a) MNIST samples\n\n(b) Random CGMMN samples\n\n(c) Samples conditioned on label 0\n\nFigure 2: Samples in (a) are from MNIST dataset; (b) are generated randomly from our CGMMN\nnetwork; (c) are generated randomly from CGMMN with conditions on label y = 0. Both (b) and\n(c) are generated after running 500 epoches.\n\nFigure 3: CGMMN samples and their nearest\nneighbour in MNIST dataset. The \ufb01rst row is\nour generated samples.\n\nsuf\ufb01ciently large. Overall, the network is a 5-layer MLP with input dimension 30 and the middle\nlayer hidden unit number (64, 256, 256, 512), and the output layer is of dimension 28 \u00d7 28, which\nrepresents the image in pixel. A minibatch of size 200 is adopted.\nFig. 2 shows some samples generated using our CGMMN, where in (b) the conditional variable y\nis randomly chosen from the 10 possible values, and in (c) y is pre-\ufb01xed at class 0. As we can see,\nwhen conditioned on label 0, almost all the generated samples are really in that class.\nAs in [19], we investigate whether the models learn\nto merely copy the data. We visualize the nearest\nneighbors in the MNIST dataset of several samples\ngenerated by CGMMN in terms of Euclidean pixel-\nwise distance [5] in Fig. 3. As we can see, by this\nmetric, the samples are not merely the copy.\nAs also discussed in [19], real-world data can be complicated and\nhigh-dimensional and autoencoder can be good at representing\ndata in a code space that captures enough statistical information\nto reliably reconstruct the data. For example, visual data, while\nrepresented in a high dimension often exists on a low-dimensional\nmanifold. Thus it is bene\ufb01cial to combine autoencoders with our\nCGMMN models to generate more smooth images, in contrast\nto Fig. 2 where there are some noise in the generated samples.\nPrecisely, we \ufb01rst learn an auto-encoder and produce code repre-\nsentations of the training data, then freeze the auto-encoder weights\nand learn a CGMMN to minimize the CMMD objective between\nthe generated codes using our CGMMN and the training data codes.\nThe generating results are shown in Fig. 4. Comparing to Fig. 2,\nthe samples are more clear.\n4.2.2 Results on Yale Face dataset\nWe now show the generating results on the Extended Yale Face dataset [9], which contains 2, 414\ngrayscale images for 38 individuals of dimension 32 \u00d7 32. There are about 64 images per subject,\none per different facial expression or con\ufb01guration. A smaller version of the dataset consists of 165\nimages of 15 individuals and the generating result can be found in Appendix A.4.2.\nWe adopt the same architecture as the \ufb01rst generating experiment for MNIST, which is a 5-layer MLP\nwith an input dimension of 50 (12 hidden variables and 38 dimensions for conditional variables, i.e.,\nlabels) and the middle layer hidden unit number (64, 256, 256, 512). A mini-batch size of 400 is\nadopted. The other settings are the same as in the MNIST experiment. The overall generating\nresults are shown in Fig. 5, where we really generate diverse images for different individuals. Again,\nas shown in Appendix A.4.1, the generated samples are not merely the copy of training data.\n4.3 Distill Bayesian Models\nOur \ufb01nal experiment is to apply CGMMN to distill knowledge from Bayesian models by learn-\ning a conditional distribution model for ef\ufb01cient prediction. Speci\ufb01cally, let \u03b8 denote the ran-\n\nFigure 4: Samples generated\nby CGMMN+Autoencoder,\nwhere the architecture follows\nfrom [19].\n\n7\n\n\fFigure 5: CGMMN generated sam-\nples for Extended Yale Face Dataset.\nColumns are conditioned on differ-\nent individuals.\n\ndom variables. A Bayesian model \ufb01rst computes the posterior distribution given the training set\ni=1 as p(\u03b8|D). In the prediction stage, given a new input x, a response sample y\nD = {(xi, yi)}N\nis generated via probability p(y|x,D) =% p(y|x, \u03b8)p(\u03b8|D)d\u03b8. This procedure usually involves a\ncomplicated integral thus is time consuming. [15] show that we can learn a relatively simple student\nnetwork to distill knowledge from the teacher network (i.e., the Bayesian model) and approximately\nrepresent the predictive distribution p(y|x,D) of the teacher network.\nOur CGMMN provides a new solution to build such a student\nnetwork for Bayesian dark knowledge. To learn CGMMN,\nwe need two datasets to estimate the CMMD objective \u2014 one\nis generated by the teacher network and the other one is gen-\nerated by CGMMN. The former sampled dataset serves as the\ntraining dataset for our CGMMN and the latter one is gener-\nated during the training process of it. For high-dimensional\ndata, adopting the same strategy as [15], we sample \u201cnear\"\nthe training data to generate the former dataset (i.e., perturb-\ning the inputs in the training set slightly before sending to the\nteacher network to sample y).\nDue to the space limitation, we test our model on a regres-\nsion problem on the Boston housing dataset, which was also\nused in [15, 10], while deferring the other results on a syn-\nthetic dataset to Appendix A.3. The dataset consists of 506\ndata points where each data is of dimension 13. We \ufb01rst train\na PBP model [10], which is a scalable method for posterior\ninference in Bayesian neural networks, as the teacher and then distill it using our CGMMN model.\nWe test whether the distilled model will degrade the prediction performance.\nWe distill\n[10] using an\nMLP network with three hidden layers and\n(100, 50, 50) hidden units for middle layers. We\ndraw N = 3, 000 sample pairs {(xi, yi)}N\ni=1\nfrom the PBP network, where xi is the input\nvariables that serve as conditional variables in our model. For a fair comparison, xi is generated\nby adding noise into training data to avoid \ufb01tting the testing data directly. We evaluate the predic-\ntion performance on the original testing data by root mean square error (RMSE). Table 3 shows the\nresults. We can see that the distilled model does not harm the prediction performance. It is worth\nmentioning that we are not restricted to distill knowledge from PBP. In fact, any Bayesian models\ncan be distilled using CGMMN.\n5 Conclusions and Discussions\nWe present conditional generative moment-matching networks (CGMMN), which is a \ufb02exible frame-\nwork to represent conditional distributions. CGMMN largely extends the ability of previous DGM\nbased on maximum mean discrepancy (MMD) while keeping the training process simple as well,\nwhich is done by back-propagation. Experimental results on various tasks, including predictive\nmodeling, data generation and Bayesian dark knowledge, demonstrate competitive performance.\nConditional modeling has been practiced as a natural step towards improving the discriminative\nability of a statistical model and/or relaxing unnecessary assumptions of the conditional variables.\nFor deep learning models, sum product networks (SPN) [24] provide exact inference on DGMs and\nits conditional extension [4] improves the discriminative ability; and the recent work [21] presents\na conditional version of the generative adversarial networks (GAN) [5] with wider applicability.\nBesides, the recent proposed conditional variational autoencoder [28] also works well on structured\nprediction. Our work \ufb01lls the research void to signi\ufb01cantly improve the applicability of moment-\nmatching networks.\nAcknowledgments\nThe work was supported by the National Basic Research Program (973 Program) of China (No.\n2013CB329403), National NSF of China Projects (Nos. 61620106010, 61322308, 61332007), the\nYouth Top-notch Talent Support Program, and the Collaborative Projects with Tencent and Intel.\n\nTable 3: Distilling results on Boston Housing\ndataset, the error is measured by RMSE\n\nthe PBP model\n\nPBP prediction Distilled by CGMMN\n2.574 \u00b1 0.089\n\n2.580 \u00b1 0.093\n\n8\n\n\fReferences\n[1] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid\n\nof adversarial networks. NIPS, 2015.\n\n[2] A. Dosovitskiy, J. Springenberg, M. Tatarchenko, and T. Brox. Learning to generate chairs, tables and\n\ncars with convolutional networks. arXiv:1411.5928, 2015.\n\n[3] G. Dziugaite, D. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean\n\ndiscrepancy optimization. UAI, 2015.\n\n[4] R. Gens and P. Domingos. Discriminative learning of sum-product networks. NIPS, 2012.\n[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adverisarial nets. NIPS, 2014.\n\n[6] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. ICML, 2013.\n[7] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola. A kernel two-sample test. JMLR,\n\n2008.\n\n[8] S. Grunewalder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, and M. Pontil. Conditional mean\n\nembedding as regressors. ICML, 2012.\n\n[9] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang. Face recognition using laplacianfaces.\n\nIEEE Trans.\n\nPattern Anal. Mach. Intelligence, 27(3):328\u2013340, 2005.\n\n[10] J. Hernandez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian\n\n[11] T. Hofmann, B. Scholkopf, and A. Smola. Kernel methods in machine learning. The Annals of Statistics,\n\nneural networks. ICML, 2015.\n\n36(3):1171\u20131220, 2008.\n\n[12] O. Kallenbery. Foundations of modern probability. New York: Springer, 2002.\n[13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.\n[14] D. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.\n[15] A. Korattikara, V. Rathod, K. Murphy, and M. Welling. Bayesian dark knowledge. NIPS, 2015.\n[16] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. ICML, 2001.\n\n[17] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. AISTATS, 2015.\n[18] C. Li, J. Zhu, T. Shi, and B. Zhang. Max-margin deep generative models. NIPS, 2015.\n[19] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. ICML, 2015.\n[20] M. Lin, Q. Chen, and S. Yan. Network in network. ICLR, 2014.\n[21] M. Mirza and S. Osindero. Conditional generative adversarial nets. ArXiv:1411.1784v1, 2014.\n[22] V. Nair and G. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. ICML, 2010.\n[23] A. Ng and M.I. Jordan. On discriminative vs. generative classi\ufb01ers: a comparison of logistic regression\n\n[24] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. UAI, 2011.\n[25] P. Sermanet, S. Chintala, and Y. Lecun. Convolutional neural networks applied to house numbers digit\n\nand naive bayes. NIPS, 2001.\n\nclassi\ufb01cation. ICPR, 2012.\n\n[26] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nsvm. Mathmetical Programming, Series B, 2011.\n\n[27] A. Smola, A. Gretton, L. Song, and B. Scholkopf. A hilbert space embedding for distributions. Interna-\n\ntional Conference on Algorithmic Learning Theory, 2007.\n\n[28] K. Sohn, X. Yan, and H. Lee. Learning structured output representation using deep conditional generative\n\nmodels. NIPS, 2015.\n\n[29] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions\n\nwith applications to dynamical systems. ICML, 2009.\n\n[30] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. NIPS, 2012.\n[31] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator.\n\n[32] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual\n\n[33] M. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks.\n\narXiv:1411.4555v2, 2015.\n\nattributes. arXiv:1512.00570, 2015.\n\nICLR, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1467, "authors": [{"given_name": "Yong", "family_name": "Ren", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Jialian", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Yucen", "family_name": "Luo", "institution": "Tsinghua University"}]}