{"title": "Factoring Variations in Natural Images with Deep Gaussian Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3518, "page_last": 3526, "abstract": "Generative models can be seen as the swiss army knives of machine learning, as many problems can be written probabilistically in terms of the distribution of the data, including prediction, reconstruction, imputation and simulation. One of the most promising directions for unsupervised learning may lie in Deep Learning methods, given their success in supervised learning. However, one of the current problems with deep unsupervised learning methods, is that they often are harder to scale. As a result there are some easier, more scalable shallow methods, such as the Gaussian Mixture Model and the Student-t Mixture Model, that remain surprisingly competitive. In this paper we propose a new scalable deep generative model for images, called the Deep Gaussian Mixture Model, that is a straightforward but powerful generalization of GMMs to multiple layers. The parametrization of a Deep GMM allows it to efficiently capture products of variations in natural images. We propose a new EM-based algorithm that scales well to large datasets, and we show that both the Expectation and the Maximization steps can easily be distributed over multiple machines. In our density estimation experiments we show that deeper GMM architectures generalize better than more shallow ones, with results in the same ballpark as the state of the art.", "full_text": "Factoring Variations in Natural Images with\n\nDeep Gaussian Mixture Models\n\nA\u00a8aron van den Oord, Benjamin Schrauwen\n\nElectronics and Information Systems department (ELIS), Ghent University\n\n{aaron.vandenoord, benjamin.schrauwen}@ugent.be\n\nAbstract\n\nGenerative models can be seen as the swiss army knives of machine learning, as\nmany problems can be written probabilistically in terms of the distribution of the\ndata, including prediction, reconstruction, imputation and simulation. One of the\nmost promising directions for unsupervised learning may lie in Deep Learning\nmethods, given their success in supervised learning. However, one of the cur-\nrent problems with deep unsupervised learning methods, is that they often are\nharder to scale. As a result there are some easier, more scalable shallow meth-\nods, such as the Gaussian Mixture Model and the Student-t Mixture Model, that\nremain surprisingly competitive. In this paper we propose a new scalable deep\ngenerative model for images, called the Deep Gaussian Mixture Model, that is\na straightforward but powerful generalization of GMMs to multiple layers. The\nparametrization of a Deep GMM allows it to ef\ufb01ciently capture products of vari-\nations in natural images. We propose a new EM-based algorithm that scales well\nto large datasets, and we show that both the Expectation and the Maximization\nsteps can easily be distributed over multiple machines. In our density estimation\nexperiments we show that deeper GMM architectures generalize better than more\nshallow ones, with results in the same ballpark as the state of the art.\n\n1\n\nIntroduction\n\nThere has been an increasing interest in generative models for unsupervised learning, with many\napplications in Image processing [1, 2], natural language processing [3, 4], vision [5] and audio [6].\nGenerative models can be seen as the swiss army knives of machine learning, as many problems can\nbe written probabilistically in terms of the distribution of the data, including prediction, reconstruc-\ntion, imputation and simulation. One of the most promising directions for unsupervised learning\nmay lie in Deep Learning methods, given their recent results in supervised learning [7]. Although\nnot a universal recipe for success, the merits of deep learning are well-established [8]. Because of\ntheir multilayered nature, these methods provide ways to ef\ufb01ciently represent increasingly complex\nrelationships as the number of layers increases. \u201cShallow\u201d methods will often require a very large\nnumber of units to represent the same functions, and may therefore over\ufb01t more.\nLooking at real-valued data, one of the current problems with deep unsupervised learning methods,\nis that they are often hard to scale to large datasets. This is especially a problem for unsupervised\nlearning, because there is usually a lot of data available, as it does not have to be labeled (e.g. images,\nvideos, text). As a result there are some easier, more scalable shallow methods, such as the Gaussian\nMixture Model (GMM) and the Student-t Mixture Model (STM), that remain surprisingly competi-\ntive [2]. Of course, the disadvantage of these mixture models is that they have less representational\npower than deep models.\nIn this paper we propose a new scalable deep generative model for images, called the Deep Gaussian\nMixture Model (Deep GMM). The Deep GMM is a straightforward but powerful generalization of\nGaussian Mixture Models to multiple layers. It is constructed by stacking multiple GMM-layers on\n\n1\n\n\fN (0, In)\nN (0, In)\n\nN (0, In)\n\nN (0, In)\nN (0, In)\nN (0, In)\n\nA1,1\n\nN (0, In)\nN (0, In)\nA1,2\nA1,3\nA1,2\n\nN (0, In)\nA1,1\nA1,2\nA1,1\n\nA1,3\n\nA1,3\n\nA\n\nA\n\nA\n\nA1\nA1\n\nA1\n\nA2\nA2\n\nA2\n\nA3\n\nA3\nA3\n\nA2,1\n\nA2,1\nA2,1\nA2,2\n\nA2,2\n\nA2,2\n\nx\n\nx\n\nx\n\n(a) Gaussian\n\nx\n\nx\n\nx\n(b) GMM\n\nA3,1\n\nA3,1\nA3,1\n\nA3,2\n\nA3,2\n\nA3,2\n\nA3,3\n\nA3,3\n\nA3,3\n\nx\n\nx\nx\n\n(c) Deep GMM\n\nFigure 1: Visualizations of a Gaussian, GMM and Deep GMM distribution. Note that these are not\ngraphical models. This visualization describes the connectivity of the linear transformations that\nmake up the multimodal structure of a deep GMM. The sampling process for the deep GMM is\nshown in red. Every time a sample is drawn, it is \ufb01rst drawn from a standard normal distribution\nand then transformed with all the transformations on a randomly sampled path. In the example it is\n\ufb01rst transformed with A1,3, then with A2,1 and \ufb01nally with A3,2. Every path results in differently\ncorrelated normal random variables. The deep GMM shown has 3 \u00b7 2 \u00b7 3 = 18 possible paths. For\neach square transformation matrix Ai,j there is a corresponding bias term bi,j (not shown here).\n\ntop of each other, which is similar to many other Deep Learning techniques. Although for every\ndeep GMM, one could construct a shallow GMM with the same density function, it would require\nan exponential number of mixture components to do so.\nThe multilayer architecture of the Deep GMM gives rise to a speci\ufb01c kind of parameter tying. The\nparameterization is most interpretable in the case of images: the layers in the architecture are able to\nef\ufb01ciently factorize the different variations that are present in natural images: changes in brightness,\ncontrast, color and even translations or rotations of the objects in the image. Because each of these\nvariations will affect the image separately, a traditional mixture model would need an exponential\nnumber of components to model each combination of variations, whereas a Deep GMM can factor\nthese variations and model them individually.\nThe proposed training algorithm for the Deep GMM is based on the most popular principle for train-\ning GMMs: Expectation Maximization (EM). Although stochastic gradient (SGD) is also a possible\noption, we suggest the use of EM, as it is inherently more parallelizable. As we will show later, both\nthe Expectation and the Maximization steps can easily be distributed on multiple computation units\nor machines, with only limited communication between compute nodes. Although there has been a\nlot of effort in scaling up SGD for deep networks [9], the Deep GMM is parallelizable by design.\nThe remainder of this paper is organized as follows. We start by introducing the design of deep\nGMMs before explaining the EM algorithm for training them. Next, we discuss the experiments\nwhere we examine the density estimation performance of the deep GMM, as a function of the num-\nber of layers, and in comparison with other methods. We conclude in Section 5, where also discuss\nsome unsolved problems for future work.\n\n2 Stacking Gaussian Mixture layers\n\nDeep GMMs are best introduced by looking at some special cases: the multivariate normal distribu-\ntion and the Gaussian Mixture model.\nOne way to de\ufb01ne a multivariate normal variable x is as a standard normal variable z \u21e0N (0, In)\nthat has been transformed with a certain linear transformation: x = Az + b, so that\n\np (x) = Nx|b, AAT .\n\n2\n\n\fp (x) =\n\nNXi=1\n\n\u21e1iNx|bi, AiAT\ni.\n\nThis is visualized in Figure 1(a). The same interpretation can be applied to Gaussian Mixture Mod-\nels, see Figure 1(b). A transformation is chosen from set of (square) transformations Ai, i = 1 . . . N\n(each having a bias term bi) with probabilities \u21e1i, i = 1 . . . N, such that the resulting distribution\nbecomes:\n\nWith this in mind, it is easy to generalize GMMs in a multi-layered fashion. Instead of sampling\none transformation from a set, we can sample a path of transformations in a network of k layers, see\nFigure 1(c). The standard normal variable z is now successively transformed with a transformation\nfrom each layer of the network. Let be the set of all possible paths through the network. Each\npath p = (p1, p2, . . . , pk) 2 has a probability \u21e1p of being sampled, with\n\nXp2\n\n\u21e1p = Xp1,p2,...,pk\n\n\u21e1(p1,p2,...,pk) = 1.\n\nHere Nj is the number of components in layer j. The density function of x is:\n\nwith\n\np (x) =Xp2\n\n\u21e1pNx|p, \u2326p\u2326T\np,\n\np = bk,pk + Ak,ik (. . . (b2,p2 + A2,p2b1,p1))\n\n\u2326p =\n\n1Yj=k\n\nAj,pj .\n\n(1)\n\n(2)\n\n(3)\n\ncan be constructed by a GMM withQk\nPk\n\nHere Am,n and bm,n are the n\u2019th transformation matrix and bias of the m\u2019th layer. Notice that one\ncan also factorize \u21e1p as follows: \u21e1(p1,p2,...,pk) = \u21e1p1\u21e1p2 . . .\u21e1 pk, so that each layer has its own set\nof parameters associated with it. In our experiments, however, this had very little difference on the\nlog likelihood. This would mainly be useful for very large networks.\nThe GMM is a special case of the deep GMM having only one layer. Moreover, each deep GMM\nj Nj components, where every path in the network represents\none component in the GMM. The parameters of these components are tied to each other in the way\nthe deep GMM is de\ufb01ned. Because of this tying, the number of parameters to train is proportional to\nj Nj. Still, the density estimator is quite expressive as it can represent a large number of Gaussian\nmixture components. This is often the case with deep learning methods: Shallow architectures can\noften theoretically learn the same functions, but will require a much larger number of parameters [8].\nWhen the kind of compound functions that a deep learning method is able to model are appropriate\nfor the type of data, their performance will often be better than their shallow equivalents, because of\nthe smaller risk of over\ufb01tting.\nIn the case of images, but also for other types of data, we can imagine why this network structure\nmight be useful. A lot of images share the same variations such as rotations, translations, brightness\nchanges, etc.. These deformations can be represented by a linear transformation in the pixel space.\nWhen learning a deep GMM, the model may pick up on these variations in the data that are shared\namongst images by factoring and describing them with the transformations in the network.\nThe hypothesis of this paper is that Deep GMMs over\ufb01t less than normal GMMs as the complexity\nof their density functions increase because the parameter tying of the Deep GMM will force it to\nlearn more useful functions. Note that this is one of the reasons why other deep learning methods\nare so successful. The only difference is that the parameter tying in deep GMMs is more explicit\nand interpretable.\nA closely related method is the deep mixture of factor analyzers (DMFA) model [10], which is an\nextension of the Mixture of Factor Analyzers (MFA) model [11]. The DMFA model has a tree\nstructure in which every node is a factor analyzer that inherits the low-dimensional latent factors\n\n3\n\n\fXn\n\nlog p (xn) =Xn\n\nr\u2713Xn\n\nlog p (xn) = Xn,p\n= Xn,p\n\nwith\n\nnp =\n\nlog24Xp2\np35 .\n\u21e1pNxn|p, \u2326p\u2326T\n\u21e1pNxn|p, \u2326p\u2326T\np\u21e5r\u2713 log Nxn|p, \u2326p\u2326T\np\u21e4\n\u21e1qNxn|q, \u2326q\u2326T\nq\nPq\nnpr\u2713 log Nxn|p, \u2326p\u2326T\np ,\np\n\u21e1pNxn|p, \u2326p\u2326T\n\u21e1qNxn|q, \u2326q\u2326T\nq ,\nPq2\n\nfrom its parent. Training is performed layer by layer, where the dataset is hierarchically clustered\nand the children of each node are trained as a MFA on a different subset of the data using the MFA\nEM algorithm. The parents nodes are kept constant when training its children. The main difference\nwith the proposed method is that in the Deep GMM the nodes of each layer are connected to all\nnodes of the layer above. The layers are trained jointly and the higher level nodes will adapt to the\nlower level nodes.\n\n3 Training deep GMMs with EM\n\nThe algorithm we propose for training Deep GMMs is based on Expectation Maximization (EM).\nThe optimization is similar to that of a GMM: in the E-step we will compute the posterior proba-\nbilities np that a path p was responsible for generating xn, also called the responsibilities. In the\nmaximization step, the parameters of the model will be optimized given those responsibilities.\n\n3.1 Expectation\n\nFrom Equation 1 we get the the log-likelihood given the data:\n\nThis is the global objective for the Deep GMM to optimize. When taking the derivative with respect\nto a parameter \u2713 we get:\n\nthe equation for the responsibilities. Although np generally depend on the parameter \u2713, in the EM\nalgorithm the responsibilities are assumed to remain constant when optimizing the model parameters\nin the M-step.\nThe E-step is very similar to that of a standard GMM, but instead of computing the responsibilities\nnk for every component k, one needs to compute them for every path p = (p1, p2, . . . , pk) 2\n. This is because every path represents a Gaussian mixture component in the equivalent shallow\nGMM. Because np needs to be computed for each datapoint independently, the E-step is very easy\nto parallelize. Often a simple way to increase the speed of convergence and to reduce computation\ntime is to use an EM-variant with \u201chard\u201d assignments. Here only one of the responsibilities of each\ndatapoint is set to 1:\n\nnp =\u21e21 p = arg maxq\u21e1qNxn|q, \u2326q\u2326T\nq\n\n0 otherwise\n\n(4)\n\nHeuristic\n\nBecause the number of paths is the product of the number of components per layer (Qk\n\nj Nj), com-\nputing the responsibilities can become intractable for big Deep GMM networks. However, when\nusing hard-EM variant (eq. 4), this problem reduces to \ufb01nding the best path for each datapoint,\nfor which we can use ef\ufb01cient heuristics. Here we introduce such a heuristic that does not hurt the\nperformance signi\ufb01cantly, while allowing us to train much larger networks.\nWe optimize the path p = (p1, p2, . . . , pk), which is a multivariate discrete variable, with a coordi-\nnate ascent algorithm. This means we change the parameters pi layer per layer, while keeping the\n\n4\n\n\f(a) Iterations\n\n(b) Reinitializations\n\n(c) Switch rate during training\n\nFigure 2: Visualizations for the introduced E-step heuristic. (a): The average log-likelihood of the\nbest-path search with the heuristic as a function of the number of iterations (passes) and (b): as a\nfunction of the number of repeats with a different initialization. Plot (c) shows the percentage of data\npoints that switch to a better path found with a different initialization as a function of the number of\nthe EM-iterations during training.\n\nparameter values of the other layers constant. After we have changed all the variables one time (one\npass), we can repeat.\n\nThe heuristic described above only requiresPk\n\nj Nj path evaluations per pass. In Figure 2 we com-\npare the heuristic with the full search. On the left we see that after 3 passes the heuristic converges\nto a local optimum. In the middle we see that when repeating the heuristic algorithm a couple of\ntimes with different random initializations, and keeping the best path after each iteration, the log-\nlikelihood converges to the optimum.\nIn our experiments we initialized the heuristic with the optimal path from the previous E-step (warm\nstart) and performed the heuristic algorithm for 1 pass. Subsequently we ran the algorithm for a\nsecond time with a random initialization for two passes for the possibility of \ufb01nding a better optimum\nIn Figure 2(c) we\nshow an example of the percentage of data points (called the switch-rate) that had a better optimum\nwith this second initialization for each EM-iteration. We can see from this Figure that the switch-\nrate quickly becomes very small, which means that using the responsibilities from the previous\nE-step is an ef\ufb01cient initialization for the current one. Although the number of path evaluations with\nthe heuristic is substantially smaller than with the full search, we saw in our experiments that the\nperformance of the resulting trained Deep GMMs was ultimately similar.\n\nfor each datapoint. Each E-step thus required 3\u21e3Pk\n\nj Nj\u2318 path evaluations.\n\n3.2 Maximization\n\nIn the maximization step, the parameters are updated to maximize the log likelihood of the data,\ngiven the responsibilities. Although standard optimization techniques for training deep networks\ncan be used (such as SGD), Deep GMMs have some interesting properties that allow us to train\nthem more ef\ufb01ciently. Because these properties are not obvious at \ufb01rst sight, we will derive the\nobjective and gradient for the transformation matrices Ai,j in a Deep GMM. After that we will\ndiscuss various ways for optimizing them. For convenience, the derivations in this section are based\non the hard-EM variant and with omission of the bias-terms parameters. Equations without these\nsimpli\ufb01cations can be obtained in a similar manner.\nIn the hard-EM variant, it is assumed that each datapoint in the dataset was generated by a path p,\nfor which n,p = 1. The likelihood of x given the parameters of the transformations on this path is\n\np (x) =A1\n\n1,p1 . . .A1\n\nk,pkN\u21e3A1\n\n1,p1 . . . A1\nk,pk\n\nx|0, In\u2318 ,\n\nwhere we use |\u00b7| to denote the absolute value of the determinant. Now let\u2019s rewrite:\n\nx\n\ni+1,pi+1 . . . A1\nk,pk\n\nz = A1\nQ = A1\ni,pi\nRp = A1\n1,p1 . . . A1\n\ni1,pi1,\n\n5\n\n(5)\n\n(6)\n(7)\n(8)\n\n\fN (0, In)\n\nR1\n\nR2\n\n...\n\nRi\n\n...\n\n...\n\nQ\n\nz\n\nRm\n\n...\n\n\"Folded\" version of all the layers \nabove the current layer\n\nCurrent layer\n\nFigure 3: Optimization of a transformation Q in a Deep GMM. We can rewrite all the possible paths\nin the above layers by \u201cfolding\u201d them into one layer, which is convenient for deriving the objective\nand gradient equations of Q.\n\nso that we get (omitting the constant term w.r.t. Q):\n\nlog p (x) / log |Q| + log N (RpQz|0, In) .\n\n(9)\nFigure 3 gives a visual overview. We have \u201cfolded\u201d the layers above the current layer into one. This\nmeans that each path p through the network above the current layer is equivalent to a transformation\nRp in the folded version. The transformation matrix for which we will derive the objective and\ngradient is called Q. The average log-likelihood of all the data points that are generated by paths that\npass through Q is:\n1\n\nlog p (xi) / log |Q| +\n\nlog N (RpQzi|0, I)\n\nNXi\n\n1\n\nNXp Xi2p\n2Xp\n\n1\n\n= log |Q|\ni and \u2326p = RT\n\n\u21e1pT r\u21e5pQT \u2326pQ\u21e4 ,\n\np Rp. For the gradient we get:\n\nlog p (xi) = QT Xp\n\n\u21e1ppQT \u2326p.\n\nzizT\n\nNp Pi2p\nN rQXi\n\n1\n\n(10)\n\n(11)\n\n(12)\n\nwhere \u21e1p = Np\n\nN , p = 1\n\nOptimization\nNotice how in Equation 11 the summation over the data points has been converted to a summation\nover covariance matrices: one for each path1. If the number of paths is small enough, this means we\ncan use full gradient updates instead of mini-batched updates (e.g. SGD). The computation of the\ncovariance matrices is fairly ef\ufb01cient and can be done in parallel. This formulation also allows us to\nuse more advanced optimization methods, such as LBFGS-B [12].\nIn the setup described above, we need to keep the transformation Rp constant while optimizing Q.\nThis is why in each M-step the Deep GMM is optimized layer-wise from top to bottom, updating\none layer at a time. It is possible to go over this process multiple times for each M-step. Important\nto note is that this way the optimization of Q does not depend on any other parameters in the same\nlayer. So for each layer, the optimization of the different nodes can be done in parallel on multiple\ncores or machines. Moreover, nodes in the same layer do not share data points when using the EM-\nvariant with hard-assignments. Another advantage is that this method is easy to control, as there\nare no learning rates or other optimization parameters to be tuned, when using L-BFGS-B \u201cout of\nthe box\u201d. A disadvantage is that one needs to sum over all possible paths above the current node in\nthe gradient computation. For deeper networks, this may become problematic when optimizing the\nlower-level nodes.\nAlternatively, one can also evaluate (11) using Kronecker products as\n\n\u00b7\u00b7\u00b7 = log |Q|\n\n1\n2\n\nvec (Q)T(Xp\n\n\u21e1p (\u2326p \u2326 p)) vec (Q)\n\n(13)\n\n1Actually we only need to sum over the number of possible transformations Rp above the node Q.\n\n6\n\n\fand Equation 12 as\n\n\u00b7\u00b7\u00b7 = QT mat (Xp\n\n\u21e1p (\u2326p \u2326 p)) vec (Q)! .\n\n(14)\n\nwith LBFGS-B even faster. We only have to constructPp\n\nHere vec is the vectorization operator and mat its inverse. With these formulations we don\u2019t have to\nloop over the number of paths anymore during the optimization. This makes the inner optimization\n\u21e1p (\u2326p \u2326 p) once, which is also easy to\nparallelize. These equation thus allow us to train even bigger Deep GMM architectures. A disad-\nvantage, however, is that it requires the dimensionality of the data to be small enough to ef\ufb01ciently\nconstruct the Kronecker products.\nWhen the aforementioned formulations are intractable because there are too number layers in the\nDeep GMM and the data dimensionality is to high, we can also optimize the parameters using back-\npropagation with a minibatch algorithm, such as Stochastic Gradient Descent (SGD). This approach\nworks for much deeper networks, because we don\u2019t need to sum over the number of paths. From\nEquation 9 we see that this is basically the same as minimizing the L2 norm of RpQz, with log |Q|\nas regularization term. Disadvantages include the use of learning rates and other parameters such as\nmomentum, which requires more engineering and \ufb01ne-tuning.\nThe most naive way is to optimize the deep GMM with SGD is by simultaneously optimizing all\nparameters, as is common in neural networks. When doing this it is important that the parameters of\nall nodes are converged enough in each M-step, otherwise nodes that are not optimized enough may\nhave very low responsibilities in the following E-step(s). This results in whole parts of the network\nbecoming unused, which is the equivalent of empty clusters during GMM or k-means training. An\nalternative way of using SGD is again by optimizing the Deep GMM layer by layer. This has\nthe advantage that we have more control over the optimization, which prevents the aforementioned\nproblem of unused paths. But more importantly, we can now again parallelize over the number of\nnodes per layer.\n\n4 Experiments and Results\n\nFor our experiments we used the Berkeley Segmentation Dataset (BSDS300) [13], which is a com-\nmonly used benchmark for density modeling of image patches and the tiny images dataset [14]. For\nBSDS300 we follow the same setup of Uria et al. [15], which is best practice for this dataset. 8 by 8\ngrayscale patches are drawn from images of the dataset. The train and test sets consists of 200 and\n100 images respectively. Because each pixel is quantized, it can only contain integer values between\n0 and 255. To make the integer pixel values continuous, uniform noise (between 0 and 1) is added.\nAfterwards, the images are divided by 256 so that the pixel values lie in the range [0, 1]. Next,\nthe patches are preprocessed by removing the mean pixel value of every image patch. Because this\nreduces the implicit dimensionality of the data, the last pixel value is removed. This results in the\ndata points having 63 dimensions. For the tiny images dataset we rescale the images to 8 by 8 and\nthen follow the same setup. This way we also have low resolution image data to evaluate on.\nIn all the experiments described in this section, we used the following setup for training Deep\nGMMs. We used the hard-EM variant, with the aforementioned heuristic in the E-step. For each\nM-step we used LBFGS-B for 1000 iterations by using equations (13) and (14) for the objective and\ngradient. The total number of iterations we used for EM was \ufb01xed to 100, although fewer iterations\nwere usually suf\ufb01cient. The only hyperparameters were the number of components for each layer,\nwhich were optimized on a validation set.\nBecause GMMs are in theory able to represent the same probability density functions as a Deep\nGMM, we \ufb01rst need to assess wether using multiple layers with a deep GMM improves performance.\nThe results of a GMM (one layer) and Deep GMMs with two or three layers are given in 4(a). As\nwe increase the complexity and number of parameters of the model by changing the number of\ncomponents in the top layer, a plateau is reached and the models ultimately start over\ufb01tting. For the\ndeep GMMs, the number of components in the other layers was kept constant (5 components). The\nDeep GMMs seem to generalize better. Although they have a similar number of parameters, they\nare able to model more complex relationships, without over\ufb01tting. We also tried this experiment on\na more dif\ufb01cult dataset by using highly downscaled images from the tiny images dataset, see Figure\n\n7\n\n\f(a) BSDS300\n\n(b) Tiny Images\n\nFigure 4: Performance of the Deep GMM for different number of layers, and the GMM (one layer).\nAll models were trained on the same dataset of 500 Thousand examples. For comparison we varied\nthe number of components in the top layer.\n\n4(b). Because there are less correlations between the pixels of a downscaled image than between\nthose of an image patch, the average log likelihood values are lower. Overall we can see that the\nDeep GMM performs well on both low and high resolution natural images.\nNext we will compare the deep GMM with other published methods on this task. Results are shown\nin Table 1. The \ufb01rst method is the RNADE model, a new deep density estimation technique which\nis an extension of the NADE model for real valued data [16, 15]. EoRNADE, which stands for\nensemble of RNADE models, is currently the state of the art. We also report the log-likelihood\nresults of two mixture models:\nthe GMM and the Student-T Mixture model, from [2]. Overall\nwe see that the Deep GMM has a strong performance. It scores better than other single models\n(RNADE, STM), but not as well as the ensemble of RNADE models.\n\nModel\nRNADE: 1hl, 2hl, 3hl; 4hl, 5hl, 6hl\nEoRNADE (6hl)\nGMM\nSTM\nDeep GMM - 3 layers\n\nAverage log likelihood\n143.2, 149.2, 152.0, 153.6, 154.7, 155.2\n157.0\n153.7\n155.3\n156.2\n\nTable 1: Density estimation results on image patch modeling using the BSDS300 dataset. Higher\nlog-likelihood values are better. \u201chl\u201d stands for the number of hidden layers in the RNADE models.\n\n5 Conclusion\n\nIn this work we introduced the deep Gaussian Mixture Model: a novel density estimation technique\nfor modeling real valued data. we show that the Deep GMM is on par with the current state of the\nart in image patch modeling, and surpasses other mixture models. We conclude that the Deep GMM\nis a viable and scalable alternative for unsupervised learning. The deep GMM tackles unsupervised\nlearning from a different angle than other recent deep unsupervised learning techniques [17, 18, 19],\nwhich makes it very interesting for future research.\nIn follow-up work, we would like to make Deep GMMs suitable for larger images and other high-\ndimensional data. Locally connected \ufb01lters, such as convolutions would be useful for this. We\nwould also like to extend our method to modeling discrete data. Deep GMMs are currently only\ndesigned for continuous real-valued data, but our approach of reparametrizing the model into layers\nof successive transformations can also be applied to other types of mixture distributions. We would\nalso like to compare this extension to other discrete density estimators such as Restricted Boltzmann\nMachines, Deep Belief Networks and the NADE model [15].\n\n8\n\n\fReferences\n[1] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image\n\nrestoration. In International Conference on Computer Vision, 2011.\n\n[2] A\u00a8aron van den Oord and Benjamin Schrauwen. The student-t mixture model as a natural image\npatch prior with application to image compression. Journal of Machine Learning Research,\n2014.\n\n[3] Yoshua Bengio, Holger Schwenk, Jean-Sbastien Sencal, Frderic Morin, and Jean-Luc Gauvain.\n\nNeural probabilistic language models. In Innovations in Machine Learning. Springer, 2006.\n\n[4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. In proceedings of Workshop at ICLR, 2013.\n\n[5] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. One-shot learning by\ninverting a compositional causal process. In Advances in Neural Information Processing Sys-\ntems, 2013.\n\n[6] Razvan Pascanu, C\u00b8 aglar G\u00a8ulc\u00b8ehre, Kyunghyun Cho, and Yoshua Bengio. How to construct\ndeep recurrent neural networks. In Proceedings of the International Conference on Learning\nRepresentations, 2013.\n\n[7] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classi\ufb01cation with deep convo-\n\nlutional neural networks. In Advances in Neural Information Processing Systems, 2012.\n\n[8] Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends R in Machine\n[9] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. In Proceed-\n\nLearning, 2(1), 2009.\n\nings of the International Conference on Learning Representations, 2014.\n\n[10] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Deep mixtures of factor analysers.\n\nIn International Conference on Machine Learning, 2012.\n\n[11] Zoubin Ghahramani and Geoffrey E Hinton. The em algorithm for mixtures of factor analyzers.\n\nTechnical report, University of Toronto, 1996.\n\n[12] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm\n\nfor bound constrained optimization. SIAM Journal on Scienti\ufb01c Computing, 1995.\n\n[13] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human seg-\nmented natural images and its application to evaluating segmentation algorithms and measur-\ning ecological statistics. In Proceedings of the International Conference on Computer Vision.\nIEEE, 2001.\n\n[14] Antonio Torralba, Robert Fergus, and William T Freeman. 80 million tiny images: A large data\nset for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 2008.\n\n[15] Benigno Uria, Iain Murray, and Hugo Larochelle. A deep and tractable density estimator. In\n\nProceedings of the International Conference on Machine Learning, 2013.\n\n[16] Benigno Uria, Iain Murray, and Hugo Larochelle. RNADE: The real-valued neural autoregres-\n\nsive density-estimator. In Advances in Neural Information Processing Systems, 2013.\n\n[17] Karol Gregor, Andriy Mnih, and Daan Wierstra. Deep autoregressive networks. In Interna-\n\ntional Conference on Machine Learning, 2013.\n\n[18] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-propagation\nand variational inference in deep latent gaussian models. In International Conference on Ma-\nchine Learning, 2014.\n\n[19] Yoshua Bengio, Eric Thibodeau-Laufer, and Jason Yosinski. Deep generative stochastic net-\n\nworks trainable by backprop. In International Conference on Machine Learning, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1850, "authors": [{"given_name": "Aaron", "family_name": "van den Oord", "institution": "Ghent University"}, {"given_name": "Benjamin", "family_name": "Schrauwen", "institution": "Ghent University"}]}