{"title": "Implicit Mixtures of Restricted Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1145, "page_last": 1152, "abstract": "We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures three-way interactions among visible units, hidden units, and a single hidden multinomial unit that represents the cluster labels. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data.", "full_text": "Implicit Mixtures of Restricted Boltzmann Machines\n\nDepartment of Computer Science, University of Toronto\n\n10 King\u2019s College Road, Toronto, M5S 3G5 Canada\n\nVinod Nair and Geoffrey Hinton\n\n{vnair,hinton}@cs.toronto.edu\n\nAbstract\n\nWe present a mixture model whose components are Restricted Boltzmann Ma-\nchines (RBMs). This possibility has not been considered before because com-\nputing the partition function of an RBM is intractable, which appears to make\nlearning a mixture of RBMs intractable as well. Surprisingly, when formulated as\na third-order Boltzmann machine, such a mixture model can be learned tractably\nusing contrastive divergence. The energy function of the model captures three-\nway interactions among visible units, hidden units, and a single hidden discrete\nvariable that represents the cluster label. The distinguishing feature of this model\nis that, unlike other mixture models, the mixing proportions are not explicitly\nparameterized. Instead, they are de\ufb01ned implicitly via the energy function and\ndepend on all the parameters in the model. We present results for the MNIST and\nNORB datasets showing that the implicit mixture of RBMs learns clusters that\nre\ufb02ect the class structure in the data.\n\n1 Introduction\n\nA typical mixture model is composed of a number of separately parameterized density models each\nof which has two important properties:\n\n1. There is an ef\ufb01cient way to compute the probability density (or mass) of a datapoint under\n\neach model.\n\n2. There is an ef\ufb01cient way to change the parameters of each model so as to maximize or\n\nincrease the sum of the log probabilities it assigns to a set of datapoints.\n\nThe mixture is created by assigning a mixing proportion to each of the component models and\nit is typically \ufb01tted by using the EM algorithm that alternates between two steps. The E-step uses\nproperty 1 to compute the posterior probability that each datapoint came from each of the component\nmodels. The posterior is also called the \u201cresponsibility\u201d of each model for a datapoint. The M-step\nuses property 2 to update the parameters of each model to raise the responsibility-weighted sum of\nthe log probabilities it assigns to the datapoints. The M-step also changes the mixing proportions of\nthe component models to match the proportion of the training data that they are responsible for.\n\nRestricted Boltzmann Machines [5] model binary data-vectors using binary latent variables. They\nare considerably more powerful than mixture of multivariate Bernoulli models 1 because they allow\nmany of the latent variables to be on simultaneously so the number of alternative latent state vectors\nis exponential in the number of latent variables rather than being linear in this number as it is with\na mixture of Bernoullis. An RBM with N hidden units can be viewed as a mixture of 2N Bernoulli\nmodels, one per latent state vector, with a lot of parameter sharing between the 2N component\nmodels and with the 2N mixing proportions being implicitly determined by the same parameters.\n\n1A multivariate Bernoulli model consists of a set of probabilities, one per component of the binary data\n\nvector.\n\n1\n\n\fHidden units\n\nHidden units\n\nj\n\nWij\n\ni\n\nVisible units\n\n(a)\n\nK component \n\nRBMs\n\nj\n\nWijk\n\nk\n\n1-of-K \n\nactivation\n\ni\n\nVisible units\n\n(b)\n\nHidden units\n\nj\n\nWijk\n\ni\n\nVisible units\n\n(c)\n\nk\n\n1-of-K \n\nactivation\n\nFigure 1: (a) Schematic representation of an RBM, (b) an implicit mixture of RBMs as a third-order\nBoltzmann machine, (c) schematic representation of an implicit mixture.\n\nIt can also be viewed as a product of N \u201cuni-Bernoulli\u201d models (plus one Bernoulli model that is\nimplemented by the visible biases). A uni-Bernoulli model is a mixture of a uniform and a Bernoulli.\nThe weights of a hidden unit de\ufb01ne the ith probability in its Bernoulli model as pi = \u03c3(wi), and the\nbias, b, of a hidden unit de\ufb01nes the mixing proportion of the Bernoulli in its uni-Bernoulli as \u03c3(b),\nwhere \u03c3(x) = (1 + exp(\u2212x))\u22121.\nThe modeling power of an RBM can always be increased by increasing the number of hidden units\n[10] or by adding extra hidden layers [12], but for datasets that contain several distinctly differ-\nent types of data, such as images of different object classes, it would be more appropriate to use a\nmixture of RBM\u2019s. The mixture could be used to model the raw data or some preprocessed rep-\nresentation that has already extracted features that are shared by different classes. Unfortunately,\nRBM\u2019s cannot easily be used as the components of mixture models because they lack property 1:\nIt is easy to compute the unnormalized density that an RBM assigns to a datapoint, but the normal-\nization term is exponentially expensive to compute exactly and even approximating it is extremely\ntime-consuming [11]. There is also no ef\ufb01cient way to modify the parameters of an RBM so that\nthe log probability of the data is guaranteed to increase, but there are good approximate methods [5]\nso this is not the main problem. This paper describes a way of \ufb01tting a mixture of RBM\u2019s without\nexplicitly computing the partition function of each RBM.\n\n2 The model\n\nWe start with the energy function for a Restricted Boltzmann Machine (RBM) and then modify it\nto de\ufb01ne the implicit mixture of RBMs. To simplify the description, we assume that the visible and\nhidden variables of the RBM are binary. The formulation below can be easily adapted to other types\nof variables (e.g., see [13]).\n\nThe energy function for a Restricted Boltzmann Machine (RBM) is\n\nE(v, h) = \u2212Xi,j\n\nW R\n\nij vihj,\n\n(1)\n\nwhere v is a vector of visible (observed) variables, h is a vector of hidden variables, and W R is\na matrix of parameters that capture pairwise interactions between the visible and hidden variables.\nNow consider extending this model by including a discrete variable z with K possible states, rep-\nresented as a K-dimensional binary vector with 1-of-K activation. De\ufb01ning the energy function in\nterms of three-way interactions among the components of v, h, and z gives\n\nE(v, h, z) = \u2212Xi,j,k\n\nW I\n\nijkvihjzk,\n\n(2)\n\nwhere W I is a 3D tensor of parameters. Each slice of this tensor along the z-dimension is a matrix\nthat corresponds to the parameters of each of the K component RBMs. The joint distribution for the\nmixture model is\n\nZI\n\n2\n\nP (v, h, z) =\n\nexp(\u2212E(v, h, z))\n\n,\n\n(3)\n\n\fwhere\n\nZI = Xu,g,y\n\nexp(\u2212E(u, g, y))\n\n(4)\n\nis the partition function of the implicit mixture model. Re-writing the joint distribution in the usual\nmixture model form gives\n\nP (v) = Xh,z\n\nP (v, h, z) =\n\nK\n\nXk=1Xh\n\nP (v, h|zk = 1)P (zk = 1).\n\n(5)\n\nEquation 5 de\ufb01nes the implicit mixture of RBMs. P (v, h|zk = 1) is the kth component RBM\u2019s\ndistribution, with W R being the kth slice of W I. Unlike in a typical mixture model, the mixing\nproportion P (zk = 1) is not a separate parameter in our model. Instead, it is implicitly de\ufb01ned\nvia the energy function in equation 2. Changing the bias of the kth unit in z changes the mixing\nproportion of the kth RBM, but all of the weights of all the RBM\u2019s also in\ufb02uence it. Figure 1 gives\na visual description of the implicit mixture model\u2019s structure.\n\n3 Learning\n\nGiven a set of N training cases {v1, ..., vN }, we want to learn the parameters of the implicit mix-\nn=1 log P (vn) with respect to W I. We use\n\nture model by maximizing the log likelihood L = PN\n\ngradient-based optimization to do this. The expression for the gradient is\n\n\u2202L\n\n\u2202W I = N (cid:28) \u2202E(v, h, z)\n\n\u2202W I\n\n(cid:29)P (v,h,z)\n\n\u2212\n\nN\n\nXn=1\n\n(cid:28) \u2202E(vn, h, z)\n\n\u2202W I\n\n(cid:29)P (h,z|vn)\n\n,\n\n(6)\n\nwhere hiP () denotes an expectation with respect to the distribution P (). The two expectations in\nequation 6 can be estimated by sample means if unbiased samples can be generated from the corre-\nsponding distributions. The conditional distribution P (h, z|v\u03b1) is easy to sample from, but sampling\nthe joint distribution P (v, h, z) requires prolonged Gibbs sampling and is intractable in practice. We\nget around this problem by using the contrastive divergence (CD) learning algorithm [5], which has\nbeen found to be effective for training a variety of energy-based models (e.g. [8],[9],[13],[4]).\nSampling the conditional distributions: We now describe how to sample the conditional distri-\nbutions P (h, z|v) and P (v|h, z), which are the main operations required for CD learning. The\nsecond case is easy: given zk = 1, we select the kth component RBM of the mixture model and\nthen sample from its conditional distribution Pk(v|h). The bipartite structure of the RBM makes\nthis distribution factorial. So the ith visible unit is drawn independently of the other units from the\nBernoulli distribution\n\nP (vi = 1|h, zk = 1) =\n\n1\n\n1 + exp(\u2212Pj W I\n\nijkhj)\n\n.\n\n(7)\n\nSampling P (h, z|v) is done in two steps. First, the K-way discrete distribution P (z|v) is computed\n(see below) and sampled. Then, given zk = 1, we select the kth component RBM and sample from\nits conditional distribution Pk(h|v). Again, this distribution is factorial, and the jth hidden unit is\ndrawn from the Bernoulli distribution\n\nP (hj = 1|v, zk = 1) =\n\n1\n\n1 + exp(\u2212Pi W I\n\nijkvi)\n\n.\n\nTo compute P (z|v) we \ufb01rst note that\n\nwhere the free energy F (v, zk = 1) is given by\n\nP (zk = 1|v) \u221d exp(\u2212F (v, zk = 1)),\n\nF (v, zk = 1) = \u2212Xj\n\nlog(1 + exp(Xi\n\nW I\n\nijkvi)).\n\n3\n\n(8)\n\n(9)\n\n(10)\n\n\fIf the number of possible states of z is small enough, then it is practical to compute the quantity\nF (v, zk = 1) for every k by brute-force. So we can compute\n\nP (zk = 1|v) =\n\n.\n\n(11)\n\nexp(\u2212F (v, zk = 1))\n\nPl exp(\u2212F (v, zl = 1))\n\nEquation 11 de\ufb01nes the responsibility of the kth component RBM for the data vector v.\nContrastive divergence learning: Below is a summary of the steps in the CD learning for the\nimplicit mixture model.\n\n1. For a training vector v+, pick a component RBM by sampling the responsibilities\n\nP (zk = 1|v+). Let l be the index of the selected RBM.\n\n2. Sample h+ \u223c Pl(h|v+).\n3. Compute the outer product D+\n4. Sample v\u2212 \u223c Pl(v|h+).\n5. Pick a component RBM by sampling the responsibilities P (zk = 1|v\u2212). Let m be the\n\n+.\nl = v+hT\n\nindex of the selected RBM.\n6. Sample h\u2212 \u223c Pm(h|v\u2212).\n7. Compute the outer product D\u2212\n\n\u2212.\nm = v\u2212hT\n\nRepeating the above steps for a mini-batch of Nb training cases results in two sets of outer products\nfor each component k in the mixture model: S+\nkL}. Then\nthe approximate likelihood gradient (averaged over the mini-batch) for the kth component RBM is\n\nkM } and S\u2212\n\nk1, ..., D\u2212\n\nk1, ..., D+\n\nk = {D+\n\nk {D\u2212\n\n1\nNb\n\n\u2202L\n\u2202W I\nk\n\n\u2248\n\n1\nNb\n\nM\n\nXi=1\n\nD+\n\nki \u2212\n\nL\n\nXj=1\n\n\uf8eb\n\uf8ed\n\nD\u2212\n\nkj\uf8f6\n\uf8f8 .\n\n(12)\n\nNote that to compute the outer products D+ and D\u2212 for a given training vector, the component\nRBMs are selected through two separate stochastic picks. Therefore the sets S+\nk need not\nbe of the same size because the choice of the mixture component can be different for v+ and v\u2212.\nScaling free energies with a temperature parameter: In practice, the above learning algorithm\ncauses all the training cases to be captured by a single component RBM, and the other components to\nbe left unused. This is because free energy is an unnormalized quantity that can have very different\nnumerical scales across the RBMs. One RBM may happen to produce much smaller free energies\nthan the rest because of random differences in the initial parameter values, and thus end up with\nhigh responsibilities for most training cases. Even if all the component RBMs are initialized to the\nexact same initial parameter values, the problem can still arise after a few noisy weight updates. The\nsolution is to use a temperature parameter T when computing the responsibilities:\n\nk and S\u2212\n\nP (zk = 1|v) =\n\n.\n\n(13)\n\nexp(\u2212F (v, zk = 1)/T )\n\nPl exp(\u2212F (v, zl = 1)/T )\n\nBy choosing a large enough T , we can make sure that random scale differences in the free energies\ndo not lead to the above collapse problem. One possibility is to start with a large T and then gradually\nanneal it as learning progresses. In our experiments we found that using a constant T works just as\nwell as annealing, so we keep it \ufb01xed.\n\n4 Results\n\nWe apply the implicit mixture of RBMs to two datasets, MNIST [1] and NORB [7]. MNIST is a\nset of handwritten digit images belonging to ten different classes (the digits 0 to 9). NORB contains\nstereo-pair images of 3D toy objects taken under different lighting conditions and viewpoints. There\nare \ufb01ve classes of objects in this set (human, car, plane, truck and animal). We use MNIST mainly\nas a sanity check, and most of our results are for the much more dif\ufb01cult NORB dataset.\nEvaluation method: Since computing the exact partition function of an RBM is intractable, it is\nnot possible to directly evaluate the quality of our mixture model\u2019s \ufb01t to the data, e.g., by computing\n\n4\n\n\fFigure 2: Features of the mixture model with \ufb01ve component RBMs trained on all ten classes of\nMNIST images.\n\nthe log probability of a test set under the model. Recently it was shown that Annealed Importance\nSampling can be used to tractably approximate the partition function of an RBM [11]. While this\nis an attractive option to consider in future work, for this paper we use the computationally cheaper\napproach of evaluating the model by using it in a classi\ufb01cation task. Classi\ufb01cation accuracy is then\nused as an indirect quantitative measure of how good the model is.\n\nA reasonable evaluation criterion for a mixture modelling algorithm is that it should be able to \ufb01nd\nclusters that are mostly \u2018pure\u2019 with respect to class labels. That is, the set of data vectors that a\nparticular mixture component has high responsibilities for should have the same class label. So it\nshould be possible to accurately predict the class label of a given data vector from the responsibilities\nof the different mixture components for that vector. Once a mixture model is fully trained, we\nevaluate it by training a classi\ufb01er that takes as input the responsibilities of the mixture components\nfor a data vector and predicts its class label. The goodness of the mixture model is measured by the\ntest set prediction accuracy of this classi\ufb01er.\n\n4.1 Results for MNIST\n\nBefore attempting to learn a good mixture model of the whole MNIST dataset, we tried two simpler\nmodeling tasks. First, we \ufb01tted an implicit mixture of two RBM\u2019s with 100 hidden units each to\nan unlabelled dataset consisting of 4,000 twos and 4,000 threes. As we hoped, almost all of the\ntwo\u2019s were modelled by one RBM and almost all of the threes by the other. On 2042 held-out\ntest cases, there were only 24 errors when an image was assigned the label of the most probable\nRBM. This compares very favorably with logistic regression which needs 8000 labels in addition\nto the images and gives 36 errors on the test set even when using a penalty on the squared weights\nwhose magnitude is set using a validation set. Logistic regression also gives a good indication of the\nperformance that could be expected from \ufb01tting a mixture of two Gaussians with a shared covariance\nmatrix, because logistic regression is equivalent to \ufb01tting such a mixture discriminatively.\n\nWe then tried \ufb01tting an implicit mixture model with only \ufb01ve component RBMs, each with 25 hidden\nunits, to the entire training set. We purposely make the model very small so that it is possible to\nvisually inspect the features and the responsibilities of the component RBMs and understand what\neach component is modelling. This is meant to qualitatively con\ufb01rm that the algorithm can learn a\nsensible clustering of the MNIST data. (Of course, the model will have poor classi\ufb01cation accuracy\nas there are more classes than clusters, so it will merge multiple classes into a single cluster.) The\nfeatures of the component RBMs are shown in \ufb01gure 2 (top row). The plots in the bottom row show\nthe fraction of training images for each of the ten classes that are hard-assigned to each component.\nThe learning algorithm has produced a sensible mixture model in that visually similar digit classes\nare combined under the same mixture component. For example, ones and eights require many\nsimilar features, so they are captured with a single RBM (leftmost in \ufb01g. 2). Similarly, images of\nfours, sevens, and nines are all visually similar, and they are modelled together by one RBM (middle\nof \ufb01g. 2).\n\n5\n\n\fWe have also trained larger models with many more mixture components. As the number of com-\nponents increase, we expect the model to partition the image space more \ufb01nely, with the different\ncomponents specializing on various sub-classes of digits. If they specialize in a way that respects\nthe class boundaries, then their responsibilities for a data vector will become a better predictor of its\nclass label.\n\nThe component RBMs use binary units both in the visible and hidden layers. The image dimension-\nality is 784 (28 \u00d7 28 pixels). We have tried various settings for the number of mixture components\n(from 20 to 120 in steps of 20) and a component\u2019s hidden layer size (50, 100, 200, 500). Classi\ufb01ca-\ntion accuracy increases with more components, until 80 components. Additional components give\nslightly worse results. The hidden layer size is set to 100, but 200 and 500 also produce similar\naccuracies. Out of the 60,000 training images in MNIST, we use 50,000 to train the mixture model\nand the classi\ufb01er, and the remaining 10,000 as a validation set for early stopping. The \ufb01nal models\nare then tested on a separate test set of 10,000 images.\n\nOnce the mixture model is trained, we train a logistic regression classi\ufb01er to predict the class label\nfrom the responsibilities2. It has as many inputs as there are mixture components, and a ten-way\nsoftmax over the class labels at the output. With 80 components, there are only 80 \u00b7 10 + 10 =\n810 parameters in the classi\ufb01er (including the 10 output biases). In our experiments, classi\ufb01cation\naccuracy is consistently and signi\ufb01cantly higher when unnormalized responsibilities are used as the\nclassi\ufb01er input, instead of the actual posterior probabilities of the mixture components given a data\nvector. These unnormalized values have no proper probabilistic interpretation, but nevertheless they\nallow for better classi\ufb01cation, so we use them in all our experiments.\n\nclassi\ufb01er input\nUnnormalized\nresponsibilities\n\nPixels\n\n7.28%\n\nTable 1: MNIST Test set error rates.\n\nLogistic regression % Test\nerror\n3.36%\n\nTable 1 shows the classi\ufb01cation error rate of the re-\nsulting classi\ufb01er on the MNIST test set. As a simple\nbaseline comparison, we train a logistic regression\nclassi\ufb01er that predicts the class label from the raw\npixels. This classi\ufb01er has 784 \u00b7 10 + 10 = 7850\nparameters and yet the mixture-based classi\ufb01er has\nless than half the error rate. The unnormalized re-\nsponsibilities therefore contain a signi\ufb01cant amount\nof information about the class labels of the images,\nwhich indicates that the implicit mixture model has learned clusters that mostly agree with the class\nboundaries, even though it is not given any class information during training.\n\n4.2 Results for NORB\n\nNORB is a much more dif\ufb01cult dataset than MNIST because the images are of very different classes\nof 3D objects (instead of 2D patterns) shown from different viewpoints and under various lighting\nconditions. The pixels are also no longer binary-valued, but instead span the grayscale range [0, 255].\nSo binary units are no longer appropriate for the visible layer of the component RBMs. Gaussian\nvisible units have previously been shown to be effective for modelling grayscale images [6], and\ntherefore we use them here. See [6] for details about Gaussian units. As in that paper, the variance\nof the units is \ufb01xed to 1, and only their means are learned.\n\nLearning an RBM with Gaussian visible units can be slow, as it may require a much greater number\nof weight updates than an equivalent RBM with binary visible units. This problem becomes even\nworse in our case since a large number of RBMs have to be trained simultaneously. We avoid it\nby \ufb01rst training a single RBM with Gaussian visible units and binary hidden units on the raw pixel\ndata, and then treating the activities of its hidden layer as pre-processed data to which the implicit\nmixture model is applied. Since the hidden layer activities of the pre-processing RBM are binary, the\nmixture model can now be trained ef\ufb01ciently with binary units in the visible layer3. Once trained,\nthe low-level RBM acts as a \ufb01xed pre-processing step that converts the raw grayscale images into\n\n2Note that the mixture model parameters are kept \ufb01xed when training the classi\ufb01er, so the learning of the\n\nmixture model is entirely unsupervised.\n\n3We actually use the real-valued probabilities of the hidden units as the data, and we also use real-valued\nprobabilities for the reconstructions. On other tasks, the learning gives similar results using binary values\nsampled from these real-valued probabilities but is slower.\n\n6\n\n\f1-of-K \n\nactivation\n\nk\n\nHidden units\n\nm\n\nj\n\nWjmk\n\nBinary \n\ndata\n\nPre-processing \ntransformation \n\nWij\n\ni\n\nGaussian visible units \n\n(raw pixel data)\n\nFigure 3: Implicit mixture model used for MNORB.\n\nbinary vectors. Its parameters are not modi\ufb01ed further when training the mixture model. Figure 3\nshows the components of the complete model.\n\nA dif\ufb01culty with training the implicit mixture model (or any other mixture model) on NORB is\nthat the \u2018natural\u2019 clusters in the dataset correspond to the six lighting conditions instead of the \ufb01ve\nobject classes. The objects themselves are small (in terms of area) relative to the background, while\nlighting affects the entire image. Any clustering signal provided by the object classes will be weak\ncompared to the effect of large lighting changes. So we simplify the dataset slightly by normalizing\nthe lighting variations across images. Each image is multiplied by a scalar such that all images\nhave the same average pixel value. This signi\ufb01cantly reduces the interference of the lighting on\nthe mixture learning4. Finally, to speed up experiments, we subsample the images from 96 \u00d7 96 to\n32 \u00d7 32 and use only one image of the stereo pair. We refer to this dataset as \u2018Modi\ufb01ed NORB\u2019\nor \u2018MNORB\u2019. It contains 24,300 training images and an equal number of test images. From the\ntraining set, 4,300 are set aside as a validation set for early stopping.\n\nWe use 2000 binary hidden units for the preprocessing RBM, so the input dimensionality of the\nimplicit mixture model is 2000. We have tried many different settings for the number of mixture\ncomponents and the hidden layer size of the components. The best classi\ufb01cation results are given\nby 100 components, each with 500 hidden units. This model has about 100 \u00b7 500 \u00b7 2000 = 108\nparameters, and takes about 10 days to train on an Intel Xeon 3Ghz processor.\n\nTable 2 shows the test set error rates for a logistic regression classi\ufb01er trained on various input\nrepresentations. Mixture of Factor Analyzers (MFA) [3] is similar to the implicit mixture of RBMs\nin that it also learns a clustering while simultaneously learning a latent representation per cluster\ncomponent. But it is a directed model based on linear-Gaussian representations, and it can be learned\ntractably by maximizing likelihood with EM. We train MFA on the raw pixel data of MNORB. The\nMFA model that gives the best classi\ufb01cation accuracy (shown in table 2) has 100 component Factor\nAnalyzers with 100 factors each. (Note that simply making the number of learnable parameters\nequal is not enough to match the capacities of the different models because RBMs use binary latent\nrepresentations, while FAs use continuous representations. So we cannot strictly control for capacity\nwhen comparing these models.)\n\nA mixture of multivariate Bernoulli distributions (see e.g. section 9.3.3 of [2]) is similar to an\nimplicit mixture model whose component RBMs have no hidden units and only visible biases as\ntrainable parameters. The differences are that a Bernoulli mixture is a directed model, it has explic-\nitly parameterized mixing proportions, and maximum likelihood learning with EM is tractable. We\ntrain this model with 100 components on the activation probabilities of the preprocessing RBM\u2019s\nhidden units. The classi\ufb01cation error rate for this model is shown in table 2.\n\n4The normalization does not completely remove lighting information from the data. A logistic regression\nclassi\ufb01er can still predict the lighting label with 18% test set error when trained and tested on normalized\nimages, compared to 8% error for unnormalized images.\n\n7\n\n\fTable 2: MNORB Test set error rates for a logistic regression classi\ufb01er with different types of input\nrepresentations.\n\nLogistic regression classi\ufb01er input\n\nUnnormalized responsibilities computed\n\nby the implicit mixture of RBMs\n\nProbabilities computed by the transformation Wij in\n\n\ufb01g 3 (i.e. the pre-processed representation)\n\nRaw pixels\n\nUnnormalized responsibilities of an MFA model\ntrained on the pre-processed representation in \ufb01g 3\n\nUnnormalized responsibilities of an MFA\n\nmodel trained on raw pixels\n\nUnnormalized responsibilities of a Mixture of\nBernoullis model trained on the pre-processed\n\nrepresentation in \ufb01g 3\n\n% Test error\n\n14.65%\n\n16.07%\n\n20.60%\n22.65%\n\n24.57%\n\n28.53%\n\nThese results show that the implicit mixture of RBMs has learned clusters that re\ufb02ect the class\nstructure in the data. By the classi\ufb01cation accuracy criterion, the implicit mixture is also better than\nMFA. The results also con\ufb01rm that the lack of explicitly parameterized mixing proportions does not\nprevent the implicit mixture model from discovering interesting cluster structure in the data.\n\n5 Conclusions\n\nWe have presented a tractable formulation of a mixture of RBMs. That such a formulation is even\npossible is a surprising discovery. The key insight here is that the mixture model can be cast as a\nthird-order Boltzmann machine, provided we are willing to abandon explicitly parameterized mixing\nproportions. Then it can be learned tractably using contrastive divergence. As future work, it would\nbe interesting to explore whether these ideas can be extended to modelling time-series data.\n\nReferences\n[1] Mnist database, http://yann.lecun.com/exdb/mnist/.\n[2] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[3] Z. Ghahramani and G. E. Hinton. The em algorithm for mixtures of factor analyzers. Technical Report\n\nCRG-TR-96-1, Dept. of Computer Science, University of Toronto, 1996.\n\n[4] X. He, R. S. Zemel, and M. A. Carreira-Perpinan. Multiscale conditional random \ufb01elds for image labeling.\n\nIn CVPR, pages 695\u2013702, 2004.\n\n[5] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1711\u20131800, 2002.\n\n[6] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313:504\u2013507, 2006.\n\n[7] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance\n\nto pose and lighting. In CVPR, Washington, D.C., 2004.\n\n[8] S. Roth and M. J. Black. Fields of experts: A framework for learning image priors. In CVPR, pages\n\n860\u2013867, 2005.\n\n[9] S. Roth and M. J. Black. Steerable random \ufb01elds. In ICCV, 2007.\n[10] N. Le Roux and Y. Bengio. Representational power of restricted boltzmann machines and deep belief\n\nnetworks. Neural Computation, To appear.\n\n[11] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In ICML, Helsinki,\n\n2008.\n\n[12] I. Sutskever and G. E. Hinton. Deep narrow sigmoid belief networks are universal approximators. Neural\n\nComputation, To appear.\n\n[13] M. Welling, M. Rosen-Zvi, and G. E. Hinton. Exponential family harmoniums with an application to\n\ninformation retrieval. In NIPS 17, 2005.\n\n8\n\n\f", "award": [], "sourceid": 935, "authors": [{"given_name": "Vinod", "family_name": "Nair", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}