{"title": "Multi-Prediction Deep Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 548, "page_last": 556, "abstract": "We introduce the Multi-Prediction Deep Boltzmann Machine (MP-DBM). The MP-DBM can be seen as a single probabilistic model trained to maximize a variational approximation to the generalized pseudolikelihood, or as a family of recurrent nets that share parameters and approximately solve different inference problems. Prior methods of training DBMs either do not perform well on classification tasks or require an initial learning pass that trains the DBM greedily, one layer at a time. The MP-DBM does not require greedy layerwise pretraining, and outperforms the standard DBM at classification, classification with missing inputs, and mean field prediction tasks.", "full_text": "Multi-Prediction Deep Boltzmann Machines\n\nIan J. Goodfellow, Mehdi Mirza, Aaron Courville, Yoshua Bengio\n\nD\u00b4epartement d\u2019informatique et de recherche op\u00b4erationnelle\n\nUniversit\u00b4e de Montr\u00b4eal\nMontr\u00b4eal, QC H3C 3J7\n\n{goodfeli,mirzamom,courvila}@iro.umontreal.ca,\n\nYoshua.Bengio@umontreal.ca\n\nAbstract\n\nWe introduce the multi-prediction deep Boltzmann machine (MP-DBM). The MP-\nDBM can be seen as a single probabilistic model trained to maximize a variational\napproximation to the generalized pseudolikelihood, or as a family of recurrent nets\nthat share parameters and approximately solve different inference problems. Prior\nmethods of training DBMs either do not perform well on classi\ufb01cation tasks or\nrequire an initial learning pass that trains the DBM greedily, one layer at a time.\nThe MP-DBM does not require greedy layerwise pretraining, and outperforms the\nstandard DBM at classi\ufb01cation, classi\ufb01cation with missing inputs, and mean \ufb01eld\nprediction tasks.1\n\n1\n\nIntroduction\n\nA deep Boltzmann machine (DBM) [18] is a structured probabilistic model consisting of many\nlayers of random variables, most of which are latent. DBMs are well established as generative\nmodels and as feature learning algorithms for classi\ufb01ers.\nExact inference in a DBM is intractable. DBMs are usually used as feature learners, where the mean\n\ufb01eld expectations of the hidden units are used as input features to a separate classi\ufb01er, such as an\nMLP or logistic regression. To some extent, this erodes the utility of the DBM as a probabilistic\nmodel\u2013it can generate good samples, and provides good features for deterministic models, but it\nhas not proven especially useful for solving inference problems such as predicting class labels given\ninput features or completing missing input features.\nAnother drawback to the DBM is the complexity of training it. Typically it is trained in a greedy,\nlayerwise fashion, by training a stack of RBMs. Training each RBM to model samples from the\nprevious RBM\u2019s posterior distribution increases a variational lower bound on the likelihood of the\nDBM, and serves as a good way to initialize the joint model. Training the DBM from a random\ninitialization generally does not work. It can be dif\ufb01cult for practitioners to tell whether a given\nlower layer RBM is a good starting point to build a larger model.\nWe propose a new way of training deep Boltzmann machines called multi-prediction training (MPT).\nMPT uses the mean \ufb01eld equations for the DBM to induce recurrent nets that are then trained to\nsolve different inference tasks. The resulting trained MP-DBM model can be viewed either as a\nsingle probabilistic model trained with a variational criterion, or as a family of recurrent nets that\nsolve related inference tasks.\nWe \ufb01nd empirically that the MP-DBM does not require greedy layerwise training, so its performance\non the \ufb01nal task can be monitored from the start. This makes it more suitable than the DBM for\n\n1Code and hyperparameters available at http://www-etud.iro.umontreal.ca/\u02dcgoodfeli/\n\nmp_dbm.html\n\n1\n\n\fpractitioners who do not have extensive experience with layerwise pretraining techniques or Markov\nchains. Anyone with experience minimizing non-convex functions should \ufb01nd MP-DBM training\nfamiliar and straightforward. Moreover, we show that inference in the MP-DBM is useful\u2013 the MP-\nDBM does not need an extra classi\ufb01er built on top of its learned features to obtain good inference\naccuracy. We show that it outperforms the DBM at solving a variety of inference tasks including\nclassi\ufb01cation, classi\ufb01cation with missing inputs, and prediction of randomly selected subsets of\nvariables. Speci\ufb01cally, we use the MP-DBM to outperform the classi\ufb01cation results reported for the\nstandard DBM by Salakhutdinov and Hinton [18] on both the MNIST handwritten character dataset\n[14] and the NORB object recognition dataset [13].\n\n2 Review of deep Boltzmann machines\n\nTypically, a DBM contains a set of D input features v that are called the visible units because they\nare always observed during both training and evaluation. When a class label is present the DBM\ntypically represents it with a discrete-valued label unit y. The unit y is observed (on examples for\nwhich it is available) during training, but typically is not available at test time. The DBM also\ncontains several latent variables that are never observed. These hidden units are usually organized\ninto L layers h(i) of size Ni, i \u2208 {1, . . . , L}, with each unit in a layer conditionally independent of\nthe other units in the layer given the neighboring layers.\nThe DBM is trained to maximize the mean \ufb01eld lower bound on log P (v, y). Unfortunately, training\nthe entire model simultaneously does not seem to be feasible. See [8] for an example of a DBM that\nhas failed to learn using the naive training algorithm. Salakhutdinov and Hinton [18] found that for\ntheir joint training procedure to work, the DBM must \ufb01rst be initialized by training one layer at a\ntime. After each layer is trained as an RBM, the RBMs can be modi\ufb01ed slightly, assembled into a\nDBM, and the DBM may be trained with PCD [22, 21] and mean \ufb01eld. In order to achieve good\nclassi\ufb01cation results, an MLP designed speci\ufb01cally to predict y from v must be trained on top of the\nDBM model. Simply running mean \ufb01eld inference to predict y given v in the DBM model does not\nwork nearly as well. See \ufb01gure 1 for a graphical description of the training procedure used by [18].\nThe standard approach to training a DBM requires training L + 2 different models using L + 2\ndifferent objective functions, and does not yield a single model that excels at answering all queries.\nOur proposed approach requires training only one model with only one objective function, and the\nresulting model outperforms previous approaches at answering many kinds of queries (classi\ufb01cation,\nclassi\ufb01cation with missing inputs, predicting arbitrary subsets of variables given the complementary\nsubset).\n\n3 Motivation\n\nThere are numerous reasons to prefer a single-model, single-training stage approach to deep Boltz-\nmann machine learning:\n\n1. Optimization As a greedy optimization procedure, layerwise training may be suboptimal.\nSmall-scale experimental work has demonstrated this to be the case for deep belief net-\nworks [1].\nIn general, for layerwise training to be optimal, the training procedure for each layer must\ntake into account the in\ufb02uence that the deeper layers will provide. The layerwise initializa-\ntion procedure simply does not attempt to be optimal.\nThe procedures used by Le Roux and Bengio [12], Arnold and Ollivier [1] make an opti-\nmistic assumption that the deeper layers will be able to implement the best possible prior\non the current layer\u2019s hidden units. This approach is not immediately applicable to Boltz-\nmann machines because it is speci\ufb01ed in terms of learning the parameters of P (h(i\u22121)|h(i))\nassuming that the parameters of the P (h(i)) will be set optimally later. In a DBM the sym-\nmetrical nature of the interactions between units means that these two distributions share\nparameters, so it is not possible to set the parameters of the one distribution, leave them\n\ufb01xed for the remainder of learning, and then set the parameters of the other distribution.\nMoreover, model architectures incorporating design features such as sparse connections,\n\n2\n\n\fpooling, or factored multilinear interactions make it dif\ufb01cult to predict how best to struc-\nture one layer\u2019s hidden units in order for the next layer to make good use of them.\n\n2. Probabilistic modeling Using multiple models and having some models specialized for\nexactly one task (like predicting y from v) loses some of the bene\ufb01t of probabilistic mod-\neling. If we have one model that excels at all tasks, we can use inference in this model to\nanswer arbitrary queries, perform classi\ufb01cation with missing inputs, and so on. The stan-\ndard DBM training procedure gives this up by training a rich probabilistic model and then\nusing it as just a feature extractor for an MLP.\n\n3. Simplicity Needing to implement multiple models and training stages makes the cost of\ndeveloping software with DBMs greater, and makes using them more cumbersome. Be-\nyond the software engineering considerations, it can be dif\ufb01cult to monitor training and tell\nwhat kind of results during layerwise RBM pretraining will correspond to good DBM clas-\nsi\ufb01cation accuracy later. Our joint training procedure allows the user to monitor the model\u2019s\nability of interest (usually ability to classify y given v) from the very start of training.\n\n4 Methods\n\nWe now described the new methods proposed in this paper, and some pre-existing methods that we\ncompare against.\n\n4.1 Multi-prediction Training\n\nOur proposed approach is to directly train the DBM to be good at solving all possible variational\ninference problems. We call this multi-prediction training because the procedure involves training\nthe model to predict any subset of variables given the complement of that subset of variables.\nLet O be a vector containing all variables that are observed during training. For a purely unsuper-\nvised learning task, O is just v itself. In the supervised setting, O = [v, y]T . Note that y won\u2019t\nbe observed at test time, only training time. Let D be the training set, i.e. a collection of values\nof O. Let S be a sequence of subsets of the possible indices of O. Let Qi be the variational (e.g.,\nmean-\ufb01eld) approximation to the joint of OSi and h given O\u2212Si.\n\nQi(OSi, h) = argminQDKL (Q(OSi, h)(cid:107)P (OSi, h | O\u2212Si)) .\n\nIn all of the experiments presented in this paper, Q is constrained to be factorial, though one could\ndesign model families for which it makes sense to use richer structure in Q. Note that there is not\nan explicit formula for Q; Q must be computed by an iterative optimization process. In order to\naccomplish this minimization, we run the mean \ufb01eld \ufb01xed point equations to convergence. Because\neach \ufb01xed point update uses the output of a previous \ufb01xed point update as input, this optimization\nprocedure can be viewed as a recurrent neural network.\n(To simplify implementation, we don\u2019t\nexplicitly test for convergence, but run the recurrent net for a pre-speci\ufb01ed number of iterations that\nis chosen to be high enough that the net usually converges)\nWe train the MP-DBM by using minibatch stochastic gradient descent on the multi-prediction (MP)\nobjective function\n\nJ(D, \u03b8) = \u2212(cid:88)\n\n(cid:88)\n\nO\u2208D\n\ni\n\nlog Qi(OSi)\n\nIn other words, the criterion for a single example O is a sum of several terms, with term i measuring\nthe model\u2019s ability to predict (through a variational approximation) a subset of the variables in the\ntraining set, OSi, given the remainder of the observed variables, O\u2212Si.\nDuring SGD training, we sample minibatches of values of O and Si. Sampling O just means\ndrawing an example from the training set. Sampling an Si uniformly simply requires sampling\none bit (1 with probability 0.5) for each variable, to determine whether that variable should be an\ninput to the inference procedure or a prediction target. To compute the gradient, we simply backprop\nthe error derivatives of J through the recurrent net de\ufb01ning Q.\nSee Fig. 2 for a graphical description of this training procedure, and Fig. 3 for an example of the\ninference procedure run on MNIST digits.\n\n3\n\n\fFigure 1: The training procedure used by Salakhutdi-\nnov and Hinton [18] on MNIST. a) Train an RBM to\nmaximize log P (v) using CD. b) Train another RBM\nto maximize log P (h(1), y) where h(1) is drawn from\nthe \ufb01rst RBM\u2019s posterior. c) Stitch the two RBMs into\none DBM. Train the DBM to maximize log P (v, y).\nd) Delete y from the model (don\u2019t marginalize it out,\njust remove the layer from the model). Make an MLP\nwith inputs v and the mean \ufb01eld expectations of h(1)\nand h(2). Fix the DBM parameters. Initialize the MLP\nparameters based on the DBM parameters. Train the\nMLP parameters to predict y.\n\nFigure 3: Mean \ufb01eld inference applied to MNIST dig-\nits. Within each pair of rows, the upper row shows pix-\nels and the lower row shows class labels. The \ufb01rst col-\numn shows a complete, labeled example. The second\ncolumn shows information to be masked out, using red\npixels to indicate information that is removed. The\nsubsequent columns show steps of mean \ufb01eld. The im-\nages show the pixels being \ufb01lled back in by the mean\n\ufb01eld inference, and the blue bars show the probability\nof the correct class under the mean \ufb01eld posterior.\n\nFigure 2: Multi-prediction training: This diagram\nshows the neural nets instantiated to do multi-\nprediction training on one minibatch of data. The\nthree rows show three different examples. Black cir-\ncles represent variables the net is allowed to oberve.\nBlue circles represent prediction targets. Green arrows\nrepresent computational dependencies. Each column\nshows a single mean \ufb01eld \ufb01xed point update. Each\nmean \ufb01eld iteration consists of two \ufb01xed point up-\ndates. Here we show only one iteration to save space,\nbut in a real application MP training should be run\nwith 5-15 iterations.\n\nFigure 4: Multi-inference trick: When estimating y\ngiven v, a mean \ufb01eld iteration consists of \ufb01rst applying\na mean \ufb01eld update to h(1) and y, then applying one to\nh(2). To use the multi-inference trick, start the itera-\ntion by computing r as the mean \ufb01eld update v would\nreceive if it were not observed. Then use 0.5(r + v)\nin place of v and run a regular mean \ufb01eld iteration.\n\nFigure 5: Samples generated by alternately sampling\nSi uniformly and sampling O\u2212Sifrom Qi(O\u2212Si\n\n).\n\n4\n\nc)d)b)a)Mean Field IterationMulti-Inference Iteration+=Step 1Step 2Previous State + ReconstructionStep 1Step 2Previous State\fThis training procedure is similar to one introduced by Brakel et al. [6] for time-series models. The\nprimary difference is that we use log Q as the loss function, while Brakel et al. [6] apply hard-coded\nloss functions such as mean squared error to the predictions of the missing values.\n\n4.2 The Multi-Inference Trick\n\nMean \ufb01eld inference can be expensive due to needing to run the \ufb01xed point equations several times\nin order to reach convergence. In order to reduce this computational expense, it is possible to train\nusing fewer mean \ufb01eld iterations than required to reach convergence. In this case, we are no longer\nnecessarily minimizing J as written, but rather doing partial training of a large number of \ufb01xed-\niteration recurrent nets that solve related problems.\nWe can approximately take the geometric mean over all predicted distributions Q (for different\nsubsets Si) and renormalize in order to combine the predictions of all of these recurrent nets. This\nway, imperfections in the training procedure are averaged out, and we are able to solve inference\ntasks even if the corresponding recurrent net was never sampled during MP training.\nIn order to approximate this average ef\ufb01ciently, we simply take the geometric mean at each step of\ninference, instead of attempting to take the correct geometric mean of the entire inference process.\nSee Fig. 4 for a graphical depiction of the method. This is the same type of approximation used to\ntake the average over several MLP predictions when using dropout [10]. Here, the averaging rule\nis slightly different. In dropout, the different MLPs we average over either include or exclude each\nvariable. To take the geometric mean over a unit hj that receives input from vi, we average together\nthe contribution viWij from the model that contains vi and the contribution 0 from the model that\ndoes not. The \ufb01nal contribution from vi is 0.5viWij so the dropout model averaging rule is to run\nan MLP with the weights divided by 2.\nFor the multi-inference trick, each recurrent net we average over solves a different inference prob-\nlem. In half of the problems, vi is observed, and contributes viWij to hj\u2019s total input. In the other\nhalf of the problems, vi is inferred. In contrast to dropout, vi is never completely absent. If we\nrepresent the mean \ufb01eld estimate of vi with ri, then in this case that unit contributes riWij to hj\u2019s\ntotal input. To run multi-inference, we thus replace references to v with 0.5(v + r), where r is\nupdated at each mean \ufb01eld iteration. The main bene\ufb01t to this approach is that it gives a good way to\nincorporate information from many recurrent nets trained in slightly different ways. If the recurrent\nnet corresponding to the desired inference task is somewhat suboptimal due to not having been sam-\npled enough during training, its defects can be oftened be remedied by averaging its predictions with\nthose of other similar recurrent nets. The multi-inference trick can also be understood as including\nan input denoising step built into the inference. In practice, multi-inference mostly seems to be bene-\n\ufb01cial if the network was trained without letting mean \ufb01eld run to convergence. When the model was\ntrained with converged mean \ufb01eld, each recurrent net is just solving an optimization problem in a\ngraphical model, and it doesn\u2019t matter whether every recurrent net has been individually trained. The\nmulti-inference trick is mostly useful as a cheap alternative when getting the absolute best possible\ntest set accuracy is not as important as fast training and evaluation.\n\n4.3\n\nJusti\ufb01cation and advantages\n\nIn the case where we run the recurrent net for predicting Q to convergence, the multi-prediction\ntraining algorithm follows the gradient of the objective function J. This can be viewed as a mean\n\ufb01eld approximation to the generalized pseudolikelihood.\nWhile both pseudolikelihood and likelihood are asymptotically consistent estimators, their behavior\nin the limited data case is different. Maximum likelihood should be better if the overall goal is\nto draw realistic samples from the model, but generalized pseudolikelihood can often be better for\ntraining a model to answer queries conditioning on sets similar to the Si used during training.\nNote that our variational approximation is not quite the same as the way variational approximations\nare usually applied. We use variational inference to ensure that the distributions we shape using\nbackprop are as close as possible to the true conditionals. This is different from the usual approach\nto variational learning, where Q is used to de\ufb01ne a lower bound on the log likelihood and variational\ninference is used to make the bound as tight as possible.\n\n5\n\n\fIn the case where the recurrent net is not trained to convergence, there is an alternate way to justify\nMP training. Rather than doing variational learning on a single probabilistic model, the MP pro-\ncedure trains a family of recurrent nets to solve related prediction problems by running for some\n\ufb01xed number of iterations. Each recurrent net is trained only on a subset of the data (and most re-\ncurrent nets are never trained at all, but only work because they share parameters with the others).\nIn this case, the multi-inference trick allows us to justify MP training as approximately training an\nensemble of recurrent nets using bagging.\nStoyanov et al. [20] have observed that a training strategy similar to MPT (but lacking the multi-\ninference trick) is useful because it trains the model to work well with the inference approximations\nit will be evaluated with at test time. We \ufb01nd these properties to be useful as well. The choice of this\ntype of variational learning combined with the underlying generalized pseudolikelihood objective\nmakes an MP-DBM very well suited for solving approximate inference problems but not very well\nsuited for sampling.\nOur primary design consideration when developing multi-prediction training was ensuring that the\nlearning rule was state-free. PCD training uses persistent Markov chains to estimate the gradient.\nThese Markov chains are used to approximately sample from the model, and only sample from\napproximately the right distribution if the model parameters evolve slowly. The MP training rule\ndoes not make any reference to earlier training steps, and can be computed with no burn in. This\nmeans that the accuracy of the MP gradient is not dependent on properties of the training algorithm\nsuch as the learning rate which can easily break PCD for many choices of the hyperparameters.\nAnother bene\ufb01t of MP is that it is easy to obtain an unbiased estimate of the MP objective from\na small number of samples of v and i. This is in contrast to the log likelihood, which requires\nestimating the log partition function. The best known method for doing so is AIS, which is relatively\nexpensive [16]. Cheap estimates of the objective function enable early stopping based on the MP-\nobjective (though we generally use early stopping based on classi\ufb01cation accuracy) and optimization\nbased on line searches (though we do not explore that possibility in this paper).\n\n4.4 Regularization\n\nIn order to obtain good generalization performance, Salakhutdinov and Hinton [18] regularized both\nthe weights and the activations of the network.\nSalakhutdinov and Hinton [18] regularize the weights using an L2 penalty. We \ufb01nd that for joint\ntraining, it is critically important to not do this (on the MNIST dataset, we were not able to \ufb01nd any\nMP-DBM hyperparameter con\ufb01guration involving weight decay that performs as well as layerwise\nDBMs, but without weight decay MP-DBMs outperform DBMs). When the second layer weights\nare not trained well enough for them to be useful for modeling the data, the weight decay term will\ndrive them to become very small, and they will never have an opportunity to recover. It is much\nbetter to use constraints on the norms of the columns of the weight vectors as done by Srebro and\nShraibman [19].\nSalakhutdinov and Hinton [18] regularize the activities of the hidden units with a somewhat compli-\ncated sparsity penalty. See http://www.mit.edu/\u02dcrsalakhu/DBM.html for details. We\nuse max(|Eh\u223cQ(h)[h] \u2212 t| \u2212 \u03bb, 0) and backpropagate this through the entire inference graph. t and\n\u03bb are hyperparameters.\n\n4.5 Related work: centering\n\nMontavon and M\u00a8uller [15] showed that an alternative, \u201ccentered\u201d representation of the DBM results\nin successful generative training without a greedy layerwise pretraining step. However, centered\nDBMs have never been shown to have good classi\ufb01cation performance. We therefore evaluate the\nclassi\ufb01cation performance of centering in this work. We consider two methods of variational PCD\ntraining. In one, we use Rao-Blackwellization [5, 11, 17] of the negative phase particles to reduce\nthe variance of the negative phase. In the other variant (\u201ccentering+\u201d), we use a special negative\nphase that Salakhutdinov and Hinton [18] found useful. This negative phase uses a small amount of\nmean \ufb01eld, which reduces the variance further but introduces some bias, and has better symmetry\nwith the positive phase. See http://www.mit.edu/\u02dcrsalakhu/DBM.html for details.\n\n6\n\n\f(a) Cross-validation\n\n(b) Missing inputs\n\n(c) General queries\n\nFigure 6: Quantitative results on MNIST: (a) During cross-validation, MP training performs well for most\nhyperparameters, while both centering and centering with the special negative phase do not perform as well\nand only perform well for a few hyperparameter values. Note that the vertical axis is on a log scale. (b) Generic\ninference tasks: When classifying with missing inputs, the MP-DBM outperforms the other DBMs for most\namounts of missing inputs. (c) When using approximate inference to resolve general queries, the standard\nDBM, centered DBM, and MP-DBM all perform about the same when asked to predict a small number of\nvariables. For larger queries, the MP-DBM performs the best.\n4.6 Sampling, and a connection to GSNs\n\nThe focus of this paper is solving inference problems, not generating samples, so we do not in-\nvestigate the sampling properties of MP-DBMs extensively. However, it is interesting to note that\nan MP-DBM can be viewed as a collection of dependency networks [9] with shared parameters.\nDependency networks are a special case of generative stochastic networks or GSNs (Bengio et al.\n[3], section 3.4). This means that the MP-DBM is associated with a distribution arising out of the\nMarkov chain in which at each step one samples an Si uniformly and then samples O from Qi(O).\nExample samples are shown in \ufb01gure 5. Furthermore, it means that if MPT is a consistent estimator\nof the conditional distributions, then MPT is a consistent estimator of the probability distribution de-\n\ufb01ned by the stationary distribution of this Markov chain. Samples drawn by Gibbs sampling in the\nDBM model do not look as good (probably because the variational approximation is too damaging).\nThis suggests that the perspective of the MP-DBM as a GSN merits further investigation.\n\n5 Experiments\n\n5.1 MNIST experiments\nIn order to compare MP training and centering to standard DBM performance, we cross-validated\neach of the new methods by running 25 training experiments for each of three conditions: centered\nDBMs, centered DBMs with the special negative phase (\u201cCentering+\u201d), and MP training.\nAll three conditions visited exactly the same set of 25 hyperparameter values for the momentum\nschedule, sparsity regularization hyperparameters, weight and bias initialization hyperparameters,\nweight norm constraint values, and number of mean \ufb01eld iterations. The centered DBMs also re-\nquired one additional hyperparameter, the number of Gibbs steps to run for variational PCD. We\nused different values of the learning rate for the different conditions, because the different condi-\ntions require different ranges of learning rate to perform well. We use the same size of model,\nminibatch and negative chain collection as Salakhutdinov and Hinton [18], with 500 hidden units\nin the \ufb01rst layer, 1,000 hidden units in the second, 100 examples per minibatch, and 100 negative\nchains. The energy function for this model is\n\nE(v, h, y) = \u2212vT W (1)h(1) \u2212 h(1)T W (2)h(2) \u2212 h(2)T W (3)y\n\u2212vT b(0) \u2212 h(1)T b(1) \u2212 h(2)T b(2) \u2212 yT b(3).\n\nSee Fig. 6a for the results of cross-validation. On the validation set, MP training consistently\nperforms better and is much less sensitive to hyperparameters than the other methods. This is likely\nbecause the state-free nature of the learning rule makes it perform better with settings of the learning\nrate and momentum schedule that result in the model distribution changing too fast for a method\nbased on Markov chains to keep up.\nWhen we add an MLP classi\ufb01er (as shown in Fig. 1d), the best \u201cCentering+\u201d DBM obtains a clas-\nsi\ufb01cation error of 1.22% on the test set. The best MP-DBM obtains a classi\ufb01cation error of 0.88%.\nThis compares to 0.95% obtained by Salakhutdinov and Hinton [18].\n\n7\n\nCenteringCentering+Multi-Prediction10\u2212210\u22121100Validationsetmisclassi\ufb01cationrateVariationacrosshyperparameters0.00.20.40.60.81.0Probabilityofdroppingeachinputunit0.00.20.40.60.81.0Testsetmisclassi\ufb01cationrateMNISTclassi\ufb01cationwithmissinginputsStandardDBM(no\ufb01netunedstage)CenteredDBMStandardDBM(+\ufb01netunedstage)MP-DBMMP-DBM(2Xhiddenunits)0.00.20.40.60.81.0ProbabilityofincludingaunitinS-0.7-0.6-0.5-0.4-0.3-0.2-0.10.0AveragetestsetlogQ(vi)fori\u2208SAbilitytoanswergeneralqueriesMP-DBMStandardDBMCentering+DBM\fIf instead of adding an MLP to the model, we simply train a larger MP-DBM with twice as many\nhidden units in each layer, and apply the multi-inference trick, we obtain a classi\ufb01cation error rate\nof 0.91%. In other words, we are able to classify nearly as well using a single large DBM and a\ngeneric inference procedure, rather than using a DBM followed by an entirely separate MLP model\nspecialized for classi\ufb01cation.\nThe original DBM was motivated primarily as a generative model with a high AIS score and as\na means of initializing a classi\ufb01er. Here we explore some more uses of the DBM as a generative\nmodel. Fig. 6b shows an evaluation of various DBM\u2019s ability to classify with missing inputs. Fig. 6c\nshows an evaluation of their ability to resolve queries about random subsets of variables. In both\ncases we \ufb01nd that the MP-DBM performs the best for most amounts of missing inputs.\n\n+\n\n1\n2\n\n5.2 NORB experiments\nNORB consists of 96\u00d796 binocular greyscale images of objects from \ufb01ve different categories, under\na variety of pose and lighting conditions. Salakhutdinov and Hinton [18] preprocessed the images\nby resampling them with bigger pixels near the border of the image, yielding an input vector of size\n8,976. We used this preprocessing as well. Salakhutdinov and Hinton [18] then trained an RBM\nwith 4,000 binary hidden units and Gaussian visible units to preprocess the data into an all-binary\nrepresentation, and trained a DBM with two hidden layers of 4,000 units each on this representation.\nSince the goal of this work is to provide a single uni\ufb01ed model and training algorithm, we do not\ntrain a separate Gaussian RBM. Instead we train a single MP-DBM with Gaussian visible units and\nthree hidden layers of 4,000 units each. The energy function for this model is\n\nE(v, h, y) = \u2212(v \u2212 \u00b5)T \u03b2W (1)h(1) \u2212 h(1)T W (2)h(2) \u2212 h(2)T W (3)h(3) \u2212 h(3)T W (4)y\n(v \u2212 \u00b5)T \u03b2(v \u2212 \u00b5) \u2212 h(1)T b(1) \u2212 h(2)T b(2) \u2212 h(3)T b(3) \u2212 yT b(4).\nwhere \u00b5 is a learned vector of visible unit means and \u03b2 is a learned diagonal precision matrix.\nBy adding an MLP on top of the MP-DBM, following the same architecture as Salakhutdinov and\nHinton [18], we were able to obtain a test set error of 10.6%. This is a slight improvement over the\nstandard DBM\u2019s 10.8%.\nOn MNIST we were able to outperform the DBM without using the MLP classi\ufb01er because we were\nable to train a larger MP-DBM. On NORB, the model size used by Salakhutdinov and Hinton [18] is\nalready as large as we are able to \ufb01t on most of our graphics cards, so we were not able to do the same\nfor this dataset. It is possible to do better on NORB using convolution or synthetic transformations\nof the training data. We did not evaluate the effect of these techniques on the MP-DBM because\nour present goal is not to obtain state-of-the-art object recognition performance but only to verify\nthat our joint training procedure works as well as the layerwise training procedure for DBMs. There\nis no public demo code available for the standard DBM on this dataset, and we were not able to\nreproduce the standard DBM results (layerwise DBM training requires signi\ufb01cant experience and\nintuition). We therefore can\u2019t compare the MP-DBM to the original DBM in terms of answering\ngeneral queries or classi\ufb01cation with missing inputs on this dataset.\n\n6 Conclusion\n\nThis paper has demonstrated that MP training and the multi-inference trick provide a means of\ntraining a single model, with a single stage of training, that matches the performance of standard\nDBMs but still works as a general probabilistic model, capable of handling missing inputs and\nanswering general queries. We have veri\ufb01ed that MP training outperforms the standard training\nprocedure at classi\ufb01cation on the MNIST and NORB datasets where the original DBM was \ufb01rst\napplied. We have shown that MP training works well with binary, Gaussian, and softmax units,\nas well as architectures with either two or three hidden layers. In future work, we hope to apply\nthe MP-DBM to more practical applications, and explore techniques, such as dropout, that could\nimprove its performance further.\n\nAcknowledgments\nWe would like to thank the developers of Theano [4, 2], Pylearn2 [7]. We would also like to thank\nNSERC, Compute Canada, and Calcul Qu\u00b4ebec for providing computational resources.\n\n8\n\n\fReferences\n[1] Arnold, L. and Ollivier, Y. (2012). Layer-wise learning of deep generative models. Technical report,\n\narXiv:1212.1524.\n\n[2] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and\nBengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised\nFeature Learning NIPS 2012 Workshop.\n\n[3] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2013). Deep generative stochastic networks trainable\n\nby backprop. Technical Report arXiv:1306.1091, Universite de Montreal.\n\n[4] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,\nIn Proceedings of the\n\nD., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.\nPython for Scienti\ufb01c Computing Conference (SciPy). Oral Presentation.\n\n[5] Blackwell, D. (1947). Conditional Expectation and Unbiased Sequential Estimation. Ann.Math.Statist.,\n\n18, 105\u2013110.\n\n[6] Brakel, P., Stroobandt, D., and Schrauwen, B. (2013). Training energy-based models for time-series impu-\n\ntation. Journal of Machine Learning Research, 14, 2771\u20132797.\n\n[7] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J.,\nBastien, F., and Bengio, Y. (2013a). Pylearn2: a machine learning research library. arXiv preprint\narXiv:1308.4214.\n\n[8] Goodfellow, I. J., Courville, A., and Bengio, Y. (2013b). Scaling up spike-and-slab models for unsupervised\n\nfeature learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1902\u20131914.\n\n[9] Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C. (2000). Dependency net-\nworks for inference, collaborative \ufb01ltering, and data visualization. Journal of Machine Learning Research,\n1, 49\u201375.\n\n[10] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinv, R. (2012). Improving neural\n\nnetworks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.\n\n[11] Kolmogorov, A. (1953). Unbiased Estimates:. American Mathematical Society translations. American\n\nMathematical Society.\n\n[12] Le Roux, N. and Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep\n\nbelief networks. Neural Computation, 20(6), 1631\u20131649.\n\n[13] LeCun, Y., Huang, F.-J., and Bottou, L. (????). Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In CVPR\u20192004, pages 97\u2013104.\n\n[14] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11), 2278\u20132324.\n\n[15] Montavon, G. and M\u00a8uller, K.-R. (2012). Learning feature hierarchies with centered deep Boltzmann\n\nmachines. CoRR, abs/1203.4416.\n\n[16] Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125\u2013139.\n[17] Rao, C. R. (1973). Linear Statistical Inference and its Applications. J. Wiley and Sons, New York, 2nd\n\nedition.\n\n[18] Salakhutdinov, R. and Hinton, G. (2009). Deep Boltzmann machines.\n\nIn Proceedings of the Twelfth\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS 2009), volume 8.\n\n[19] Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm.\n\nAnnual Conference on Learning Theory, pages 545\u2013560. Springer-Verlag.\n\nIn Proceedings of the 18th\n\n[20] Stoyanov, V., Ropson, A., and Eisner, J. (2011). Empirical risk minimization of graphical model parame-\n\nters given approximate inference, decoding, and model structure. In AISTATS\u20192011.\n\n[21] Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood\n\ngradient. In ICML\u20192008, pages 1064\u20131071.\n\n[22] Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing\n\nergodicity rates. Stochastics and Stochastic Reports, 65(3), 177\u2013228.\n\n9\n\n\f", "award": [], "sourceid": 349, "authors": [{"given_name": "Ian", "family_name": "Goodfellow", "institution": "University of Montreal"}, {"given_name": "Mehdi", "family_name": "Mirza", "institution": "University of Montreal"}, {"given_name": "Aaron", "family_name": "Courville", "institution": "University of Montreal"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "University of Montreal"}]}