{"title": "Iterative Refinement of the Approximate Posterior for Directed Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4691, "page_last": 4699, "abstract": "Variational methods that rely on a recognition network to approximate the posterior of directed graphical models offer better inference and learning than previous methods. Recent advances that exploit the capacity and flexibility in this approach have expanded what kinds of models can be trained. However, as a proposal for the posterior, the capacity of the recognition network is limited, which can constrain the representational power of the generative model and increase the variance of Monte Carlo estimates. To address these issues, we introduce an iterative refinement procedure for improving the approximate posterior of the recognition network and show that training with the refined posterior is competitive with state-of-the-art methods. The advantages of refinement are further evident in an increased effective sample size, which implies a lower variance of gradient estimates.", "full_text": "Iterative Re\ufb01nement of the Approximate Posterior for\n\nDirected Belief Networks\n\nUniversity of New Mexico and the Mind Research Network\n\nR Devon Hjelm\n\ndhjelm@mrn.org\n\nKyunghyun Cho\n\nCourant Institute & Center for Data Science, New York University\n\nkyunghyun.cho@nyu.edu\n\nJunyoung Chung\n\nUniversity of Montreal\n\njunyoung.chung@umontreal.ca\n\nRuss Salakhutdinov\n\nCarnegie Melon University\n\nrsalakhu@cs.toronto.edu\n\nVince Calhoun\n\nUniversity of New Mexico and the Mind Research Network\n\nvcalhoun@mrn.org\n\nNebojsa Jojic\n\nMicrosoft Research\n\njojic@microsoft.com\n\nAbstract\n\nVariational methods that rely on a recognition network to approximate the posterior\nof directed graphical models offer better inference and learning than previous\nmethods. Recent advances that exploit the capacity and \ufb02exibility in this approach\nhave expanded what kinds of models can be trained. However, as a proposal for the\nposterior, the capacity of the recognition network is limited, which can constrain the\nrepresentational power of the generative model and increase the variance of Monte\nCarlo estimates. To address these issues, we introduce an iterative re\ufb01nement\nprocedure for improving the approximate posterior of the recognition network and\nshow that training with the re\ufb01ned posterior is competitive with state-of-the-art\nmethods. The advantages of re\ufb01nement are further evident in an increased effective\nsample size, which implies a lower variance of gradient estimates.\n\n1\n\nIntroduction\n\nVariational methods have surpassed traditional methods such as Markov chain Monte Carlo [MCMC,\n15] and mean-\ufb01eld coordinate ascent [25] as the de-facto standard approach for training directed\ngraphical models. Helmholtz machines [3] are a type of directed graphical model that approximate\nthe posterior distribution with a recognition network that provides fast inference as well as \ufb02exible\nlearning which scales well to large datasets. Many recent signi\ufb01cant advances in training Helmholtz\nmachines come as estimators for the gradient of the objective w.r.t. the approximate posterior. The\nmost successful of these methods, variational autoencoders [VAE, 12], relies on a re-parameterization\nof the latent variables to pass the learning signal to the recognition network. This type of parame-\nterization, however, is not available with discrete units, and the naive Monte Carlo estimate of the\ngradient has too high variance to be practical [3, 12].\nHowever, good estimators are available through importance sampling [1], input-dependent baselines\n[13], a combination baselines and importance sampling [14], and parametric Taylor expansions [9].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fEach of these methods strive to be a lower-variance and unbiased gradient estimator. However, the\nreliance on the recognition network means that the quality of learning is bounded by the capacity of\nthe recognition network, which in turn raises the variance.\nWe demonstrate reducing the variance of Monte Carlo based estimators by iteratively re\ufb01ning\nthe approximate posterior provided by the recognition network. The complete learning algorithm\nfollows expectation-maximization [EM, 4, 16], where in the E-step the variational parameters of\nthe approximate posterior are initialized using the recognition network, then iteratively re\ufb01ned. The\nre\ufb01nement procedure provides an asymptotically-unbiased estimate of the variational lowerbound,\nwhich is tight w.r.t. the true posterior and can be used to easily train both the recognition network and\ngenerative model during the M-step. The variance-reducing re\ufb01nement is available to any directed\ngraphical model and can give a more accurate estimate of the log-likelihood of the model.\nFor the iterative re\ufb01nement step, we use adaptive importance sampling [AIS, 17]. We demonstrate the\nproposed re\ufb01nement procedure is effective for training directed belief networks, providing a better\nor competitive estimates of the log-likelihood. We also demonstrate the improved posterior from\nre\ufb01nement can improve inference and accuracy of evaluation for models trained by other methods.\n\n2 Directed Belief Networks and Variational Inference\n\np(x|h1)p(hL)(cid:81)L\u22121\n\nA directed belief network is a generative directed graphical model consisting of a conditional density\np(x|h) and a prior p(h), such that the joint density can be expressed as p(x, h) = p(x|h)p(h). In\nparticular, the joint density factorizes into a hierarchy of conditional densities and a prior: p(x, h) =\nl=1 p(hl|hl+1), where p(hl|hl+1) is the conditional density at the l-th layer and\np(hL) is a prior distribution of the top layer. Sampling from the model can be done simply via\nancestral-sampling, \ufb01rst sampling from the prior, then subsequently sampling from each layer until\nreaching the observation, x. This latent variable structure can improve model capacity, but inference\ncan still be intractable, as is the case in sigmoid belief networks [SBN, 15], deep belief networks\n[DBN, 11], deep autoregressive networks [DARN, 7], and other models in which each of the\nconditional distributions involves complex nonlinear functions.\n\n2.1 Variational Lowerbound of Directed Belief Network\n\nThe objective we consider is the likelihood function, p(x; \u03c6), where \u03c6 represent parameters of the\ngenerative model (e.g. a directed belief network). Estimating the likelihood function given the joint\ndistribution, p(x, h; \u03c6), above is not generally possible as it requires intractable marginalization over\nh. Instead, we introduce an approximate posterior, q(h|x), as a proposal distribution. In this case,\nthe log-likelihood can be bounded from below\u2217:\n\n(cid:20)\n\n(cid:21)\n\nq(h|x) log\n\np(x, h)\nq(h|x)\n\n= Eq(h|x)\n\nlog\n\np(x, h)\nq(h|x)\n\n:= L1,\n\n(1)\n\n(cid:88)\n\nlog p(x, h) \u2265(cid:88)\n\nh\n\nh\n\nlog p(x) =\n\nwhere we introduce the subscript in the lowerbound to make the connection to importance sampling\nlater. The bound is tight (e.g., L1 = log p(x)) when the KL divergence between the approximate and\ntrue posterior is zero (e.g., DKL(q(h|x)||p(h|x)) = 0). The gradients of the lowerbound w.r.t. the\ngenerative model can be approximated using the Monte Carlo approximation of the expectation:\n\n\u2207\u03c6 log p(x, h(k); \u03c6), h(k) \u223c q(h|x).\n\n(2)\n\nK(cid:88)\n\nk=1\n\n\u2207\u03c6L1 \u2248 1\nK\n\nThe success of variational inference lies on the choice of approximate posterior, as poor choice can\nresult in a looser variational bound. A deep feed-forward recognition network parameterized by \u03c8 has\nbecome a popular choice, such that q(h|x) = q(h|x; \u03c8), as it offers fast and \ufb02exible data-dependent\ninference [see, e.g., 22, 12, 13, 20]. Generally known as a \u201cHelmholtz machine\u201d [3], these approaches\noften require additional tricks to train, as the naive Monte Carlo gradient of the lowerbound w.r.t.\nthe variational parameters has high variance. In addition, the variational lowerbound in Eq. (1) is\nconstrained by the assumptions implicit in the choice of approximate posterior, as the approximate\nposterior must be within the capacity of the recognition network and factorial.\n\n\u2217 For clarity of presentation, we will often omit dependence on parameters \u03c6 of the generative model, so that\n\np(x, h) = p(x, h; \u03c6)\n\n2\n\n\fFigure 1: Iterative re\ufb01nement for variational inference. An initial estimate of the variational parameters is\nmade through a recognition network. The variational parameters are then updated iteratively, maximizing the\nlowerbound. The \ufb01nal approximate posterior is used to train the generative model by sampling. The recognition\nnetwork parameters are updated using the KL divergence between the re\ufb01ned posterior qk and the output of the\nrecognition network q0.\n\n2.2\n\nImportance Sampled Variational lowerbound\n\nThese assumptions can be relaxed by using an unbiased K-sampled importance weighted estimate of\nthe likelihood function (see [2] for details):\n\nL1 \u2264 LK =\n\n1\nK\n\np(x, h(k))\nq(h(k)|x)\n\n=\n\n1\nK\n\nw(k) \u2264 p(x),\n\n(3)\n\nwhere h(k) \u223c q(h|x) and w(k) are the importance weights. This lowerbound is tighter than the\nsingle-sample version provided in Eq. (1) and is an asymptotically unbiased estimate of the likelihood\nas K \u2192 \u221e.\nThe gradient of the lowerbound w.r.t. the model parameters \u03c6 is simple and can be estimated as:\n\n(cid:88)\n\nk=1\n\n(cid:88)\n\nk=1\n\n\u2207\u03c6LK =\n\n\u02dcw(k)\u2207\u03c6 log p(x, h(k); \u03c6), where \u02dcw(k) =\n\n.\n\n(4)\n\nw(k)(cid:80)K\n\nk(cid:48)=1 w(k(cid:48))\n\nK(cid:88)\n\nk=1\n\nThe estimator in Eq. (3) can reduce the variance of the gradients, \u2207\u03c8LK, but in general additional\nvariance reduction is needed [14]. Alternatively, importance sampling yields an estimate of the\ninclusive KL divergence, DKL(p(h|x)||q(h|x)), which can be used for training parameters \u03c8 of\nthe recognition network [1]. However, it is well known that importance sampling can yield heavily-\nskewed distributions over the importance weights [5], so that only a small number of the samples will\neffectively have non-zero weight. This is consequential not only in training, but also for evaluating\nmodels when using Eq. (3) to estimate test log-probabilities, which requires drawing a very large\nnumber of samples (N \u2265 100, 000 in the literature for models trained on MNIST [7]).\nThe effective samples size, ne, of importance-weighted estimates increases and is optimal when the\napproximate posterior matches the true posterior:\n\nk=1 w(k)(cid:17)2\n(cid:16)(cid:80)K\n(cid:80)K\n\nk=1(w(k))2\n\n(cid:16)(cid:80)K\n(cid:17)2\n(cid:80)K\nk=1 p(x, h(k))/p(h(k)|x)\n\n(cid:0)p(x, h(k))/p(h(k)|x)(cid:1)2 \u2264 (Kp(x))2\n\nk=1\n\n\u2264\n\nKp(x)2 = K.\n\n(5)\n\nne =\n\nConversely, importance sampling from a poorer approximate posterior will have lower effective\nsampling size, resulting in higher variance of the gradient estimates.\nIn order to improve the\neffectiveness of importance sampling, we need a method for improving the approximate posterior\nfrom those provided by the recognition network.\n\n3\n\nIterative Re\ufb01nement for Variational Inference (IRVI)\n\nTo address the above issues, iterative re\ufb01nement for variational inference (IRVI) uses the recognition\nnetwork as a preliminary guess of the posterior, then re\ufb01nes the posterior through iterative updates of\nthe variational parameters. For the re\ufb01nement step, IRVI uses a stochastic transition operator, g(.),\nthat maximizes the variational lowerbound.\n\n3\n\n\fAn overview of IRVI is available in Figure 1. For the expectation (E)-step, we feed the observation x\nthrough the recognition network to get the initial parameters, \u00b50, of the approximate posterior,\nq0(h|x; \u03c8). We then re\ufb01ne \u00b50 by applying T updates to the variational parameters, \u00b5t+1 = g(\u00b5t, x),\niterating through T parameterizations \u00b51, . . . , \u00b5T of the approximate posterior qt(h|x).\nWith the \ufb01nal set of parameters, \u00b5T , the gradient estimate of the recognition parameters \u03c8 in the\nmaximization (M)-step is taken w.r.t the negative exclusive KL divergence:\n\n\u2212\u2207\u03c8DKL(qT (h|x)||q0(h|x; \u03c8)) \u2248 1\nK\n\n\u2207\u03c8 log q0(h(k)|x; \u03c8),\n\n(6)\n\nwhere h(k) \u223c qT (h|x). Similarly, the gradients w.r.t. the parameters of the generative model \u03c6\nfollow Eqs. (2) or (4) using samples from the re\ufb01ned posterior qT (h|x). As an alternative to Eq. (6),\nwe can maximize the negative inclusive KL divergence using the re\ufb01ned approximate posterior:\n\nK(cid:88)\n\nk=1\n\n\u2212\u2207\u03c8DKL(p(h|x)||q0(h|x; \u03c8)) \u2248 K(cid:88)\n\n\u02dcw(k)\u2207\u03c8 log q0(h(k)|x; \u03c8).\n\n(7)\n\nThe form of the IRVI transition operator, g(\u00b5t, x), depends on the problem. In the case of continuous\nvariables, we can make use of the VAE re-parameterization with the gradient of the lowerbound in\nEq. (1) for our re\ufb01nement step (see supplementary material). However, as this is not available with\ndiscrete units, we take a different approach that relies on adaptive importance sampling.\n\nk=1\n\n3.1 Adaptive Importance Re\ufb01nement (AIR)\n\n(cid:88)\n\n\u02c6\u00b5 = Ep(h|x) [h] =\n\nAdaptive importance sampling [AIS, 17] provides a general approach for iteratively re\ufb01ning the\nvariational parameters. For Bernoulli distributions, we observe that the mean parameter of the true\nposterior, \u02c6\u00b5, can be written as the expected value of the latent variables:\n\n\u2248 K(cid:88)\nEq. 8 until a stopping criteria is met. While using the update, g(\u00b5t, x, \u03b3) = (cid:80)K\n\nAs the initial estimator typically has high variance, AIS iteratively moves \u00b5t toward \u02c6\u00b5 by applying\nk=1 \u02dcw(k)h(k) in\nprinciple works, a convex combination of importance sample estimate of the current step and the\nparameters from the previous step tends to be more stable:\n\nh p(h|x) =\n\np(x, h)\nq(h|x)\n\nq(h|x) h\n\n(cid:88)\n\nh\n\n\u02dcw(k)h(k).\n\n(8)\n\n1\n\np(x)\n\nh\n\nh(m) \u223c Bernoulli(\u00b5k); \u00b5t+1 = g(\u00b5t, x, \u03b3) = (1 \u2212 \u03b3)\u00b5t + \u03b3\n\n(9)\nHere, \u03b3 is the inference rate and (1 \u2212 \u03b3) can be thought of as the adaptive \u201cdamping\u201d rate. This\napproach, which we call adaptive importance re\ufb01nement (AIR), should work with any discrete\nparametric distribution. Although AIR is applicable with continuous Gaussian variables, which\nmodel second-order statistics, we leave adapting AIR to continuous latent variables for future work.\n\n\u02dcw(k)h(k).\n\nk=1\n\nk=1\n\nK(cid:88)\n\n3.2 Algorithm and Complexity\n\nThe general AIR algorithm follows Algorithm 1 with gradient variations following Eqs. (2), (4),\n(6), and (7). While iterative re\ufb01nement may reduce the variance of stochastic gradient estimates\nand speed up learning, it comes at a computational cost, as each update is T times more expen-\nsive than \ufb01xed approximations. However, in addition to potential learning bene\ufb01ts, AIR can also\nimprove the approximate posterior of an already trained directed belief networks at test, indepen-\ndent on how the model was trained. Our implementation following Algorithm 1 is available at\nhttps://github.com/rdevon/IRVI.\n\n4 Related Work\n\nAdaptive importance re\ufb01nement (AIR) trades computation for expressiveness and is similar in\nthis regard to the re\ufb01nement procedure of hybrid MCMC for variational inference [HVI, 24] and\n\n4\n\n\fAlgorithm 1 AIR\nRequire: A generative model p(x, h; \u03c6) = p(x|h; \u03c6)p(h; \u03c6) and a recognition network \u00b50 = f (x; \u03c8)\nRequire: A transition operator g(\u00b5, x, \u03b3) and inference rate \u03b3.\n\nDraw K samples h(k) \u223c qt(h|x) and compute normalized importance weights \u02dcw(k)\n\nk=1 \u02dcw(k)h(k)\n\nend for\nif reweight then\n\nCompute \u00b50 = f (x; \u03c8) for q0(h|x; \u03c8)\nfor t=1:T do\n\n\u00b5t = (1 \u2212 \u03b3)\u00b5t\u22121 + \u03b3(cid:80)K\n\u2206\u03c6 \u221d(cid:80)K\n(cid:80)K\nk=1 \u02dcw(k)\u2207\u03c6 log p(x, h(k); \u03c6)\nk=1 \u2207\u03c6 log p(x, h(k); \u03c6)\n\u2206\u03c8 \u221d(cid:80)K\n(cid:80)K\nk=1 \u02dcw(k)\u2207\u03c8 log q0(h(k)|x; \u03c8)\nk=1 \u2207\u03c8 log q0(h(k)|x; \u03c8)\n\nend if\nif inclusive KL Divergence then\n\n\u2206\u03c6 \u221d 1\n\nK\n\nelse\n\nelse\n\n\u2206\u03c8 \u221d 1\n\nK\n\nend if\n\nnormalizing \ufb02ows for VAE [NF, 21]. HVI has a similar complexity as AIR, as it requires re-estimating\nthe lowerbound at every step. While NF can be less expensive than AIR, both HVI and NF rely on\nthe VAE re-parameterization to work, and thus cannot be applied to discrete variables. Sequential\nimportance sampling [SIS, 5] can offer a better re\ufb01nement step than AIS but typically requires\nresampling to control variance. While parametric versions exist that could be applicable to training\ndirected graphical models with discrete units [8, 18], their applicability as a general re\ufb01nement\nprocedure is limited as the re\ufb01nement parameters need to be learned.\nImportance sampling is central to reweighted wake-sleep [RWS, 1], importance-weighted autoen-\ncoders [IWAE, 2], variational inference for Monte Carlo objectives [VIMCO, 14], and recent work on\nstochastic feed-forward networks [SFFN, 26, 19]. While each of these methods are competitive, they\nrely on importance samples from the recognition network and do not offer the low-variance estimates\navailable from AIR. Neural variational inference and learning [NVIL, 13] is a single-sample and\nbiased version of VIMCO, which is greatly outperformed by techniques that use importance sampling.\nBoth NVIL and VIMCO reduce the variance of the Monte Carlo estimates of gradients by using an\ninput-dependent baseline, but this approach does not necessarily provide a better posterior and cannot\nbe used to give better estimates of the likelihood function or expectations.\nFinally, IRVI is meant to be a general approach to re\ufb01ning the approximate posterior. IRVI is not\nlimited to the re\ufb01nement step provided by AIR, and many different types of re\ufb01nement steps are\navailable to improve the posterior for models above (see supplementary material for the continuous\ncase). SIS and sequential importance resampling [SIR, 6] can be used as an alternative to AIR and\nmay provide a better re\ufb01nement step for IRVI.\n\n5 Experiments\n\nWe evaluate iterative re\ufb01nement for variational inference (IRVI) using adaptive importance re\ufb01nement\n(AIR) for both training and evaluating directed belief networks. We train and test on the following\nbenchmarks: the binarized MNIST handwritten digit dataset [23] and the Caltech-101 Silhouettes\ndataset. We centered the MNIST and Caltech datasets by subtracting the mean-image over the\ntraining set when used as input to the recognition network. We also train additional models using the\nre-weighted wake-sleep algorithm [RWS, 1], the state of the art for many con\ufb01gurations of directed\nbelief networks with discrete variables on these datasets for comparison and to demonstrate improving\nthe approximate posteriors with re\ufb01nement. With our experiments, we show that 1) IRVI can train\na variety of directed models as well or better than existing methods, 2) the gains from re\ufb01nement\nimproves the approximate posterior, and can be applied to models trained by other algorithms, and 3)\nIRVI can be used to improve a model with a relatively simple approximate posterior.\nModels were trained using the RMSprop algorithm [10] with a batch size of 100 and early stopping\nby recorded best variational lower bound on the validation dataset. For AIR, 20 \u201cinference steps\"\n\n5\n\n\f(cid:89)\n\ni\n\ni\u22121(cid:88)\n\nj=0\n\nFigure 2: The log-likelihood (left) and normalized effective sample size (right) with epochs in log-scale on the\ntraining set for AIR with 5 and 20 re\ufb01nement steps (vanilla AIR), reweighted AIR with 5 and 20 re\ufb01nement\nsteps, reweighted AIR with inclusive KL objective and 5 or 20 re\ufb01nement steps, and reweighted wake-sleep\n(RWS), all with a single stochastic latent layer. All models were evaluated with 100 posterior samples, their\nrespective number of re\ufb01nement steps for the effective sample size (ESS), and with 20 re\ufb01nement steps of AIR\nfor the log-likelihood. Despite longer wall-clock time per epoch,\n\n(K = 20), 20 adaptive samples (M = 20), and an adaptive damping rate, (1 \u2212 \u03b3), of 0.9 were used\nduring inference, chosen from validation in initial experiments. 20 posterior samples (N = 20) were\nused for model parameter updates for both AIR and RWS. All models were trained for 500 epochs\nand were \ufb01ne-tuned for an additional 500 with a decaying learning rate and SGD.\nWe use a generative model composed of a) a factorized Bernoulli prior as with sigmoid belief networks\n(SBNs) or b) an autoregressive prior, as in published MNIST results with deep autoregressive networks\n[DARN, 7]:\n\na) p(h) =\n\nb) P (hi = 1) = \u03c3(\n\np(hi); P (hi = 1) = \u03c3(bi),\n\n(10)\nwhere \u03c3 is the sigmoid (\u03c3(x) = 1/(1 + exp(\u2212x))) function, Wr is a lower-triangular square matrix,\nand b is the bias vector.\nFor our experiments, we use conditional and approximate posterior densities that follow Bernoulli\ndistributions:\n\nhj<i) + bi),\n\n(W i,j<i\n\nr\n\n(11)\nwhere Wl is a weight matrix between the l and l + 1 layers. As in Gregor et al. [7] with MNIST, we\ndo not use autoregression on the observations, x, and use a fully factorized approximate posterior.\n\nl\n\nP (hi,l = 1|hl+1) = \u03c3(W i,:\n\n\u00b7 hl+1 + bi,l),\n\n5.1 Variance Reduction and Choosing the AIR Objective\n\nThe effective sample size (ESS) in Eq. (5) is a good indicator of the variance of gradient estimate. In\nFig. 2 (right), we observe that the ESS improves as we take more AIR steps when training a deep\nbelief network (AIR(5) vs AIR(20)). When the approximate posterior is not re\ufb01ned (RWS), the ESS\nstays low throughout training, eventually resulting in a worse model. This improved ESS reveals\nitself as faster convergence in terms of the exact log-likelihood in the left panel of Fig. 2 (see the\nprogress of each curve until 100 epochs. See also supplementary materials for wall-clock time.)\nThis faster convergence does not guarantee a good \ufb01nal log-likelihood, as the latter depends on the\ntightness of the lowerbound rather than the variance of its estimate. This is most apparent when\ncomparing AIR(5), AIR+RW(5) and AIR+RW+IKL(5). AIR(5) has a low variance (high ESS) but\ncomputes the gradient of a looser lowerbound from Eq. (2), while the other two compute the gradient\nof a tighter lowerbound from Eq. (4). This results in AIR(5) converging faster than the other two,\nwhile the \ufb01nal log-likelihood estimates are better for the other two.\nWe however observe that the \ufb01nal log-likelihood estimates are comparable across all three variants\n(AIR, AIR+RW and AIR+RW+IKL) when a suf\ufb01cient number of AIR steps are taken so that L1 is\nsuf\ufb01ciently tight. When 20 steps were taken, we observe that the AIR(20) converges faster as well as\nachieves a better log-likelihood compared to AIR+RW(20) and AIR+RW+IKL(20). Based on these\nobservations, we use vanilla AIR (subsequently just \u201cAIR\u201d) in our following experiments.\n\n6\n\n\fTable 1: Results for adaptive importance sampling iterative re\ufb01nement (AIR), reweighted wake-sleep (RWS),\nand RWS with re\ufb01nement with AIR at test (RWS+) for a variety of model con\ufb01gurations. Additional sigmoid\nbelief networks (SBNs) trained with neural variational inference and learning (NVIL) from \u2020Mnih and Gregor\n[13] and variational inference for Monte Carlo objectives (VIMCO) from \u00a7Mnih and Rezende [14]. AIR is\ntrained with 20 inference steps and adaptive samples (K = 20, M = 20) in training (*3 layer SBN was trained\nwith 50 steps with a inference rate of 0.05). NVIL DARN results are from fDARN and VIMCO was trained\nusing 50 posterior samples (as opposed to 20 with AIR and RWS).\n\nModel\nSBN 200\nSBN 200-200\nSBN 200-200-200\nDARN 200\nDARN 500\n\nRWS\n102.51\n93.82\n92.00\n86.91\n85.40\n\nRWS+\n102.00\n92.83\n91.02\n86.21\n84.71\n\nMNIST\nAIR\n100.92\n92.90\n92.56\u2217\n85.89\n85.46\n\n5.2 Training and Density Estimation\n\nNVIL\u2020 VIMCO\u00a7\n113.1\n99.8\n96.7\n92.5\u2020\n90.7\u2020\n\n90.9\u00a7\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\nCaltech-101 Silhouettes\nAIR\nRWS\n116.61\n121.38\n106.94\n112.86\n104.36\n110.57\n109.76\n113.69\n\nRWS+\n118.63\n107.20\n104.54\n109.73\n\n\u2013\n\n\u2013\n\n\u2013\n\n0 ; \u03c82).\n\n(12)\n\n0 ); \u00b5(1)\n\n0 = f2(\u00b5(1)\n\n0 )q(h2|x; \u00b5(2)\n\n0 = f1(x; \u03c81); \u00b5(2)\n\nare a function of the initial variational parameters of the \ufb01rst layer, \u00b5(1)\n0 :\n\nWe evaluate AIR for training SBNs with one, two, and three layers of 200 hidden units and DARN\nwith 200 and 500 hidden units, comparing against our implementation of RWS. All models were\ntested using 100, 000 posterior samples to estimate the lowerbounds and average test log-probabilities.\nWhen training SBNs with AIR and RWS, we used a completely deterministic network for the\napproximate posterior. For example, for a 2-layer SBN, the approximate posterior factors into the\napproximate posteriors for the top and the bottom hidden layers, and the initial variational parameters\nof the top layer, \u00b5(2)\n0\nq0(h1, h2|x) = q0(h1|x; \u00b5(1)\nFor DARN, we trained two different con\ufb01gurations on MNIST: one with 500 stochastic units and an\nadditional hyperbolic tangent deterministic layer with 500 units in both the generative and recognition\nnetworks, and another with 200 stochastic units with a 500 hyperbolic tangent deterministic layer in\nthe generative network only. We used DARN with 200 units with the Caltech-101 silhouettes dataset.\nThe results of our experiments with the MNIST and Caltech-101 silhouettes datasets trained with\nAIR, RWS, and RWS re\ufb01ned at test with AIR (RWS+) are in Table 1. Re\ufb01nement at test (RWS+)\nalways improves the results for RWS. As our unre\ufb01ned results are comparable to those found in\nBornschein and Bengio [1], the improved results indicate many evaluations of Helmholtz machines in\nthe literature could bene\ufb01t from re\ufb01nement with AIR to improve evaluation accuracy. For most model\ncon\ufb01gurations, AIR and RWS perform comparably, though RWS appears to do better in the average\ntest log-probability estimates for some con\ufb01gurations of MNIST. RWS+ performs comparably with\nvariational inference for Monte Carlo objectives [VIMCO, 14], despite the reported VIMCO results\nrelying on more posterior samples in training. Finally, AIR results approach SOTA with Caltech-101\nsilhouettes with 3-layer SBNs against neural autoregressive distribution estimator [NADE, 1].\nWe also tested our log-probability estimates against the exact log-probability (by marginalizing\nover the joint) of smaller single-layer SBNs with 20 stochastic units. The exact log-probability was\n\u2212127.474 and our estimate with the unre\ufb01ned approximate was \u2212127.51 and \u2212127.48 with 100\nre\ufb01nement steps. Overall, this result is consistent with those of Table 1, that iterative re\ufb01nement\nimproves the accuracy of log-probability estimates.\n\n5.3 Posterior Improvement\n\nIn order to visualize the improvements due to re\ufb01nement and to demonstrate AIR as a general means\nof improvement for directed models at test, we generate N samples from the approximate posterior\nwithout (h \u223c q0(h|x; \u03c8)) and with re\ufb01nement (h \u223c qT (h|x)), from a single-layer SBN with 20\n(cid:80)N\nstochastic units originally trained with RWS. We then use the samples from the approximate posterior\nn=1 p(x|h(n)). We\nto compute the expected conditional probability or average reconstruction: 1\nN\nused a restricted model with a lower number of stochastic units to demonstrate that re\ufb01nement also\nworks well with simple models, where the recognition network is more likely to \u201caverage\u201d over latent\ncon\ufb01gurations, giving a misleading evaluation of the model\u2019s generative capability.\n\n7\n\n\fFigure 3: Top: Average reconstructions, 1/N(cid:80)N\n\nn=1 p(x|h(n)), for h(n) sampled from the output of the\nrecognition network, q0(h|x) (middle row) against those sampled from the re\ufb01ned posterior, qT (h|x) (bottom\nrow) for T = 20 with a model trained on MNIST. Top row is ground truth. Among the digits whose reconstruction\nchanges the most, many changes correctly reveal the identity of the digit. Bottom: Average reconstructions\nfor a single-layer model with 200 trained on Caltech-101 silhouettes. Instead of using the posterior from the\nrecognition network, we derived a simpler version, setting 80% of the variational parameters from the recognition\nnetwork to 0.5, then applied iterative re\ufb01nement.\n\nWe also re\ufb01ne the approximate posterior of a simpli\ufb01ed version of the recognition network of a\nsingle-layer SBN with 200 units trained with RWS. We simpli\ufb01ed the approximate posterior by \ufb01rst\ncomputing \u00b50 = f (x; \u03c8), then randomly setting 80% of the variational parameters to 0.5.\nFig. 3 shows improvement from re\ufb01nement for 25 digits from the MNIST test dataset, where the\nsamples chosen were those of which the expected reconstruction error of the original test sample\nwas the most improved. The digits generated from the re\ufb01ned posterior are of higher quality, and in\nmany cases the correct digit class is revealed. This shows that, in many cases where the recognition\nnetwork indicates that the generative model cannot model a test sample correctly, re\ufb01nement can\nmore accurately reveal the model\u2019s capacity. With the simpli\ufb01ed approximate posterior, re\ufb01nement is\nable to retrieve most of the shape of images from the Caltech-101 silhouettes, despite only starting\nwith 20% of the original parameters from the recognition network. This indicates that the work of\ninference need not all be done via a complex recognition network: iterative re\ufb01nement can be used to\naid in inference with a relatively simple approximate posterior.\n\n6 Conclusion\n\nWe have introduced iterative re\ufb01nement for variational inference (IRVI), a simple, yet effective and\n\ufb02exible approach for training and evaluating directed belief networks that works by improving the\napproximate posterior from a recognition network. We demonstrated IRVI using adaptive importance\nre\ufb01nement (AIR), which uses importance sampling at each iterative step, and showed that AIR can\nbe used to provide low-variance gradients to ef\ufb01ciently train deep directed graphical models. AIR\ncan also be used to more accurately reveal the generative model\u2019s capacity, which is evident when\nthe approximate posterior is of poor quality. The improved approximate posterior provided by AIR\nshows an increased effective samples size, which is a consequence of a better approximation of the\ntrue posterior and improves the accuracy of the test log-probability estimates.\n\n7 Acknowledgements\n\nThis work was supported by Microsoft Research to RDH under NJ; NIH P20GM103472, R01 grant\nREB020407, and NSF grant 1539067 to VDC; and ONR grant N000141512791 and ADeLAIDE\ngrant FA8750-16C-0130-001 to RS. KC was supported in part by Facebook, Google (Google Faculty\nAward 2016) and NVidia (GPU Center of Excellence 2015-2016), and RDH was supported in part by\nPIBBS.\n\nReferences\n[1] J\u00f6rg Bornschein and Yoshua Bengio. Reweighted wake-sleep. arXiv preprint arXiv:1406.2751, 2014.\n\n[2] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint\n\narXiv:1509.00519, 2015.\n\n8\n\n\f[3] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural\n\ncomputation, 7(5):889\u2013904, 1995.\n\n[4] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the\n\nem algorithm. Journal of the royal statistical society. Series B (methodological), 1977.\n\n[5] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods.\n\nIn Sequential Monte Carlo methods in practice, pages 3\u201314. Springer, 2001.\n\n[6] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-gaussian\nIn Radar and Signal Processing, IEE Proceedings F, volume 140, pages\n\nbayesian state estimation.\n107\u2013113. IET, 1993.\n\n[7] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive\n\nnetworks. arXiv preprint arXiv:1310.8499, 2013.\n\n[8] Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential monte carlo. In\n\nAdvances in Neural Information Processing Systems, pages 2611\u20132619, 2015.\n\n[9] Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for\n\nstochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.\n\n[10] Geoffrey Hinton. Neural networks for machine learning. Coursera, video lectures, 2012.\n\n[11] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets.\n\nNeural computation, 18(7):1527\u20131554, 2006.\n\n[12] Diederik Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[13] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Proceedings\n\nof the 31st International Conference on Machine Learning (ICML-14), pages 1791\u20131799, 2014.\n\n[14] Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv preprint\n\narXiv:1602.06725, 2016.\n\n[15] Radford M Neal. Connectionist learning of belief networks. Arti\ufb01cial intelligence, 56(1), 1992.\n\n[16] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and\n\nother variants. In Learning in graphical models, pages 355\u2013368. Springer, 1998.\n\n[17] Man-Suk Oh and James O Berger. Adaptive importance sampling in monte carlo integration. Journal of\n\nStatistical Computation and Simulation, 41(3-4):143\u2013168, 1992.\n\n[18] Brooks Paige and Frank Wood. Inference networks for sequential monte carlo in graphical models. arXiv\n\npreprint arXiv:1602.06701, 2016.\n\n[19] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary\n\nstochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.\n\n[20] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In Proceedings of the 31st International Conference on Machine\nLearning (ICML-14), pages 1278\u20131286, 2014.\n\n[21] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\n[22] Ruslan Salakhutdinov and Hugo Larochelle. Ef\ufb01cient learning of deep boltzmann machines. In Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pages 693\u2013700, 2010.\n\n[23] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings\n\nof the 25th international conference on Machine learning, pages 872\u2013879. ACM, 2008.\n\n[24] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference:\nBridging the gap. In David Blei and Francis Bach, editors, Proceedings of the 32nd International Confer-\nence on Machine Learning (ICML-15), pages 1218\u20131226. JMLR Workshop and Conference Proceedings,\n2015. URL http://jmlr.org/proceedings/papers/v37/salimans15.pdf.\n\n[25] Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean \ufb01eld theory for sigmoid belief networks.\n\nJournal of arti\ufb01cial intelligence research, 4(1):61\u201376, 1996.\n\n[26] Yichuan Tang and Ruslan R Salakhutdinov. Learning stochastic feedforward neural networks. In Advances\n\nin Neural Information Processing Systems, pages 530\u2013538, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2348, "authors": [{"given_name": "Devon", "family_name": "Hjelm", "institution": "University of New Mexico"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Kyunghyun", "family_name": "Cho", "institution": "University of Montreal"}, {"given_name": "Nebojsa", "family_name": "Jojic", "institution": "Microsoft Research"}, {"given_name": "Vince", "family_name": "Calhoun", "institution": "Mind Research Network"}, {"given_name": "Junyoung", "family_name": "Chung", "institution": "University of Montreal"}]}