Review for NeurIPS paper: Analytical Probability Distributions and Exact Expectation-Maximization for Deep Generative Networks

NeurIPS 2020

Analytical Probability Distributions and Exact Expectation-Maximization for Deep Generative Networks

Review 1

Summary and Contributions: The paper focuses on learning parametric distributions with latent variables via EM method, where the decoder is piece-wise affine function. They propose to compute the posterior distribution explicitly, which makes E-step exact. For piece-wise affine decoder and Gaussian prior, latent space is partitioned into polytopes and authors enumerate all the polytopes and do marginalization over them explicitly. M-step is also performed analytically due to usage of Gaussian distributions for the decoder and prior. Authors illustrate effectiveness of proposed exact EM learning on toy examples and on MNIST, as compared to vanilla VAE.

Strengths: The fact that for small networks and low-dimensional latent space the authors were able to do EM learning exactly is interesting. The way authors solve the computational problem of marginalizing the latent variables is new to me.

Weaknesses: The main weakness is very limited applicability of the proposed method: it doesn't seem like a good fit for modern machine learning given the size limitations. Given this the claims of the paper are exaggerated. Aside from interesting computational approach the impact of the method is low. Empirical evaluation of the computational effort of the new algorithm is missing.

Correctness: The claims of the paper are exaggerated since the method only applies to small problems while the claims are general. Empirical methodology is missing analysis of computational cost.

Clarity: The paper is well-written, with extensive derivations.

Relation to Prior Work: Unlike the prior work that analyzed partitioning of latent space by piece-wise affine functions, the current work explicitly performs marginalization over it, this is implied in the text.

Reproducibility: Yes

Additional Feedback: - Given explicit marginalization of latent variables, why not minimize log-likelihood directly rather then using EM? I am probably missing something, so would appreciate a clarification of this. - Adding cost analysis of the algorithm would be helpful, as well as discussion of scaling it up to realistic problems. - in comparison to VAE in Sec 4.3 it is unclear if the authors compute log-likelihood or ELBO (both terms are used in a confusing way) - curves in most plots are not labelled *************post rebuttal*********************************** I'd like to thank the authors for their response and addressing the above points. I have updated my score. I am still not convinced of the usefulness of the approach beyond toy examples.

Review 2

Summary and Contributions: Deep generative models (DGMs), specifically variational autoencoders (VAEs), currently rely on variational inference and stochastic optimization of a lower bound to maximize likelihood since the analytic likelihood cannot be computed in general. This paper shows that in fact the likelihood can be computed analytically and maximized with analytic expectation maximization (EM) updates when the network uses affine piecewise nonlinearities like ReLU and leaky-ReLU. The key insight is that these networks induces a partition of the latent space that can be handled tractably when the prior and likelihood are both Gaussian. This paper analytically derives the posterior distribution, the marginal distribution, the expectation of the complete likelihood (for the E step), and the updates to the parameters (for the M step). These novel derivations allows the authors to perform EM on DGMs for the first time. Empirically, the paper shows that learning via EM produces generative models with higher likelihood and higher sample quality than maximizing the evidence lower bound (ELBO). ********************* Post rebuttal I thank the authors for their rebuttal. Please include the discussions that were promised in the rebuttal. I will be staying with my score of 8. ********************* *********************

Strengths: 1. This paper is the first to compute the likelihood and posterior distributions involved in the probabilistic model of a variational autoencoder, showing that they are mixtures of truncated Gaussians for deep models that use affine piecewise nonlinearities. Being able to compute these terms enables applications like model comparison 2. This paper is the first to analytically perform EM with variational autoencoders. This is desirable because EM is guaranteed to converge to a local optimum and usually requires fewer iterations than gradient-based optimization. 3. The paper's derivations enables for the first time a direct comparison of the true likelihood to the ELBO 4. The analytic form of the posterior derived in the paper strongly motivates using multimodal posterior approximations 5. The paper shows that despite using a 1 dimensional latent space with a 2 hidden layer neural network, training VAEs with EM on MNIST results in higher likelihood models with more diverse samples than maximizing the ELBO. Additionally EM-learning is shown to converge to a better optimum in fewer updates than variational inference for MNIST and 2 other toy datasets.

Weaknesses: # Weaknesses 1. The proposed algorithm to compute the terms required for EM is computationally intensive and scales poorly with the latent space dimensionality and model size. This limits the empirical evaluations to only small neural networks and 1 dimensional latent spaces. 2. The analytical computations are only possible when working with Gaussian priors and likelihoods for the VAE. It also may not be possible to analytically compute the terms derived in the paper when the nonlinearities are not affine. 3. The derivation requires assuming that the likelihood of the data conditioned on latent has a constant covariance with respect to the latent. However, this is not a huge problem because in practice, the covariance is often made constant for optimization stability.

Correctness: The claims and methods are correct.

Clarity: The paper is very well-written and clear. The authors did well to present their work in a mostly self contained way.

Relation to Prior Work: The authors did a very good job discussing prior work, including how their methods differ and relate to previous contributions.

Reproducibility: Yes

Additional Feedback: Line 38: "descend" -> "descent" Line 54: "akin" -> "akin to" Line 89: "circulent" -> "circulant" Line 89: ", and, " -> "and, " Line 100: "D_\ell" -> "D^\ell" Line 124: "Maximization" -> "Maximizing" Line 168: "p(t=r)" -> "Pr(t=r)" Line 201: "hte" -> "the" Line 207: "q^\ell" -> "q^\ell (z)" Line 246: "b ased" -> "based" Section 4.3: May want to explicitly mention that you also performed experiment on MNIST, but left it in the appendix q^{all} is sometimes used instead of q in the appendix Line 476: "speicific" -> "specific" Appendix E: Although you describe the algorithm in previous sections, you may also want to explain what "reduce", "SearchRegion", "flip" mean or reference the section where they are explained. Line 596: "we an" -> "with an" Line 599: "read" -> "real"

Review 3

Summary and Contributions: The authors propose an expectation-maximisation (EM) approach to learn deep latent variable models (DLVMs) with decoders using ReLU nonlinearities. To this end, they leverage the fact that ReLU networks are piecewise affine, leading to the DLVM being effectively a Gaussian mixture model akin to mixtures of probabilistic PCA (PPCA).

Strengths: The approach is quite fresh, and the paper is certainly thought-provoking, by taking an inference approach somewhat orthogonal to the mainstream one. The authors do not try to hide that the EM approach is still quite impractical in realistic settings.

Weaknesses: - As acknowledged by the authors, there are important computational limitations that prevent the EM approach from being use to train really powerful DLVMs. - Similar insight was present in a paper presented at last year's ICML, which limits the novelty and "thought-provokingness" of the paper (see "prior work" box). - I think that the experiments are quite limited, in particular VAEs with more expressive posteriors could be compared to as well (e.g. IWAE, normalising flow posterior).

Correctness: All theoretical claims look sound.

Clarity: The paper is extremely well written, and should be enjoyed by both neophytes and VAE aficionados.

Relation to Prior Work: Very similar insights were present in the following paper: Variational Laplace Autoencoders, Park et al., ICML 2019 In particular, Park et al. (2019) also leverage the link between ReLU-based decoders and PPCA. This makes some of the novelty claims of lines 52-53 wrong. A detailed discussion on the links between these two papers is really needed. Less importantly, a few additional papers on links between mixture models and deep latent variable models and traditional mixture models: Leveraging the Exact Likelihood of Deep Latent Variable Models, Mattei and Frellsen, NeurIPS 2019 Taming VAEs, Rezende and Viola, arXiv:1810.00597 Finally, the idea of using EM to train deep latent variable models was already present in Bishop et al. (1998), although in a very different form: The Generative Topographic Mapping, Bishop et al., Neural Computation (1998)

Reproducibility: Yes

Additional Feedback: Isn't Lemma 2 true even without the ReLU assumption? Your proof technique does not work in that case, but it seems to be true under probably weaker assumptions. ----- POST-REBUTTAL EDIT ----- I have read other reviews and the authors's rebuttal. I appreciate that the authors discussed the links with the ICML paper on Laplacian VAEs that I mentioned. I strongly encourage them to write a detailed discussion on this (possibly in the appendix) to the final version. Assuming that they will add such a discussion, I am willing to raise my score to 7. I also suggest that the authors add short discussions on the other papers I mentioned, in particular Bishop (1998).

Review 4

Summary and Contributions: The paper introduces a method for exactly computing the marginal, posterior and conditional distributions of a Deep Generative Network that uses Continuous Piece Affine (not that strong of an constraint). They also derive an Expectation Maximization algorithm, allowing gradient free training and which doesn't rely on a variational bound. This allows them to highlight the shortcomings of VAE training for DGNs . They use this algorithm to demonstrate some of the instability of the VAE training for a DGN is because of the variational approximation isn't sufficiently well capturing the posterior.

Strengths: Providing ground truth checking of methods is often a vital step in understanding what price you pay for using different approximations and approximate inference schemes. This makes it a relevant and potentially significant result for the community that will potentially inspire future work and improvements. These improvements could both be to to their method in turns, probably trading off computation speed in exchange for accuracy, and potential improvements to the variational inference from better understand its short-comings.

Weaknesses: The principal weakness is that the method is very impractical for applying for higher dimensional problems as the scaling will be very poor. This is not explored nor is the scaling discussed in any detail apart from a very briefly qualitative mention in the appendix.

Correctness: The paper appears methodologically sound and correct. The experiments are very limited and synthetic, but they do clearly demonstrate their method working and the short-comings of AVI when applied to the same problems. Figure 4 clearly shows that EM training ends up with a superior NLL to the VAE.

Clarity: Paper is clear and well written. It clearly goes explains each concept in turn. The figures are useful for understanding the different parts of the model. Minor corrections Line 201 - typo

Relation to Prior Work: It appears well cited and how it differs from previous work is clearly explained.

Reproducibility: Yes

Additional Feedback: While computational complexity is not the main aim of the paper, it would be informative to provide the number of regions found in the posterior for the synthetic examples used in the paper, and the computational time required for their method and the VAE approximation. -- Post rebuttal: Thank you to the authors for adding citations for to address the scalability of the method with number of layer/topologies and adding computation times for each step to the appendix. Additionally, approximate estimates of how quickly the number of regions grows in a "real" network would help in evaluating the network.