NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona

### Reviewer 1

#### Summary

The paper proposes to replace standard Monte Carlo methods for ABC with a method based on Bayesian density estimation. Although density estimation has been used before within the context of ABC, this approach allows for the direct replacement of sample-based approximations of the posterior with an analytic approximation. The novelty is similar in scope that of ABC with variational inference [1], but the approach discussed here is quite different. [1] Tran, Nott and Kohn, 2015, "Variational Bayes with Intractable Likelihood"

#### Qualitative Assessment

The paper is actually very well written and easy to understand. The level of technical writing is sufficient for the expert, while eliding unnecessary details. While paper heavily builds upon previous work, the key idea of proposition 1 is used elegantly throughout, both to choose the proposal prior and to estimate the posterior approximation. In addition, there are other moderate novel but useful contributions sprinkled throughout the paper, such as the extension of MDN to SVI. However, the authors must also discuss other work in the use of SVI with ABC, such as for example [1]. The paper lacks a firm theoretical underpinning, apart from the asymptotic motivation that Proposition 1 provides to the proposed algorithm. However, I believe that this is more than sufficient for this type of paper, and I do not count that as a negative, especially given the NIPS format [I doubt that an explanation of the model, experiments as well heavy theory could fit in the eight pages provided]. The experimental results are a good mix of simple examples and larger datasets, and are clearly presented. I also like how the authors disentangle the effect of selecting the proposal distribution from the posterior estimation. The plots are trying to take the taking effective sample size into account, but I am not sure that this is the best metric. After all, samples are purely computational beasts in this setting. Wouldn't it make more sense to measure actual CPU time?

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 2

#### Summary

In this paper the authors present an alternative approach to Approximate Bayesian Computation for (of course) models with intractable likelihood but from which mock datasets may be readily simulated under arbitrary parameters (within the prior support, etc etc.). The approach presented makes use of a flexible parametric modelling tool—the Mixture Density Network—to approximate the Bayesian conditional density on the parameter space with respect to the (mock) data; in this way the authors bring about a potentially powerful synthesis of ideas from the machine learning and statistical theory.

#### Qualitative Assessment

I believe this may be an outstanding paper as the approach suggested is well motivated and clearly explained; and my impression from the numerous worked examples is that it will very likely have an impact on the application of likelihood free methods, especially (but not exclusively) for problems in which the mock data simulations are costly such that efficiency of the sampler or posterior approximation scheme is at a premium (e.g. weather simulations, cosmological simulations, individual simulation models for epidemiology). It is worth noting here the parallel development within the statistics community of random forest methods for epsilon-free ABC inference targeting models for the conditional density (Marin et al., 1605.05537), which highlight the enthusiasm for innovations in this direction. I have a concern with the authors’ proof of Proposition 1 in that the term ‘sufficiently flexible’ is not explicitly described but should be, in which case sufficient conditions on the posterior for use of the MVN model could be easily identified. Naturally these will be rather restrictive so interest turns to understanding and identifying circumstances where the the approximation may be considered adequate or otherwise, and empirical metrics by which the user might be guided in this decision. Minor notes: - the comparison to existing work in Section 4 is well done (e.g. identification of regression-adjustment as a development in a similar direction); perhaps though it is worth noting that the ‘earliest ABC work’ of Diggle & Gratton (1984) was to develop a kernel-based estimator of the likelihood - in the introduction it is mentioned that “it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses”; a number of recent studies in scaleable Bayesian methods have been directed towards this problem (e.g. Zheng, Kim, Ho & Xing 2014, Scott et al. 2013, Minsker et al. 2014)

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 3

#### Summary

The authors propose to approximate the posterior of intractable models using a density estimator based on neural network. The main advantage, relative to ABC methods, is that it is not necessary to choose a tolerance. The innovative part is that they model the posterior directly, while a more common approach is to approximate/estimate the intractable likelihood. Hence, Proposition 1 is the main result of the paper, in my opinion. Starting from Proposition 1, several conditional density estimators could be used, and the authors use a Mixture Density Network. They then describe how the proposal prior and the posterior density are estimated, using respectively Algorithm 1 and 2. They illustrate the method with several simple examples, two of which have intractable likelihoods.

#### Qualitative Assessment

The most original part of the paper is Proposition 1, which is quite interesting. However, I have some doubts regarding the assumptions leading to formula (2). As explained in the appendix, this formula holds if q_theta is complex enough to make so that the KL distance is zero. Now, in a realistic example and with finite sample size, q_theta can't be very complex, otherwise it would over-fit. Hence, (2) holds only approximately. The examples are a bit disappointing. In particular, tolerance-based ABC methods suffer in high dimensions, hence I would have expected to see at least one relatively high dimensional example (say 20d). It is not clear to me that the MDN estimator would scale well as the number of model parameters or of summary statistics increases. The practical utility of the method depends quite a lot on how it scales, and at the moment this is not evident. My understanding is that the complexity of the MDN fit depends on the hyper-parameter lambda and on the number of components. The number of components was chosen manually, but the value of lambda is never reported. How was this chosen? I have some further comments. Section by section: - Sec 2.3 1. Is a proper prior required? 2. In Algorithm 1, how is convergence assessed? Because the algorithm seems to be stochastic. - Sec 2.4 1. The authors say: "If we take p̃(θ) to be the actual prior, then q φ (θ | x) will learn the posterior for all x" is this really true? Depending on the prior, the model might learn the posterior for values of x very different from x_0, but probably not "for all x". Maybe it is also worth pointing out that you need to model qφ(θ | x) close to x_0 because you are modelling the posterior non-parametrically. If, for instance, you were using a linear regression model, the variance of the estimator would be much reduced by choosing points x very far from x_0. 2. Why the authors use one Gaussian component for the proposal prior and several for the posterior? Is sampling from a MDN with multiple components expensive? If the same number of components was used, it might be possible to unify Algorithms 1 and 2. That is, repeat algorithm 2 several times, use the estimated posterior at the i-th iteration as the proposal prior for the next one. 3. It is not clear to me how MDN is initialized at each iteration in Algorithm 1. The authors say that by initializing the prior using the previous iteration allows them to keep N small. Hence, I think that by initializing they don't simply mean giving a good initialization to the optimizer, but something related to recycling all the simulations obtained so far. Either way, at the moment is it not quite clear what happens. - Sec 2.5 1. It is not clear to me why MVN-SVI avoids overfitting. Whether it overfits or not probably depends on the hyperparameter \lambda. How is this chosen at the moment? I guess not by cross-validation, given that the authors say that no validation set is needed. - Sec 3.1 1. The differences between the densities in the left plot of Figure 1 are barely visible. Maybe plot log-densities? 2. What value of lambda was used to obtain these results? This is not reported, same in the remaining examples. - Sec 3.2 1. Is formula (5) correct? x is a vector, but its mean and variance are scalar. 2. In Figure 2: maybe it is worth explaining why in ABC the largest number of simulations does not correspond to the smallest KL distance. I guess that this is because \epsilon is too small and the rejection rate is high. - Sec 3.3 1. The authors say that "in all cases the MDNs chose to use only one component and switch the rest off, which is consistent with our observation about the near-Gaussianity of the posterior". Does this happen for any value of \lambda?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

This paper proposes a method for parameters inference. The paper sets the problem where we have a set of observed variables, x, and a set of underlying parameters theta. We assume that we can sample from p(x|theta) but that we don't have an explicit form for it. The goal is to recover the parameter posterior p(theta|x). We assume we have a prior distribution p(theta) over the parameters theta. The paper explains that must of the usual methods to solve this kind of problems is to replace p(x=x0|theta) by p(||x-x0|| < epsilon|theta) and use a sampling method, such as MCMC. However, they explain that it only approximates the true distribution when epsilon goes to 0, but at the same time the computing complexity grows to infinity. The proposed method is to directly train a neural network to learn p(theta|x) (renormalized by a known ratio of pt(theta) over p(theta), explained later). The network produces the parameters for a mixture of Gaussian. The training points are drawn from the following procedure: choose a distribution pt(theta) to sample from. Sample a batch a N points from pt(theta). Run them through the sampler to get the corresponding points x. Train the network to predict p(x|theta) from the input theta. The selection of pt is important for convergence speed, and a method is proposed: start with the prior p(theta) and as the neural network is trained, use the current model to refine the prior pt. Results are on multiple datasets, and the method seems to work well, and converge better than MCMC and simple rejection methods.

#### Qualitative Assessment

The paper is clear and the method looks sounds. Several related works are presented towards the end of the paper (why not the beginning as in most papers?). The differences between the current method and these are explained, but no comparisons are directly shown with most of the related methods. It would be nice to include these on at least one problem.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 5

#### Summary

The paper is on likelihood-free inference, that is on parametric inference for models where the likelihood function is too expensive to evaluate. It is proposed to obtain an approximation of the posterior distribution of the parameters by approximating the conditional distribution of data given parameters with a Gaussian mixture model (a mixture density network). The authors see the main advantages over standard approximate Bayesian computation (ABC) in that - their approach is returning a "parametric approximation to the exact posterior" as opposed to returning samples from an approximate posterior (line 49), - their approach is computationally more efficient. (line 55) The paper contains a short theoretical part where the approach is shown to yield the correct posterior in the limit of infinitely many simulations if the mixture model can represent any density. The approach is verified on two toy models where the true posterior is known and two more demanding models with intractable likelihoods.