{"title": "Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2080, "page_last": 2088, "abstract": "We describe an embarrassingly parallel, anytime Monte Carlo method for likelihood-free models.  The algorithm starts with the view that the stochasticity of the pseudo-samples generated by the simulator can be controlled externally by a vector of random numbers u, in such a way that the outcome, knowing u, is deterministic.  For each instantiation of u we run an optimization procedure to minimize the distance between summary statistics of the simulator and the data. After reweighing these samples using the prior and the Jacobian (accounting for the change of volume in transforming from the space of summary statistics to the space of parameters) we show that this weighted ensemble represents a Monte Carlo estimate of the posterior distribution. The procedure can be run embarrassingly parallel (each node handling one sample) and anytime (by allocating resources to the worst performing sample). The procedure is validated on six experiments.", "full_text": "Optimization Monte Carlo: Ef\ufb01cient and\n\nEmbarrassingly Parallel Likelihood-Free Inference\n\nEdward Meeds\n\nInformatics Institute\n\nUniversity of Amsterdam\ntmeeds@gmail.com\n\nMax Welling\u2217\n\nwelling.max@gmail.com\n\nInformatics Institute\n\nUniversity of Amsterdam\n\nAbstract\n\nWe describe an embarrassingly parallel, anytime Monte Carlo method for\nlikelihood-free models. The algorithm starts with the view that the stochastic-\nity of the pseudo-samples generated by the simulator can be controlled externally\nby a vector of random numbers u, in such a way that the outcome, knowing u,\nis deterministic. For each instantiation of u we run an optimization procedure to\nminimize the distance between summary statistics of the simulator and the data.\nAfter reweighing these samples using the prior and the Jacobian (accounting for\nthe change of volume in transforming from the space of summary statistics to the\nspace of parameters) we show that this weighted ensemble represents a Monte\nCarlo estimate of the posterior distribution. The procedure can be run embar-\nrassingly parallel (each node handling one sample) and anytime (by allocating\nresources to the worst performing sample). The procedure is validated on six ex-\nperiments.\n\nIntroduction\n\n1\nComputationally demanding simulators are used across the full spectrum of scienti\ufb01c and industrial\napplications, whether one studies embryonic morphogenesis in biology, tumor growth in cancer\nresearch, colliding galaxies in astronomy, weather forecasting in meteorology, climate changes in\nthe environmental science, earthquakes in seismology, market movement in economics, turbulence\nin physics, brain functioning in neuroscience, or fabrication processes in industry. Approximate\nBayesian computation (ABC) forms a large class algorithms that aims to sample from the posterior\ndistribution over parameters for these likelihood-free (a.k.a. simulator based) models. Likelihood-\nfree inference, however, is notoriously inef\ufb01cient in terms of the number of simulation calls per\nindependent sample. Further, like regular Bayesian inference algorithms, care must be taken so that\nposterior sampling targets the correct distribution.\nThe simplest ABC algorithm, ABC rejection sampling, can be fully parallelized by running indepen-\ndent processes with no communication or synchronization requirements. I.e. it is an embarrassingly\nparallel algorithm. Unfortunately, as the most inef\ufb01cient ABC algorithm, the bene\ufb01ts of this ti-\ntle are limited. There has been considerable progress in distributed MCMC algorithms aimed at\nlarge-scale data problems [2, 1]. Recently, a sequential Monte Carlo (SMC) algorithm called \u201cthe\nparticle cascade\u201d was introduced that emits streams of samples asynchronously with minimal mem-\nory management and communication [17]. In this paper we present an alternative embarrassingly\nparallel sampling approach: each processor works independently, at full capacity, and will inde\ufb01-\nnitely emit independent samples. The main trick is to pull random number generation outside of the\nsimulator and treat the simulator as a deterministic piece of code. We then minimize the difference\n\u2217Donald Bren School of Information and Computer Sciences University of California, Irvine, and Canadian\n\nInstitute for Advanced Research.\n\n1\n\n\fbetween observations and the simulator output over its input parameters and weight the \ufb01nal (opti-\nmized) parameter value with the prior and the (inverse of the) Jacobian. We show that the resulting\nweighted ensemble represents a Monte Carlo estimate of the posterior. Moreover, we argue that the\nerror of this procedure is O(\u0001) if the optimization gets \u0001-close to the optimal value. This \u201cOpti-\nmization Monte Carlo\u201d (OMC) has several advantages: 1) it can be run embarrassingly parallel, 2)\nthe procedure generates independent samples and 3) the core procedure is now optimization rather\nthan MCMC. Indeed, optimization as part of a likelihood-free inference procedure has recently been\nproposed [12]; using a probabilistic model of the mapping from parameters to differences between\nobservations and simulator outputs, they apply \u201cBayesian Optimization\u201d (e.g. [13, 21]) to ef\ufb01ciently\nperform posterior inference. Note also that since random numbers have been separated out from the\nsimulator, powerful tools such as \u201cautomatic differentiation\u201d (e.g. [14]) are within reach to assist\nwith the optimization. In practice we \ufb01nd that OMC uses far fewer simulations per sample than\nalternative ABC algorithms.\nThe approach of controlling randomness as part of an inference procedure is also found in a related\nclass of parameter estimation algorithms called indirect inference [11]. Connections between ABC\nand indirect inference have been made previously by [7] as a novel way of creating summary statis-\ntics. An indirect inference perspective led to an independently developed version of OMC called the\n\u201creverse sampler\u201d [9, 10].\nIn Section 2 we brie\ufb02y introduce ABC and present it from a novel viewpoint in terms of random\nnumbers. In Section 3 we derive ABC through optimization from a geometric point of view, then\nproceed to generalize it to higher dimensions. We show in Section 4 extensive evidence of the\ncorrectness and ef\ufb01ciency of our approach. In Section 5 we describe the outlook for optimization-\nbased ABC.\n\n2 ABC Sampling Algorithms\nThe primary interest in ABC is the posterior of simulator parameters \u03b8 given a vector of (statistics\nof) observations y, p(\u03b8|y). The likelihood p(y|\u03b8) is generally not available in ABC. Instead we can\nuse the simulator as a generator of pseudo-samples x that reside in the same space as y. By treating\nx as auxiliary variables, we can continue with the Bayesian treatment:\n\np(\u03b8|y) =\n\n\u2248 p(\u03b8)(cid:82) p\u0001(y|x)p(x|\u03b8) dx\n(cid:82) p(\u03b8)(cid:82) p\u0001(y|x)p(x|\u03b8) dx d\u03b8\n\np(\u03b8)p(y|\u03b8)\n\np(y)\n\n(1)\n\nOf particular importance is the choice of kernel measuring the discrepancy between observations y\nand pseudo-data x. Popular choices for kernels are the Gaussian kernel and the uniform \u0001-tube/ball.\nThe bandwidth parameter \u0001 (which may be a vector \u0001 accounting for relative importance of each\nstatistic) plays critical role: small \u0001 produces more accurate posteriors, but is more computationally\ndemanding, whereas large \u0001 induces larger error but is cheaper.\nWe focus our attention on population-based ABC samplers, which include rejection sampling, im-\nportance sampling (IS), sequential Monte Carlo (SMC) [6, 20] and population Monte Carlo [3]. In\nrejection sampling, we draw parameters from the prior \u03b8 \u223c p(\u03b8), then run a simulation at those\nparameters x \u223c p(x|\u03b8); if the discrepancy \u03c1(x, y) < \u0001, then the particle is accepted, otherwise it\nis rejected. This is repeated until n particles are accepted. Importance sampling generalizes rejec-\ntion sampling using a proposal distribution q\u03c6(\u03b8) instead of the prior, and produces samples with\nweights wi \u221d p(\u03b8)/q(\u03b8). SMC extends IS to multiple rounds with decreasing \u0001, adapting their par-\nticles after each round, such that each new population improves the approximation to the posterior.\nOur algorithm has similar qualities to SMC since we generate a population of n weighted particles,\nbut differs signi\ufb01cantly since our particles are produced by independent optimization procedures,\nmaking it completely parallel.\n\n3 A Parallel and Ef\ufb01cient ABC Sampling Algorithm\nInherent in our assumptions about the simulator is that internally there are calls to a random number\ngenerator which produces the stochasticity of the pseudo-samples. We will assume for the moment\nthat this can be represented by a vector of uniform random numbers u which, if known, would\nmake the simulator deterministic. More concretely, we assume that any simulation output x can\nbe represented as a deterministic function of parameters \u03b8 and a vector of random numbers u,\n\n2\n\n\f(a) D\u03b8 = Dy\n\n(b) D\u03b8 < Dy\n\nFigure 1: Illustration of OMC geometry. (a) Dashed lines indicate contours f (\u03b8, u) over \u03b8 for several u. For\nthree values of u, their initial and optimal \u03b8 positions are shown (solid blue/white circles). Within the grey\nacceptance region, the Jacobian, indicated by the blue diagonal line, describes the relative change in volume\ninduced in f (\u03b8, u) from a small change in \u03b8. Corresponding weights \u221d 1/|J| are shown as vertical stems. (b)\nWhen D\u03b8 < Dy, here 1 < 2, the change in volume is proportional to the length of the line segment inside the\nellipsoid (|JT J|1/2). The orange line indicates the projection of the observation onto the contour of f (\u03b8, u) (in\nthis case, identical to the optimal).\n\ni.e. x = f (\u03b8, u). This assumption has been used previously in ABC, \ufb01rst in \u201ccoupled ABC\u201d\n[16] and also in an application of Hamiltonian dynamics to ABC [15]. We do not make any further\nassumptions regarding u or p(u), though for some problems their dimension and distribution may be\nknown a priori. In these cases it may be worth employing Sobol or other low-discrepancy sequences\nto further improve the accuracy of any Monte Carlo estimates.\nWe will \ufb01rst derive a dual representation for the ABC likelihood function p\u0001(y|\u03b8) (see also [16]),\n\np\u0001(y|\u03b8) =\n\n=\n\np\u0001(y|x)p(x|\u03b8) dx =\np\u0001(y|f (\u03b8, u))p(u) du\n\np\u0001(y|x)I[x = f (\u03b8, u)]p(u) dxdu\n\n(2)\n\n(3)\n\n(cid:90) (cid:90)\n\n(cid:90)\n(cid:90)\n(cid:90)\n\nleading to the following Monte Carlo approximation of the ABC posterior,\n\np\u0001(\u03b8|y) \u221d p(\u03b8)\n\np(u)p\u0001(y|f (u, \u03b8)) du \u2248 1\nn\n\np\u0001(y|f (ui, \u03b8))p(\u03b8) ui \u223c p(u)\n\n(4)\n\n(cid:88)\n\ni\n\ni that results in y = f (\u03b8o\n\nSince p\u0001 is a kernel that only accepts arguments y and f (ui, \u03b8) that are \u0001 close to each other (for\nvalues of \u0001 that are as small as possible), Equation 4 tells us that we should \ufb01rst sample values for\nu from p(u) and then for each such sample \ufb01nd the value for \u03b8o\ni , u). In\npractice we want to drive these values as close to each other as possible through optimization and\naccept an O(\u0001) error if the remaining distance is still O(\u0001). Note that apart from sampling the values\nfor u this procedure is deterministic and can be executed completely in parallel, i.e. without any\ncommunication. In the following we will assume a single observation vector y, but the approach is\nequally applicable to a dataset of N cases.\n3.1 The case D\u03b8 = Dy\nWe will \ufb01rst study the case when the number of parameters \u03b8 is equal to the number of summary\nstatistics y. To understand the derivation it helps to look at Figure 1a which illustrates the derivation\nfor the one dimensional case. In the following we use the following abbreviation: fi(\u03b8) stands for\nf (\u03b8, ui). The general idea is that we want to write the approximation to the posterior as a mixture\nof small uniform balls (or delta peaks in the limit):\n\nwiU\u0001(\u03b8|\u03b8\u2217\n\ni )p(\u03b8)\n\n(5)\n\n(cid:88)\n\ni\n\np(\u03b8|y) \u2248 1\nn\n\n(cid:88)\n\ni\n\np\u0001(y|f (ui, \u03b8))p(\u03b8) \u2248 1\nn\n\n3\n\n\fwith wi some weights that we will derive shortly. Then, if we make \u0001 small enough we can replace\nany average of a suf\ufb01ciently smooth function h(\u03b8) w.r.t. this approximate posterior simply by eval-\nuating h(\u03b8) at some arbitrarily chosen points inside these balls (for instance we can take the center\nof the ball \u03b8\u2217\ni ),\n\n(cid:90)\n\nh(\u03b8)p(\u03b8|y) d\u03b8 \u2248 1\nn\n\nh(\u03b8\u2217\n\ni )wip(\u03b8\u2217\ni )\n\n(6)\n\n(cid:88)\n\ni\n\nTo derive this expression we \ufb01rst assume that:\n\n(7)\ni.e. a ball of radius \u0001. C(\u0001) is the normalizer which is immaterial because it cancels in the posterior.\nFor small enough \u0001 we claim that we can linearize fi(\u03b8) around \u03b8o\ni :\n\np\u0001(y|fi(\u03b8)) = C(\u0001)I[||y \u2212 fi(\u03b8)||2 \u2264 \u00012]\n\nRi = O(||\u03b8 \u2212 \u03b8o\n\ni ||2)\n\n(8)\n\n. We take \u03b8o\ni\n\nto be the end result of our\n\ni (\u03b8 \u2212 \u03b8o\n\ni ) + Jo\n\n\u02c6fi(\u03b8) = fi(\u03b8o\n\ni ) + Ri,\ni is the Jacobian matrix with columns \u2202fi(\u03b8o\ni )\n\nwhere Jo\noptimization procedure for sample ui. Using this we thus get,\ni )) \u2212 Jo\n\n\u2202\u03b8d\n\n||y \u2212 fi(\u03b8)||2 \u2248 ||(y \u2212 fi(\u03b8o\n\ni ) \u2212 Ri||2\ni (\u03b8 \u2212 \u03b8o\n(9)\nWe \ufb01rst note that since we assume that our optimization has ended up somewhere inside the ball\ni )|| = O(\u0001). Also, since we only\nde\ufb01ned by ||y \u2212 fi(\u03b8)||2 \u2264 \u00012 we can assume that ||y \u2212 fi(\u03b8o\nconsider values for \u03b8 that satisfy ||y \u2212 fi(\u03b8)||2 \u2264 \u00012, and furthermore assume that the function\ni || = O(\u0001) as well. All of this implies\nfi(\u03b8) is Lipschitz continuous in \u03b8 it follows that ||\u03b8 \u2212 \u03b8o\nthat we can safely ignore the remaining term Ri (which is of order O(||\u03b8 \u2212 \u03b8o\ni ||2) = O(\u00012)) if we\nrestrict ourselves to the volume inside the ball.\nThe next step is to view the term I[||y \u2212 fi(\u03b8)||2 \u2264 \u00012] as a distribution in \u03b8. With the Taylor\nexpansion this results in,\n\nI[(\u03b8 \u2212 \u03b8o\n\ni \u2212 Jo,\u22121\n\ni\n\n(y \u2212 fi(\u03b8o\n\ni )))T JoT\n\ni Jo\n\nThis represents an ellipse in \u03b8-space with a centroid \u03b8\u2217\n\ni (\u03b8 \u2212 \u03b8o\n\n(y \u2212 fi(\u03b8o\ni and volume Vi given by\n\ni\n\ni ))) \u2264 \u00012]\n\n\u03b8\u2217\ni = \u03b8o\n\n(y \u2212 fi(\u03b8o\ni ))\n\ni\n\ni + Jo,\u22121\n(cid:88)\n\ni Jo\ni )\nwith \u03b3 a constant independent of i. We can approximate the posterior now as,\ni )p(\u03b8\u2217\ni )\ni Jo\ni )\n\nU\u0001(\u03b8|\u03b8\u2217\ndet(JoT\n\np(\u03b8|y) \u2248 1\n\u03ba\n\ni )p(\u03b8)\ni Jo\ni )\n\n\u03b4(\u03b8 \u2212 \u03b8\u2217\n\n\u2248 1\n\u03ba\n\ndet(JoT\n\ndet(JoT\n\n(cid:113)\n\ni\n\ni\n\nmalization, \u03ba = (cid:80)\n\ni Jo\n\ni p(\u03b8\u2217\n\ni ) det(JoT\n\nwhere in the last step we have send \u0001 \u2192 0. Finally, we can compute the constant \u03ba through nor-\ni )\u22121/2. The whole procedure is accurate up to errors of the\norder O(\u00012), and it is assumed that the optimization procedure delivers a solution that is located\nwithin the epsilon ball. If one of the optimizations for a certain sample ui did not end up within\nthe epsilon ball there can be two reasons: 1) the optimization did not converge to the optimal value\nfor \u03b8, or 2) for this value of u there is no solution for which f (\u03b8|u) can get within a distance \u0001\nfrom the observation y. If we interpret \u0001 as our uncertainty in the observation y, and we assume that\nour optimization succeeded in \ufb01nding the best possible value for \u03b8, then we should simply reject\nthis sample \u03b8i. However, it is hard to detect if our optimization succeeded and we may therefore\nsometimes reject samples that should not have been rejected. Thus, one should be careful not to\ncreate a bias against samples ui for which the optimization is dif\ufb01cult. This situation is similar to a\nsampler that will not mix to remote local optima in the posterior distribution.\n\nVi =\n\ni \u2212 Jo,\u22121\n\u03b3(cid:113)\n(cid:113)\n\n(cid:88)\n\n(10)\n\n(11)\n\n(12)\n\n3.2 The case D\u03b8 < Dy\nThis is the overdetermined case and here the situation as depicted in Figure 1b is typical: the mani-\nfold that f (\u03b8, ui) traces out as we vary \u03b8 forms a lower dimensional surface in the Dy dimensional\nenveloping space. This manifold may or may not intersect with the sphere centered at the observa-\ntion y (or ellipsoid, for the general case \u0001 instead of \u0001). Assume that the manifold does intersect the\n\n4\n\n\fi (y \u2212 fi(\u03b8o\ni ))\n\nepsilon ball but not y. Since we trust our observation up to distance \u0001, we may simple choose to\npick the closest point \u03b8\u2217\ni to y on the manifold, which is given by,\n\u03b8\u2217\ni = \u03b8o\n\ni + Jo\u2020\n(13)\nwhere Jo\u2020\nis the pseudo-inverse. We can now de\ufb01ne our ellipse around this point, shifting the\ncenter of the ball from y to fi(\u03b8\u2217\ni ) (which do not coincide in this case). The uniform distribution\non the ellipse in \u03b8-space is now de\ufb01ned in the D\u03b8 dimensional manifold and has volume Vi =\ni )\u22121/2. So once again we arrive at almost the same equation as before (Eq. 12) but with\n\u03b3 det(JoT\ni )|| \u2264 \u00012\nthe slightly different de\ufb01nition of the point \u03b8\u2217\nand if we assume that our optimization succeeded, we will only make mistakes of order O(\u00012).\n\ni given by Eq. 13. Crucially, since ||y \u2212 fi(\u03b8\u2217\n\nJo\u2020\ni = (JoT\n\ni Jo\n\ni )\u22121JoT\n\ni\n\ni\n\ni Jo\n\n3.3 The case D\u03b8 > Dy\nThis is the underdetermined case in which it is typical that entire manifolds (e.g. hyperplanes) may\nbe a solution to ||y \u2212 fi(\u03b8\u2217\nIn this case we can not approximate the posterior with a\nmixture of point masses and thus the procedure does not apply. However, the case D\u03b8 > Dy is less\ninteresting than the other ones above as we expect to have more summary statistics than parameters\nfor most problems.\n\ni )|| = 0.\n\n4 Experiments\nThe goal of these experiments is to demonstrate 1) the correctness of OMC and 2) the relative\nef\ufb01ciency of OMC in relation to two sequential MC algorithms, SMC (aka population MC [3]) and\nadaptive weighted SMC [5]. To demonstrate correctness, we show histograms of weighted samples\nalong with the true posterior (when known) and, for three experiments, the exact OMC weighted\nsamples (when the exact Jacobian and optimal \u03b8 is known). To demonstrate ef\ufb01ciency, we compute\nthe mean simulations per sample (SS)\u2014the number of simulations required to reach an \u0001 threshold\u2014\nand the effective sample size (ESS), de\ufb01ned as 1/wT w. Additionally, we may measure ESS/n, the\nfraction of effective samples in the population. ESS is a good way of detecting whether the posterior\nis dominated by a few particles and/or how many particles achieve discrepancy less than epsilon.\nThere are several algorithmic options for OMC. The most obvious is to spawn independent pro-\ncesses, draw u for each, and optimize until \u0001 is reached (or a max nbr of simulations run), then\ncompute Jacobians and particle weights. Variations could include keeping a sorted list of discrepan-\ncies and allocating computational resources to the worst particle. However, to compare OMC with\nSMC, in this paper we use a sequential version of OMC that mimics the epsilon rounds of SMC.\nEach simulator uses different optimization procedures, including Newton\u2019s method for smooth sim-\nulators, and random walk optimization for others; Jacobians were computed using one-sided \ufb01nite\ndifferences. To limit computational expense we placed a max of 1000 simulations per sample per\nround for all algorithms. Unless otherwise noted, we used n = 5000 and repeated runs 5 times; lack\nof error bars indicate very low deviations across runs. We also break some of the notational conven-\ntion used thus far so that we can specify exactly how the random numbers translate into pseudo-data\nand the pseudo-data into statistics. This is clari\ufb01ed for each example. Results are explained in\nFigures 2 to 4.\n4.1 Normal with Unknown Mean and Known Variance\n\n\u221a\n\nThe simplest example is the inference of the mean \u03b8 of a univariate normal distribution with known\nvariance \u03c32. The prior distribution \u03c0(\u03b8) is normal with mean \u03b80 and variance k\u03c32, where k > 0 is\na factor relating the dispersions of \u03b8 and the data yn. The simulator can generate data according to\nthe normal distribution, or deterministically if the random effects rum are known:\n\n=\u21d2 xm = \u03b8 + rum\n\n(cid:80)\nm xm. Therefore we have f (\u03b8, u) = \u03b8 + R(u) where R(u) =(cid:80) rum /M\n\n\u03c0(xm|\u03b8) = N (xm|\u03b8, \u03c32)\n(14)\n\u22121(2um\u2212 1) (using the inverse CDF). A suf\ufb01cient statistic for this problem is\nwhere rum = \u03c3\n2 erf\nthe average s(x) = 1\nM\nIn our experiment we set M = 2 and y = 0. The exact\n(the average of the random effects).\nJacobian and \u03b8o\ni can be computed for this problem: for a draw ui, Ji = 1; if s(y) is the mean of the\ni = s(y) \u2212 R(ui). Therefore the exact\nobservations y, then by setting f (\u03b8o\nweights are wi \u221d \u03c0(\u03b8o\ni ), allowing us to compare directly with an exact posterior based on our dual\nrepresentation by u (shown by orange circles in Figure 2 top-left). We used Newton\u2019s method to\noptimize each particle.\n\ni , ui) = s(y) we \ufb01nd \u03b8o\n\n5\n\n\fFigure 2: Left: Inference of unknown mean. For \u0001 0.1, OMC uses 3.7 SS; AW/SMC uses 20/20 SS; at \u0001 0.01,\nOMC uses 4 SS (only 0.3 SS more), and SMC jumps to 110 SS. For all algorithms and \u0001 values, ESS/n=1.\nRight: Inference for mixture of normals. Similar results for OMC; at \u0001 0.025 AW/SMC had 40/50 SS and at\n\u0001 0.01 has 105/120 SS. The ESS/n remained at 1 for OMC, but decreased to 0.06/0.47 (AW/SMC) at \u0001 0.025,\nand 0.35 for both at \u0001 0.01. Not only does the ESS remain high for OMC, but it also represents the tails of the\ndistribution well, even at low \u0001.\n\n4.2 Normal Mixture\nA standard illustrative ABC problem is the inference of the mean \u03b8 of a mixture of two normals\n[19, 3, 5]: p(x|\u03b8) = \u03c1 N (\u03b8, \u03c32\n2), with \u03c0(\u03b8) = U(\u03b8a, \u03b8b) where hyperparameters\n2 = 1/100, \u03b8a = \u221210, \u03b8b = 10, and a single observation scalar y = 0.\nare \u03c1 = 1/2, \u03c32\nFor this problem M = 1 so we drop the subscript m. The true posterior is simply p(\u03b8|y = 0) \u221d\n\u03c1 N (\u03b8, \u03c32\n2), \u03b8 \u2208 {\u221210, 10}. In this problem there are two random numbers u1\nand u2, one for selecting the mixture component and the other for the random innovation; further,\nthe statistic is the identity, i.e. s(x) = x:\n\n1) + (1 \u2212 \u03c1) N (\u03b8, \u03c32\n\n1) + (1\u2212 \u03c1) N (\u03b8, \u03c32\n\n1 = 1, \u03c32\n\n2 erf(2u2 \u2212 1)) + [u1 \u2265 \u03c1](\u03b8 + \u03c32\n\n2 erf(2u2 \u2212 1))\n\n\u221a\n\n\u221a\n\nx = [u1 < \u03c1](\u03b8 + \u03c31\n\n\u221a\n\n\u221a\n\n= \u03b8 +\n\n\u03c3[u1\u2265\u03c1]\n2\nwhere R(u) =\n. As with the previous example, the Jacobian is 1\ni = y \u2212 R(ui) is known exactly. This problem is notable for causing performance issues in\nand \u03b8o\nABC-MCMC [19] and its dif\ufb01culty in targeting the tails of the posterior [3]; this is not the case for\nOMC.\n\n2 erf(2u2 \u2212 1)\u03c3[u1<\u03c1]\n\u03c3[u1\u2265\u03c1]\n\n2 erf(2u2 \u2212 1)\u03c3[u1<\u03c1]\n\n= \u03b8 + R(u)\n\n1\n\n1\n\n2\n\n(15)\n(16)\n\n4.3 Exponential with Unknown Rate\nIn this example, the goal is to infer the rate \u03b8 of an exponential distribution, with a gamma prior\np(\u03b8) = Gamma(\u03b8|\u03b1, \u03b2), based on M draws from Exp(\u03b8):\n\np(xm|\u03b8) = Exp(xm|\u03b8) = \u03b8 exp(\u2212\u03b8xm) =\u21d2 xm = \u2212 1\n\u03b8\n\nproblem is the average s(x) = (cid:80)\n\n(17)\nwhere rum = \u2212 ln(1 \u2212 um) (the inverse CDF of the exponential). A suf\ufb01cient statistic for this\nm xm/M. Again, we have exact expressions for the Jacobian\ni = R(ui)/s(y). We used M = 2,\n\ni , using f (\u03b8, ui) = R(ui)/\u03b8, Ji = \u2212R(ui)/\u03b82 and \u03b8o\n\nln(1 \u2212 um) =\n\nrum\n\n1\n\u03b8\n\nand \u03b8o\ns(y) = 10 in our experiments.\n\n4.4 Linked Mean and Variance of Normal\nIn this example we link together the mean and variance of the data generating function as follows:\n(18)\n\np(xm|\u03b8) = N (xm|\u03b8, \u03b82)\n\n=\u21d2 xm = \u03b8 + \u03b8\n\n\u22121(2um \u2212 1) = \u03b8rum\n\n2 erf\n\n\u221a\n\n6\n\n\fFigure 3: Left: Inference of rate of exponential. A similar result wrt SS occurs for this experiment: at \u0001 1,\nOMC had 15 v 45/50 for AW/SMC; at \u0001 0.01, SS was 28 OMC v 220 AW/SMC. ESS/n dropping with below\n1: OMC drops at \u0001 1 to 0.71 v 0.97 for SMC; at \u0001 0.1 ESS/n remains the same. Right: Inference of linked\nnormal. ESS/n drops signi\ufb01cantly for OMC: at \u0001 0.25 to 0.32 and at \u0001 0.1 to 0.13, while it remains high\nfor SMC (0.91 to 0.83). This is the result the inability of every ui to achieve \u03c1 < \u0001, whereas for SMC, the\nalgorithm allows them to \u201cdrop\u201d their random numbers and effectively switch to another. This was veri\ufb01ed\nby running an expensive \ufb01ne-grained optimization, resulting in 32.9% and 13.6% optimized particles having\n\u03c1 under \u0001 0.25/0.1. Taking this inef\ufb01ciency into account, OMC still requires 130 simulations per effective\nsample v 165 for SMC (ie 17/0.13 and 136/0.83).\n\n\u221a\n\n\u22121(2um \u2212 1). We put a positive constraint on \u03b8: p(\u03b8) = U(0, 10). We used\n\n2 erf\n\nwhere rum = 1 +\n2 statistics, the mean and variance of M draws from the simulator:\n=\u21d2 f1(\u03b8, u) = \u03b8R(u)\n=\u21d2 f2(\u03b8, u) = \u03b82V (u)\n\n(xm \u2212 s1(x))2\n\ns1(x) =\n\nxm\n\n1\nM\n1\nM\n\n(cid:88)\nwhere V (u) = (cid:80)\n\ns2(x) =\n\nm\n\n/M \u2212 R(u)2 and R(u) = (cid:80)\n\nm r2\num\n\n[R(u), 2\u03b8V (u)]T . In our experiments M = 10, s(y) = [2.7, 12.8].\n\n\u2202f1(\u03b8, u)\n\n\u2202\u03b8\n\n\u2202f2(\u03b8, u)\n\n\u2202\u03b8\n\n= R(u)\n\n= 2\u03b8V (u)\n\n(19)\n\n(20)\n\nm rum /M; the exact Jacobian is therefore\n\n4.5 Lotka-Volterra\nThe simplest Lotka-Volterra model explains predator-prey populations over time, controlled by a set\nof stochastic differential equations:\n\n= \u03b81x1 \u2212 \u03b82x1x2 + r1\n\ndx1\ndt\n\ndx2\ndt\n\n= \u2212\u03b82x2 \u2212 \u03b83x1x2 + r2\n\n(21)\nwhere x1 and x2 are the prey and predator population sizes, respectively. Gaussian noise r \u223c\nN (0, 102) is added at each full time-step. Lognormal priors are placed over \u03b8. The simulator\nruns for T = 50 time steps, with constant initial populations of 100 for both prey and predator.\nThere is therefore P = 2T outputs (prey and predator populations concatenated), which we use\nas the statistics. To run a deterministic simulation, we draw ui \u223c \u03c0(u) where the dimension of\nu is P . Half of the random variables are used for r1 and the other half for r2. In other words,\n\u22121(2ust \u2212 1), where s \u2208 {1, 2} for the appropriate population. The Jacobian is a\nrust = 10\n100\u00d73 matrix that can be computed using one-sided \ufb01nite-differences using 3 forward simulations.\n\n2 erf\n\n\u221a\n\n4.6 Bayesian Inference of the M/G/1 Queue Model\nBayesian inference of the M/G/1 queuing model is challenging, requiring ABC algorithms [4, 8] or\nsophisticated MCMC-based procedures [18]. Though simple to simulate, the output can be quite\n\n7\n\n\fFigure 4: Top: Lotka-Volterra. Bottom: M/G/1 Queue. The left plots shows the posterior mean \u00b11 std errors\nof the posterior predictive distribution (sorted for M/G/1). Simulations per sample and the posterior of \u03b81 are\nshown in the other plots. For L-V, at \u0001 3, the SS for OMC were 15 v 116/159 for AW/SMC, and increased\nat \u0001 2 to 23 v 279/371. However, the ESS/n was lower for OMC: at \u0001 3 it was 0.25 and down to 0.1 at \u0001 2,\nwhereas ESS/n stayed around 0.9 for AW/SMC. This is again due to the optimal discrepancy for some u being\ngreater than \u0001; however, the samples that remain are independent samples. For M/G/1, the results are similar,\nbut the ESS/n is lower than the number of discrepancies satisfying \u0001 1, 9% v 12%, indicating that the volume\nof the Jacobians is having a large effect on the variance of the weights. Future work will explore this further.\n\nnoisy. In the M/G/1 queuing model, a single server processes arriving customers, which are then\nserved within a random time. Customer m arrives at time wm \u223c Exp(\u03b83) after customer m\u2212 1, and\nis served in sm \u223c U(\u03b81, \u03b82) service time. Both wm and sm are unobserved; only the inter-departure\ntimes xm are observed. Following [18], we write the simulation algorithm in terms of arrival times\nvm. To simplify the updates, we keep track of the departure times dm. Initially, d0 = 0 and v0 = 0,\nfollowed by updates for m \u2265 1:\n\nxm = sm + max(0, vm \u2212 dm\u22121)\n\ndm = dm\u22121 + xm\n\nvm = vm\u22121 + wm\n\n(22)\nAfter trying several optimization procedures, we found the most reliable optimizer was simply a\nrandom walk. The random sources in the problem used for Wm (there are M) and for Um (there are\nM), therefore u is dimension 2M. Typical statistics for this problem are quantiles of x and/or the\nminimum and maximum values; in other words, the vector x is sorted then evenly spaced values for\nthe statistics functions f (we used 3 quantiles). The Jacobian is an M \u00d73 matrix. In our experiments\n\u03b8\u2217 = [1.0, 5.0, 0.2]\n\n5 Conclusion\nWe have presented Optimization Monte Carlo, a likelihood-free algorithm that, by controlling the\nsimulator randomness, transforms traditional ABC inference into a set of optimization procedures.\nBy using OMC, scientists can focus attention on \ufb01nding a useful optimization procedure for their\nsimulator, and then use OMC in parallel to generate samples independently. We have shown that\nOMC can also be very ef\ufb01cient, though this will depend on the quality of the optimization pro-\ncedure applied to each problem. In our experiments, the simulators were cheap to run, allowing\nJacobian computations using \ufb01nite differences. We note that for high-dimensional input spaces and\nexpensive simulators, this may be infeasible, solutions include randomized gradient estimates [22]\nor automatic differentiation (AD) libraries (e.g. [14]). Future work will include incorporating AD,\nimproving ef\ufb01ciency using Sobol numbers (when the size u is known), incorporating Bayesian opti-\nmization, adding partial communication between processes, and inference for expensive simulators\nusing gradient-based optimization.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for the many useful comments that improved this manuscript.\nMW acknowledges support from Facebook, Google, and Yahoo.\n\n8\n\n\fReferences\n[1] Ahn, S., Korattikara, A., Liu, N., Rajan, S., and Welling, M. (2015). Large scale distributed Bayesian\n\nmatrix factorization using stochastic gradient MCMC. In KDD.\n\n[2] Ahn, S., Shahbaba, B., and Welling, M. (2014). Distributed stochastic gradient MCMC. In Proceedings of\n\nthe 31st International Conference on Machine Learning (ICML-14), pages 1044\u20131052.\n\n[3] Beaumont, M. A., Cornuet, J.-M., Marin, J.-M., and Robert, C. P. (2009). Adaptive approximate Bayesian\n\ncomputation. Biometrika, 96(4):983\u2013990.\n\n[4] Blum, M. G. and Franc\u00b8ois, O. (2010). Non-linear regression models for approximate Bayesian computa-\n\ntion. Statistics and Computing, 20(1):63\u201373.\n\n[5] Bonassi, F. V. and West, M. (2015). Sequential Monte Carlo with adaptive weights for approximate\n\nBayesian computation. Bayesian Analysis, 10(1).\n\n[6] Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 68(3):411\u2013436.\n\n[7] Drovandi, C. C., Pettitt, A. N., and Faddy, M. J. (2011). Approximate Bayesian computation using indirect\n\ninference. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):317\u2013337.\n\n[8] Fearnhead, P. and Prangle, D. (2012). Constructing summary statistics for approximate Bayesian compu-\ntation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series\nB (Statistical Methodology), 74(3):419\u2013474.\n\n[9] Forneron, J.-J. and Ng, S. (2015a). The ABC of simulation estimation with auxiliary statistics. arXiv\n\npreprint arXiv:1501.01265v2.\n\n[10] Forneron, J.-J. and Ng, S. (2015b). A likelihood-free reverse sampler of the posterior distribution. arXiv\n\npreprint arXiv:1506.04017v1.\n\n[11] Gourieroux, C., Monfort, A., and Renault, E. (1993). Indirect inference. Journal of applied econometrics,\n\n8(S1):S85\u2013S118.\n\n[12] Gutmann, M. U. and Corander, J. (2015). Bayesian optimization for likelihood-free inference of\nsimulator-based statistical models. Journal of Machine Learning Research, preprint arXiv:1501.03291.\nIn press.\n\n[13] Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global optimization, 13(4):455\u2013492.\n\n[14] Maclaurin, D. and Duvenaud, D. (2015). Autograd. github.com/HIPS/autograd.\n[15] Meeds, E., Leenders, R., and Welling, M. (2015). Hamiltonian ABC. Uncertainty in AI, 31.\n[16] Neal, P. (2012). Ef\ufb01cient likelihood-free Bayesian computation for household epidemics. Statistical\n\nComputing, 22:1239\u20131256.\n\n[17] Paige, B., Wood, F., Doucet, A., and Teh, Y. W. (2014). Asynchronous anytime Sequential Monte Carlo.\n\nIn Advances in Neural Information Processing Systems, pages 3410\u20133418.\n\n[18] Shestopaloff, A. Y. and Neal, R. M. (2013). On Bayesian inference for the M/G/1 queue with ef\ufb01cient\n\nMCMC sampling. Technical Report, Dept. of Statistics, University of Toronto.\n\n[19] Sisson, S., Fan, Y., and Tanaka, M. M. (2007). Sequential Monte Carlo without likelihoods. Proceedings\n\nof the National Academy of Sciences, 104(6).\n\n[20] Sisson, S., Fan, Y., and Tanaka, M. M. (2009). Sequential Monte Carlo without likelihoods: Errata.\n\nProceedings of the National Academy of Sciences, 106(16).\n\n[21] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning\n\nalgorithms. Advances in Neural Information Processing Systems 25.\n\n[22] Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient\n\napproximation. Automatic Control, IEEE Transactions on, 37(3):332\u2013341.\n\n9\n\n\f", "award": [], "sourceid": 1252, "authors": [{"given_name": "Ted", "family_name": "Meeds", "institution": "U. Amsterdam"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam"}]}