{"title": "Asynchronous Anytime Sequential Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 3410, "page_last": 3418, "abstract": "We introduce a new sequential Monte Carlo algorithm we call the particle cascade. The particle cascade is an asynchronous, anytime alternative to traditional sequential Monte Carlo algorithms that is amenable to parallel and distributed implementations. It uses no barrier synchronizations which leads to improved particle throughput and memory efficiency. It is an anytime algorithm in the sense that it can be run forever to emit an unbounded number of particles while keeping within a fixed memory budget. We prove that the particle cascade provides an unbiased marginal likelihood estimator which can be straightforwardly plugged into existing pseudo-marginal methods.", "full_text": "Asynchronous Anytime Sequential Monte Carlo\n\nBrooks Paige\n\nFrank Wood\nDepartment of Engineering Science\n\nUniversity of Oxford\n\nOxford, UK\n\n{brooks,fwood}@robots.ox.ac.uk\n\n{doucet,y.w.teh}@stats.ox.ac.uk\n\nArnaud Doucet\n\nYee Whye Teh\n\nDepartment of Statistics\n\nUniversity of Oxford\n\nOxford, UK\n\nAbstract\n\nWe introduce a new sequential Monte Carlo algorithm we call the particle cas-\ncade. The particle cascade is an asynchronous, anytime alternative to traditional\nsequential Monte Carlo algorithms that is amenable to parallel and distributed\nimplementations.\nIt uses no barrier synchronizations which leads to improved\nparticle throughput and memory ef\ufb01ciency. It is an anytime algorithm in the sense\nthat it can be run forever to emit an unbounded number of particles while keeping\nwithin a \ufb01xed memory budget. We prove that the particle cascade provides an un-\nbiased marginal likelihood estimator which can be straightforwardly plugged into\nexisting pseudo-marginal methods.\n\n1\n\nIntroduction\n\nSequential Monte Carlo (SMC) inference techniques require blocking barrier synchronizations at\nresampling steps which limit parallel throughput and are costly in terms of memory. We introduce\na new asynchronous anytime sequential Monte Carlo algorithm that has statistical ef\ufb01ciency com-\npetitive with standard SMC algorithms and has suf\ufb01ciently higher particle throughput such that it is\non balance more ef\ufb01cient per unit computation time. Our approach uses locally-computed decision\nrules for each particle that do not require block synchronization of all particles, instead only sharing\nof summary statistics with particles that follow. In our algorithm each resampling point acts as a\nqueue rather than a barrier: each particle chooses the number of its own offspring by comparing its\nown weight to the weights of particles which previously reached the queue, blocking only to update\nsummary statistics before proceeding.\nAn anytime algorithm is an algorithm that can be run continuously, generating progressively better\nsolutions when afforded additional computation time. Traditional particle-based inference algo-\nrithms are not anytime in nature; all particles need to be propagated in lock-step to completion in\norder to compute expectations. Once a particle set runs to termination, inference cannot straight-\nforwardly be continued by simply doing more computation. The na\u00a8\u0131ve strategy of running SMC\nagain and merging the resulting sets of particles is suboptimal due to bias (see [13] for explana-\ntion). Particle Markov chain Monte Carlo methods (i.e. particle Metropolis Hastings and iterated\nconditional sequential Monte Carlo (iCSMC) [1]) for correctly merging particle sets produced by\nadditional SMC runs are closer to anytime in nature but suffer from burstiness as big sets of particles\nare computed then emitted at once and, fundamentally, the inner-SMC loop of such algorithms still\nsuffers the kind of excessive synchronization performance penalty that the particle cascade directly\navoids. Our asynchronous SMC algorithm, the particle cascade, is anytime in nature. The particle\ncascade can be run inde\ufb01nitely, without resorting to merging of particle sets.\n\n1.1 Related work\n\nOur algorithm shares a super\ufb01cial similarity to Bernoulli branching numbers [5] and other search\nand exploration methods used for particle \ufb01ltering, where each particle samples some number of\n\n1\n\n\fchildren to propagate to the next observation. Like the particle cascade, the total number of particles\nwhich exist at each generation is allowed to gradually increase and decrease. However, computing\nbranching correction numbers is generally a synchronous operation, requiring all particle weights\nto be known in order to choose an appropriate number of offspring; nor are these methods anytime.\nSequentially interacting Markov chain Monte Carlo [2, 9] is an anytime algorithm, which although\nconceptually similar to SMC has different synchronization properties.\nParallelizing the resampling step of sequential Monte Carlo methods has drawn increasing recent\ninterest as the effort progresses to scale up algorithms to take advantage of high-performance com-\nputing systems and GPUs. Removing the global collective resampling operation [10] is a particular\nfocus for improving performance.\nRunning arbitrarily many particles within a \ufb01xed memory budget can also be addressed by tracking\nrandom number seeds used to generate proposals, allowing particular particles to be deterministi-\ncally \u201creplayed\u201d [7]. However, this approach is not asynchronous nor anytime.\n\n2 Background\n\nWe begin by brie\ufb02y reviewing sequential Monte Carlo as generally formulated on state-space mod-\nels. Suppose we have a non-Markovian dynamical system with latent random variables X0, . . . , XN\nand observed random variables Y0, . . . , YN described by the joint density\n\np(xn|x0:n\u22121, y0:n\u22121) = f (xn|x0:n\u22121)\np(yn|x0:n, y0:n\u22121) = g(yn|x0:n),\n\n(1)\n\nwhere X0 is drawn from some initial distribution \u00b5(\u00b7), and f and g are conditional densities.\nGiven observed values Y0:N = y0:N , the posterior distribution p(x0:n|y0:n) is approximated by a\nweighted set of K particles, with each particle k denoted X k\n0:n for k = 1, . . . , K. Particles are\npropagated forward from proposal densities q(xn|x0:n\u22121) and re-weighted at each n = 1, . . . , N\n\nn|X k\nX k\n\n0:n\u22121 \u223c q(xn|X k\ng(yn|X k\nwk\n\nn =\n\n0:n\u22121)\n0:n)f (X k\nn|X k\n\nn|X k\n0:n\u22121)\n\nq(X k\n\n0:n\u22121)\n\nW k\n\nn = W k\n\nn\u22121wk\nn,\n\n(2)\n\n(3)\n\n(4)\n\nn is the weight associated with observation yn and W k\n\nn is the unnormalized weight of\nwhere wk\nparticle k after observation n. It is assumed that exact evaluation of p(x0:N|y0:N ) is intractable and\nthat the likelihoods g(yn|X k\n0:n) can be evaluated pointwise. In many complex dynamical systems,\n0:n\u22121) may be prohibitively costly or even\nor in black-box simulation models, evaluation of f (X k\nimpossible. As long as one is capable of simulating from the system, the proposal distribution can be\nchosen as q(\u00b7) \u2261 f (\u00b7), in which case the particle weights are simply wk\n0:n), eliminating\nthe need to compute the densities f (\u00b7).\nThe normalized particle weights \u00af\u03c9k\n\nn are used to approximate the posterior\n\nn = g(yn|X k\n\nn|X k\n\nj=1 W j\n\nn = W k\n\nn /(cid:80)K\n\u02c6p(x0:n|y0:n) \u2248 K(cid:88)\n(cid:80)K\n\nk=1\n\n\u00af\u03c9k\n\nn\u03b4X k\n\n0:n\n\n(x0:n).\n\n(5)\n\nIn the very simple sequential importance sampling setup described here, the marginal likelihood can\nbe estimated by \u02c6p(y0:n) = 1\nK\n\nn .\nk=1 W k\n\n2.1 Resampling and degeneracy\n\nn, . . . , \u00af\u03c9K\n\nThe algorithm described above suffers from a degeneracy problem wherein most of the normalized\nweights \u00af\u03c91\nn become very close to zero for even moderately large n. Traditionally this is\ncombated by introducing a resampling step: as we progress from n to n + 1, particles with high\nweights are duplicated and particles with low weights are discarded, preventing all the probability\nmass in our approximation to the posterior from accumulating on a single particle. A resampling\n\n2\n\n\fscheme is an algorithm for selecting the number of offspring particles M k\nn+1 that each particle k\nwill produce after stage n. Many different schemes for resampling particles exist; see [6] for an\noverview. Resampling changes the weights of particles: as the system progresses from n to n + 1,\neach of the M k\nn prior\nto resampling. Most resampling schemes generate an unweighted set of particles with V k\nn+1 = 1 for\nall particles. When a resampling step is added at every n, the marginal likelihood can be estimated\n\nn+1 children are assigned a new weight V k\n\nn+1, replacing the previous weight W k\n\n1\nK\n\ni=0\n\nk=1 wk\n\ni ; this estimate of the marginal likelihood is unbiased [8].\n\nby \u02c6p(y0:n) =(cid:81)n\n\n(cid:80)K\n\n2.2 Synchronization and limitations\n\nOur goal is to scale up to very large numbers of particles, using a parallel computing architecture\nwhere each particle is simulated as a separate process or thread. In order to resample at each n we\nmust compute the normalized weights \u00af\u03c9k\nn, requiring us to wait until all individual particles have both\nn before the normalization and\n\ufb01nished forward simulation and computed their individual weight W k\nresampling required for any to proceed. While the forward simulation itself is trivially parallelizable,\nthe weight normalization and resampling step is a synchronous, collective operation. In practice this\ncan lead to signi\ufb01cant underuse of computing resources in a multiprocessor environment, hindering\nour ability to scale up to large numbers of particles.\nMemory limitations on \ufb01nite computing hardware also limit the number of simultaneous particles\nwe are capable of running in practice. All particles must move through the system together, simul-\ntaneously; if the total memory requirements of particles is greater than the available system RAM,\nthen a substantial overhead will be incurred from swapping memory contents to disk.\n\n3 The Particle Cascade\n\nThe particle cascade algorithm we introduce addresses both these limitations: it does not require\nsynchronization, and keeps only a bounded number of particles alive in the system at any given time.\nInstead of resampling, we will consider particle branching, where each particle may produce 0 or\nmore offspring. These branching events happen asynchronously and mutually exclusively, i.e. they\nare processed one at a time.\n\n3.1 Local branching decisions\n\nAt each stage n of sequential Monte Carlo, particles process observation yn. Without loss of gener-\nality, we can de\ufb01ne an ordering on the particles 1, 2, . . . in the order they arrive at yn. We keep track\nof the running average weight W k\nn of the \ufb01rst k particles to arrive at observation yn in an online\nmanner\n\nW k\n\nn = W k\nn\n\nW k\n\nn =\n\nW k\u22121\n\nn +\n\n1\nk\n\nW k\nn\n\nfor k = 1,\n\nfor k = 2, 3, . . . .\n\n(6)\n\n(7)\n\nk \u2212 1\nk\n\nn of particle k relative to those of\nThe number of children of particle k depends on the weight W k\nother particles. Particles with higher relative weight are more likely to be located in a high posterior\nprobability part of the space, and should be allowed to spawn more child particles.\nIn our online asynchronous particle system we do not have access to the weights of future particles\nn among\nwhen processing particle k. Instead we will compare W k\nn to the current average weight W k\nn+1, will\nparticles processed thus far. Speci\ufb01cally, the number of children, which we denote by M k\ndepend on the ratio\n\nRk\n\nW k\nn\nW k\nn\nn+1 such that the total weight of all children\nEach child of particle k will be assigned a weight V k\nM k\nThere is a great deal of \ufb02exibility available in designing a scheme for choosing the number of child\nparticles; we need only be careful to set V k\nn+1 to\n\nn+1 appropriately. Informally, we would like M k\n\nn+1 has expectation W k\nn .\n\nn+1V k\n\nn =\n\n(8)\n\n.\n\n3\n\n\fbe large when Rk\nthe outgoing weight V k\nguarantees M k\n\nn is large. If M k\nn+1 = W k\n\nn+1 > 0, then we set V k\n\nn+1 = W k\n\nn /M k\n\nn+1.\n\nn+1 is sampled in such a way that E[M k\nn, then we set\nn. Alternatively, if we are using a scheme which deterministically\n\nn+1] = Rk\n\nk=1 M k\n\nn(cid:99),(cid:100)Rk\n\nn, or a discrete distribution over the integers {(cid:98)Rk\n\nA simple approach would be to sample M k\nn+1 independently conditioned on the weights. In such\nn+1 from some simple distribution, e.g. a Poisson distribution with\nschemes we could draw each M k\nn(cid:101)}. However, one issue that arises\nmean Rk\nin such approaches where the number of children for each particle is conditionally independent is\nthat the variance of the total number of particles at each generation can grow faster than desirable.\nSuppose we start the system with K0 particles. The number of particles at subsequent stages n is\nn. We would like to avoid situations in which the number of\n\ngiven recursively as Kn =(cid:80)Kn\u22121\n\nparticles becomes too large, or collapses to 1.\nInstead, we will allow M k\nn to depend on the number of children of previous particles at n, in such\na way that we can stabilize the total number of particles in each generation. Suppose that we wish\nfor the number of particles to be stabilized around K0. After k \u2212 1 particles have been processed,\nwe expect the total number of children produced at that point to be approximately k \u2212 1, so that if\nthe number is less than k \u2212 1 we should allow particle k to produce more children, and vice versa.\nSimilarly, if we already currently have more than K0 children, we should allow particle k to produce\nfewer children.\nWe use a simple scheme which satis\ufb01es these criteria, where the number of particles is chosen at\nrandom when Rk\n\nn \u2265 1\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nn < 1, and set deterministically when Rk\nn < 1;\nn < 1;\n\n(0, 0) w.p. 1 \u2212 Rk\nn,\nn,\nn) w.p. Rk\n(1, W k\nn(cid:99), W k\n((cid:98)Rk\nn(cid:99) )\nn(cid:98)Rk\nn(cid:101), W k\n((cid:100)Rk\nn(cid:101) )\nn(cid:100)Rk\n\nif Rk\nif Rk\nif Rk\nif Rk\n\nn \u2265 1 and(cid:80)k\u22121\nn \u2265 1 and(cid:80)k\u22121\n\n(M k\n\nn+1, V k\n\nn+1) =\n\nj=1 M j\nj=1 M j\n\nn+1 > min(K0, k \u2212 1);\nn+1 \u2264 min(K0, k \u2212 1).\n\n(9)\n\nn(cid:99),(cid:100)Rk\n\nAs the number of particles becomes large, the estimated average weight closely approximates the\nn \u2212 (cid:98)Rk\nn(cid:99))\ntrue average weight. Were we to replace the deterministic rounding with a Bernoulli(Rk\nchoice between {(cid:98)Rk\nn(cid:101)}, then this decision rule de\ufb01nes the same distribution on the number\nof offspring particles M k\nNote the anytime nature of this algorithm \u2014 any given particle passing through the system needs\nn+1 in order to make\nonly the running average W k\nlocal branching decisions, not the previous particles themselves. Thus it is possible to run this\nalgorithm for some \ufb01xed number of initial particles K0, inspect the output of the completed particles\nwhich have left the system, and decide whether to continue by initializing additional particles.\n\nn and the preceding child particle counts(cid:80)k\u22121\n\nn+1 as the well-known systematic resampling procedure [3, 10].\n\nj=1 M j\n\n3.2 Computing expectations and marginal likelihoods\n\napproximate the posterior expectation by E[\u03d5(X0:n)|y0:n] \u2248(cid:80)Kn\n\nSamples drawn from the particle cascade can be used to compute expectations in the same man-\nner as usual; that is, given some function \u03d5(\u00b7), we normalize weights \u00af\u03c9k\nn and\nn\u03d5(X k\n\nn = W k\n\nj=1 W j\n\n0:n).\n\nk=1 \u00af\u03c9k\n\nWe can also use the particle cascade to de\ufb01ne an estimator of the marginal likelihood p(y0:n),\n\nn /(cid:80)Kn\n\nKn(cid:88)\n\nk=1\n\n1\nK0\n\n\u02c6p(y0:n) =\n\nW k\nn .\n\n(10)\n\nThe form of this estimate is fairly distinct from the standard SMC estimators in Section 2. One can\n\nthink of \u02c6p(y0:n) as \u02c6p(y0:n) = \u02c6p(y0)(cid:81)n\n\nK0(cid:88)\n\nk=1\n\n\u02c6p(y0) =\n\n1\nK0\n\nW k\n0 ,\n\ni=1 \u02c6p(yi|y0:i\u22121) where\n\u02c6p(yn|y0:n\u22121) =\n\n(cid:80)Kn\n(cid:80)Kn\u22121\n\nk=1 W k\nn\nk=1 W k\nn\u22121\n\nfor n \u2265 1.\n\n(11)\n\nNote that the incrementally updated running averages W k\nlikelihood estimate; that is, \u02c6p(y0:n) = Kn\nK0\n\nn.\nW k\n\nn are very directly tied to the marginal\n\n4\n\n\f3.3 Theoretical properties, unbiasedness, and consistency\n\nUnder weak assumptions we can show that the marginal likelihood estimator \u02c6p(y0:n) de\ufb01ned in\nEq. 10 is unbiased, and that both its variance and L2 errors of estimates of reasonable posterior ex-\npectations decrease in the number of particle initializations as 1/K0. Note that because the cascade\nis an anytime algorithm K0 may be increased simply, without restarting inference. Detailed proofs\nare given in the supplemental material; statements of the results are provided here.\nDenote by B(E) the space of bounded real-valued functions on a space E, and suppose each Xn\nis an X -valued random variable. Assume the Bernoulli(Rk\nn(cid:99)) version of the resampling rule\nin Eq. 9, and further assume that g(yn|\u00b7, y0:n\u22121) : X n+1 \u2192 R is in B(X n+1) and strictly positive.\nFinally assume that the ordering in which particles arrive at each n is a random permutation of\nthe particle index set, conditions which we state precisely in the supplemental material. Then the\nfollowing propositions hold:\nProposition 1 (Unbiasedness of marginal likelihood estimate) For any K0 \u2265 1 and n \u2265 0\n\nn \u2212 (cid:98)Rk\n\n(12)\nProposition 2 (Variance of marginal likelihood estimate) For any n \u2265 0, there exists a constant\nan < \u221e such that for any K0 \u2265 1\n\nE [\u02c6p(y0:n)] = p(y0:n).\n\nProposition 3 (L2 error bounds) For any n \u2265 0, there exists a constant an < \u221e such that for any\n\nV [\u02c6p(y0:n)] \u2264 an\nK0\n\n.\n\n(13)\n\nK0 \u2265 1 and any \u03c8n \u2208 B(cid:0)X n+1(cid:1)\n\n\uf8ee\uf8f0(cid:40)(cid:32) Kn(cid:88)\n\nE\n\n(cid:33)\n\n(cid:90)\n\n(cid:41)2\uf8f9\uf8fb \u2264 an\n\nK0\n\n\u00af\u03c9k\nn\u03c8n(X k\n\n0:n)\n\n\u2212\n\np(dx0:n|y0:n)\u03c8n(x0:n)\n\n(cid:107)\u03c8n(cid:107)2 .\n\n(14)\n\nk=1\n\nAdditional results and proofs can be found in the supplemental material.\n\n4 Active bounding of memory usage\n\nIn an idealized computational environment, with in\ufb01nite available memory, our implementation of\nthe particle cascade could begin by launching (a very large number) K0 particles simultaneously\nwhich then gradually propagate forward through the system. In practice, only some \ufb01nite number\nof particles, probably much smaller than K0, can be simultaneously simulated ef\ufb01ciently. Further-\nmore, the initial particles are not truly launched all at once, but rather in a sequence, introducing a\ndependency in the order in which particles arrive at each observation n.\nOur implementation of the particle cascade addresses these issues by explicitly injecting randomness\ninto the execution order of particles, and by imposing a machine-dependent hard cap on the number\nof simultaneous extant processes. This permits us to run our particle \ufb01lter system inde\ufb01nitely, for\narbitrarily large and, in fact, growing initial particle counts K0, on \ufb01xed commodity hardware.\nEach particle in our implementation runs as an independent operating system process [12]. In order\nto ef\ufb01ciently run a large number of particles, we impose a hard limit \u03c1 on the total number of\nparticles which can simultaneously exist in the particle system; most of these will generally be\nsleeping processes. The ideal choice for this number will vary based on hardware capabilities, but\nin general should be made as large as possible.\nScheduling across particles is managed via a global \ufb01rst-in random-out process queue of length\n\u03c1; this can equivalently be conceptualized as a random-weight priority queue. Each particle corre-\nsponds to a single live process, augmented by a single additional control process which is responsible\nonly for spawning additional initial particles (i.e. incrementing the initial particle count K0). When\nany particle k arrives at any likelihood evaluation n, it computes its target number of child parti-\ncles M k\nn+1 = 0 it immediately terminates; otherwise\nit enters the queue. Once this particle either enters the queue or terminates, some other process\n\nn+1 and outgoing particle weight V k\n\nn+1. If M k\n\n5\n\n\fFigure 1: All results are reported over multiple independent replications, shown here as independent\nlines. (top) Convergence of estimates to ground truth vs. number of particles, shown as (left) MSE\nof marginal probabilities of being in each state for every observation n in the HMM, and (right)\nMSE of the latent expected position in the linear Gaussian state space model. (bottom) Convergence\nof marginal likelihood estimates to the ground truth value (marked by a red dashed line), for (left)\nthe HMM, and (right) the linear Gaussian model.\n\ncontinues execution \u2014 this process is chosen uniformly at random, and as such may be a sleeping\nparticle at any stage n < N, or it may instead be the control process which then launches a new\nparticle. At any given time, there are some number of particles K\u03c1 < \u03c1 currently in the queue, and\nso the probability of resuming any particular individual particle, or of launching a new particle, is\n1/(K\u03c1 + 1). If the particle released from the queue has exactly one child to spawn, it advances to\nthe next observation and repeats the resampling process. If, however, a particle has more than one\nchild particle to spawn, rather than launching all child particles at once it launches a single particle to\nsimulate forward, decrements the total number of particles left to launch by one, and itself re-enters\nthe queue. The system is initialized by seeding the system with a number of initial particles \u03c10 < \u03c1\nat n = 0, creating \u03c10 active initial processes. The ideal choice for the process count constraint \u03c1\nmay vary across operating systems and hardware.\nIn the event that the process count is fully saturated (i.e. the process queue is full), then we forcibly\nprevent particles from duplicating themselves and creating new children. If we release a particle\nfrom the queue which seeks to launch m > 1 additional particles when the queue is full, we instead\ncollapse all the remaining particles into a single particle; this single particle represents a virtual set\nof particles, but does not create a new process and requires no additional CPU or memory resources.\nWe keep track of a particle count multiplier C k\nn that we propagate forward along with the particle.\n0 = 1, and then when a particle collapse takes place, update their\nAll particles are initialized with C k\nmultiplier at n + 1 to mC k\nn. This affects the way in which running weight averages are computed;\nn . We incorporate all these values\nsuppose a new particle k arrives with multiplier C k\ninto the average weight immediately, and update W k\n\nn taking into account the multiplicity, with\n\nn and weight W k\n\nW k\n\nn =\n\nk \u2212 1\nk + C k\n\nW k\u22121\n\nC k\nn\nn \u2212 1\nk + C k\nThis does not affect the computation of the ratio Rk\nn. We preserve the particle multiplier, until we\nreach the \ufb01nal n = N; then, after all forward simulation is complete, we re-incorporate the particle\nmultiplicity when reporting the \ufb01nal particle weight W k\n\nfor k = 2, 3, . . ..\n\nn \u2212 1\n\nn +\n\nW k\nn\n\n(15)\n\nN = C k\n\nN V k\n\nN .\nN wk\n\n5 Experiments\n\nWe report experiments on performing inference in two simple state space models, each with N = 50\nobservations, in order to demonstrate the overall validity and utility of the particle cascade algorithm.\n\n6\n\n10110210310410510-410-310-210-1100MSE101102103104105HMM: # of particles\u2212180\u2212160\u2212140\u2212120log ^p(y0:N)10110210310410510-1100101102SMCParticle CascadeNo resamplingiCSMC101102103104105Linear Gaussian: # of particles\u2212130\u2212120\u2212110\u2212100\u221290\u221280True valueSMCParticle CascadeNo resampling\fFigure 2: (top) Comparative convergence rates between SMC alternatives including our new algo-\nrithm, and (bottom) estimation of marginal likelihood, by time. Results are shown for (left) the\nhidden Markov model, and (right) the linear Gaussian state space model.\n\nThe \ufb01rst is a hidden Markov model (HMM) with 10 latent discrete states, each with an associated\nGaussian emission distribution; the second a one-dimensional linear Gaussian model. Note that\nusing these models means that we can compute posterior marginals at each n and the marginal\nlikelihood Z = p(y0:N ) exactly.\nThese experiments are not designed to stress-\ntest the particle cascade; rather, they are de-\nsigned to show that performance of the particle\ncascade closely approximates that of fully syn-\nchronous SMC algorithms, even in a small-data\nsmall-complexity regime where we expect their\nperformance to be very good.\nIn addition to\ncomparing to standard SMC, we also compare\nto a worst-case particle \ufb01lter in which we never\nresample, instead propagating particles forward\ndeterministically with a single child particle at\nevery n. While the statistical (per-sample) ef\ufb01-\nciency of this approach is quite poor, it is fully\nparallelizable with no blocking operations in\nthe algorithm at all, and thus provides a ceiling\nestimate of the raw sampling speed attainable\nin our overall implementation.\nWe also benchmark against what we believe to\nbe the most practically competitive similar ap-\nproach, iterated conditional SMC [1]. Iterated\nconditional SMC corresponds to the particle Gibbs algorithm in the case where parameter values\nare known; by using a particle \ufb01lter sweep as a step within a larger MCMC algorithm, iCSMC pro-\nvides a statistically valid approach to sampling from a posterior distribution by repeatedly running\nsequential Monte Carlo sweeps each with a \ufb01xed number of particles. One downside to iCSMC\nis that it does not provide an estimate of the marginal likelihood. In all benchmarks, we propose\nfrom the prior distribution, with q(xn|\u00b7) \u2261 f (xn|x0:n\u22121); the SMC and iCSMC benchmarks use a\nmultinomial resampling scheme.\nOn both these models we see the statistical ef\ufb01ciency of the particle cascade is approximately in line\nwith synchronous SMC, slightly outperforming the iCSMC algorithm and signi\ufb01cantly outperform-\n\nFigure 3: Average time to draw a single com-\nplete particle on a variety of machine architec-\ntures. Queueing rather than blocking at each ob-\nservation improves performance, and appears to\nimprove relative performance even more as the\navailable compute resources increase. Note that\nthis plot shows only average time per sample, not\na measure of statistical ef\ufb01ciency. The high speed\nof the non-resampling algorithm is not suf\ufb01cient\nto make it competitive with the other approaches.\n\n7\n\n10010110210310-410-310-210-1100MSE100101102103HMM: Time (seconds)\u2212180\u2212160\u2212140\u2212120log ^p(y0:N)10010110210310-1100101102SMCParticle CascadeNo resamplingiCSMC100101102103Linear Gaussian: Time (seconds)\u2212130\u2212120\u2212110\u2212100\u221290\u221280True valueSMCParticle CascadeNo resampling2481632# of cores0510152025303540Time per sample (ms)Particle CascadeNo ResamplingIterated CSMCSMC\fing the fully parallelized non-resampling approach. This suggests that the approximations made by\ncomputing weights at each n based on only the previously observed particles, and the total particle\ncount limit imposed by \u03c1, do not have an adverse effect on overall performance. In Fig. 1 we plot\nconvergence per particle to the true posterior distribution, as well as convergence in our estimate of\nthe normalizing constant.\n\n5.1 Performance and scalability\n\nAlthough values will be implementation-dependent, we are ultimately interested not in per-sample\nef\ufb01ciency but rather in our rate of convergence over time. We record wall clock time for each algo-\nrithm for both of these models; the results for convergence of our estimates of values and marginal\nlikelihood are shown in Fig. 2. These particular experiments were all run on Amazon EC2, in an\n8-core environment with Intel Xeon E5-2680 v2 processors. The particle cascade provides a much\nfaster and more accurate estimate of the marginal likelihood than the competing methods, in both\nmodels. Convergence in estimates of values is quick as well, faster than the iCSMC approach. We\nnote that for very small numbers of particles, running a simple particle \ufb01lter is faster than the parti-\ncle cascade, despite the blocking nature of the resampling step. This is due to the overhead incurred\nby the particle cascade in sending an initial \ufb02urry of \u03c10 particles into the system before we see\nany particles progress to the end; this initial speed advantage diminishes as the number of samples\nincreases. Furthermore, in stark contrast to the simple SMC method, there are no barriers to draw-\ning more samples from the particle cascade inde\ufb01nitely. On this \ufb01xed hardware environment, our\nimplementation of SMC, which aggressively parallelizes all forward particle simulations, exhibits\na dramatic loss of performance as the number of particles increases from 104 to 105, to the point\nwhere simultaneously running 105 particles is simply not possible in a feasible amount of time.\nWe are also interested in how the particle cascade scales up to larger hardware, or down to smaller\nhardware. A comparison across \ufb01ve hardware con\ufb01gurations is shown in Fig. 3.\n\n6 Discussion\n\nThe particle cascade has broad applicability to all SMC and particle \ufb01ltering inference applications.\nFor example, constructing an appropriate sequence of densities for SMC is possible in arbitrary prob-\nabilistic graphical models, including undirected graphical models; see e.g. the sequential decompo-\nsition approach of [11]. We are particularly motivated by the SMC-based probabilistic programming\nsystems that have recently appeared in the literature [14, 12]. Both suggested that the primary per-\nformance bottleneck in their inference algorithms was barrier synchronization, something we have\ndone away with entirely. What is more, while particle MCMC methods are particularly appropri-\nate when there is a clear boundary that can be exploited between between parameters of interest\nand nuisance state variables, in probabilistic programming in particular, parameter values must be\ngenerated as part of the state trajectory itself, leaving no explicitly denominated latent parameter\nvariables per se. The particle cascade is particularly relevant in such situations.\nFinally, as the particle cascade yields an unbiased estimate of the marginal likelihood it can be\nplugged directly into PIMH, SMC2 [4], and other existing pseudo-marginal methods.\n\nAcknowledgments\n\nYee Whye Teh\u2019s research leading to these results has received funding from EPSRC (grant\nEP/K009362/1) and the ERC under the EU\u2019s FP7 Programme (grant agreement no. 617411).\nArnaud Doucet\u2019s research is partially funded by EPSRC (grants EP/K009850/1 and EP/K000276/1).\nFrank Wood is supported under DARPA PPAML through the U.S. AFRL under Cooperative Agree-\nment number FA8750-14-2-0004. The U.S. Government is authorized to reproduce and distribute\nreprints for Governmental purposes notwithstanding any copyright notation heron. The views and\nconclusions contained herein are those of the authors and should be not interpreted as necessarily\nrepresenting the of\ufb01cial policies or endorsements, either expressed or implied, of DARPA, the U.S.\nAir Force Research Laboratory or the U.S. Government.\n\n8\n\n\fReferences\n[1] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte\nCarlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n72(3):269\u2013342, 2010.\n\n[2] Anthony Brockwell, Pierre Del Moral, and Arnaud Doucet. Sequentially interacting Markov\n\nchain Monte Carlo methods. Annals of Statistics, 38(6):3387\u20133411, 2010.\n\n[3] James Carpenter, Peter Clifford, and Paul Fearnhead. An improved particle \ufb01lter for non-linear\n\nproblems. IEE Proceedings - Radar, Sonar and Navigation, 146(1):2\u20137, Feb 1999.\n\n[4] Nicolas Chopin, Pierre E Jacob, and Omiros Papaspiliopoulos. SMC2: an ef\ufb01cient algorithm\nfor sequential analysis of state space models. Journal of the Royal Statistical Society: Series\nB (Statistical Methodology), 75(3):397\u2013426, 2013.\n\n[5] D. Crisan, P. Del Moral, and T. Lyons. Discrete \ufb01ltering using branching and interacting\n\nparticle systems. Markov Process. Related Fields, 5(3):293\u2013318, 1999.\n\n[6] Randal Douc, Olivier Capp\u00b4e, and Eric Moulines. Comparison of resampling schemes for\nIn In 4th International Symposium on Image and Signal Processing and\n\nparticle \ufb01ltering.\nAnalysis (ISPA), pages 64\u201369, 2005.\n\n[7] Seong-Hwan Jun and Alexandre Bouchard-C\u02c6ot\u00b4e. Memory (and time) ef\ufb01cient sequential\nmonte carlo. In Proceedings of the 31st International Conference on Machine Learning, 2014.\n[8] Pierre Del Moral. Feynman-Kac Formulae \u2013 Genealogical and Interacting Particle Systems\n\nwith Applications. Probability and its Applications. Springer, 2004.\n\n[9] Pierre Del Moral and Arnaud Doucet.\n\nInteracting Markov chain Monte Carlo methods for\nsolving nonlinear measured-valued equations. Annals of Applied Probability, 20(2):593\u2013639,\n2010.\n\n[10] Lawrence M. Murray, Anthony Lee, and Pierre E. Jacob. Parallel resampling in the particle\n\n\ufb01lter. arXiv preprint arXiv:1301.4019, 2014.\n\n[11] Christian A. Naesseth, Fredrik Lindsten, and Thomas B. Sch\u00a8on. Sequential Monte Carlo for\n\nGraphical Models. In Advances in Neural Information Processing Systems 27. 2014.\n\n[12] Brooks Paige and Frank Wood. A compilation target for probabilistic programming languages.\n\nIn Proceedings of the 31st International Conference on Machine learning, 2014.\n\n[13] Nick Whiteley, Anthony Lee, and Kari Heine. On the role of interaction in sequential Monte\n\nCarlo algorithms. arXiv preprint arXiv:1309.2918, 2013.\n\n[14] Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new approach to prob-\nIn Proceedings of the 17th International conference on\n\nabilistic programming inference.\nArti\ufb01cial Intelligence and Statistics, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1769, "authors": [{"given_name": "Brooks", "family_name": "Paige", "institution": "University of Oxford"}, {"given_name": "Frank", "family_name": "Wood", "institution": "University of Oxford"}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": "University of Oxford"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}]}