{"title": "An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 528, "abstract": "We propose an efficient sequential Monte Carlo inference scheme for the recently proposed coalescent clustering model (Teh et al, 2008). Our algorithm has a quadratic runtime while those in (Teh et al, 2008) is cubic. In experiments, we were surprised to find that in addition to being more efficient, it is also a better sequential Monte Carlo sampler than the best in (Teh et al, 2008), when measured in terms of variance of estimated likelihood and effective sample size.", "full_text": "An Ef\ufb01cient Sequential Monte Carlo Algorithm for\n\nCoalescent Clustering\n\nDilan G\u00a8or\u00a8ur\nGatsby Unit\n\nUniversity College London\ndilan@gatsby.ucl.ac.uk\n\nYee Whye Teh\nGatsby Unit\n\nUniversity College London\nywteh@gatsby.ucl.ac.uk\n\nAbstract\n\nWe propose an ef\ufb01cient sequential Monte Carlo inference scheme for the recently\nproposed coalescent clustering model [1]. Our algorithm has a quadratic runtime\nwhile those in [1] is cubic.\nIn experiments, we were surprised to \ufb01nd that in\naddition to being more ef\ufb01cient, it is also a better sequential Monte Carlo sampler\nthan the best in [1], when measured in terms of variance of estimated likelihood\nand effective sample size.\n\n1 Introduction\n\nAlgorithms for automatically discovering hierarchical structure from data play an important role\nin machine learning. In many cases the data itself has an underlying hierarchical structure whose\ndiscovery is of interest, examples include phylogenies in biology, object taxonomies in vision or\ncognition, and parse trees in linguistics. In other cases, even when the data is not hierarchically\nstructured, such structures are still useful simply as a statistical tool to ef\ufb01ciently pool information\nacross the data at different scales; this is the starting point of hierarchical modelling in statistics.\nMany hierarchical clustering algorithms have been proposed in the past for discovering hierarchies.\nIn this paper we are interested in a Bayesian approach to hierarchical clustering [2, 3, 1]. This is\nmainly due to the appeal of the Bayesian approach being able to capture uncertainty in learned struc-\ntures in a coherent manner. Unfortunately, inference in Bayesian models of hierarchical clustering\nare often very complex to implement, and computationally expensive as well.\nIn this paper we build upon the work of [1] who proposed a Bayesian hierarchical clustering model\nbased on Kingman\u2019s coalescent [4, 5]. [1] proposed both greedy and sequential Monte Carlo (SMC)\nbased agglomerative clustering algorithms for inferring hierarchical clustering which are simpler\nto implement than Markov chain Monte Carlo methods. The algorithms work by starting with each\ndata item in its own cluster, and iteratively merge pairs of clusters until all clusters have been merged.\nThe SMC based algorithm has computational cost O(n3) per particle, where n is the number of data\nitems.\nWe propose a new SMC based algorithm for inference in the coalescent clustering of [1]. The\nalgorithm is based upon a different perspective on Kingman\u2019s coalescent than that in [1], where\nthe computations required to consider whether to merge each pair of clusters at each iteration is\nnot discarded in subsequent iterations. This improves the computational cost to O(n2) per particle,\nallowing this algorithm to be applied to larger datasets.\nIn experiments we show that our new\nalgorithm achieves improved costs without sacri\ufb01cing accuracy or reliability.\nKingman\u2019s coalescent originated in the population genetics literature, and there has been signi\ufb01cant\ninterest there on inference, including Markov chain Monte Carlo based approaches [6] and SMC\napproaches [7, 8]. The SMC approaches have interesting relationship to our algorithm and to that of\n[1]. While ours and [1] integrate out the mutations on the coalescent tree and sample the coalescent\n\n1\n\n\ftimes, [7, 8] integrate out the coalescent times, and sample mutations instead. Because of this\ndifference, ours and that of [1] will be more ef\ufb01cient in higher dimensional data, as well as other\ncases where the state space is too large and sampling mutations will be inef\ufb01cient.\nIn the next section, we review Kingman\u2019s coalescent and the existing SMC algorithms for inference\non this model. In Section 3, we describe a cheaper SMC algorithm. We compare our method with\nthat of [1] in Section 4 and conclude with a discussion in Section 5.\n\n2 Hierarchical Clustering using Kingman\u2019s Coalescent\n\nKingman\u2019s coalescent [4, 5] describes the family relationship between a set of haploid individuals\nby constructing the genealogy backwards in time. Ancestral lines coalesce when the individuals\nshare a common ancestor, and the genealogy is a binary tree rooted at the common ancestor of all\nthe individuals under consideration. We brie\ufb02y review the coalescent and the associated clustering\nmodel as presented in [1] before presenting a different formulation more suitable for our proposed\nalgorithm.\nLet \u03c0 be the genealogy of n individuals. There are n\u22121 coalescent events in \u03c0, we order these events\nwith i = 1 being the most recent one, and i = n \u2212 1 for the last event when all ancestral lines are\ncoalesced. Event i occurs at time Ti < 0 in the past, and involves the coalescing of two ancestors,\ndenoted \u03c1li and \u03c1ri, into one denoted \u03c1i. Let Ai be the set of ancestors right after coalescent event i,\nand A0 be the full set of individuals at the present time T0 = 0. To draw a sample \u03c0 from Kingman\u2019s\n(cid:1)\ncoalescent we sample the coalescent events one at a time starting from the present. At iteration i we\npick the pair of individuals \u03c1li, \u03c1ri uniformly at random from the n \u2212 i + 1 individuals available in\nequal to the number of pairs available, and set Ai = Ai\u22121 \u2212 {\u03c1li, \u03c1ri} + {\u03c1i}, Ti = Ti\u22121 \u2212 \u03b4i. The\nprobability of \u03c0 is thus:\n\n(cid:1)) from an exponential distribution with rate(cid:0)n\u2212i+1\n(cid:16) \u2212(cid:16) n \u2212 i + 1\n\nAi\u22121, pick a waiting time \u03b4i \u223c Exp((cid:0)n\u2212i+1\nn\u22121Y\n\np(\u03c0) =\n\n(cid:17)\n\n(cid:17)\n\nexp\n\n(1)\n\n2\n\n2\n\n\u03b4i\n\n.\n\ni=1\n\n2\n\nThe coalescent can be used as a prior over binary trees in a model where we have a tree-structured\nlikelihood function for observations at the leaves. Let \u03b8i be the subtree rooted at \u03c1i and xi be the\nobservations at the leaves of \u03b8i. [1] showed that by propagating messages up the tree the likelihood\nfunction can be written in a sequential form:\n\np(x| \u03c0) = Z0(x)\n\nZ\u03c1i(xi|\u03b8i),\n\n(2)\n\nn\u22121Y\n\ni=1\n\nZ\n\nZ\n\np0(yi) Y\n\nwhere Z\u03c1i is a function only of the coalescent times associates with \u03c1li, \u03c1ri, \u03c1i and of the local\nmessages sent from \u03c1li, \u03c1ri to \u03c1i, and Z0(x) is an easily computed normalization constant in eq. (2).\nEach function has the form (see [1] for further details):\n\nZ\u03c1i(xi|\u03b8i) =\n\np(yc|yi, \u03b8i)M\u03c1c(yc) dyc dyi\n\n(3)\n\nc=li,ri\n\nwhere M\u03c1c is the message from child \u03c1c to \u03c1i. The posterior is proportional to the product of eq. (1)\nand eq. (2) and our aim is to have an ef\ufb01cient way to compute the posterior. For this purpose, we\nwill give a different perspective to constructing the coalescent in the following and describe our\nsequential Monte Carlo algorithm in Section 3.\n\nIn this section we describe a different formulation of the coalescent based on the fact that each stage\n\n2.1 A regenerative race process\n\nof the coalescent can be interpreted as a race between the(cid:0)n\u2212i+1\ngets to coalesce, at which point the next stage starts with(cid:0)n\u2212i\n\n(cid:1) pairs of individuals to coalesce.\n(cid:1) pairs in the race. Na\u00a8\u0131vely this race\n\nEach pair proposes a coalescent time, the pair with most recent coalescent time \u201cwins\u201d the race and\n\n2\n\nprocess would require a total of O(n3) pairs to propose coalescent times. We show that using the\nregenerative (memoryless) property of exponential distributions allows us to reduce this to O(n2).\n\n2\n\n2\n\n\fAlgorithm 1 A regenerative race process for constructing the coalescent\n\ninputs: number of individuals n,\nset starting time T0 = 0 and A0 the set of n individuals\nfor all pairs of existing individuals \u03c1l, \u03c1r \u2208 A0 do\npropose coalescent time tlr using eq. (4)\nend for\nfor all coalescence events i = 1 : n \u2212 1 do\n\n\ufb01nd the pair to coalesce (\u03c1li, \u03c1ri) using eq. (5)\nset coalescent time Ti = tliri and update Ai = Ai\u22121 \u2212 {\u03c1li, \u03c1ri} + {\u03c1i}\nremove pairs with \u03c1l \u2208 {\u03c1li, \u03c1ri}, \u03c1r \u2208 Ai\u22121\\{\u03c1li, \u03c1ri}\nfor all new pairs with \u03c1l = \u03c1i, \u03c1r \u2208 Ai\\{\u03c1i} do\nend for\n\npropose coalescent time using eq. (4)\n\nend for\n\nThe same idea will allow us to reduce the computational cost of our SMC algorithm from O(n3) to\nO(n2).\n\nAt stage i of the coalescent we have n \u2212 i + 1 individuals in Ai\u22121, and(cid:0)n\u2212i+1\n\n(cid:1) pairs in the race to\n\ncoalesce. Each pair \u03c1l, \u03c1r \u2208 Ai\u22121, \u03c1l 6= \u03c1r proposes a coalescent time\n\n2\n\n(4)\nthat is, by subtracting from the last coalescent time a waiting time drawn from an exponential distri-\nbution of rate 1. The pair \u03c1li, \u03c1ri with most recent coalescent time wins the race:\n\ntlr|Ti\u22121 \u223c Ti\u22121 \u2212 Exp(1),\n\n(\u03c1li, \u03c1ri) = argmax\n(\u03c1l,\u03c1r)\n\ntlr,\n\n\u03c1l, \u03c1r \u2208 Ai\u22121, \u03c1l 6= \u03c1r\n\n(5)\n\nn\n\nand coalesces into a new individual \u03c1i at time Ti = tliri. At this point stage i + 1 of the race begins,\nwith some pairs dropping out of the race (speci\ufb01cally those with one half of the pair being either \u03c1li\nor \u03c1ri) and new ones entering (speci\ufb01cally those formed by pairing the new individual \u03c1i with an\nexisting one). Among the pairs (\u03c1l, \u03c1r) that did not drop out nor just entered the race, consider the\ndistribution of tlr conditioned on the fact that tlr < Ti (since (\u03c1l, \u03c1r) did not win the race at stage i).\nUsing the memoryless property of the exponential distribution, we see that tlr|Ti \u223c Ti \u2212 Exp(1),\nthus eq. (4) still holds and we need not redraw tlr for the stage i + 1 race. In other words, once tlr\nis drawn once, it can be reused for subsequent stages of the race until it either wins a race or drops\nout. The generative process is summarized in Algorithm 1.\nWe obtain the probability of the coalescent \u03c0 as a product over the i = 1, . . . , n \u2212 1 stages of the\nrace, of the probability of each event \u201c\u03c1li, \u03c1ri wins stage i and coalesces at time Ti\u201d given more\nrecent stages. The probability at stage i is simply the probability that tliri = Ti, and that all other\nproposed coalescent times tlr < Ti, conditioned on the fact that the proposed coalescent times tlr\nfor all pairs at stage i are all less than Ti\u22121. This gives:\n\nn\u22121Y\nn\u22121Y\n\ni=1\n\ni=1\n\np(\u03c0) =\n\n=\n\n(cid:18)\np(tliri = Ti | tliri < Ti\u22121) Y\n(cid:18) p(tliri = Ti)\n\nY\n\n(\u03c1l,\u03c1r)6=(\u03c1li ,\u03c1ri )\n\np(tliri < Ti\u22121)\n\n(\u03c1l,\u03c1r)6=(\u03c1li ,\u03c1ri )\n\np(tlr < Ti)\np(tlr < Ti\u22121)\n\n(cid:19)\n\np(tl0r0 < Ti | tl0r0 < Ti\u22121)\n\n(cid:19)\n\n(6)\n\n(7)\n\no\n\n(cid:19)\n\nwhere the second product runs over all pairs in stage i except the winning pair. Each pair that\nparticipated in the race has corresponding terms in eq. (7), starting at the stage when the pair entered\nthe race, and ending with the stage when the pair either dropped out or wins the stage. As these\nterms cancel, eq. (7) simpli\ufb01es to,\n\np(\u03c0) =\n\np(tliri = Ti)\n\np(tlr < Ti)\n\n,\n\n(8)\n\n(cid:18)\n\nn\u22121Y\n\ni=1\n\nY\n\n\u03c1l\u2208{\u03c1li ,\u03c1ri},\u03c1r\u2208Ai\u22121\\{\u03c1li ,\u03c1ri}\n\n3\n\n\fwhere the second product runs only over those pairs that dropped out after stage i. The \ufb01rst term\nis the probability of pair (\u03c1li, \u03c1ri) coalescing at time Ti given its entrance time, and the second\nterm is the probability of pair (\u03c1l, \u03c1r) dropping out of the race at time Ti given its entrance time.\nWe can verify that this expression equals eq. (1) by plugging in the probabilities for exponential\ndistributions. Finally, multiplying the prior eq. (8) and the likelihood eq. (2) we have,\n\nY\n\n(cid:19)\n\np(x, \u03c0) = Z0(x)\n\nZ\u03c1i(xi|\u03b8i)p(tliri = Ti)\n\n\u03c1l\u2208{\u03c1li ,\u03c1ri}, \u03c1r\u2208Ai\u22121\\{\u03c1li ,\u03c1ri}\n\np(tlr < Ti)\n\n.\n\n(9)\n\n3 Ef\ufb01cient SMC Inference on the Coalescent\n\n(cid:19)\n\n(cid:18)\n\nn\u22121Y\n\ni=1\n\n(cid:18)\n\nn\u22121Y\n\ni=1\n\nY\n\nY\n\nOur sequential Monte Carlo algorithm for posterior inference is directly inspired by the regenerative\nrace process described above. In fact the algorithm is structurally exactly as in Algorithm 1, but with\neach pair \u03c1l, \u03c1r proposing a coalescent time from a proposal distribution tlr \u223c Qlr instead of from\neq. (4). The idea is that the proposal distribution Qlr is constructed taking into account the observed\ndata, so that Algorithm 1 produces better approximate samples from the posterior.\nThe overall probability of proposing \u03c0 under the SMC algorithm can be computed similarly to\neq. (6)-(8), and is,\n\nq(\u03c0) =\n\nqliri(tliri = Ti)\n\nqlr(tlr < Ti)\n\n,\n\n(10)\n\n\u03c1l\u2208{\u03c1li ,\u03c1ri},\u03c1r\u2208Ai\u22121\\{\u03c1li ,\u03c1ri}\n\nwhere qlr is the density of Qlr. As both eq. (9) and eq. (10) can be computed sequentially, the weight\nw associated with each sample \u03c0 can be computed \u201con the \ufb02y\u201d as the coalescent tree is constructed:\n\nw0 = Z0(x)\n\nwi = wi\u22121\n\nZ\u03c1i(xi|\u03b8i)p(tliri = Ti)\n\nqliri(tliri = Ti)\n\n\u03c1l\u2208{\u03c1li ,\u03c1ri}, \u03c1r\u2208Ai\u22121\\{\u03c1li ,\u03c1ri}\n\np(tlr < Ti)\nqlr(tlr < Ti) .\n\n(11)\n\nFinally we address the choice of proposal distribution Qlr to use. [1] noted that Z\u03c1i(xi|\u03b8i) acts as a\n\u201clocal likelihood\u201d term in eq. (9). We make use of this observation and use eq. (4) as a \u201clocal prior\u201d,\ni.e. the following density for the proposal distribution Qlr:\n\nqlr(tlr) \u221d Z\u03c1lr(xlr|tlr, \u03c1l, \u03c1r, \u03b8i\u22121)p(tlr | Tc(lr))\n\n(12)\n\nbeing(cid:0)n\u2212i+1\n\n2\n\nwhere \u03c1lr is a hypothetical individual resulting from coalescing l and r, Tc(lr) denotes the time\nwhen the pair (\u03c1l, \u03c1r) enters the race, xlr are the data under \u03c1l and \u03c1r, and p(tlr | Tc(lr)) =\netlr\u2212Tc(lr)I(tlr < Tc(lr)) is simply an exponential density with rate 1 that has been shifted and\nre\ufb02ected. I(\u00b7) is an indicator function returning 1 if its argument is true, and 0 otherwise.\nThe proposal distribution in [1] also has a form similar to eq. (12), but with the exponential rate\n\n(cid:1) instead, if the proposal was in stage i of the race. This dependence means that at\n\neach stage of the race the coalescent times proposal distribution needs to be recomputed for each\npair, leading to an O(n3) computation time. On the other hand, similar to the prior process, we need\nto propose a coalescent time for each pair only once when it is \ufb01rst created. This results in O(n2)\ncomputational complexity per particle1.\nNote that it may not always be possible (or ef\ufb01cient) to compute the normalizing constant of the\ndensity in eq. (12) (even if we can sample from it ef\ufb01ciently). This means that the weight updates\neq. (11) cannot be computed. In that case, we can use an approximation \u02dcZ\u03c1lr to Z\u03c1lr instead. In the\nfollowing subsection we describe the independent-sites parent-independent model we used in the\nexperiments, and how to construct \u02dcZ\u03c1lr.\n\n1Technically the time cost is O(n2(m + log n)), where n is the number of individuals, and m is the cost of\nsampling from and evaluating eq. (12). The additional log n factor comes about because a priority queue needs\nto be maintained to determine the winner of each stage ef\ufb01ciently, but this is negligible compared to m.\n\n4\n\n\f3.1\n\nIndependent-Sites Parent-Independent Likelihood Model\n\n \n1 \u2212 KX\n\n!\n\nIn our experiments we have only considered coalescent clustering of discrete data, though our ap-\nproach can be applied more generally. Say each data item consists of a D dimensional vector where\neach entry can take on one of K values. We use the independent-sites parent-independent mutation\nmodel over multinomial vectors in [1] as our likelihood model. Speci\ufb01cally, this model assumes that\neach point on the tree is associated with a D dimensional multinomial vector, and each entry of this\nvector on each branch of the tree evolves independently (thus independent-sites), forward in time,\nand with mutations occurring at rate \u03bbd on entry d. When a mutation occurs, a new value for the\nentry is drawn from a distribution \u03c6d, independently of the previous value at that entry (thus parent-\nindependent). When a coalescent event is encountered, the mutation process evolves independently\ndown both branches.\nSome calculations show that the transition probability matrix of the mutation process associated\nwith entry d on a branch of length t is e\u2212\u03bbdtIK + (1 \u2212 e\u2212\u03bbdt)\u03c6>\nd 1K, where IK is the identity\nmatrix, 1K is a vector of 1\u2019s, and we have implicitly represented the multinomial distribution \u03c6d as\na vector of probabilities. The message for entry d from node \u03c1i on the tree to its parent is a vector\nM d\n\u03c1i\n\n= 1. The local likelihood term is then:\n\n]>, normalized so that \u03c6>\n\n= [M d1\n\u03c1i\n\n, . . . , M dK\n\u03c1i\n\nd M d\n\u03c1i\n\n(xlr|tlr, \u03c1l, \u03c1r, \u03b8i\u22121) = 1 \u2212 e\u03bbd(2tlr\u2212tl\u2212tr)\n\nZ d\n\u03c1lr\n\n\u03c6dkM dk\n\u03c1l\n\nM dk\n\u03c1r\n\n(13)\n\nk=1\n\nThe logarithm of the proposal density is then:\n\nlog qlr(tlr) = constant + (tlr \u2212 Tc(lr)) +\n\nDX\n\nd=1\n\nlog Z d\n\u03c1lr\n\n(xlr|tlr, \u03c1l, \u03c1r, \u03b8i\u22121)\n\n(14)\n\nThis is not of standard form, and we use an approximation log \u02dcqlr(tlr) instead. Speci\ufb01cally, we use\na piecewise linear log \u02dcqlr(tlr), which can be easily sampled from, and for which the normalization\nterm is easy to compute.\n(xlr|tlr, \u03c1l, \u03c1r, \u03b8i\u22121), as a function\nThe approximation is constructed as follows. Note that log Z d\n\u03c1lr\nof tlr, is concave if the term inside the parentheses in eq. (13) is positive, convex if negative, and\nconstant if zero. Thus eq. (14) is a sum of linear, concave and convex terms. Using the upper and\nlower envelopes developed for adaptive rejection sampling [9], we can construct piecewise linear\nupper and lower envelopes for log qlr(tlr) by upper and lower bounding the concave and convex\nparts separately. The upper and lower envelopes give exact bounds on the approximation error\nintroduced, and we can ef\ufb01ciently improve the envelopes until a given desired approximation error\nis achieved. Finally, we used the upper bound as our approximate log \u02dcqlr(tlr). Note that the same\nissue arises in the proposal distribution for SMC-PostPost, and we used the same piecewise linear\napproximation. The details of this algorithm can be found in [10].\n\n4 Experiments\n\nThe improved computational cost of inference makes it possible to do Bayesian inference for the co-\nalescence models on larger datasets. The SMC samplers converge to the exact solution in the limit of\nin\ufb01nite particles. However, it is not enough to be more ef\ufb01cient per particle, the crucial point is how\nef\ufb01cient the algorithm is overall. An important question is how many particles we need in practice.\nTo address this question, we compared the performance of our algorithm SMC1 to SMC-PostPost\non the synthetic data shown in Figure 12. There are 15 binary 12-dimensional vectors in the dataset.\nThere is overlap between the features of the data points however the data does not obey a tree struc-\nture, which will result in a multimodal posterior. Both SMC1 and SMC-PostPost recover the\nstructure with only a few particles. However there is room for improvement as the variance in the\nlikelihood obtained from multiple runs decreases with increasing number of particles. Since both\nSMC algorithms are exact in the limit, the values should converge as we add more particles. We can\ncheck convergence by observing the variance of likelihood estimates of multiple runs. The variance\n\n2The comparison is done in the importance sampling setting, i.e. without using resampling for comparison\n\nof the proposal distributions.\n\n5\n\n\fFigure 1: Synthetic data features is shown on the left; each data point is a binary column vector. A\nsample tree from the SMC1 algorithm demonstrate that the algorithm could capture the similarity\nstructure. The true covariance of the data (a) and the distance on the tree learned by the SMC1\nalgorithm averaged over particles (b) are shown, showing that the overall structure was corerctly\ncaptured. The results obtained from SMC-PostPost were very similar to SMC1 therefore are\nnot shown here.\n\nshould shrink as we increase the number of particles. Figure 2 shows the change in the estimated\nlikelihood as a function of number of particles. From this \ufb01gure, we can conclude that the compu-\ntationally cheaper algorithm SMC1 is more ef\ufb01cient also in the number of particles as it gives more\naccurate answers with less particles.\n\nFigure 2: The change in the likelihood (left) and the effective sample size (right) as a function of\nnumber of particles for SMC1 (solid) and SMC-PostPost (dashed). The mean estimate of both\nalgorithms are very close, with the SMC1 having a much tighter variance. The variance of both\nalgorithms shrink and the effective sample size increases as the number of particles increase.\n\nA quantity of interest in genealogical studies is the time to the most recent common ancestor\n(MRCA), which is the time of the last coalescence event. Although there is not a physical inter-\npretation of this quantity for hierarchical clustering, it gives us an indication about the variance of\nthe particles. We can observe the variation in the time to MRCA to assess convergence. Similar to\nthe variance behaviour in the likelihood, with small number of particles SMC-PostPost has\nhigher variance than SMC1 . However, as there are more particles, results of the two algorithms\nalmost overlap. The mean time for each step of coalescence together with its variance for 7250\nparticles for both algorithms is depicted in Figure 3. It is interesting that the \ufb01rst few coalescence\ntimes of SMC1 are shorter than those for SMC-PostPost. The distribution of the particle weights\nis important for the ef\ufb01ciency of the importance sampler. Ideally, the weights would be uniform\nsuch that each particle contributes equally to the posterior estimation. If there is only a few particles\nthat come from a high probability region, the weights of those particles would be much larger than\n\n6\n\n(a)(b)2468101233334455551112200.511.502004006008001000\u2212101234567x 10\u221230likelihoodnumber of particles, 7 runs each0200400600800100050100150200effective sample sizenumber of particles, 7 runs each\fFigure 3: Times for each coalescence step averaged over 7250 particles. Note that both algorithms\nalmost converged at the same distribution when given enough resources. There is a slight difference\nin the mean coalescence time. It is interesting that the SMC1 algorithm proposes shorter times for\nthe initial coalescence events.\n\nthe rest, resulting in a low effective sample size. We will discuss this point more in the next section.\nHere, we note that for the synthetic dataset, the effective sample size of SMC-PostPost is very\npoor, and that of SMC1 is much higher, see Figure 2.\n\n5 Discussion\n\nWe described an ef\ufb01cient Sequential Monte Carlo algorithm for inference on hierarchical clustering\nmodels that use Kingman\u2019s coalescent as a proir. Our method makes use of a regenerative perspec-\ntive to construct the coalescent tree. Using this construction, we achieve quadratic run time per\nparticle. By employing a tight upper bound on the local likelihood term, the proposed algorithm is\napplicable to general data generation processes.\nWe also applied our algorithm for inferring the structure in the phylolinguistic data used in [1]. We\nused the same Indo-European subset of the data, with the same subset of features, that is 44 lan-\nguages with 100 binary features. Three example trees with the largest weights out of 7750 samples\nare depicted in Figure 4. Unfortunately, on this dataset, the effective sample size of both algorithms\nis close to one. A usual method to circumvent the low effective sample size problem in sequential\nMonte Carlo algorithms is to do resampling, that is, detecting the particles that will not contribute\nmuch to the posterior from the partial samples and prune them away, multiplying the promising\nsamples. There are two stages to doing resampling. We need to decide at what point to prune away\nsamples, and how to select which samples to prune away. As shown by [11], different problems may\nrequire different resampling algorithms. We tried resampling using Algorithm 5.1 of [12], however\nthis only had a small improvement in the \ufb01nal performance for both algorithms on this data set.\nNote that both algorithms use \u201dlocal likelihoods\u201d for calculating the weights, therefore the weights\nare not fully informative about the actual likelihood of the partial sample. Furthermore, in the\nrecursive calculation of the weights in SMC1 , we are including the effect of a pair only when they\neither coalesce or cease to exist for the sake of saving computations. Therefore the partial weights\nare even less informative about the state of the sample and the effective sample size cannot really\ngive full explanation about whether the current sample is good or not.\nIn fact, we did observe\noscillations on the effective sample size calculated on the weights along the iterations, i.e. starting\noff with a high value, decreasing to virtually 1 and increasing later before the termination, which\nalso indicates that it is not clear which of the particles will be more effective eventually. An open\nquestion is how to incorporate a resampling algorithm to improve the ef\ufb01ciency.\n\nReferences\n\n[1] Y. W. Teh, H. Daume III, and D. M. Roy. Bayesian agglomerative clustering with coalescents.\n\nIn Advances in Neural Information Processing Systems, volume 20, 2008.\n\n7\n\n246810121410\u2212310\u2212210\u22121100  smc1smcPostPost\fFigure 4: Tree structure infered from WALS data. (a),(b) Samples from a run with 7750 particles\nwithout resampling. (c) Sample from a run with resampling. The values above the trees are nor-\nmalized weights. Note that the weight of (a) is almost one, which means that the contribution from\nthe rest of the particles is in\ufb01nitesimal although the tree structure in (b) also seem to capture the\nsimilarities between languages.\n\n[2] R. M. Neal. De\ufb01ning priors for distributions using Dirichlet diffusion trees. Technical Report\n\n0104, Department of Statistics, University of Toronto, 2001.\n\n[3] C. K. I. Williams. A MCMC approach to hierarchical mixture modelling.\n\nNeural Information Processing Systems, volume 12, 2000.\n\nIn Advances in\n\n[4] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability,\n\n19:27\u201343, 1982. Essays in Statistical Science.\n\n[5] J. F. C. Kingman. The coalescent. Stochastic Processes and their Applications, 13:235\u2013248,\n\n1982.\n\n[6] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach.\n\nJournal of Molecular Evolution, 17:368\u2013376, 1981.\n\n[7] R. C. Grif\ufb01ths and S. Tavare. Simulating probability distributions in the coalescent. Theoretical\n\nPopulation Biology, 46:131\u2013159, 1994.\n\n[8] M. Stephens and P. Donnelly. Inference in molecular population genetics. Journal of the Royal\n\nStatistical Society, 62:605\u2013655, 2000.\n\n[9] W.R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Applied Statistics,\n\n41:337\u2013348, 1992.\n\n[10] D. G\u00a8or\u00a8ur and Y.W. Teh. Concave convex adaptive rejection sampling. Technical report, Gatsby\n\nComputational Neuroscience Unit, 2008.\n\n[11] Y. Chen, J. Xie, and J. Liu. Stopping-time resampling for sequential monte carlo methods.\n\nJournal of the Royal Statistical Society, 67, 2005.\n\n[12] P. Fearnhead. Sequential Monte Carlo Method in Filter Theory. PhD thesis, Merton College,\n\nUniversity of Oxford, 1998.\n\n8\n\n012Romance \u2212CatalanRomance \u2212ItalianRomance \u2212FrenchRomance \u2212PortugueseRomance \u2212SpanishAlbanian\u2212AlbanianCeltic  \u2212CornishRomance \u2212RomanianBaltic  \u2212LithuanianSlavic  \u2212RussianSlavic  \u2212UkrainianSlavic  \u2212SloveneSlavic  \u2212SerbianGermanic\u2212DanishGermanic\u2212SwedishGermanic\u2212NorwegianGermanic\u2212DutchGermanic\u2212GermanGermanic\u2212EnglishGermanic\u2212IcelandicArmenian\u2212Armenian EArmenian\u2212Armenian WIndic   \u2212HindiIndic   \u2212PanjabiIndic   \u2212MaithiliIndic   \u2212MarathiIndic   \u2212NepaliIranian \u2212PashtoIranian \u2212OsseticIndic   \u2212BengaliIndic   \u2212KashmiriIndic   \u2212SinhalaIranian \u2212KurdishIranian \u2212PersianIranian \u2212TajikSlavic  \u2212BulgarianGreek   \u2212GreekCeltic  \u2212BretonCeltic  \u2212WelshCeltic  \u2212GaelicCeltic  \u2212IrishSlavic  \u2212CzechBaltic  \u2212LatvianSlavic  \u2212Polish0.0151504c) with resampling012Slavic  \u2212SerbianSlavic  \u2212SloveneSlavic  \u2212RussianBaltic  \u2212LithuanianSlavic  \u2212CzechSlavic  \u2212UkrainianSlavic  \u2212PolishBaltic  \u2212LatvianAlbanian\u2212AlbanianRomance \u2212RomanianRomance \u2212FrenchRomance \u2212PortugueseSlavic  \u2212BulgarianGreek   \u2212GreekRomance \u2212CatalanRomance \u2212ItalianRomance \u2212SpanishGermanic\u2212DanishGermanic\u2212NorwegianGermanic\u2212SwedishGermanic\u2212DutchGermanic\u2212GermanGermanic\u2212EnglishGermanic\u2212IcelandicArmenian\u2212Armenian WIndic   \u2212SinhalaIndic   \u2212BengaliIndic   \u2212KashmiriIndic   \u2212HindiIndic   \u2212PanjabiIranian \u2212PashtoIndic   \u2212MaithiliIndic   \u2212MarathiIndic   \u2212NepaliIranian \u2212OsseticArmenian\u2212Armenian ECeltic  \u2212BretonCeltic  \u2212CornishCeltic  \u2212WelshCeltic  \u2212GaelicCeltic  \u2212IrishIranian \u2212KurdishIranian \u2212PersianIranian \u2212Tajik0.000379939b) no resampling012Armenian\u2212Armenian EArmenian\u2212Armenian WIranian \u2212KurdishIranian \u2212PersianIranian \u2212TajikIndic   \u2212BengaliIndic   \u2212MarathiIndic   \u2212SinhalaIndic   \u2212MaithiliIranian \u2212OsseticIndic   \u2212NepaliIndic   \u2212HindiIndic   \u2212PanjabiIranian \u2212PashtoIndic   \u2212KashmiriAlbanian\u2212AlbanianSlavic  \u2212BulgarianGreek   \u2212GreekRomance \u2212CatalanRomance \u2212ItalianRomance \u2212RomanianRomance \u2212FrenchRomance \u2212PortugueseRomance \u2212SpanishGermanic\u2212DanishGermanic\u2212SwedishGermanic\u2212NorwegianGermanic\u2212DutchGermanic\u2212GermanGermanic\u2212EnglishGermanic\u2212IcelandicCeltic  \u2212BretonCeltic  \u2212CornishCeltic  \u2212WelshCeltic  \u2212GaelicCeltic  \u2212IrishSlavic  \u2212CzechBaltic  \u2212LatvianBaltic  \u2212LithuanianSlavic  \u2212SerbianSlavic  \u2212SloveneSlavic  \u2212PolishSlavic  \u2212UkrainianSlavic  \u2212Russian0.998921a) no resampling\f", "award": [], "sourceid": 1020, "authors": [{"given_name": "Dilan", "family_name": "Gorur", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}