{"title": "Memoized Online Variational Inference for Dirichlet Process Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1133, "page_last": 1141, "abstract": "Variational inference algorithms provide the most effective framework for large-scale training  of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima.  We present a new algorithm, memoized online variational inference, which scales to very large (yet finite) datasets while avoiding the complexities of stochastic gradient.  Our algorithm maintains finite-dimensional sufficient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples.  Exploiting nested families of variational bounds for infinite nonparametric models, we develop principled birth and merge moves allowing non-local optimization.  Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed.  Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.", "full_text": "Memoized Online Variational Inference for\n\nDirichlet Process Mixture Models\n\nMichael C. Hughes and Erik B. Sudderth\n\nDepartment of Computer Science, Brown University, Providence, RI 02912\n\nmhughes@cs.brown.edu, sudderth@cs.brown.edu\n\nAbstract\n\nVariational inference algorithms provide the most effective framework for large-\nscale training of Bayesian nonparametric models. Stochastic online approaches\nare promising, but are sensitive to the chosen learning rate and often converge\nto poor local optima. We present a new algorithm, memoized online variational\ninference, which scales to very large (yet \ufb01nite) datasets while avoiding the com-\nplexities of stochastic gradient. Our algorithm maintains \ufb01nite-dimensional suf-\n\ufb01cient statistics from batches of the full dataset, requiring some additional mem-\nory but still scaling to millions of examples. Exploiting nested families of varia-\ntional bounds for in\ufb01nite nonparametric models, we develop principled birth and\nmerge moves allowing non-local optimization. Births adaptively add components\nto the model to escape local optima, while merges remove redundancy and im-\nprove speed. Using Dirichlet process mixture models for image clustering and\ndenoising, we demonstrate major improvements in robustness and accuracy.\n\n1 Introduction\n\nBayesian nonparametric methods provide a \ufb02exible framework for unsupervised modeling of struc-\ntured data like text documents, time series, and images. They are especially promising for large\ndatasets, as their nonparametric priors should allow complexity to grow smoothly as more data is\nseen. Unfortunately, contemporary inference algorithms do not live up to this promise, scaling\npoorly and yielding solutions that represent poor local optima of the true posterior. In this paper, we\npropose new scalable algorithms capable of escaping local optima. Our focus is on clustering data\nvia the Dirichlet process (DP) mixture model, but our methods are much more widely applicable.\n\nStochastic online variational inference is a promising general-purpose approach to Bayesian non-\nparametric learning from streaming data [1]. While individual steps of stochastic optimization\nalgorithms are by design scalable, they are extremely vulnerable to local optima for non-convex\nunsupervised learning problems, frequently yielding poor solutions (see Fig. 2). While taking the\nbest of multiple runs is possible, this is unreliable, expensive, and ineffective in more complex\nstructured models. Furthermore, the noisy gradient step size (or learning rate) requires external pa-\nrameters which must be \ufb01ne-tuned for best performance, often requiring an expensive validation\nprocedure. Recent work has proposed methods for automatically adapting learning rates [2], but\nthese algorithms\u2019 progress on the overall variational objective remains local and non-monotonic.\n\nIn this paper, we present an alternative algorithm, memoized online variational inference, which\navoids noisy gradient steps and learning rates altogether. Our method is useful when all data may\nnot \ufb01t in memory, but we can afford multiple full passes through the data by processing successive\nbatches. The algorithm visits each batch in turn and updates a cached set of suf\ufb01cient statistics which\naccurately re\ufb02ect the entire dataset. This allows rapid and noise-free updates to global parameters\nat every step, quickly propagating information and speeding convergence. Our memoized approach\nis generally applicable in any case batch or stochastic online methods are useful, including topic\nmodels [1] and relational models [3], though we do not explore these here.\n\n1\n\n\fWe further develop a principled framework for escaping local optima in the online setting, by inte-\ngrating birth and merge moves within our algorithm\u2019s coordinate ascent steps. Most existing mean-\n\ufb01eld algorithms impose a restrictive \ufb01xed truncation in the number of components, which is hard to\nset a priori on big datasets: either it is too small and inexpressive, or too large and computationally\ninef\ufb01cient. Our birth and merge moves, together with a nested variational approximation to the pos-\nterior, enable adaptive creation and pruning of clusters on-the-\ufb02y. Because these moves are validated\nby an exactly tracked global variational objective, we avoid potential instabilities of stochastic on-\nline split-merge proposals [4]. The structure of our moves is very different from split-merge MCMC\nmethods [5, 6]; applications of these algorithms have been limited to hundreds of data points, while\nour experiments show scaling of memoized split-merge proposals to millions of examples.\n\nWe review the Dirichlet process mixture model and variational inference in Sec. 2, outline our novel\nmemoized algorithm in Sec. 3, and evaluate on clustering and denoising applications in Sec. 4.\n\n2 Variational inference for Dirichlet process mixture models\n\nThe Dirichlet process (DP) provides a nonparametric prior for partitioning exchangeable datasets\ninto discrete clusters [7]. An instantiation G of a DP is an in\ufb01nite collection of atoms, each of which\nrepresents one mixture component. Component k has mixture weight wk sampled as follows:\n\n\u221e\n\nk\u22121\n\nG \u223c DP(\u03b10H), G ,\n\nwk\u03b4\u03c6k ,\n\nvk \u223c Beta(1, \u03b10), wk = vk\n\n(1 \u2212 v\u2113).\n\n(1)\n\nXk=1\n\nY\u2113=1\n\nThis stick-breaking process provides mixture weights and parameters. Each data item n chooses\nan assignment zn \u223c Cat(w), and then draws observations xn \u223c F (\u03c6zn ). The data-generating\nparameter \u03c6k is drawn from a base measure H with natural parameters \u03bb0. We assume both H and\nF belong to exponential families with log-normalizers a and suf\ufb01cient statistics t:\n\np(\u03c6k | \u03bb0) = expn\u03bbT\n\n0 t0(\u03c6k) \u2212 a0(\u03bb0)o,\n\np(xn | \u03c6k) = expn\u03c6T\n\nk t(xn) \u2212 a(\u03c6k)o.\n\n(2)\n\nFor simplicity, we assume unit reference measures. The goal of inference is to recover stick-breaking\nproportions vk and data-generating parameters \u03c6k for each global mixture component k, as well as\ndiscrete cluster assignments z = {zn}N\n\nn=1 for each observation. The joint distribution is\n\nN\n\n\u221e\n\np(x, z, \u03c6, v) =\n\nF (xn | \u03c6zn )Cat(zn | w(v))\n\nBeta(vk | 1, \u03b10)H(\u03c6k | \u03bb0)\n\n(3)\n\nYn=1\n\nYk=1\n\nWhile our algorithms are directly applicable to any DP mixture of exponential families, our experi-\nments focus on D-dimensional real-valued data xn, for which we take F to be Gaussian. For some\ndata, we consider full-mean, full-covariance analysis (where H is normal-Wishart), while other ap-\nplications consider zero-mean, full-covariance analysis (where H is Wishart).\n\n2.1 Mean-\ufb01eld variational inference for DP mixture models\n\nTo approximate the full (but intractable) posterior over variables z, v, \u03c6, we consider a fully-\nfactorized variational distribution q, with individual factors from appropriate exponential families:1\n\nN\n\nK\n\nq(z, v, \u03c6) =\n\nq(zn) = Cat(zn | \u02c6rn1, . . . \u02c6rnK),\n\nYn=1\n\nYk=1\n\nq(zn|\u02c6rn)\n\nq(vk|\u02c6\u03b11, \u02c6\u03b10)q(\u03c6k|\u02c6\u03bbk),\n\nq(vk) = Beta(vk | \u02c6\u03b1k1, \u02c6\u03b1k0),\n\nq(\u03c6k) = H(\u03c6k | \u02c6\u03bbk).\n\n(4)\n\n(5)\n\nTo tractably handle the in\ufb01nite set of components available under the DP prior, we truncate the\ndiscrete assignment factor to enforce q(zn = k) = 0 for k > K. This forces all data to be explained\nby only the \ufb01rst K components, inducing conditional independence between observed data and any\nglobal parameters vk, \u03c6k with index k > K. Inference may thus focus exclusively on a \ufb01nite set of\nK components, while reasonably approximating the true in\ufb01nite posterior for large K.\n\n1To ease notation, we mark variables with hats to distinguish parameters \u02c6\u03b8 of variational factors q from\n\nparameters \u03b8 of the generative model p. In this way, \u03b8k and \u02c6\u03b8k always have equal dimension.\n\n2\n\n\fCrucially, our truncation is nested: any learned q with truncation K can be represented exactly under\ntruncation K +1 by setting the \ufb01nal component to have zero mass. This truncation, previously advo-\ncated by [8, 4], has considerable advantages over non-nested direct truncation of the stick-breaking\nprocess [7], which places arti\ufb01cally large mass on the \ufb01nal component. It is more ef\ufb01cient and\nbroadly applicable than an alternative trunction which sets the stick-breaking \u201ctail\u201d to its prior [9].\n\nVariational algorithms optimize the parameters of q to minimize the KL divergence from the true,\nintractable posterior [7]. The optimal q maximizes the evidence lower bound (ELBO) objective L:\n\nlog p(x | \u03b10, \u03bb0) \u2265 L(q) , Eqh log p(x, v, z, \u03c6 | \u03b10, \u03bb0) \u2212 log q(v, z, \u03c6)i\n\nFor DP mixtures of exponential family distributions, L(q) has a simple form. For each component k,\nwe store its expected mass \u02c6Nk and expected suf\ufb01cient statistic sk(x). All but one term in L(q) can\nthen be written using only these summaries and expectations of the global parameters v, \u03c6:\n\n(6)\n\nN\n\nN\n\nXn=1\n\n\u02c6Nk , Eqh\n\nK\n\nL(q) =\n\n\u02c6rnk,\n\nznki =\n\nsk(x) , Eqh\n\nznkt(xn)i =\nXn=1\nXn=1\nXk=1 Eq[\u03c6k]T sk(x) \u2212 \u02c6Nk Eq[a(\u03c6k)] + \u02c6NkEq[log wk(v)] \u2212\nXn=1\nq(\u03c6k | \u02c6\u03bbk) (cid:21)!\n+ Eq(cid:20) log\n\nq(vk | \u02c6\u03b1k1, \u02c6\u03b1k0)(cid:21) + Eq(cid:20) log\n\nBeta(vk | 1, \u03b10)\n\nH(\u03c6k | \u03bb0)\n\nN\n\nN\n\nXn=1\n\nN\n\n\u02c6rnkt(xn),\n\n(7)\n\n\u02c6rnk log \u02c6rnk\n\n(8)\n\nExcluding the entropy term \u2212P \u02c6rnk log \u02c6rnk which we discuss later, this bound is a simple linear\n\nfunction of the summaries \u02c6Nk, sk(x). Given precomputed entropies and summaries, evaluation of\nL(q) can be done in time independent of the data size N . We next review variational algorithms\nfor optimizing q via coordinate ascent, iteratively updating individual factors of q. We describe\nalgorithms in terms of two updates [1]: global parameters (stick-breaking proportions vk and data-\ngenerating parameters \u03c6k), and local parameters (assignments of data to components zn).\n\n2.2 Full-dataset variational inference\n\nStandard full-dataset variational inference [7] updates local factors q(zn | \u02c6rn) for all observations\nn = 1, . . . , N by visiting each item n and computing the fraction \u02c6rnk explained by component k:\n\n\u02dcrnk = exp(cid:16)Eq[log wk(v)] + Eq[log p(xn | \u03c6k)](cid:17),\n\nNext, we update global factors q(vk|\u02c6\u03b1k1, \u02c6\u03b1k0), q(\u03c6k|\u02c6\u03bbk) for each component k. After computing\nsummary statistics \u02c6Nk, sk(x) given the new \u02c6rnk via Eq. (7), the update equations become\n\n\u02c6rnk =\n\n.\n\n(9)\n\n\u02dcrnk\n\u2113=1 \u02dcrn\u2113\n\nPK\n\n\u02c6\u03b1k1 = \u03b11 + \u02c6Nk,\n\n\u02c6\u03b1k0 = \u03b10 +\n\n\u02c6N\u2113,\n\n\u02c6\u03bbk = \u03bb0 + sk(x).\n\n(10)\n\nK\n\nX\u2113=k+1\n\nWhile simple and guaranteed to converge, this approach scales poorly to big datasets. Because\nglobal parameters are updated only after a full pass through the data, information propagates slowly.\n\n2.3 Stochastic online variational inference\n\nStochastic online (SO) variational inference scales to huge datasets [1]. Instead of analyzing all data\nat once, SO processes only a subset (\u201cbatch\u201d) Bt at each iteration t. These subsets are assumed\nsampled uniformly at random from a larger (but \ufb01xed size N ) corpus. Given a batch, SO \ufb01rst\nupdates local factors q(zn) for n \u2208 Bt via Eq. (9). It then updates global factors via a noisy gradient\nstep, using suf\ufb01cient statistics of q(zn) from only the current batch. These steps optimize a noisy\nfunction, which in expectation (with respect to batch sampling) converges to the true objective (6).\n\nNatural gradient steps are computationally tractable for exponential family models, involving nearly\nthe same computations as the full-dataset updates [1]. For example, to update the variational param-\neter \u02c6\u03bbk from (5) at iteration t, we \ufb01rst compute the global update given only data in the current batch,\n\n3\n\n\fampli\ufb01ed to be at full-dataset scale: \u02c6\u03bb\u2217\n|Bt| sk(Bt). Then, we interpolate between this and\nthe previous global parameters to arrive at the \ufb01nal result: \u02c6\u03bb(t)\n. The learn-\ning rate \u03c1t controls how \u201cforgetful\u201d the algorithm is of previous values; if it decays at appropriate\nrates, stochastic inference provably converges to a local optimum of the global objective L(q) [1].\n\nk + (1 \u2212 \u03c1t)\u02c6\u03bb(t\u22121)\n\nk = \u03bb0 + N\n\nk \u2190 \u03c1t\u02c6\u03bb\u2217\n\nk\n\nThis online approach has clear computational advantages and can sometimes yield higher quality\nsolutions than the full-data algorithm, since it conveys information between local and global pa-\nrameters more frequently. However, performance is extremely sensitive to the learning rate decay\nschedule and choice of batch size, as we demonstrate in later experiments.\n\n3 Memoized online variational inference\n\nb=1. For each batch, we maintain memoized suf\ufb01cient statistics Sb\n\nGeneralizing previous incremental variants of the expectation maximization (EM) algorithm [10], we\nnow develop our memoized online variational inference algorithm. We divide the data into B \ufb01xed\nk = [ \u02c6Nk(Bb), sk(Bb)]\nbatches {Bb}B\nk = [ \u02c6Nk, sk(x)]. These compact\nfor each component k. We also track the full-dataset statistics S0\nsummary statistics allow guarantees of correct full-dataset analysis while processing only one small\nbatch at a time. Our approach hinges on the fact that these suf\ufb01cient statistics are additive: sum-\nmaries of an entire dataset can be written exactly as the addition of summaries of distinct batches.\nNote that our memoization of deterministic analyses of batches of data is distinct from the stochastic\nmemoization, or \u201clazy\u201d instantiation, of random variables in some Monte Carlo methods [11, 12].\n\nMemoized inference proceeds by visiting (in random order) each distinct batch once in a full pass\nthrough the data, incrementally updating the local and global parameters related to that batch b.\nFirst, we update local parameters for the current batch (q(zn | \u02c6rn) for n \u2208 Bb) via Eq. (9). Next, we\nupdate cached global suf\ufb01cient statistics for each component: we subtract the old (cached) summary\nof batch b, compute a new batch-level summary, and add the result to the full-dataset summary:\n\nS0\nk \u2190 S0\n\nk \u2212 Sb\n\nk, Sb\n\nk \u2190h Xn\u2208Bb\n\n\u02c6rnk, Xn\u2208Bb\n\n\u02c6rnkt(xn)i, S0\n\nk \u2190 S0\n\nk + Sb\nk.\n\n(11)\n\nFinally, given the new full-dataset summary S0\nk, we update global parameters exactly as in Eq. (10).\nUnlike stochastic online algorithms, memoized inference is guaranteed to improve the full-dataset\nELBO at every step. Correctness follows immediately from the arguments in [10]. By construction,\neach local or global step will result in a new q that strictly increases the objective L(q).\n\nIn the limit where B = 1, memoized inference reduces to standard full-dataset updates. However,\ngiven many batches it is far more scalable, while maintaining all guarantees of batch inference. Fur-\nthermore, it generally converges faster than the full-dataset algorithm due to frequently interleaving\nglobal and local updates. Provided we can store memoized suf\ufb01cient statistics for each batch (not\neach observation), memoized inference has the same computational complexity as stochastic meth-\nods while avoiding noise and sensitivity to learning rates. Recent analysis of convex optimization\nalgorithms [13] demonstrated theoretical and practical advantages for methods that use cached full-\ndataset summaries to update parameters, as we do, instead of stochastic current-batch-only updates.\n\nThis memoized algorithm can compute the full-dataset objective L(q) exactly at any point (after\nvisiting all items once). To do so ef\ufb01ciently, we need to compute and store the assignment entropy\nH b\nH 0\neach batch. Given both H 0\n\nk = \u2212Pn\u2208Bb rnk log rnk after visiting each batch b. We also need to track the full-data entropy\nk = PB\n\nk, which is additive just like the suf\ufb01cient statistics and incrementally updated after\nk, evaluation of the full-dataset ELBO in Eq. (8) is exact and rapid.\n\nk and S0\n\nb=1 H b\n\n3.1 Birth moves to escape local optima\n\nWe now propose additional birth moves that, when interleaved with conventional coordinate ascent\nparameter updates, can add useful new components to the model and escape local optima. Previous\nmethods [14, 9, 4] for changing variational truncations create just one extra component via a \u201csplit\u201d\nmove that is highly-specialized to particular likelihoods. Wang and Blei [15] explore truncation\nlevels via a local collapsed Gibbs sampler, but samplers are slow to make large changes. In contrast,\nour births add many components at once and apply to any exponential family mixture model.\n\n4\n\n\fBirth Move \n\nBefore \n1\n\n2\n\n7\n\n6\n\nAfter \n\n1\n\n3\n\n4\n\n2\n\n5\n\nAdd fresh components to \n\nexpand original model \n\nLearn fresh DP-GMM on \n\nsubsample via VB \n\nSubsample data \n explained by 1 \n\n1) Create new \n components \n2) Adopt in one \n\npass thru data \n\nMemoized summary \n\n\u02c6N b\nk\n\nexpected count of each component \n\n1 2 3 4 5 6 7\n\n \n\nBatch 1 \n\n!\n\nBatch b \n\nBatch b+1 \n\n!\n\nBatch B \n\n \n0\n\n \n0\n\n \n0\n\n \n0\n\ncurrent position \n \n \n0\n\n \n0\n\n \n0\n\n \n0\n\n \n0\n\n \n0\n\n3) Merge to remove \n\nredundancy \n\nbatches not-yet updated \non this pass do not use \nany new components \n\nFigure 1: One pass through a toy dataset for memoized learning with birth and merge moves (MO-BM),\nshowing creation (left) and adoption (right) of new components. Left: Scatter plot of 2D observed data, and a\nsubsample targeted via the \ufb01rst mixture component. Elliptical contours show component covariance matrices.\nRight: Bar plots of memoized counts \u02c6N b\n\nk for each batch. Not shown: Memoized suf\ufb01cient statistics sb\nk.\n\nCreating new components in the online setting is challenging. Each small batch may not have\nenough examples of a missing component to inspire a good proposal, even if that component is well-\nsupported by the full dataset. We thus advocate birth moves that happen in three phases (collection,\ncreation, and adoption) over two passes of the data. The \ufb01rst pass collects a targeted data sample\nmore likely to yield informative proposals than a small, prede\ufb01ned batch. The second pass, shown in\nFig. 1, creates new components and then updates every batch with the expanded model. Successive\nbirths are interleaved; at each pass there are proposals both active and in preparation. We sketch out\neach step of the algorithm below. For complete details, see the supplement.\nCollection During pass 1, we collect a targeted subsample x\u2032 of the data, of size at most N \u2032 = 104.\nThis subsample targets a single component k\u2032. When visiting each batch, we copy data xn into x\u2032 if\n\u02c6rnk\u2032 > \u03c4 (we set \u03c4 = 0.1). This threshold test ensures the subsample contains related data, but also\npromotes diversity by considering data explained partially by other components k 6= k\u2032. Targeted\nsamples vary from iteration to iteration because batches are visited in distinct, random orders.\nCreation Before pass 2, we create new components by \ufb01tting a DP mixture model with K \u2032 (we\ntake K \u2032 = 10) components to x\u2032, running variational inference for a limited budget of iterations.\nTaking advantage of our nested truncation, we expand our current model to include all K + K \u2032 com-\nponents, as shown in Fig. 1. Unlike previous work [9, 4], we do not immediately assess the change in\nELBO produced by these new components, and always accept them. We rely on subsequent merge\nmoves (Sec. 3.2) to remove unneeded components.\nAdoption During pass 2, we visit each batch and perform local and global parameter updates for\nthe expanded (K + K \u2032)-component mixture. These updates use expanded global summaries S0\nthat include summaries S\u2032 from the targeted analysis of x\u2032. This results in two interpretations of the\nsubset x\u2032: assignment to original components (mostly k\u2032) and assignment to brand-new components.\nIf the new components are favored, they will gain mass via new assignments made at each batch.\nAfter the pass, we subtract away S\u2032 to yield both S0 and global parameters exactly consistent with\nthe data x. Any nearly-empty new component will likely be pruned away by later merges.\n\nBy adding many components at once, our birth move allows rapid escape from poor local optima.\nAlone, births may sometimes cause a slight ELBO decrease by adding unnecessary components.\nHowever, in practice merge moves reliably reject poor births and restore original con\ufb01gurations. In\nSec. 4, births are so effective that runs started at K = 1 recover necessary components on-the-\ufb02y.\n\n3.2 Merge moves that optimize the full data objective\n\nThe computational cost of inference grows with the number of components K. To keep K small, we\ndevelop merge moves that replace two components with a single merged one. Merge moves were\n\ufb01rst explored for batch variational methods [16, 14]. For hierarchical DP topic models, stochastic\n\n5\n\n\f6\n\n0\n1\nx\n \n \n \n\ne\nc\nn\ne\nd\nv\ne\ng\no\n\n \n\ni\n\nl\n\n1.04\n1.03\n1.02\n1.01\n1\n0.99\n\n \n\n3\n\nFull K=25\nMO K=25\nGreedyMerge\nMO\u2212BM K=1\n21\n27\n\n9\n\n6\n24\nnum. passes thru data (N=100000)\n\n15\n\n12\n\n18\n\n \n\n6\n\n0\n1\nx\n \n \n \n\ne\nc\nn\ne\nd\nv\ne\ng\no\n\n \n\ni\n\nl\n\n30\n\n1.04\n1.03\n1.02\n1.01\n1\n0.99\n\n \n\n3\n\n \n\nSOa K=25\nSOb K=25\nSOc K=25\n24\n6\nnum. passes thru data (N=100000)\n\n12\n\n15\n\n18\n\n21\n\n27\n\n9\n\n30\n\nData: 5x5 patches worst MO-BM\n\nworst MO\n\nworst Full\n\nbest SOb\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n0.13\n\n0.13\n\n0.12\n\n0.12\n\n0.00\n\n0.00\n\n0.13\n\n0.12\n\n0.00\n\n0.00\n\n0.25\n\n0.13\n\n0.25\n\n0.13\n\n0.12\n\n0.12\n\n0.13\n\n0.13\n\n0.13\n\n0.12\n\n0.13\n\n0.25\n\n0.25\n\n0.13\n\n0.13\n\n0.25\n\n0.25\n\n0.00\n\n0.13\n\n0.13\n\n0.13\n\n0.00\n\nFigure 2: Comparison of full-data, stochastic (SO), and memoized (MO) on toy data with K = 8 true\ncomponents (Sec. 4.1). Top: Trace of ELBO during training across 10 runs. SO compared with learning rates\na,b,c. Bottom Left: Example patch generated by each component. Bottom: Covariance matrix and weights wk\nfound by one run of each method, aligned to true components. \u201cX\u201d: no comparable component found.\n\nvariational inference methods have been augmented to evaluate merge proposals based on noisy,\nsingle-batch estimates of the ELBO [4]. This can result in accepted merges that decrease the full-\ndata objective (see Sec. 4.1 for an empirical illustration).\nIn contrast, our algorithm accurately\ncomputes the full ELBO for each merge proposal, ensuring only useful merges are accepted.\n\nGiven two components ka, kb to merge, we form a candidate q\u2032 with K \u2212 1 components, where\nmerged component km takes over all assignments to ka, kb: \u02c6rnkm = \u02c6rnka + \u02c6rnkb . Instead of com-\nputing new assignments explicitly, additivity allows direct construction of merged global suf\ufb01cient\nstatistics: S0\n\n. Merged global parameters follow from Eq. (10).\n\nkm = S0\n\nka + S0\n\nkb\n\nOur merge move has three steps: select components, form the candidate con\ufb01guration q\u2032, and accept\nq\u2032 if the ELBO improves. Selecting ka, kb to merge at random is unlikely to yield an improved\ncon\ufb01guration. After choosing ka at random, we select kb using a ratio of marginal likelihoods M\nwhich compares the merged and separated con\ufb01gurations, easily computed with cached summaries:\n\np(kb | ka) \u221d\n\nM (Ska + Skb )\nM (Ska)M (Skb )\n\n, M (Sk) = exp(cid:16)a0(\u03bb0 + sk(x))(cid:17).\n\n(12)\n\nOur memoized approach allows exact evaluation of the full-data ELBO to compare the existing q to\nmerge candidate q\u2032. As shown in Eq. (8), evaluating L(q\u2032) is a linear function of merged suf\ufb01cient\nn=1(\u02c6rnka + \u02c6rnkb ) log(\u02c6rnka + \u02c6rnkb ).\nWe compute this term in advance for all possible merge pairs. This requires storing one set of\nK(K \u2212 1)/2 scalars, one per candidate pair, for each batch. This modest precomputation allows\nrapid and exact merge moves, which improve model quality and speed-up post-merge iterations.\n\nstatistics, except for the assignment entropy term: Hab = \u2212PN\n\nIn one pass of the data, our algorithm performs a birth, memoized ascent steps for all batches, and\nseveral merges after the \ufb01nal batch. After a few passes, it recovers high-quality, compact structure.\n\n4 Experimental results\n\nWe now compare algorithms for learning DP-Gaussian mixture models (DP-GMM), using our own\nimplementations of full-dataset, stochastic online (SO), and memoized online (MO) inference, as\nwell as our new birth-merge memoized algorithm (MO-BM). Code is available online. To examine\nSO\u2019s sensitivity to learning rate, we use a recommended [1] decay schedule \u03c1t = (t + d)\u2212\u03ba with\nthree diverse settings: a) \u03ba = 0.5, d = 10, b) \u03ba = 0.5, d = 100, and c)\u03ba = 0.9, d = 10.\n\n4.1 Toy data: How reliably do algorithms escape local optima?\n\nWe \ufb01rst study N = 100000 synthetic image patches generated by a zero-mean GMM with 8 equally-\ncommon components. Each one is de\ufb01ned by a 25 \u00d7 25 covariance matrix producing 5 \u00d7 5 patches\nwith a strong edge. We investigate whether algorithms recover the true K = 8 structure. Each \ufb01xed-\ntruncation method runs from 10 \ufb01xed random initializations with K = 25, while MO-BM starts at\nK = 1. Online methods traverse 100 batches (1000 examples per batch).\n\n6\n\n\fSmart (k-means++) Initialization\n\nRandom Initialization\n\n \n\n \n\n\u22122.85\n\u22122.9\n\u22122.95\n\u22123\n\u22123.05\n\u22123.1\n\n \n\nSOa SOb SOc Full MO MO\u2212BM Kuri\n\n20 batches\n100 batches\n\n\u22123\n\n\u22123.5\n\n\u22124\n\n6\n\n0\n1\nx\n \n \n \ne\nc\nn\ne\nd\nv\ne\n \ng\no\n\ni\n\nl\n\n\u22124.5\n\n \n\nSOa SOb SOc Full MO MO\u2212BM Kuri\n\n20 batches\n100 batches\n\n6\n\n0\n1\nx\n \n \n \n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\n\nl\n\n0.82\n\n0.8\n\n0.78\n\n0.76\n\n0.74\n\n0.72\n\ny\nc\na\nr\nu\nc\nc\na\n\n \nt\n\nn\ne\nm\nn\ng\n\ni\nl\n\nA\n\nK\n \ns\nt\n\nn\ne\nn\no\np\nm\no\nc\n \n.\n\nm\nu\nn\n\n0.7\n\n40 50 60 70 80 90 100 110\n\nEffective num. components K\n\n100\n80\n60\n40\n20\n0\n \n0\n\n \n\nFull\nMO\nMO\u2212BM\n\n40 80 120 160 200\n\nnum. pass thru data (N=60000)\n\nFigure 3: MNIST. Top: Comparison of \ufb01nal ELBO for multiple runs of each method, varying initialization\nand number of batches. Stochastic online (SO) compared at learning rates a,b,c. Bottom left: Visualization of\ncluster means for MO-BM\u2019s best run. Bottom center: Evaluation of cluster alignment to true digit label. Bottom\nright: Growth in truncation-level K as more data visited with MO-BM.\n\nFig. 2 traces the training-set ELBO as more data arrives for each algorithm and shows estimated\ncovariance matrices for the top 8 components for select runs. Even the best runs of SO do not\nrecover ideal structure. In contrast, all 10 runs of our birth-merge algorithm \ufb01nd all 8 components,\ndespite initialization at K = 1. The ELBO trace plots show this method escaping local optima, with\nslight drops indicating addition of new components followed by rapid increases as these are adopted.\nThey further suggest that our \ufb01xed-truncation memoized method competes favorably with full-data\ninference, often converging to similar or better solutions after fewer passes through the data.\n\nThe fact that our MO-BM algorithm only performs merges that improve the full-data ELBO is cru-\ncial. Fig. 2 shows trace plots of GreedyMerge, a memoized online variant that instead uses only the\ncurrent-batch ELBO to assess a proposed merge, as done in [4]. Given small batches (1000 exam-\nples each), there is not always enough data to warrant many distinct 25 \u00d7 25 covariance components.\nThus, this method favors merges that in fact remove vital structure. All 5 runs of this GreedyMerge\nalgorithm ruinously accept merges that decrease the full objective, consistently collapsing down to\njust one component. Our memoized approach ensures merges are always globally bene\ufb01cial.\n\n4.2 MNIST digit clustering\n\nWe now compare algorithms for clustering N = 60000 MNIST images of handwritten digits 0-9.\nWe preprocess as in [9], projecting each image down to D = 50 dimensions via PCA. Here, we also\ncompare to Kurihara\u2019s public implementation of variational inference with split moves [9]. MO-BM\nand Kurihara start at K = 1, while other methods are given 10 runs from two K = 100 initialization\nroutines: random and smart (based on k-means++ [17]). For online methods, we compare 20 and\n100 batches, and three learning rates. All runs complete 200 passes through the full dataset.\n\nThe \ufb01nal ELBO values for every run of each method are shown in Fig. 3. SO\u2019s performance varies\ndramatically across initialization, learning rate, and number of batches. Under random initializa-\ntion, SO reaches especially poor local optima (note lower y-axis scale). In contrast, our memoized\napproach consistently delivers solutions on par with full inference, with no apparent sensitivity to\nthe number of batches. With births and merges enabled, MO-BM expands from K = 1 to over 80\ncomponents, \ufb01nding better solutions than every smart K = 100 initialization. MO-BM even outper-\nforms Kurihara\u2019s of\ufb02ine split algorithm, yielding 30-40 more components and higher ELBO values.\nAltogether, Fig. 3 exposes SO\u2019s extreme sensitivity, validates MO as a more reliable alternative, and\nshows that our birth-merge algorithm is more effective at avoiding local optima.\n\nFig. 3 also shows cluster means learned by the best MO-BM run, covering many styles of each digit.\nWe further compute a hard segmentation of the data using the q(z) from smart initialization runs.\nEach DP-GMM cluster is aligned to one digit by majority vote of its members. A plot of alignment\naccuracy in Fig. 3 shows our MO-BM consistently among the best, with SO lagging signi\ufb01cantly.\n\n7\n\n\f7\n\n0\n1\nx\n \n \n \n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\n\nl\n\n\u22121.55\n\u22121.56\n\u22121.57\n\u22121.58\n\u22121.59\n\u22121.6\n\u22121.61\n\u22121.62\n\n \n\n5\n\n \n\nSOa K=100\nSOb K=100\nFull K=100\nMO K=100\nMO\u2212BM K=1\n\n15\n\n10\n40\nnum. passes thru data (N=108754)\n\n30\n\n25\n\n20\n\n35\n\n45\n\n50\n\nFigure 4: SUN-397 tiny images. Left: ELBO during training. Right: Visualization of 10 of 28 learned clusters\nfor best MO-BM run. Each column shows two images from the top 3 categories aligned to one cluster.\n\n8\n\n0\n1\nx\n \n \n \n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\n\nl\n\n4.45\n\n4.4\n\n4.35\n\n4.3\n\n4.25\n\n \n\n10\n\n \n\nMO\u2212BM K=1\nMO K=100\nSOa K=100\n\n20\n80\nnum. passes thru data (N=1880200)\n\n70\n\n50\n\n40\n\n30\n\n60\n\n90 100\n\nK\n \ns\nt\n\nn\ne\nn\no\np\nm\no\nc\n \n.\n\nm\nu\nn\n\n300\n250\n200\n150\n100\n50\n0\n \n0\n\nMO\u2212BM K=1\nMO K=100\nSOa K=100\n\n10\n\n20\n80\nnum. passes thru data (N=1880200)\n\n70\n\n50\n\n40\n\n60\n\n30\n\n \n\n90 100\n\n9\n\n0\n1\nx\n \n \n \n\ne\nc\nn\ne\nd\nv\ne\n\ni\n\n \n\ng\no\n\nl\n\n2.06\n2.04\n2.02\n2\n1.98\n1.96\n\n \n\n4\n\n \n\nMO\u2212BM K=1\nMO K=100\nSOa K=100\n\n12\n\n8\nnum. passes thru data (N=8640000)\n\n28\n\n20\n\n32\n\n16\n\n24\n\n36\n\n40\n\nFigure 5: 8 \u00d7 8 image patches. Left: ELBO during training, N = 1.88 million. Center: Effective truncation-\nlevel K during training, N = 1.88 million. Right: ELBO during training, N = 8.64 million.\n\n4.3 Tiny image clustering\n\nWe next learn a full-mean DP-GMM for tiny, 32 \u00d7 32 images from the SUN-397 scene categories\ndataset [18]. We preprocess all 108754 color images via PCA, projecting each example down to\nD = 50 dimensions. We start MO-BM at K = 1, while other methods have \ufb01xed K = 100. Fig. 4\nplots the training ELBO as more data is seen. Our MO-BM runs surpass all other algorithms.\n\nTo verify quality, Fig. 4 shows images from the 3 most-related scene categories for each of several\nclusters found by MO-BM. For each learned cluster k, we rank all 397 categories to \ufb01nd those with\nthe largest fraction of members assigned to k via \u02c6r\u00b7k. The result is quite sensible, with clusters for\ntall free-standing objects, swimming pools and lakes, doorways, and waterfalls.\n\n4.4 Image patch modeling\n\nOur last experiment applies a zero-mean, full-covariance DP-GMM to learn the covariance struc-\ntures of natural image patches, inspired by [19, 20]. We compare online algorithms on N = 1.88\nmillion 8 \u00d7 8 patches, a dense subsampling of all patches from 200 images of the Berkeley Seg-\nmentation dataset. Fig. 5 shows that our birth-merge memoized algorithm started at K = 1 can\nconsistently add useful components and reach better solutions than alternatives. We also examined\na much bigger dataset of N = 8.64 million patches, and still see advantages for our MO-BM.\n\nFinally, we perform denoising on 30 heldout images, using code from [19]. Our best MO-BM run on\nthe 1.88 million patch dataset achieves PSNR of 28.537 dB, within 0.05 dB of the PSNR achieved\nby [19]\u2019s publicly-released GMM with K = 200 trained on a similar corpus. This performance is\nvisually indistinguishable, highlighting the practical value of our new algorithm.\n\n5 Conclusions\n\nOur novel memoized online variational algorithm avoids noisiness and sensitivity inherent in\nstochastic methods. Our birth and merge moves successfully escape local optima. These innovations\nare applicable to common nonparametric models beyond the Dirichlet process.\n\nAcknowledgments This research supported in part by ONR Award No. N00014-13-1-0644. M. Hughes\nsupported in part by an NSF Graduate Research Fellowship under Grant No. DGE0228243.\n\n8\n\n\fReferences\n\n[1] M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 14:1303\u20131347,\n\n2013.\n\n[2] R. Ranganath, C. Wang., D. Blei, and E. Xing. An adaptive learning rate for stochastic variational infer-\n\nence. In ICML, 2013.\n\n[3] P. Gopalan, D. M. Mimno, S. Gerrish, M. J. Freedman, and D. M. Blei. Scalable inference of overlapping\n\ncommunities. In NIPS, 2012.\n\n[4] M. Bryant and E. Sudderth. Truly nonparametric online variational inference for hierarchical Dirichlet\n\nprocesses. In NIPS, 2012.\n\n[5] S. Jain and R.M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process\n\nmixture model. Journal of Computational and Graphical Statistics, 13(1):158\u2013182, 2004.\n\n[6] D. B. Dahl. Sequentially-allocated merge-split sampler for conjugate and nonconjugate Dirichlet process\n\nmixture models. Submitted to Journal of Computational and Graphical Statistics, 2005.\n\n[7] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixture models. Bayesian\n\nAnalysis, 1(1):121\u2013144, 2006.\n\n[8] Y. W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. In NIPS, 2008.\n\n[9] K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational Dirichlet process mixtures. In NIPS,\n\n2006.\n\n[10] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. In Learning in graphical models, 1999.\n\n[11] O. Papaspiliopoulos and G. O. Roberts. Retrospective Markov chain Monte Carlo methods for Dirichlet\n\nprocess hierarchical models. Biometrika, 95(1):169\u2013186, 2008.\n\n[12] N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, and J. Tenenbaum. Church: A language for\n\ngenerative models. In Uncertainty in Arti\ufb01cial Intelligence, 2008.\n\n[13] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence\n\nrate for \ufb01nite training sets. In NIPS, 2012.\n\n[14] N. Ueda and Z. Ghahramani. Bayesian model search for mixture models based on optimizing variational\n\nbounds. Neural Networks, 15(1):1223\u20131241, 2002.\n\n[15] C. Wang and D. Blei. Truncation-free stochastic variational inference for Bayesian nonparametric models.\n\nIn NIPS, 2012.\n\n[16] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton. SMEM algorithm for mixture models. Neural\n\nComputation, 12(9):2109\u20132128, 2000.\n\n[17] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In ACM-SIAM Symposium\n\non Discrete Algorithms, pages 1027\u20131035, 2007.\n\n[18] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition\n\nfrom abbey to zoo. In CVPR, 2010.\n\n[19] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In\n\nICCV, 2011.\n\n[20] D. Zoran and Y. Weiss. Natural images, Gaussian mixtures and dead leaves. In NIPS, 2012.\n\n9\n\n\f", "award": [], "sourceid": 600, "authors": [{"given_name": "Michael", "family_name": "Hughes", "institution": "Brown University"}, {"given_name": "Erik", "family_name": "Sudderth", "institution": "Brown University"}]}