{"title": "Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models", "book": "Advances in Neural Information Processing Systems", "page_first": 28, "page_last": 36, "abstract": "We develop a sequential low-complexity inference procedure for Dirichlet process mixtures of Gaussians for online clustering and parameter estimation when the number of clusters are unknown a-priori. We present an easily computable, closed form parametric expression for the conditional likelihood, in which hyperparameters are recursively updated as a function of the streaming data assuming conjugate priors. Motivated by large-sample asymptotics, we propose a noveladaptive low-complexity design for the Dirichlet process concentration parameter and show that the number of classes grow at most at a logarithmic rate. We further prove that in the large-sample limit, the conditional likelihood and datapredictive distribution become asymptotically Gaussian. We demonstrate through experiments on synthetic and real data sets that our approach is superior to otheronline state-of-the-art methods.", "full_text": "Adaptive Low-Complexity Sequential Inference for\n\nDirichlet Process Mixture Models\n\nTheodoros Tsiligkaridis, Keith W. Forsythe\n\nMassachusetts Institute of Technology, Lincoln Laboratory\n\nLexington, MA 02421 USA\n\nttsili@ll.mit.edu, forsythe@ll.mit.edu\n\nAbstract\n\nWe develop a sequential low-complexity inference procedure for Dirichlet pro-\ncess mixtures of Gaussians for online clustering and parameter estimation when\nthe number of clusters are unknown a-priori. We present an easily computable,\nclosed form parametric expression for the conditional likelihood, in which hyper-\nparameters are recursively updated as a function of the streaming data assuming\nconjugate priors. Motivated by large-sample asymptotics, we propose a novel\nadaptive low-complexity design for the Dirichlet process concentration parame-\nter and show that the number of classes grow at most at a logarithmic rate. We\nfurther prove that in the large-sample limit, the conditional likelihood and data\npredictive distribution become asymptotically Gaussian. We demonstrate through\nexperiments on synthetic and real data sets that our approach is superior to other\nonline state-of-the-art methods.\n\n1\n\nIntroduction\n\nDirichlet process mixture models (DPMM) have been widely used for clustering data Neal (1992);\nRasmussen (2000). Traditional \ufb01nite mixture models often suffer from over\ufb01tting or under\ufb01tting\nof data due to possible mismatch between the model complexity and amount of data. Thus, model\nselection or model averaging is required to \ufb01nd the correct number of clusters or the model with\nthe appropriate complexity. This requires signi\ufb01cant computation for high-dimensional data sets or\nlarge samples. Bayesian nonparametric modeling are alternative approaches to parametric modeling,\nan example being DPMM\u2019s which can automatically infer the number of clusters from the data via\nBayesian inference techniques.\nThe use of Markov chain Monte Carlo (MCMC) methods for Dirichlet process mixtures has made\ninference tractable Neal (2000). However, these methods can exhibit slow convergence and their\nconvergence can be tough to detect. Alternatives include variational methods Blei & Jordan (2006),\nwhich are deterministic algorithms that convert inference to optimization. These approaches can\ntake a signi\ufb01cant computational effort even for moderate sized data sets. For large-scale data sets\nand low-latency applications with streaming data, there is a need for inference algorithms that are\nmuch faster and do not require multiple passes through the data. In this work, we focus on low-\ncomplexity algorithms that adapt to each sample as they arrive, making them highly scalable. An\nonline algorithm for learning DPMM\u2019s based on a sequential variational approximation (SVA) was\nproposed in Lin (2013), and the authors in Wang & Dunson (2011) recently proposed a sequential\nmaximum a-posterior (MAP) estimator for the class labels given streaming data. The algorithm is\ncalled sequential updating and greedy search (SUGS) and each iteration is composed of a greedy\nselection step and a posterior update step.\nThe choice of concentration parameter \u03b1 is critical for DPMM\u2019s as it controls the number of clus-\nters Antoniak (1974). While most fast DPMM algorithms use a \ufb01xed \u03b1 Fearnhead (2004); Daume\n\n1\n\n\f(2007); Kurihara et al. (2006), imposing a prior distribution on \u03b1 and sampling from it provides more\n\ufb02exibility, but this approach still heavily relies on experimentation and prior knowledge. Thus, many\nfast inference methods for Dirichlet process mixture models have been proposed that can adapt \u03b1\nto the data, including the works Escobar & West (1995) where learning of \u03b1 is incorporated in the\nGibbs sampling analysis, Blei & Jordan (2006) where a Gamma prior is used in a conjugate manner\ndirectly in the variational inference algorithm. Wang & Dunson (2011) also account for model un-\ncertainty on the concentration parameter \u03b1 in a Bayesian manner directly in the sequential inference\nprocedure. This approach can be computationally expensive, as discretization of the domain of \u03b1 is\nneeded, and its stability highly depends on the initial distribution on \u03b1 and on the range of values of\n\u03b1. To the best of our knowledge, we are the \ufb01rst to analytically study the evolution and stability of\nthe adapted sequence of \u03b1\u2019s in the online learning setting.\nIn this paper, we propose an adaptive non-Bayesian approach for adapting \u03b1 motivated by large-\nsample asymptotics, and call the resulting algorithm ASUGS (Adaptive SUGS). While the basic\nidea behind ASUGS is directly related to the greedy approach of SUGS, the main contribution is\na novel low-complexity stable method for choosing the concentration parameter adaptively as new\ndata arrive, which greatly improves the clustering performance. We derive an upper bound on the\nnumber of classes, logarithmic in the number of samples, and further prove that the sequence of\nconcentration parameters that results from this adaptive design is almost bounded. We \ufb01nally prove,\nthat the conditional likelihood, which is the primary tool used for Bayesian-based online clustering,\nis asymptotically Gaussian in the large-sample limit, implying that the clustering part of ASUGS\nasymptotically behaves as a Gaussian classi\ufb01er. Experiments show that our method outperforms\nother state-of-the-art methods for online learning of DPMM\u2019s.\nThe paper is organized as follows. In Section 2, we review the sequential inference framework for\nDPMM\u2019s that we will build upon, introduce notation and propose our adaptive modi\ufb01cation. In\nSection 3, the probabilistic data model is given and sequential inference steps are shown. Section\n4 contains the growth rate analysis of the number of classes and the adaptively-designed concentra-\ntion parameters, and Section 5 contains the Gaussian large-sample approximation to the conditional\nlikelihood. Experimental results are shown in Section 6 and we conclude in Section 7.\n\n2 Sequential Inference Framework for DPMM\n\nHere, we review the SUGS framework of Wang & Dunson (2011) for online clustering. Here, the\nnonparametric nature of the Dirichlet process manifests itself as modeling mixture models with\ncountably in\ufb01nite components. Let the observations be given by yi \u2208 Rd, and \u03b3i to denote\nthe class label of the ith observation (a latent variable). We de\ufb01ne the available information at\ntime i as y(i) = {y1, . . . , yi} and \u03b3(i\u22121) = {\u03b31, . . . , \u03b3i\u22121}. The online sequential updating and\ngreedy search (SUGS) algorithm is summarized next for completeness. Set \u03b31 = 1 and calculate\n\u03c0(\u03b81|y1, \u03b31). For i \u2265 2,\n\n1. Choose best class label for yi:\n2. Update\n\nposterior\n\nf (yi|\u03b8\u03b3i )\u03c0(\u03b8\u03b3i|y(i\u22121), \u03b3(i\u22121)).\n\nthe\n\n\u03b3i \u2208 arg max1\u2264h\u2264ki\u22121+1 P (\u03b3i = h|y(i), \u03b3(i\u22121)).\n\u03c0(\u03b8\u03b3i|y(i), \u03b3(i))\n\nyi, \u03b3i:\n\nusing\n\ndistribution\n\n\u221d\n\nwhere \u03b8h are the parameters of class h, f (yi|\u03b8h) is the observation density conditioned on class\nh and ki\u22121 is the number of classes created at time i \u2212 1. The algorithm sequentially allocates\nobservations yi to classes based on maximizing the conditional posterior probability.\nTo calculate the posterior probability P (\u03b3i = h|y(i), \u03b3(i\u22121)), de\ufb01ne the variables:\n\nLi,h(yi)\n\ndef\n\n= P (yi|\u03b3i = h, y(i\u22121), \u03b3(i\u22121)),\n\n\u03c0i,h(\u03b1)\n\ndef\n\n= P (\u03b3i = h|\u03b1, y(i\u22121), \u03b3(i\u22121))\n\nFrom Bayes\u2019 rule, P (\u03b3i = h|y(i), \u03b3(i\u22121)) \u221d Li,h(yi)\u03c0i,h(\u03b1) for h = 1, . . . , ki\u22121 + 1. Here, \u03b1 is\nconsidered \ufb01xed at this iteration, and is not updated in a fully Bayesian manner.\nAccording to the Dirichlet process prediction, the predictive probability of assigning observation yi\nto a class h is:\n\n(cid:26) mi\u22121(h)\n\n2\n\n\u03c0i,h(\u03b1) =\n\ni\u22121+\u03b1 , h = 1, . . . , ki\u22121\nh = ki\u22121 + 1\ni\u22121+\u03b1 ,\n\n\u03b1\n\n(1)\n\n\fAlgorithm 1 Adaptive Sequential Updating and Greedy Search (ASUGS)\n\nInput: streaming data {yi}\u221e\nSet \u03b31 = 1 and k1 = 1. Calculate \u03c0(\u03b81|y1, \u03b31).\nfor i \u2265 2: do\n\ni=1, rate parameter \u03bb > 0.\n\n(a) Update concentration parameter:\n(b) Choose best label for yi:\n(c) Update posterior distribution:\n\nki\u22121\n\n(cid:110) Li,h(yi)\u03c0i,h(\u03b1i\u22121)\n\u03bb+log(i\u22121).\n(cid:80)\nh } =\n\n\u03b1i\u22121 =\n\u03b3i \u223c {q(i)\n\u03c0(\u03b8\u03b3i|y(i), \u03b3(i)) \u221d f (yi|\u03b8\u03b3i)\u03c0(\u03b8\u03b3i|y(i\u22121), \u03b3(i\u22121)).\n\nh(cid:48) Li,h(cid:48) (yi)\u03c0i,h(cid:48) (\u03b1i\u22121)\n\n(cid:111)\n\n.\n\nend for\n\nwhere mi\u22121(h) = (cid:80)i\u22121\n\ni \u2212 1, and \u03b1 > 0 is the concentration parameter.\n\nl=1 I(\u03b3l = h) counts the number of observations labeled as class h at time\n\n2.1 Adaptation of Concentration Parameter \u03b1\n\nIt is well known that the concentration parameter \u03b1 has a strong in\ufb02uence on the growth of the num-\nber of classes Antoniak (1974). Our experiments show that in this sequential framework, the choice\nof \u03b1 is even more critical. Choosing a \ufb01xed \u03b1 as in the online SVA algorithm of Lin (2013) requires\ncross-validation, which is computationally prohibitive for large-scale data sets. Furthermore, in the\nstreaming data setting where no estimate on the data complexity exists, it is impractical to perform\ncross-validation. Although the parameter \u03b1 is handled from a fully Bayesian treatment in Wang &\nDunson (2011), a pre-speci\ufb01ed grid of possible values \u03b1 can take, say {\u03b1l}L\nl=1, along with the prior\ndistribution over them, needs to be chosen in advance. Storage and updating of a matrix of size\n(ki\u22121 + 1) \u00d7 L and further marginalization is needed to compute P (\u03b3i = h|y(i), \u03b3(i\u22121)) at each\niteration i. Thus, we propose an alternative data-driven method for choosing \u03b1 that works well in\npractice, is simple to compute and has theoretical guarantees.\nThe idea is to start with a prior distribution on \u03b1 that favors small \u03b1 and shape it into a posterior\ndistribution using the data. De\ufb01ne pi(\u03b1) = p(\u03b1|y(i), \u03b3(i)) as the posterior distribution formed at\ntime i, which will be used in ASUGS at time i + 1. Let p1(\u03b1) \u2261 p1(\u03b1|y(1), \u03b3(1)) denote the prior\nfor \u03b1, e.g., an exponential distribution p1(\u03b1) = \u03bbe\u2212\u03bb\u03b1. The dependence on y(i) and \u03b3(i) is trivial\nonly at this \ufb01rst step. Then, by Bayes rule, pi(\u03b1) \u221d p(yi, \u03b3i|y(i\u22121), \u03b3(i\u22121), \u03b1)p(\u03b1|y(i\u22121), \u03b3(i\u22121)) \u221d\npi\u22121(\u03b1)\u03c0i,\u03b3i (\u03b1) where \u03c0i,\u03b3i (\u03b1) is given in (1). Once this update is made after the selection of \u03b3i, the\n\u03b1 to be used in the next selection step is the mean of the distribution pi(\u03b1), i.e., \u03b1i = E[\u03b1|y(i), \u03b3(i)].\nAs will be shown in Section 5, the distribution pi(\u03b1) can be approximated by a Gamma distribution\nwith shape parameter ki and rate parameter \u03bb + log i. Under this approximation, we have \u03b1i =\n\u03bb+log i, only requiring storage and update of one scalar parameter ki at each iteration i.\nThe ASUGS algorithm is summarized in Algorithm 1. The selection step may be implemented\nby sampling the probability mass function {q(i)\nh }. The posterior update step can be ef\ufb01ciently per-\nformed by updating the hyperparameters as a function of the streaming data for the case of conjugate\ndistributions. Section 3 derives these updates for the case of multivariate Gaussian observations and\nconjugate priors for the parameters.\n\nki\n\n3 Sequential Inference under Unknown Mean & Unknown Covariance\n\nWe consider the general case of an unknown mean and covariance for each class. The probabilistic\nmodel for the parameters of each class is given as:\n\n\u00b5|T \u223c N (\u00b7|\u00b50, coT),\n\nT \u223c W(\u00b7|\u03b40, V0)\n\nyi|\u00b5, T \u223c N (\u00b7|\u00b5, T),\n\n(2)\nwhere N (\u00b7|\u00b5, T) denotes the multivariate normal distribution with mean \u00b5 and precision matrix\nT, and W(\u00b7|\u03b4, V) is the Wishart distribution with 2\u03b4 degrees of freedom and scale matrix V. The\nparameters \u03b8 = (\u00b5, T) \u2208 Rd\u00d7 Sd\n++ follow a normal-Wishart joint distribution. The model (2) leads\nto closed-form expressions for Li,h(yi)\u2019s due to conjugacy Tzikas et al. (2008).\nTo calculate the class posteriors, the conditional likelihoods of yi given assignment to class h and\nthe previous class assignments need to be calculated \ufb01rst. The conditional likelihood of yi given\n\n3\n\n\fassignment to class h and the history (y(i\u22121), \u03b3(i\u22121)) is given by:\n\n(3)\nDue to the conjugacy of the distributions, the posterior \u03c0(\u03b8h|y(i\u22121), \u03b3(i\u22121)) always has the form:\n\nLi,h(yi) =\n\nf (yi|\u03b8h)\u03c0(\u03b8h|y(i\u22121), \u03b3(i\u22121))d\u03b8h\n\n(cid:90)\n\n\u03c0(\u03b8h|y(i\u22121), \u03b3(i\u22121)) = N (\u00b5h|\u00b5(i\u22121)\n, c(i\u22121)\n\n, V(i\u22121)\n\n, \u03b4(i\u22121)\n\nh\n\n, c(i\u22121)\nh Th)W(Th|\u03b4(i\u22121)\n\nh\n\n, V(i\u22121)\n\n)\n\nh\n\nh\n\nh\n\nh\n\nh\n\nyi +\n\n\u00b5(i)\n\u03b3i\n\nh , V(i)\n\nh ). The matrix \u03a3(i)\n\n=\n= c(i\u22121)\n\nc(i\u22121)\n1 + c(i\u22121)\n\nh )\u22121\nh := (V(i)\n2\u03b4(i)\nh\n\nwhere \u00b5(i\u22121)\nare hyperparameters that can be recursively computed as new\nsamples come in. The form of this recursive computation of the hyperparameters is derived in\nAppendix A. For ease of interpretation and numerical stability, we de\ufb01ne \u03a3(i)\nas the\ninverse of the mean of the Wishart distribution W(\u00b7|\u03b4(i)\nh has the natural\ninterpretation as the covariance matrix of class h at iteration i. Once the \u03b3ith component is chosen,\nthe parameter updates for the \u03b3ith class become:\n\u00b5(i\u22121)\n\n1 + c(i\u22121)\n\u03b3i\n+ 1\n2\u03b4(i\u22121)\n1 + 2\u03b4(i\u22121)\n1\n(7)\n2\nh } will remain positive\nIf the starting matrix \u03a3(0)\nh\nde\ufb01nite. Let us return to the calculation of the conditional likelihood (3). By iterated integration, it\nfollows that:\nLi,h(yi) \u221d\n\nis positive de\ufb01nite, then all the matrices {\u03a3(i)\n\nc(i\u22121)\n1 + c(i\u22121)\n\n)(yi \u2212 \u00b5(i\u22121)\n\n(yi \u2212 \u00b5(i\u22121)\n\n1 + 2\u03b4(i\u22121)\n\n(cid:33)d/2\n\n= \u03b4(i\u22121)\n\n\u03a3(i\u22121)\n\n)\u22121/2\n\n(cid:32)\n\n\u03a3(i)\n\u03b3i\n\n\u03b4(i)\n\u03b3i\n\nc(i)\n\u03b3i\n\n(4)\n\n(8)\n\n(5)\n\n(6)\n\n)T\n\n+\n\n=\n\n\u03b3i\n\n+\n\n\u03b3i\n\n\u03b3i\n\n\u03b3i\n\n\u03b3i\n\n\u03b3i\n\n\u03b3i\n\n\u03b3i\n\n1\n\n\u03b3i\n\n\u03b3i\n\n\u03b3i\n\n\u03b3i\n\n1\n\n\u03b3i\n\n) det(\u03a3(i\u22121)\n)T (\u03a3(i\u22121)\n\nh\n\nh\n\n)\u22121(yi \u2212 \u00b5(i\u22121)\n\n)\n\nh\n\n(cid:19)\u03b4(i\u22121)\n\nh\n\n+ 1\n2\n\nr(i\u22121)\n2\u03b4(i\u22121)\n\nh\n\nh\n\n(cid:18)\n\nh\n\n\u03c1d(\u03b4(i\u22121)\n(yi \u2212 \u00b5(i\u22121)\ndef= c(i\u22121)\n1+c(i\u22121)\n\nh\n\nh\n\n1 + r(i\u22121)\n2\u03b4(i\u22121)\n\nh\n\nh\n\nand r(i\u22121)\n\nwhere \u03c1d(a) def= \u0393(a+ 1\n2 )\n. A detailed mathematical derivation of this\n\u0393(a+ 1\u2212d\n2 )\nconditional likelihood is included in Appendix B. We remark that for the new class h = ki\u22121 + 1,\nLi,ki\u22121+1 has the form (8) with the initial choice of hyperparameters r(0), \u03b4(0), \u00b5(0), \u03a3(0).\n\nh\n\nh\n\n4 Growth Rate Analysis of Number of Classes & Stability\n\nIn this section, we derive a model for the posterior distribution pn(\u03b1) using large-sample approxi-\nmations, which will allow us to derive growth rates on the number of classes and the sequence of\nconcentration parameters, showing that the number of classes grows as E[kn] = O(log1+\u0001 n) for \u0001\narbitarily small under certain mild conditions.\nThe probability density of the \u03b1 parameter is updated at the jth step in the following fashion:\n\npj+1(\u03b1) \u221d pj(\u03b1) \u00b7\n\ninnovation class chosen\notherwise\n\n,\n\n(cid:26) \u03b1\n\nj+\u03b1\n\n1\n\nj+\u03b1\n\nwhere only the \u03b1-dependent factors in the update are shown. The \u03b1-independent factors are absorbed\nby the normalization to a probability density. Choosing the innovation class pushes mass toward\nin\ufb01nity while choosing any other class pushes mass toward zero. Thus there is a possibility that\nthe innovation probability grows in a undesired manner. We assess the growth of the number of\ndef= kn \u2212 1 under simple assumptions on some likelihood functions that appear\ninnovations rn\nnaturally in the ASUGS algorithm.\nAssuming that the initial distribution of \u03b1 is p1(\u03b1) = \u03bbe\u2212\u03bb\u03b1, the distribution used at step n + 1 is\n\nproportional to \u03b1rn(cid:81)n\u22121\n\nj=1 (1 + \u03b1\n\nj )\u22121e\u2212\u03bb\u03b1. We make use of the limiting relation\n\n4\n\n\fTheorem 1. The following asymptotic behavior holds: limn\u2192\u221e log(cid:81)n\u22121\n\nj=1 (1+ \u03b1\nj )\n\u03b1 log n\n\n= 1.\n\nProof. See Appendix C.\n\nUsing Theorem 1, a large-sample model for pn(\u03b1) is \u03b1rn e\u2212(\u03bb+log n)\u03b1, suitably normalized. Recog-\nnizing this as the Gamma distribution with shape parameter rn + 1 and rate parameter \u03bb + log n, its\nmean is given by \u03b1n = rn+1\n\u03bb+log n. We use the mean in this form to choose class membership in Alg. 1.\nThis asymptotic approximation leads to a very simple scalar update of the concentration parameter;\nthere is no need for discretization for tracking the evolution of continuous probability distributions\non \u03b1. In our experiments, this approximation is very accurate.\nRecall that the innovation class is labeled K+ = kn\u22121 + 1 at the nth step. The modeled updates\nrandomly select a previous class or innovation (new class) by sampling from the probability distri-\nbution {q(n)\nmn(k) , where mn(k)\nrepresents the number of members in class k at time n.\nWe assume the data follows the Gaussian mixture distribution:\n\nk=1. Note that n \u2212 1 =(cid:80)\nk = P (\u03b3n = k|y(n), \u03b3(n\u22121))}K+\nK(cid:88)\n\nk(cid:54)=K+\n\npT (y) def=\n\n\u03c0hN (y|\u00b5h, \u03a3h)\n\n(9)\n\nh=1\n\n(cid:88)\n\nk(cid:54)=K+\n\n(cid:33)\n\nwhere \u03c0h are the prior probabilities, and \u00b5h, \u03a3h are the parameters of the Gaussian clusters.\nDe\ufb01ne the mixture-model probability density function, which plays the role of the predictive distri-\nbution:\n\n\u02dcLn,K+(y) def=\n\nmn\u22121(k)\nn \u2212 1\n\nLn,k(y),\n\n(10)\n\nso that the probabilities of choosing a previous class or an innovation (using Equ. (1)) are propor-\nLn,K+ (yn), respec-\n\nLn,k(yn) = (n\u22121)\nn\u22121+\u03b1n\u22121\n\n\u02dcLn,K+(yn) and\n\nmn\u22121(k)\nn\u22121+\u03b1n\u22121\n\nn\u22121+\u03b1n\u22121\n\nk(cid:54)=K+\n\n\u03b1n\u22121\n\ntively. If \u03c4n\u22121 denotes the innovation probability at step n, then we have\n\ntional to(cid:80)\n\n(cid:32)\n\n\u03c1n\u22121\n\n\u03b1n\u22121Ln,K+(yn)\nn \u2212 1 + \u03b1n\u22121\n\n, \u03c1n\u22121\n\n(n \u2212 1) \u02dcLn,K+(yn)\n\nn \u2212 1 + \u03b1n\u22121\n\n= (\u03c4n\u22121, 1 \u2212 \u03c4n\u22121)\n\n(11)\n\nfor some positive proportionality factor \u03c1n\u22121.\nDe\ufb01ne the likelihood ratio (LR) at the beginning of stage n as 1:\n\nln(y) def=\n\nLn,K+(y)\n\u02dcLn,K+(y)\n\n(12)\n\nConceptually, the mixture (10) represents a modeled distribution \ufb01tting the currently observed data.\nIf all \u201cmodes\u201d of the data have been observed, it is reasonable to expect that \u02dcLn,K+ is a good model\nfor future observations. The LR ln(yn) is not large when the future observations are well-modeled\nby (10). In fact, we expect \u02dcLn,K+ \u2192 pT as n \u2192 \u221e, as discussed in Section 5.\nLemma 1. The following bound holds: \u03c4n\u22121 = ln(yn)\u03b1n\u22121\n\n(cid:16) ln(yn)\u03b1n\u22121\n\n\u2264 min\n\n(cid:17)\n\n, 1\n\n.\n\nn\u22121+ln(yn)\u03b1n\u22121\n\nn\u22121\n\nProof. The result follows directly from (11) after a simple calculation.\n\nThe innovation random variable rn is described by the random process associated with the proba-\nbilities of transition\n\nP (rn+1 = k|rn) =\n\nk = rn + 1\n\n1 \u2212 \u03c4n, k = rn\n\n.\n\n(13)\n\n(cid:26) \u03c4n,\n\n1Here, L0(\u00b7) def= Ln,K+(\u00b7) is independent of n and only depends on the initial choice of hyperparameters\n\nas discussed in Sec. 3.\n\n5\n\n\fdef= min( rn+1\nan\n\nThe expectation of rn is majorized by the expectation of a similar random process, \u00afrn, based on the\ntransition probability \u03c3n\n, 1) instead of \u03c4n as Appendix D shows, where the random\nsequence {an} is given by ln+1(yn+1)\u22121n(\u03bb + log n). The latter can be described as a modi\ufb01cation\nof a Polya urn process with selection probability \u03c3n. The asymptotic behavior of rn and related\nvariables is described in the following theorem.\nTheorem 2. Let \u03c4n be a sequence of real-valued random variables 0 \u2264 \u03c4n \u2264 1 satisfying \u03c4n \u2264 rn+1\nfor n \u2265 N, where an = ln+1(yn+1)\u22121n(\u03bb + log n), and where the nonnegative, integer-valued\nrandom variables rn evolve according to (13). Assume the following for n \u2265 N:\n\nan\n\n1. ln(yn) \u2264 \u03b6\n(a.s.)\n2. D(pT (cid:107) \u02dcLn,K+) \u2264 \u03b4\n\n(a.s.)\n\nwhere D(p (cid:107) q) is the Kullback-Leibler divergence between distributions p(\u00b7) and q(\u00b7). Then, as\nn \u2192 \u221e,\n\n\u221a\n\n\u221a\n\n\u03b4/2 n),\n\n\u03b1n = OP (log\u03b6\n\n\u03b4/2 n)\n\n(14)\n\nrn = OP (log1+\u03b6\n\nProof. See Appendix E.\n\nTheorem 2 bounds the growth rate of the mean of the number of class innovations and the concen-\ntration parameter \u03b1n in terms of the sample size n and parameter \u03b6. The bounded LR and bounded\nKL divergence conditions of Thm. 2 manifest themselves in the rate exponents of (14). The ex-\nperiments section shows that both of the conditions of Thm. 2 hold for all iterations n \u2265 N for\nsome N \u2208 N. In fact, assuming the correct clustering, the mixture distribution \u02dcLn,kn\u22121+1 converges\nto the true mixture distribution pT , implying that the number of class innovations grows at most\nas O(log1+\u0001 n) and the sequence of concentration parameters is O(log\u0001 n), where \u0001 > 0 can be\narbitrarily small.\n\n5 Asymptotic Normality of Conditional Likelihood\n\n\u03c1d(a)\n\nIn this section, we derive an asymptotic expression for the conditional likelihood (8) in order to gain\ninsight into the steady-state of the algorithm.\nWe let \u03c0h denote the true prior probability of class h. Using the bounds of the Gamma function\nin Theorem 1.6 from Batir (2008), it follows that lima\u2192\u221e\ne\u2212d/2(a\u22121/2)d/2 = 1. Under normal\nconvergence conditions of the algorithm (with the pruning and merging steps included), all classes\nh = 1, . . . , K will be correctly identi\ufb01ed and populated with approximately ni\u22121(h) \u2248 \u03c0h(i \u2212 1)\nobservations at time i \u2212 1. Thus, the conditional class prior for each class h converges to \u03c0h as\ni\u2192\u221e\u2212\u2192 \u03c0h. According\ni \u2192 \u221e, in virtue of (14), \u03c0i,h(\u03b1i\u22121) = ni\u22121(h)\ni\u22121+\u03b1i\u22121\nto (5), we expect r(i\u22121)\nh \u2192 1 as i \u2192 \u221e since c(i\u22121)\n\u223c\n\u03c0h(i \u2212 1) as i \u2192 \u221e according to (7). Also, from before, \u03c1d(\u03b4(i\u22121)\n\u2212 1/2)d/2 \u223c\nh \u2192 \u03a3h as i \u2192 \u221e.\ne\u2212d/2(\u03c0h\nThis follows from the strong law of large numbers, as the updates are recursive implementations\nof the sample mean and sample covariance matrix. Thus, the large-sample approximation to the\n(cid:17)\u2212 i\u22121\nconditional likelihood becomes:\n\n\u221a\n\u03c0h\n1+ OP (log\u03b6\ni\u22121\n\u223c \u03c0h(i \u2212 1). Also, we expect 2\u03b4(i\u22121)\n\n2 )d/2. The parameter updates (4)-(7) imply \u00b5(i)\n\nh \u2192 \u00b5h and \u03a3(i)\n\n) \u223c e\u2212d/2(\u03b4(i\u22121)\n\n2 \u2212 1\ni\u22121\n\n\u03b4/2(i\u22121))\n\n(cid:16)\n\n=\n\nh\n\nh\n\nh\n\nh\n\n1 + \u03c0\n\n\u22121\nh\n\ni\u22121 (yi \u2212 \u00b5(i\u22121)\n\n)T (\u03a3(i\u22121)\nlimi\u2192\u221e det(\u03a3(i\u22121)\n\nh\n\nh\n\n)\u22121(yi \u2212 \u00b5(i\u22121)\n)1/2\n\nh\n\n)\n\nh\n\n\u22121\nh\n\n2\u03c0\n\nLi,h(yi)\n\ni\u2192\u221e\u221d limi\u2192\u221e\ni\u2192\u221e\u221d e\u2212 1\n\n\u221a\n\n2 (yi\u2212\u00b5h)T \u03a3\n\n\u22121\nh (yi\u2212\u00b5h)\n\ndet \u03a3h\n\nu )u = ec. The conditional likelihood (15) corresponds to the multivari-\nwhere we used limu\u2192\u221e(1+ c\nate Gaussian distribution with mean \u00b5h and covariance matrix \u03a3h. A similar asymptotic normality\n\n6\n\n(15)\n\n\fresult was recently obtained in Tsiligkaridis & Forsythe (2015) for Gaussian observations with a von\nh \u2192 \u03a3h, Ln,h(y) \u2192 N (y|\u00b5h, \u03a3h)\nMises prior. The asymptotics mn\u22121(h)\nas n \u2192 \u221e imply that the mixture distribution \u02dcLn,K+ in (10) converges to the true Gaussian mixture\ndistribution pT of (9). Thus, for any small \u03b4, we expect D(pT (cid:107) \u02dcLn,K+) \u2264 \u03b4 for all n \u2265 N,\nvalidating the assumption of Theorem 2.\n\nn\u22121 \u2192 \u03c0h, \u00b5(n)\n\nh \u2192 \u00b5h, \u03a3(n)\n\n6 Experiments\n\nWe apply the ASUGS learning algorithm to a synthetic 16-class example and to a real data set, to\nverify the stability and accuracy of our method. The experiments show the value of adaptation of\nthe Dirichlet concentration parameter for online clustering and parameter estimation.\nSince it is possible that multiple clusters are similar and classes might be created due to outliers, or\ndue to the particular ordering of the streaming data sequence, we add the pruning and merging step\nin the ASUGS algorithm as done in Lin (2013). We compare ASUGS and ASUGS-PM with SUGS,\nSUGS-PM, SVA and SVA-PM proposed in Lin (2013), since it was shown in Lin (2013) that SVA\nand SVA-PM outperform the block-based methods that perform iterative updates over the entire data\nset including Collapsed Gibbs Sampling, MCMC with Split-Merge and Truncation-Free Variational\nInference.\n\n6.1 Synthetic Data set\n\nWe consider learning the parameters of a 16-class Gaussian mixture each with equal variance of\n\u03c32 = 0.025. The training set was made up of 500 iid samples, and the test set was made up of\n1000 iid samples. The clustering results are shown in Fig. 1(a), showing that the ASUGS-based ap-\nproaches are more stable than SVA-based algorithms. ASUGS-PM performs best and identi\ufb01es the\ncorrect number of clusters, and their parameters. Fig. 1(b) shows the data log-likelihood on the test\nset (averaged over 100 Monte Carlo trials), the mean and variance of the number of classes at each it-\neration. The ASUGS-based approaches achieve a higher log-likelihood than SVA-based approaches\nasymptotically. Fig. 6.1 provides some numerical veri\ufb01cation for the assumptions of Theorem 2.\nAs expected, the predictive likelihood \u02dcLi,K+ (10) converges to the true mixture distribution pT (9),\nand the likelihood ratio li(yi) is bounded after enough samples are processed.\n\n(a)\n\n(b)\n\nFigure 1: (a) Clustering performance of SVA, SVA-PM, ASUGS and ASUGS-PM on synthetic data\nset. ASUGS-PM identi\ufb01es the 16 clusters correctly. (b) Joint log-likelihood on synthetic data, mean\nand variance of number of classes as a function of iteration. The likelihood values were evaluated on\na held-out set of 1000 samples. ASUGS-PM achieves the highest log-likelihood and has the lowest\nasymptotic variance on the number of classes.\n\n6.2 Real Data Set\n\nWe applied the online nonparametric Bayesian methods for clustering image data. We used the\nMNIST data set, which consists of 60, 000 training samples, and 10, 000 test samples. Each sample\n\n7\n\n-4-2024-4-2024SVA-4-2024-4-2024SVA-PM-4-2024-4-2024ASUGS-4-2024-4-2024ASUGS-PMIteration0100200300400500Avg. Joint Log-likelihood-10-8-6-4-2Iteration0100200300400500Mean Number of Classes0510152025ASUGSASUGS-PMSUGSSUGS-PMSVASVA-PMIteration0100200300400500Variance of Number of Classes012345\fFigure 2: Likelihood ratio li(yi) = Li,K+(yi)\n\u02dcLi,K+(yi)\nmixture distribution pT (right) for synthetic example (see 1).\n\n(left) and L2-distance between \u02dcLi,K+(\u00b7) and true\n\nis a 28 \u00d7 28 image of a handwritten digit (total of 784 dimensions), and we perform PCA pre-\nprocessing to reduce dimensionality to d = 50 dimensions as in Kurihara et al. (2006).\nWe use only a random 1.667% subset, consisting of 1000 random samples for training. This training\nset contains data from all 10 digits with an approximately uniform proportion. Fig. 3 shows the\npredictive log-likelihood over the test set, and the mean images for clusters obtained using ASUGS-\nPM and SVA-PM, respectively. We note that ASUGS-PM achieves higher log-likelihood values and\n\ufb01nds all digits correctly using only 23 clusters, while SVA-PM \ufb01nds some digits using 56 clusters.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Predictive log-likelihood (a) on test set, mean images for clusters found using ASUGS-PM\n(b) and SVA-PM (c) on MNIST data set.\n\n6.3 Discussion\n\nAlthough both SVA and ASUGS methods have similar computational complexity and use decisions\nand information obtained from processing previous samples in order to decide on class innova-\ntions, the mechanics of these methods are quite different. ASUGS uses an adaptive \u03b1 motivated\nby asymptotic theory, while SVA uses a \ufb01xed \u03b1. Furthermore, SVA updates the parameters of all\nthe components at each iteration (in a weighted fashion) while ASUGS only updates the parameters\nof the most-likely cluster, thus minimizing leakage to unrelated components. The \u03bb parameter of\nASUGS does not affect performance as much as the threshold parameter \u0001 of SVA does, which often\nleads to instability requiring lots of pruning and merging steps and increasing latency. This is crit-\nical for large data sets or streaming applications, because cross-validation would be required to set\n\u0001 appropriately. We observe higher log-likelihoods and better numerical stability for ASUGS-based\nmethods in comparison to SVA. The mathematical formulation of ASUGS allows for theoretical\nguarantees (Theorem 2), and asymptotically normal predictive distribution.\n\n7 Conclusion\n\nWe developed a fast online clustering and parameter estimation algorithm for Dirichlet process mix-\ntures of Gaussians, capable of learning in a single data pass. Motivated by large-sample asymptotics,\nwe proposed a novel low-complexity data-driven adaptive design for the concentration parameter\nand showed it leads to logarithmic growth rates on the number of classes. Through experiments on\nsynthetic and real data sets, we show our method achieves better performance and is as fast as other\nstate-of-the-art online learning DPMM methods.\n\n8\n\nSamplei100200300400500li(yi)010002000300040005000600070008000900010000Samplei0100200300400500k~Li;K+!pTk2200.511.522.53Iteration01002003004005006007008009001000Predictive Log-Likelihood-5000-4500-4000-3500-3000-2500-2000-1500-1000-5000ASUGS-PMSUGS-PMSVA-PM\fReferences\nAntoniak, C. E. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric\n\nProblems. The Annals of Statistics, 2(6):1152\u20131174, 1974.\n\nBatir, N. Inequalities for the Gamma Function. Archiv der Mathematik, 91(6):554\u2013563, 2008.\nBlei, D. M. and Jordan, M. I. Variational Inference for Dirichlet Process Mixtures. Bayesian Anal-\n\nysis, 1(1):121\u2013144, 2006.\n\nDaume, H. Fast Search for Dirichlet Process Mixture Models. In Conference on Arti\ufb01cial Intelli-\n\ngence and Statistics, 2007.\n\nEscobar, M. D. and West, M. Bayesian Density Estimation and Inference using Mixtures. Journal\n\nof the American Statistical Association, 90(430):577\u2013588, June 1995.\n\nFearnhead, P. Particle Filters for Mixture Models with an Uknown Number of Components. Statis-\n\ntics and Computing, 14:11\u201321, 2004.\n\nKurihara, K., Welling, M., and Vlassis, N. Accelerated Variational Dirichlet Mixture Models. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2006.\n\nLin, Dahua. Online learning of nonparametric mixture models via sequential variational approxi-\nmation. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q. (eds.),\nAdvances in Neural Information Processing Systems 26, pp. 395\u2013403. Curran Associates, Inc.,\n2013.\n\nNeal, R. M. Bayesian Mixture Modeling. In Proceedings of the Workshop on Maximum Entropy\n\nand Bayesian Methods of Statistical Analysis, volume 11, pp. 197\u2013211, 1992.\n\nNeal, R. M. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9(2):249\u2013265, June 2000.\n\nRasmussen, C. E. The in\ufb01nite gaussian mixture model. In Advances in Neural Information Process-\n\ning Systems 12, pp. 554\u2013560. MIT Press, 2000.\n\nTsiligkaridis, T. and Forsythe, K. W. A Sequential Bayesian Inference Framework for Blind Fre-\nquency Offset Estimation. In Proceedings of IEEE International Workshop on Machine Learning\nfor Signal Processing, Boston, MA, September 2015.\n\nTzikas, D. G., Likas, A. C., and Galatsanos, N. P. The Variational Approximation for Bayesian\n\nInference. IEEE Signal Processing Magazine, pp. 131\u2013146, November 2008.\n\nWang, L. and Dunson, D. B. Fast Bayesian Inference in Dirichlet Process Mixture Models. Journal\n\nof Computational and Graphical Statistics, 20(1):196\u2013216, 2011.\n\n9\n\n\f", "award": [], "sourceid": 16, "authors": [{"given_name": "Theodoros", "family_name": "Tsiligkaridis", "institution": "MIT Lincoln Laboratory"}, {"given_name": "Theodoros", "family_name": "Tsiligkaridis", "institution": "MIT Lincoln Laboratory"}, {"given_name": "Keith", "family_name": "Forsythe", "institution": "MIT Lincoln Laboratory"}]}