{"title": "Variational Inference for the Nested Chinese Restaurant Process", "book": "Advances in Neural Information Processing Systems", "page_first": 1990, "page_last": 1998, "abstract": "The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efficiently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well.", "full_text": "Variational Inference for the\n\nNested Chinese Restaurant Process\n\nChong Wang\n\nComputer Science Department\n\nPrinceton University\n\nDavid M. Blei\n\nComputer Science Department\n\nPrinceton University\n\nchongw@cs.princeton.edu\n\nblei@cs.princeton.edu\n\nAbstract\n\nThe nested Chinese restaurant process (nCRP) is a powerful nonparametric\nBayesian model for learning tree-based hierarchies from data. Since its poste-\nrior distribution is intractable, current inference methods have all relied on MCMC\nsampling. In this paper, we develop an alternative inference technique based\non variational methods. To employ variational methods, we derive a tree-based\nstick-breaking construction of the nCRP mixture model, and a novel variational\nalgorithm that ef\ufb01ciently explores a posterior over a large set of combinatorial\nstructures. We demonstrate the use of this approach for text and hand written digits\nmodeling, where we show we can adapt the nCRP to continuous data as well.\n\n1 Introduction\n\nFor many application areas, such as text analysis and image analysis, learning a tree-based hierarchy\nis an appealing approach to illuminate the internal structure of the data. In such settings, however,\nthe combinatoric space of tree structures makes model selection unusually daunting. Traditional\ntechniques, such as cross-validation, require us to enumerate all possible model structures; this kind\nof methodology quickly becomes infeasible in the face of the set of all trees.\nThe nested Chinese restaurant process (nCRP) [1] addresses this problem by specifying a generative\nprobabilistic model for tree structures. This model can then be used to discover structure from data\nusing Bayesian posterior computation. The nCRP has been applied to several problems, such as\n\ufb01tting hierarchical topic models [1] and discovering taxonomies of images [2, 3].\nThe nCRP is based on the Chinese restaurant process (CRP) [4], which is closely linked to the\nDirichlet process in its application to mixture models [5]. As a complicated Bayesian nonparametric\nmodel, posterior inference in an nCRP-based model is intractable, and previous approaches all rely\nGibbs sampling [1, 2, 3]. While powerful and \ufb02exible, Gibbs sampling can be slow to converge and\nit is dif\ufb01cult to assess the convergence [6, 7]. Here, we develop an alternative for posterior inference\nfor nCRP-based models.\nOur solution is to use the optimization-based variational methods [8]. The idea behind variational\nmethods is to posit a simple distribution over the latent variables, and then to \ufb01t this distribution to\nbe close to the posterior of interest. Variational methods have been successfully applied to several\nBayesian nonparametric models, such as Dirichlet process (DP) mixtures [9, 10, 11], hierarchical\nDirichlet processes (HDP) [12], Pitman-Yor processes [13] and Indian buffet processes (IBP) [14].\nThe work presented here is unique in that our optimization of the variational distribution searches the\ncombinatorial space of trees. Similar to Gibbs sampling, our method includes an exploration of a\nlatent structure associated with the free parameters in addition to their values. First, we describe the\ntree-based stick-breaking construction of nCRP, which is needed for variational inference. Second,\nwe develop our variational inference algorithm, which explores the in\ufb01nite tree space associated with\nthe nCRP. Finally, we study the performance of our algorithm on discrete and continuous data sets.\n\n1\n\n\f2 Nested Chinese restaurant process mixtures\n\nThe nested Chinese restaurant process (nCRP) is a distribution over hierarchical partitions [1]. It\ngeneralizes the Chinese restaurant process (CRP), which is a distribution over partitions. The CRP\ncan be described by the following metaphor. Imagine a restaurant with an in\ufb01nite number of tables,\nand imagine customers entering the restaurant in sequence. The dth customer sits at a table according\nto the following distribution,\n\n(cid:26) mk\n\n\u03b3\n\np(cd = k|c1:(d\u22121)) \u221d\n\nif k is previous occupied\nif k is a new table,\n\n(1)\n\nwhere mk is the number of previous customers sitting at table k and \u03b3 is a positive scalar. After D\ncustomers have sat down, their seating plan describes a partition of D items.\nIn the nested CRP, imagine now that tables are organized in a hierarchy: there is one table at the \ufb01rst\nlevel; it is associated with an in\ufb01nite number of tables at the second level; each second-level table\nis associated with an in\ufb01nite number of tables at the third level; and so on until the Lth level. Each\ncustomer enters at the \ufb01rst level and comes out at the Lth level, generating a path with L tables as\nshe sits in each restaurant. Moving from a table at level (cid:96) to one of its subtables at level (cid:96) + 1, the\ncustomer draws following the CRP using Equation 1. (This description is slightly different from the\nmetaphor in [1], but leads to the same distribution.)\nThe nCRP mixture model can be derived by analogy to the CRP mixture model [15]. (From now\non, we will use the term \u201cnodes\u201d instead of \u201ctables.\u201d) Each node is associated with a parameter w,\nwhere w \u223c G0 and G0 is called the base distribution. Each data point is drawn by \ufb01rst choosing a\npath in the tree according to the nCRP, and then choosing its value from a distribution that depends\non the parameters in that path. An additional hidden variable x represents other latent quantities\nthat can be used in this distribution. This is a generalization of the model described in [1]. For data\nD = {tn}N\n\nn=1, the nCRP mixture assumes that the nth data point tn is drawn as follows:\n\n1. Draw a path cn|c1:(n\u22121) \u223c nCRP(\u03b3, c1:(n\u22121)), which contains L nodes from the tree.\n2. Draw a latent variable xn \u223c p(xn|\u03bb).\n3. Draw an observation tn \u223c p(tn|Wcn , xn, \u03c4).\n\nThe parameters \u03bb and \u03c4 are associated with the latent variables x and data generating distribution,\nrespectively. Note that Wcn contains the wis selected by the path cn. Speci\ufb01c applications of the\nnCRP mixture depend on the particular forms of p(w), p(x) and p(t|Wc, x).\nThe corresponding posterior of the latent variables decomposes the data into a collection of paths, and\nprovides distributions of the parameters attached to each node in those paths. Even though the nCRP\nassumes an \u201cin\ufb01nite\u201d tree, the paths associated with the data will only populate a portion of that tree.\nThrough this posterior, the nCRP mixture can be used as a \ufb02exible tree-based mixture model that\ndoes not assume a particular tree structure in advance of the data.\n\nHierarchical topic models. The nCRP mixture described above includes the hierarchical topic\nmodel of [1] as a special case. In that model, observed data are documents, i.e., a list of N words\nfrom a \ufb01xed vocabulary. The nodes of the tree are associated with distributions over words (\u201ctopics\u201d),\nand each document is associated with both a path in the tree and with a vector of proportions over its\nlevels. Given a path, a document is generated by repeatedly generating level assignments from the\nproportions and then words from the corresponding topics. In the notation above, p(w) is a Dirichlet\ndistribution over the vocabulary simplex, p(x) is a joint distribution of level proportions (from a\nDirichlet) and level assignments (N draws from the proportions), and p(t|Wc, x) are the N draws\nfrom the topics (for each word) associated with x.\nTree-based hierarchical component analysis. For continuous data, if p(w), p(x) and p(t|Wc, x)\nare appropriate Gaussian distributions, we obtain hierarchical component analysis, a generalization\nof probabilistic principal component analysis (PPCA) [16, 17]. In this model, w is the component\nparameter for the node it belongs to. Each path c can be thought as a PPCA model with factor loading\nWc speci\ufb01ed by that path. Then each data point chooses a path (also a PPCA model speci\ufb01ed by that\npath) and draw the factors x. This model can also be thought as an in\ufb01nite mixtures of PPCA model,\n\n2\n\n\fFigure 1: Left. A possible tree structure in a 3-level nCRP. Right. The tree-based stick-breaking\nconstruction of a 3-level nCRP.\n\nwhere each PPCA can share components. In addition, we can incorporate the general exponential\nfamily PCA [18, 19] into the nCRP framework.1\n\n2.1 Tree-based stick-breaking construction\n\ni=1 \u03c0i\u03b4wi,\n\nvi \u223c Beta(1, \u03b3), \u03c0i = vi\n\ni=1 \u03c0i = 1 almost surely. This representation also illuminates\n\nCRP mixtures can be equivalently formulated using the Dirichlet process (DP) as a distribution over\nthe distribution of each data point\u2019s random parameter [21, 4]. An advantage of expressing the CRP\nmixture with a DP is that the draw from the DP can be explicitly represented using the stick-breaking\nconstruction [22]. The DP bundles the scaling parameter \u03b3 and base distribution G0. A draw from a\nDP(\u03b3, G0) is described as\n\n(cid:81)i\u22121\nj=1(1 \u2212 vj), wi \u223c G0, i \u2208 {1, 2,\u00b7\u00b7\u00b7}, G =(cid:80)\u221e\nwhere \u03c0 are the stick lengths, and(cid:80)\u221e\n(cid:81)i\u22121\nj=1(1 \u2212 v1j) for i = {1, 2,\u00b7\u00b7\u00b7 ,\u221e} and(cid:80)\u221e\n\nthe discreteness of a distribution drawn from a DP.\nFor the nCRP, we develop a similar stick-breaking construction. At the \ufb01rst level, the root node\u2019s\nstick length is \u03c01 = v1 \u2261 1. For all the nodes at the second level, their stick lengths are constructed\nas for the DP, i.e., \u03c01i = \u03c01v1i\ni=1 \u03c01i = \u03c01 =\n1. The stick-breaking construction is then applied to each of these stick segments at the second\nlevel. For example, the \u03c011 portion of the stick is divided up into an in\ufb01nite number of pieces\naccording to the stick-breaking process. For the segment \u03c01k, the stick lengths of its children are\n\u03c01ki = \u03c01kv1ki\ni=1 \u03c01ki = \u03c01k. The whole process\ncontinues for L levels. This construction is best understood by Figure 1 (Right).\nAlthough this stick represents an in\ufb01nite tree, the nodes are countable and each node is uniquely\nidenti\ufb01ed by a sequence of L numbers. We will denote all Beta draws as V , each of which are\nindependent draws from Beta(1, \u03b3) (except for the root v1, which is equal to one).\nThe tree-based stick-breaking construction lets us calculate the conditional probability of a path given\nV . Let the path c = [1, c2,\u00b7\u00b7\u00b7 , cL],\n\n(cid:81)i\u22121\nj=1(1 \u2212 v1kj), for i = {1, 2,\u00b7\u00b7\u00b7 ,\u221e} and(cid:80)\u221e\n\np(c|V ) =(cid:81)L\n\n(cid:96)=1 \u03c01,c2,\u00b7\u00b7\u00b7 ,c(cid:96) =(cid:81)L\n\n(cid:81)c(cid:96)\u22121\nj=1 (1 \u2212 v1,c2,\u00b7\u00b7\u00b7 ,j).\n\n(2)\nBy integrating out V in Equation 2, we recover the nCRP. Given Equation 2, the joint probability of\na data set under the nCRP mixture is\n\np(t1:N , x1:N , c1:N , V , W ) = p(V )p(W )(cid:81)N\n\nn=1 p(cn|V )p(xn)p(tn|Wcn , xn).\n\n(cid:96)=1 v1,c2,\u00b7\u00b7\u00b7 ,c(cid:96)\n\n(3)\n\nThis representation is the basis for variational inference.\n\n3 Variational inference for the nCRP mixture\n\nThe central computational problem in Bayesian modeling is posterior inference: Given data, what is\nthe conditional distribution of the latent variables in the model? In the nCRP mixture, these latent\nvariables provide the tree structure and node parameters.\n\n1We note that Bach and Jordan [20] studied tree-dependent component analysis, a generalization of inde-\npendent component analysis where the components are organized in a tree. This model expresses a different\nphilosophy: Their tree re\ufb02ects the actual conditional dependencies among the components. Data are not\ngenerated by choosing a path \ufb01rst, but by a linear transformation of all components in the tree.\n\n3\n\n\fPosterior inference in an nCRP mixture has previously relied on Gibbs sampling, in which we sample\nfrom a Markov chain whose stationary distribution is the posterior [1, 2, 3]. Variational inference\nprovides an alternative methodology: Posit a simple (e.g., factorized) family of distributions over\nthe latent variables indexed by free parameters (called \u201cvariational parameters\u201d). Then \ufb01t those\nparameters to be close in KL divergence to the true posterior of interest [8, 23].\nVariational inference for Bayesian nonparametric models uses a truncated stick-breaking represen-\ntation in the variational distribution [9] \u2013 free variational parameters are allowed only up to the\ntruncation level. If the truncation is too large, the variational algorithm will still isolate only a subset\nof components; if the truncation is too small, methods have been developed to expand the truncated\nstick as part of the variational algorithm [10]. In the nCRP mixture, however, the challenge is that the\ntree structure is too large even to effectively truncate. We will address this by de\ufb01ning search criteria\nfor adaptively adjusting the structure of the variational distribution, searching over the set of trees to\nbest accommodate the data.\n\n3.1 Variational inference based on the tree-based stick-breaking construction\n\nWe \ufb01rst address the problem of variational inference with a truncated tree of \ufb01xed structure. Suppose\nthat we have a truncated tree T and let MT be the set of all nodes in T . Our family of variational\ndistributions is de\ufb01ned as follows,\n\nq(W , V , x1:N , c1:N ) =(cid:81)\n\nq(wi)q(vi)(cid:81)\n\np(wi)p(vi)(cid:81)N\n\ni\u2208MT\n\ni /\u2208MT\n\nn=1 q(cn)q(xn),\n\n(4)\nwhere: (1) Distributions p(wi) and p(vi) for i /\u2208 MT , are the prior distributions, containing\nno variational parameters; (2) Distributions q(wi) and q(vi) for i \u2208 MT contain the variational\nparameters that we want to optimize for the truncated tree T ; (3) Distribution q(cn) is the variational\nmultinomial distribution over all the possible paths, not just those in the truncated tree T . Note that\nthere are in\ufb01nite number of paths. We will address this issue below; (4) Distribution q(xn) is the\nvariational distribution for the latent variable xn and it is in the same family of distribution, as p(xn).\nIn summary, this family of distributions retains the in\ufb01nite tree structure. Moreover, this family is\nnested [10, 11]: If a truncated tree T1 is a subtree of a truncated tree T2 then variational distributions\nde\ufb01ned over T1 are a special case of those de\ufb01ned over T2. Theoretically, the solution found using\nT2 is at least as good as the one found using T1. This allows us to use greedy search to \ufb01nd a better\ntree structure.\nWith the variational distributions (Equation 4) and the joint distributions (Equation 3), we turn to the\ndetails of posterior inference. Equivalent to minimizing KL is tightening the bound on the likelihood\nof the observations D = {tn}N\nlog p(t1:N ) \u2265 Eq [log p(t1:N , V , W , x1:N , c1:N )] \u2212 Eq [log q(V , W , x1:N , c1:N )]\n\nn=1 given by Jensen\u2019s inequality [8],\n\n(cid:105)\n\n+(cid:80)N\n\n(cid:104)\n\n(cid:105)\n\n+(cid:80)N\n\n(cid:104)\n\n=(cid:80)\n\ni\u2208MT\n\nEq\n\nlog p(wi)p(vi)\nq(wi)q(vi)\n\nn=1\n\nEq\n\nlog p(xn)\nq(xn)\n\nn=1\n\nEq\n\nlog p(tn|xn,Wcn )p(cn|V )\n\nq(cn)\n\n(cid:44) L(q).\nWe optimize L(q) using coordinate ascent. First we isolate the terms that only contain q(cn),\n\n(cid:104)\n\n(cid:105)\n\n(5)\n\n(6)\n\nL (q(cn)) = Eq [log p(tn|xn, Wcn)p(cn|V )] \u2212 Eq [log q(cn)] .\n\nThen we \ufb01nd the optimal solution for q(cn) by setting the gradient to zero:\n\nq(cn = c) \u221d Sn,c (cid:44) exp{Eq [log p(cn = c|V )] + Eq [log p(tn|xn, Wc)]} .\n\n(7)\nSince the values of q(cn = c) is in\ufb01nite, operating coordinate ascent over q(cn = c) is dif\ufb01cult. We\nplug the optimal q(cn) (Equation 7) into Equation 6 to obtain the lower bound\n\nto \ufb01nd an ef\ufb01cient way to manipulate this. 2) the lower bound log(cid:80)\n\n(8)\nTwo issues arise: 1) the variational distribution q(cn) has in\ufb01nite number of values, and we need\nc Sn,c (Equation 8) contains\nin\ufb01nite sum, which pose a problem in evaluation. In the appendix, we show that all the operations\ncan be done only via the truncated tree T . We summarize the results as follows. Let \u00afc be a path in\nT , either an inner path (a path ending at an inner node) or a full path (a path ending at a leaf node).\nNote that the inner path is only de\ufb01ned for the truncated tree T . The number of such \u00afc is \ufb01nite. In the\n\nc Sn,c.\n\nL (q(cn)) = log(cid:80)\n\n4\n\n\fnCRP tree, denote child(\u00afc) as the set of all full paths that are not in T but include \u00afc as a sub path.\nAs a special case, if \u00afc is a full path, child(\u00afc) just contains itself. As shown in the appendix, we can\ncompute these quantities ef\ufb01ciently:\n\nq(cn = \u00afc) (cid:44)(cid:80)\n\nc:c\u2208child(\u00afc) q(cn = c) and Sn,\u00afc (cid:44)(cid:80)\n\nc:c\u2208child(\u00afc) Sn,c.\n\n(9)\n\nConsequently iterating over the truncated tree T using \u00afc is the same as iterating all the full paths in\nthe nCRP tree. And these are all we need for doing variational inference.\nNext, we move to optimize q(vi|ai, bi) for i \u2208 MT , where ai and bi are variational parameters for\nBeta distribution q(vi). Let the path containing vi be [1, c2,\u00b7\u00b7\u00b7 , c(cid:96)0], where (cid:96)0 \u2264 L. We isolate the\nterm that only contains vi from the lower bound (Equation 5),\n\n(cid:80)\nc q(cn = c) log p(cn = c|V ).\n\nn=1\n\n(10)\n\nAfter plugging Equation 2 into 10 and setting the gradient to be zero, we obtain the optimal q(vi),\n\nL (q(vi)) = Eq [log p(vi) \u2212 log q(vi)] +(cid:80)N\ni = 1 +(cid:80)N\ni = \u03b3 +(cid:80)N\n\ni \u22121,\nc(cid:96)0+1,\u00b7\u00b7\u00b7 ,cL\nj,c(cid:96)0+1,\u00b7\u00b7\u00b7 ,cL:j>c(cid:96)0\n\n(1 \u2212 vi)b\u2217\n\n(cid:80)\n(cid:80)\n\nn=1\n\nn=1\n\na\u2217\nb\u2217\n\nq(vi) \u221d va\u2217\n\ni \u22121\n\ni\n\nq(cn = [1, c2,\u00b7\u00b7\u00b7 , c(cid:96)0, c(cid:96)0+1,\u00b7\u00b7\u00b7 , cL]),\n\nq(cn = [1, c2,\u00b7\u00b7\u00b7 , c(cid:96)0\u22121, j, c(cid:96)0+1,\u00b7\u00b7\u00b7 , cL]),\n\n(11)\n\nwhere the in\ufb01nite sum involved can be solved using Equations 9.\nThe variational update functions for W and x depend on the actual distributions we use, and deriving\nthem is straightforward. If they include an in\ufb01nite sum then we apply similar techniques as we did\nfor q(vi).\n\n3.2 Re\ufb01ning the tree structure during variational inference\n\nSince our variational distribution is nested, a larger truncated tree will always (theoretically) achieve\na lower bound at least as tight as a smaller truncated tree. This allows us to search the in\ufb01nite tree\nspace until a certain criterion is satis\ufb01ed (e.g., relative change of the lower bound). To achieve this,\nwe present several heuristics to guide us to do so. All these operations are performed on the truncated\ntree T .\n\nGrow. This operation is similar to what Gibbs sampling does in searching the tree space. We\nimplement two heuristics: 1) Randomly choose several data points, and for each of them sample\na path \u00afc according to q(cn = \u00afc). If it is an inner path, expand it a full path; 2) For every inner\nn=1 q(cn = \u00afc). Then sample an inner path (say \u00afc\u2217)\n\npath in T , \ufb01rst compute the quantity g(\u00afc) =(cid:80)N\nthis path \u2013 for path c, the criterion is(cid:80)N\n\naccording to g(\u00afc), and expand it to full path.\n\nPrune.\n\nIf a certain path gets very little probability assignments from all data points, we eliminate\nn=1 q(cn = c) < \u03b4, where \u03b4 is a small number. We use\n\u03b4 = 10\u22126). This mimics Gibbs sampling in the sense that for nCRP (or CRP), if a certain path (table)\ngets no assignments in the sampling process, it will never get any assignment any more according to\nEquation 1.\n\nIf paths i and j give almost equal posterior distributions, merging these two paths is\ni Pj/|Pi||Pj|, where Pi = [q(c1 = i),\u00b7\u00b7\u00b7 , q(cN =\n\nMerge.\nemployed [24]. The measure is J(i, j) = P T\ni)]T . We use 0.95 as the threshold in our experiments.\nIn theory, Prune and Merge may decrease the lower bound. Empirically, we found even sometime\nit does, the effect is negligible. (but reduced the size of the tree). For continuous data settings, we\nadditionally implement the Split method used in [24].\n\n4 Experiments\n\nIn this section, we demonstrate variational inference for the nCRP. We analyze both discrete and\ncontinuous data using the two applications discussed in Section 2.\n\n5\n\n\fMethod\n\nPer-word test set likelihood\nJACM\n\nPsy. Review\n\nPNAS\n\n\u22125.3922 \u00b1 0.0052 \u22125.7834 \u00b1 0.0149 \u22126.4961 \u00b1 0.0068\nGibbs sampling\n\u22125.4331 \u00b1 0.0100 \u22125.8430 \u00b1 0.0153 \u22126.5736 \u00b1 0.0050\nVar. inference\nVar. inference (G) \u22125.4495 \u00b1 0.0118 \u22125.8593 \u00b1 0.0157 \u22126.5996 \u00b1 0.0153\n\nTable 1: Test set likelihood comparison on three datasets. Var. inference (G): variational inference\ninitialized from the initialization of Gibbs sampling. Variational inference can give competitive\nperformance on test set likelihood.\n\n4.1 Hierarchical topic modeling\n\nFor discrete data, we compare variational inference compared with Gibbs sampling for hierarchical\ntopic modeling. Three corpora are used in the experiments: (1) JACM: a collection of 536 abstracts\nfrom the Journal of the ACM from years 1987 to 2004 with a vocabulary size of 1,539 and around\n68K words; (2) Psy. Review: a collection of 1,272 psychology abstracts from Psychological Review\nfrom years 1967 to 2003, with a vocabulary size of 1,971 and around 137K words; (3) PNAS: a\ncollection of 5,000 abstracts from the Proceedings of the National Academy of Sciences from years\n1991 to 2001, with a vocabulary size of 7762 and around 895K words. Those terms occurring in\nfewer than 5 documents were removed.\nLocal maxima can be a problem for both Gibbs sampling and variational inference. To avoid them in\nGibbs sampling, we randomly restart the sampler 200 times and take the trajectory with the highest\naverage posterior likelihood. We run the Gibbs sampling for 10000 iterations and collect the results\nfor post analysis. For variational inference, we use two types of initializations 1) similar to Gibbs\nsampling, we gradually add data points during the variational inference as well \u2013 add a new path for\neach document in the initialization; 2) we initialize the variational inference from the initialization\nfor Gibbs sampling \u2013 using the MAP estimate using one Gibbs sample. We set L = 3 for all the\nexperiments and use the same hyperparameters in both algorithms. Speci\ufb01cally, the stick-breaking\nprior parameter \u03b3 is set to 1.0; the symmetric Dirichlet prior parameter for the topics is set to 1.0; the\nprior for level proportions is skewed to favor high levels (50, 20, 10). (This is suggested in [1].) We\nrun the variational inference until the relative change of log-likelihood is less than 0.001.\n\nPer-word test set likelihood. We use test set likelihood as a measure of performance. The proce-\ndure is to divide the corpus into a training set Dtrain and a test set Dtest, and approximate the likelihood\nof Dtest given Dtrain. We use the same method in Teh et al. [12] to approximate it. Speci\ufb01cally, we\nuse posterior means \u02c6\u03b8 and \u02c6\u03b2 to represent the estimated topic mixture proportions over L levels and\ntopic multinomial parameters. For the variational method, we use\n\np({t1,\u00b7\u00b7\u00b7 , tN}test) =(cid:81)N\np({t1,\u00b7\u00b7\u00b7 , tN}test) =(cid:81)N\n\nn=1\n\n(cid:80)\nc q(cn = c)(cid:81)\n(cid:80)\n(cid:80)S\n(cid:81)\n\nj\n\n(cid:80)\n(cid:80)\n\nc \u03b4cs\n\nn\n\n1\nS\n\n\u02c6\u03b8n,(cid:96)\n\n\u02c6\u03b2c(cid:96),tnj ,\n\nn,(cid:96)\n\nwhere \u02c6\u03b8 and \u02c6\u03b2 are estimated using mean values from the variational distributions. For Gibbs sampling,\nwe use S samples and compute\n\ns=1\n\nn=1\n\n10 after a 200-sample burn-in for a document in test set. Actually, 1/S(cid:80)S\n\nwhere \u02c6\u03b8s and \u02c6\u03b2s are estimated using sample s [25, 12]. We use 30 samples collected at a lag of\nn gives the\nempirical estimation of p(cn), where in variational inference, we approximate it using q(cn). Table 1\nshows the test likelihood comparison using \ufb01ve-fold cross validation. This shows our model can give\ncompetitive performance in term of the test set likelihood. This discrepancy is similar to that in [12]\nwhen variational inference is compared the collapsed Gibbs sampling for HDP.\n\n(cid:80)\n\nc \u03b4cs\n\ns=1\n\nj\n\nn,(cid:96)\n\nn,(cid:96)\n\n\u02c6\u03b8s\n\n\u02c6\u03b2s\nc(cid:96),tnj\n\n,\n\nTopic visualizations. Figures 2 and 3 show the tree-based topic visualizations from JACM and\nPNAS datasets. These are quite similar to those obtained by Gibbs sampling (see [1]).\n\n4.2 Modeling handwritten digits using hierarchical component analysis\n\nFor continuous data, we use the hierarchical component analysis for modeling handwritten digits\n(http://archive.ics.uci.edu/ml). This dataset contains 3823 handwritten digits as a training set and\n\n6\n\n\fFigure 2: A sub network discovered on JACM dataset, each topic represented by top 5 terms. The\nwhole tree has 30 nodes, with an average branching factor 2.64.\n\nFigure 3: A sub network discovered on PNAS dataset, each topic represented by top 5 terms. The\nwhole tree has 45 nodes, with an average branching factor 2.93.\n\n1797 as a testing set. Each digit contains 64 integer attributes, ranging from 0-16. As described in\nsection 2, we use PPCA [16] as the basic model for each path. We use a global mean parameter \u00b5 for\nall paths, although a model with an individual mean parameter for each path can be similarly derived.\nWe put broad priors over the parameters similar to those in variational Bayesian PCA [17]. The\nstick-breaking prior parameter \u03b3 = 1 is set to be 1.0; for each node, w \u223c N (0, 103); \u00b5 \u223c N (0, 103);\nthe inverse of the variance for the noise model in PPCA is \u03c4 and \u03c4 \u223c Gamma(10\u22123, 10\u22123). Again,\nwe run the variational inference until the relative change of log-likelihood is less than 0.001.\nWe compare the reconstruction error with PCA. To compute the reconstruction error for our model,\nwe \ufb01rst select the path for each data point using its MAP estimation by \u02c6cn = arg maxc q(cn = c).\nThen we use the similar approach [26, 24] to reconstruct tn,\n\n\u02c6tn = W\u02c6cn(W\u02c6cn\n\nT W\u02c6cn)\u22121W\u02c6cn\n\nT (tn \u2212 \u02c6\u00b5) + \u02c6\u00b5.\n\nWe test our model using depth L = 2, 3, 4, 5. All of our models run within 2 minutes. The\nreconstruction errors for both the training and testing set are shown in Table 2. Our model gives lower\nreconstruction errors than PCA.\n\n5 Conclusions\n\nIn this paper, we presented the variational inference algorithm for the nested Chinese restaurant\nprocess based on its tree-based stick-breaking construction. Our result indicates that the variational\n\n7\n\nof a andis innetworksnetworkdistributedparallelprocessorsprogramslogicrulesresolutionprogramqueriesformulascomplexityqueryclassesroutingnetworkcommunicationsortingdistributedqueuingclosedtreesspanningproductformlogarithmicimprovesuponworstcaseoatomicconcurrentwaitfreecontrolsharedmethodstreedecompositioncompressiongreedyfunctionspolynomialbooleancompressioninputbuildingedgesdesiredefficiencytogetherplanargraphsmaximumcomponentessentiallythe in anda todnarnareplicationstrandrecombinationspeciesevolutionbasedtimevisualcardiacmiceerarheartraddnatelomerasebrcarecombinationhotracoxpcnaspotgirkchannelsgagexchangercurrentsdyecouplingtnfplusgapspeciespopulationsyearspopulationgeneticsleepfaorfsmaizehaplotypeleptinghagemicecardiacfkdimerizationerythropoietinreversiblyinterleukin\fReconstruction error on handwritten digits\n\n#Depth HCA (tr)\n\nPCA (tr) HCA (te)\n\nPCA (te)\n\n878.5\n727.7\n633.0\n564.2\n\n2(9)\n3(14)\n4(18)\n5(22)\n\n631.6\n559.8\n463.4\n384.8\n\n863.0\n722.3\n621.0\n553.0\n\n699.4\n585.6\n506.1\n461.8\n\nTable 2: Reconstruction error comparison (Tr: train; Te: test). HCA stands for hierarchical component\nanalysis. PCA uses L largest components. In the \ufb01rst column, 2(9) means L = 2 with 9 nodes\ninferred using our model. Others are similarly de\ufb01ned. HCA gives lower reconstruction errors.\n\ninference is a powerful alternative method for the widely used Gibbs sampling. We also adapt the\nnCRP to model continuous data, e.g. in hierarchical component analysis.\n\nAcknowledgements. We thank anonymous reviewers for insightful suggestions. David M. Blei is\nsupported by ONR 175-6343, NSF CAREER 0745520, and grants from Google and Microsoft.\n\nAppendix: ef\ufb01ciently manipulating Sn,c and q(cn = c)\nCase 1: All nodes of the path are in T , c \u2282 MT . Let Z0 (cid:44) Eq [log p(tn|xn, Wc)]. We have\n\n(cid:104)(cid:80)L\n(cid:96)=1(log(v1,c2,\u00b7\u00b7\u00b7 ,c(cid:96)) +(cid:80)c(cid:96)\u22121\n\n(cid:105)\nj=1 log(1 \u2212 v1,c2,\u00b7\u00b7\u00b7 ,j))\n\nSn,c = exp\n\n+ Z0\n\n(cid:110)\n\n(cid:111)\n\nEq\n\n.\n\n(12)\n\nc\u2208child(\u00afc) Sn,c\n\nCase 2: At least one node is not in T , c (cid:54)\u2282 MT . Although c (cid:54)\u2282 MT , c must have some nodes\nin MT . Then c can be written as c = [\u00afc, c(cid:96)0+1,\u00b7\u00b7\u00b7 , cL], where \u00afc (cid:44) [1, c2,\u00b7\u00b7\u00b7 , c(cid:96)0] \u2282 MT and\n[\u00afc, c(cid:96)0+1,\u00b7\u00b7\u00b7 , c(cid:96)] (cid:54)\u2282 MT for any (cid:96) > (cid:96)0. In the truncated tree T , let j0 be the maximum index\nfor the child node whose parent path is \u00afc, then we know if c(cid:96)0+1 > j0, [\u00afc, c(cid:96)0+1,\u00b7\u00b7\u00b7 , cL] (cid:54)\u2282 MT .\nNow we \ufb01x the sub path \u00afc and let [c(cid:96)0+1,\u00b7\u00b7\u00b7 , cL] vary (satisfying c(cid:96)0+1 > j0). All these possible\npaths constitute a set: child(\u00afc) (cid:44) {[\u00afc, c(cid:96)0+1,\u00b7\u00b7\u00b7 , cL] : c(cid:96)0+1 > j0}. According to Equation 4, for\nany c \u2208 child(\u00afc) , Z0 (cid:44) Eq [log p(tn|xn, Wc)] is constant, since the variational distribution for w\n(cid:80)\noutside the truncated tree is the same prior distribution. We have\n=(cid:80)\n\n(cid:105)(cid:111)\nj=1 log(1 \u2212 v1,c2,\u00b7\u00b7\u00b7 ,j))\n\n(cid:104)(cid:80)L\n(cid:96)=1(log(v1,\u00b7\u00b7\u00b7 ,c(cid:96)) +(cid:80)c(cid:96)\u22121\n(cid:110)\n(cid:104)(cid:80)(cid:96)0\n(cid:96)=1(log(v1,c2,\u00b7\u00b7\u00b7 ,c(cid:96)) +(cid:80)c(cid:96)\u22121\n(cid:105)(cid:17)\nj=1 log(1 \u2212 v1,c2,\u00b7\u00b7\u00b7 ,c(cid:96)0 ,j)\n\nZ0 + Eq\n= exp(Z0+(L\u2212(cid:96)0)Ep[log(v)])\n\u00d7 exp\n\n(13)\nwhere v \u223c Beta(1, \u03b3). Such cases contain all inner nodes in the truncated tree T . Note that Case 1\nc Sn,c can be computed ef\ufb01ciently.\n\nis a special case of Case 2 by setting (cid:96)0 = L. Given all these,(cid:80)\nc\u2208child(\u00afc) q(cn = c) \u221d(cid:80)\n\nFurthermore, given Equations 13 and Equation 7, we de\ufb01ne\n\nq(cn = \u00afc) (cid:44)(cid:80)\n\n(cid:105)(cid:111)\nj=1 log(1 \u2212 v1,c2,\u00b7\u00b7\u00b7 ,j))\n\n(1\u2212exp(Ep[log(1\u2212v)]))L\u2212(cid:96)0 exp\n\n(cid:104)(cid:80)j0\n\nc\u2208child(\u00afc) exp\n\nc\u2208child(\u00afc) Sn,c,\n\n(cid:110)\n\n(cid:16)\n\n(14)\n\nEq\n\nEq\n\n,\n\nwhich corresponds the sum of probabilities from all paths in child(\u00afc). We note that this organization\nonly depends on the truncated tree T and is suf\ufb01cient for variational inference.\n\n8\n\n\fReferences\n\n[1] Blei, D. M., T. L. Grif\ufb01ths, M. I. Jordan, et al. Hierarchical topic models and the nested Chinese restaurant\n\nprocess. In NIPS. 2003.\n\n[2] Bart, E., I. Porteous, P. Perona, et al. Unsupervised learning of visual taxonomies. In CVPR. 2008.\n[3] Sivic, J., B. C. Russell, A. Zisserman, et al. Unsupervised discovery of visual object class hierarchies. In\n\nCVPR. 2008.\n\n[4] Aldous, D. Exchangeability and related topics. In Ecole d\u2019Ete de Probabilities de Saint-Flour XIII 1983,\n\npages 1\u2013198. Springer, 1985.\n\n[5] Ferguson, T. S. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209\u2013\n\n230, 1973.\n\n[6] Neal, R. Probabilistic inference using Markov chain Monte Carlo methods. Tech. Rep. CRG-TR-93-1,\n\nDepartment of Computer Science, University of Toronto, 1993.\n\n[7] Robert, C., G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, NY, 2004.\n[8] Jordan, M. I., Z. Ghahramani, T. S. Jaakkola, et al. An introduction to variational methods for graphical\n\nmodels. Learning in Graphical Models, 1999.\n\n[9] Blei, D. M., M. I. Jordan. Variational methods for the Dirichlet process. In ICML. 2004.\n[10] Kurihara, K., M. Welling, N. A. Vlassis. Accelerated variational Dirichlet process mixtures. In NIPS.\n\n2006.\n\n[11] Kurihara, K., M. Welling, Y. W. Teh. Collapsed variational Dirichlet process mixture models. In IJCAI.\n\n2007.\n\n[12] Teh, Y. W., K. Kurihara, M. Welling. Collapsed variational inference for HDP. In NIPS. 2008.\n[13] Sudderth, E. B., M. I. Jordan. Shared segmentation of natural scenes using dependent Pitman-Yor processes.\n\nIn NIPS. 2008.\n\n[14] Doshi, F., K. T. Miller, J. Van Gael, et al. Variational inference for the Indian buffet process. In AISTATS,\n\nvol. 12. 2009.\n\n[15] Escobar, M. D., M. West. Bayesian density estimation and inference using mixtures. Journal of the\n\nAmerican Statistical Association, 90:577\u2013588, 1995.\n\n[16] Tipping, M. E., C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical\n\nSociety, Series B, 61:611\u2013622, 1999.\n\n[17] Bishop, C. M. Variational principal components. In ICANN. 1999.\n[18] Collins, M., S. Dasgupta, R. E. Schapire. A generalization of principal components analysis to the\n\nexponential family. In NIPS. 2001.\n\n[19] Mohamed, S., K. A. Heller, Z. Ghahramani. Bayesian exponential family PCA. In NIPS. 2008.\n[20] Bach, F. R., M. I. Jordan. Beyond independent components: Trees and clusters. JMLR, 4:1205\u20131233,\n\n2003.\n\n[21] Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems.\n\nThe Annals of Statistics, 2(6):1152\u20131174, 1974.\n\n[22] Sethuraman, J. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[23] Wainwright, M., M. Jordan. Variational inference in graphical models: The view from the marginal\n\npolytope. In Allerton Conference on Control, Communication and Computation. 2003.\n\n[24] Ueda, N., R. Nakano, Z. Ghahramani, et al. SMEM algorithm for mixture models. Neural Computation,\n\n12(9):2109\u20132128, 2000.\n\n[25] Grif\ufb01ths, T. L., M. Steyvers. Finding scienti\ufb01c topics. Proc Natl Acad Sci USA, 101 Suppl 1:5228\u20135235,\n\n2004.\n\n[26] Tipping, M. E., C. M. Bishop. Mixtures of probabilistic principal component analysers. Neural Computa-\n\ntion, 11(2):443\u2013482, 1999.\n\n9\n\n\f", "award": [], "sourceid": 214, "authors": [{"given_name": "Chong", "family_name": "Wang", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}