{"title": "Hierarchical Topic Models and the Nested Chinese Restaurant Process", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 24, "abstract": "", "full_text": "Hierarchical Topic Models and\n\nthe Nested Chinese Restaurant Process\n\nDavid M. Blei\n\nblei@cs.berkeley.edu\n\nThomas L. Grif\ufb01ths\ngruffydd@mit.edu\n\nMichael I. Jordan\n\njordan@cs.berkeley.edu\n\nJoshua B. Tenenbaum\n\njbt@mit.edu\n\nUniversity of California, Berkeley Massachusetts Institute of Technology\n\nBerkeley, CA 94720\n\nCambridge, MA 02139\n\nAbstract\n\nWe address the problem of learning topic hierarchies from data. The\nmodel selection problem in this domain is daunting\u2014which of the large\ncollection of possible trees to use? We take a Bayesian approach, gen-\nerating an appropriate prior via a distribution on partitions that we refer\nto as the nested Chinese restaurant process. This nonparametric prior al-\nlows arbitrarily large branching factors and readily accommodates grow-\ning data collections. We build a hierarchical topic model by combining\nthis prior with a likelihood that is based on a hierarchical variant of latent\nDirichlet allocation. We illustrate our approach on simulated data and\nwith an application to the modeling of NIPS abstracts.\n\n1\n\nIntroduction\n\nComplex probabilistic models are increasingly prevalent in domains such as bioinformat-\nics, information retrieval, and vision. These domains create fundamental modeling chal-\nlenges due to their open-ended nature\u2014data sets often grow over time, and as they grow\nthey bring new entities and new structures to the fore. Current statistical modeling tools\noften seem too rigid in this regard; in particular, classical model selection techniques based\non hypothesis testing are poorly matched to problems in which data can continue to accrue\nand unbounded sets of often incommensurate structures must be considered at each step.\nAn important instance of such modeling challenges is provided by the problem of learning\na topic hierarchy from data. Given a collection of \u201cdocuments,\u201d each of which contains a\nset of \u201cwords,\u201d we wish to discover common usage patterns or \u201ctopics\u201d in the documents,\nand to organize these topics into a hierarchy. (Note that while we use the terminology of\ndocument modeling throughout this paper, the methods that we describe are general.) In\nthis paper, we develop ef\ufb01cient statistical methods for constructing such a hierarchy which\nallow it to grow and change as the data accumulate.\nWe approach this model selection problem by specifying a generative probabilistic model\nfor hierarchical structures and taking a Bayesian perspective on the problem of learning\nthese structures from data. Thus our hierarchies are random variables; moreover, these\nrandom variables are speci\ufb01ed procedurally, according to an algorithm that constructs the\nhierarchy as data are made available. The probabilistic object that underlies this approach\n\n\fis a distribution on partitions of integers known as the Chinese restaurant process [1]. We\nshow how to extend the Chinese restaurant process to a hierarchy of partitions, and show\nhow to use this new process as a representation of prior and posterior distributions for topic\nhierarchies.\nThere are several possible approaches to the modeling of topic hierarchies. In our approach,\neach node in the hierarchy is associated with a topic, where a topic is a distribution across\nwords. A document is generated by choosing a path from the root to a leaf, repeatedly\nsampling topics along that path, and sampling the words from the selected topics. Thus\nthe organization of topics into a hierarchy aims to capture the breadth of usage of topics\nacross the corpus, re\ufb02ecting underlying syntactic and semantic notions of generality and\nspeci\ufb01city. This approach differs from models of topic hierarchies which are built on the\npremise that the distributions associated with parents and children are similar [2]. We\nassume no such constraint\u2014for example, the root node may place all of its probability mass\non function words, with none of its descendants placing any probability mass on function\nwords. Our model more closely resembles the hierarchical topic model considered in [3],\nthough that work does not address the model selection problem which is our primary focus.\n\n2 Chinese restaurant processes\n\nWe begin with a brief description of the Chinese restaurant process and subsequently show\nhow this process can be extended to hierarchies.\n\n2.1 The Chinese restaurant process\n\nThe Chinese restaurant process (CRP) is a distribution on partitions of integers obtained\nby imagining a process by which M customers sit down in a Chinese restaurant with an\nin\ufb01nite number of tables.1 The basic process is speci\ufb01ed as follows. The \ufb01rst customer sits\nat the \ufb01rst table. The mth subsequent customer sits at a table drawn from the following\ndistribution:\n\np(occupied table i j previous customers) =\np(next unoccupied table j previous customers) =\n\nmi\n\n(cid:13)+m(cid:0)1\n\n(cid:13)\n\n(cid:13)+m(cid:0)1\n\n(1)\n\nwhere mi is the number of previous customers at table i, and (cid:13) is a parameter. After M\ncustomers sit down, the seating plan gives a partition of M items. This distribution gives\nthe same partition structure as draws from a Dirichlet process [4]. However, the CRP also\nallows several variations on the basic rule in Eq. (1), including a data-dependent choice of\n(cid:13) and a more general functional dependence on the current partition [5]. This \ufb02exibility\nwill prove useful in our setting.\nThe CRP has been used to represent uncertainty over the number of components in a mix-\nture model.\nIn a species sampling mixture [6], each table in the Chinese restaurant is\nassociated with a draw from p((cid:12) j (cid:17)) where (cid:12) is a mixture component parameter. Each\ndata point is generated by choosing a table i from Eq. (1) and then sampling a value from\nthe distribution parameterized by (cid:12)i (the parameter associated with that table). Given a\ndata set, the posterior under this model has two components. First, it is a distribution over\nseating plans; the number of mixture components is determined by the number of tables\nwhich the data occupy. Second, given a seating plan, the particular data which are sitting at\neach table induce a distribution on the associated parameter (cid:12) for that table. The posterior\ncan be estimated using Markov chain Monte Carlo [7]. Applications to various kinds of\nmixture models have begun to appear in recent years; examples include Gaussian mixture\nmodels [8], hidden Markov models [9] and mixtures of experts [10].\n\n1The terminology was inspired by the Chinese restaurants in San Francisco which seem to have\nan in\ufb01nite seating capacity. It was coined by Jim Pitman and Lester Dubins in the early eighties [1].\n\n\f2.2 Extending the CRP to hierarchies\n\nThe CRP is amenable to mixture modeling because we can establish a one-to-one rela-\ntionship between tables and mixture components and a one-to-many relationship between\nmixture components and data. In the models that we will consider, however, each data\npoint is associated with multiple mixture components which lie along a path in a hierarchy.\nWe develop a hierarchical version of the CRP to use in specifying a prior for such models.\nA nested Chinese restaurant process can be de\ufb01ned by imagining the following scenario.\nSuppose that there are an in\ufb01nite number of in\ufb01nite-table Chinese restaurants in a city. One\nrestaurant is determined to be the root restaurant and on each of its in\ufb01nite tables is a card\nwith the name of another restaurant. On each of the tables in those restaurants are cards that\nrefer to other restaurants, and this structure repeats in\ufb01nitely. Each restaurant is referred to\nexactly once; thus, the restaurants in the city are organized into an in\ufb01nitely-branched tree.\nNote that each restaurant is associated with a level in this tree (e.g., the root restaurant is at\nlevel 1 and the restaurants it refers to are at level 2).\nA tourist arrives in the city for a culinary vacation. On the \ufb01rst evening, he enters the root\nChinese restaurant and selects a table using Eq. (1). On the second evening, he goes to the\nrestaurant identi\ufb01ed on the \ufb01rst night\u2019s table and chooses another table, again from Eq. (1).\nHe repeats this process for L days. At the end of the trip, the tourist has sat at L restaurants\nwhich constitute a path from the root to a restaurant at the Lth level in the in\ufb01nite tree\ndescribed above. After M tourists take L-day vacations, the collection of paths describe a\nparticular L-level subtree of the in\ufb01nite tree (see Figure 1a for an example of such a tree).\nThis prior can be used to model topic hierarchies. Just as a standard CRP can be used to\nexpress uncertainty about a possible number of components, the nested CRP can be used\nto express uncertainty about possible L-level trees.\n\n3 A hierarchical topic model\n\nLet us consider a data set composed of a corpus of documents. Each document is a col-\nlection of words, where a word is an item in a vocabulary. Our basic assumption is that\nthe words in a document are generated according to a mixture model where the mixing\nproportions are random and document-speci\ufb01c. Consider a multinomial variable z, and an\nassociated set of distributions over words p(w j z; (cid:12)), where (cid:12) is a parameter. These top-\nics (one distribution for each possible value of z) are the basic mixture components in our\nmodel. The document-speci\ufb01c mixing proportions associated with these components are\ndenoted by a vector (cid:18). Temporarily assuming K possible topics in the corpus, an assump-\ntion that we will soon relax, z thus ranges over K possible values and (cid:18) is a K-dimensional\ni=1 (cid:18)ip(w j z = i; (cid:12)i)\n\nvector. Our document-speci\ufb01c mixture distribution is p(w j (cid:18)) = PK\n\nwhich is a random distribution since (cid:18) is random.\nWe now specify the following two-level generative probabilistic process for generating a\ndocument: (1) choose a K-vector (cid:18) of topic proportions from a distribution p((cid:18) j (cid:11)), where\n(cid:11) is a corpus-level parameter; (2) repeatedly sample words from the mixture distribution\np(w j (cid:18)) for the chosen value of (cid:18). When the distribution p((cid:18) j (cid:11)) is chosen to be a Dirichlet\ndistribution, we obtain the latent Dirichlet allocation model (LDA) [11]. LDA is thus a two-\nlevel generative process in which documents are associated with topic proportions, and the\ncorpus is modeled as a Dirichlet distribution on these topic proportions.\nWe now describe an extension of this model in which the topics lie in a hierarchy. For the\nmoment, suppose we are given an L-level tree and each node is associated with a topic.\nA document is generated as follows: (1) choose a path from the root of the tree to a leaf;\n(2) draw a vector of topic proportions (cid:18) from an L-dimensional Dirichlet; (3) generate the\nwords in the document from a mixture of the topics along the path from root to leaf, with\n\n\fmixing proportions (cid:18). This model can be viewed as a fully generative version of the cluster\nabstraction model [3].\nFinally, we use the nested CRP to relax the assumption of a \ufb01xed tree structure. As we have\nseen, the nested CRP can be used to place a prior on possible trees. We also place a prior on\nthe topics (cid:12)i, each of which is associated with a restaurant in the in\ufb01nite tree (in particular,\nwe assume a symmetric Dirichlet with hyperparameter (cid:17)). A document is drawn by \ufb01rst\nchoosing an L-level path through the restaurants and then drawing the words from the L\ntopics which are associated with the restaurants along that path. Note that all documents\nshare the topic associated with the root restaurant.\n\n1. Let c1 be the root restaurant.\n2. For each level \u2018 2 f2; : : : ; Lg:\n\n(a) Draw a table from restaurant c\u2018(cid:0)1 using Eq. (1). Set c\u2018 to be the restaurant\n\nreferred to by that table.\n\n3. Draw an L-dimensional topic proportion vector (cid:18) from Dir((cid:11)).\n4. For each word n 2 f1; : : : ; N g:\n\n(a) Draw z 2 f1; : : : ; Lg from Mult((cid:18)).\n(b) Draw wn from the topic associated with restaurant cz.\n\nThis model, hierarchical LDA (hLDA), is illustrated in Figure 1b. The node labeled T\nrefers to a collection of an in\ufb01nite number of L-level paths drawn from a nested CRP.\nGiven T , the cm;\u2018 variables are deterministic\u2014simply look up the \u2018th level of the mth path\nin the in\ufb01nite collection of paths. However, not having observed T , the distribution of cm;\u2018\nwill be de\ufb01ned by the nested Chinese restaurant process, conditioned on all the cq;\u2018 for\nq < m.\nNow suppose we are given a corpus of M documents, w1; : : : ; wM . The posterior on\nthe c\u2019s is essentially transferred (via the deterministic relationship), to a posterior on the\n\ufb01rst M paths in T . Consider a new document wM +1.\nIts posterior path will depend,\nthrough the unobserved T , on the posterior paths of all the documents in the original corpus.\nSubsequent new documents will also depend on the original corpus and any new documents\nwhich were observed before them. Note that, through Eq. (1), any new document can\nchoose a previously unvisited restaurant at any level of the tree. I.e., even if we have a\npeaked posterior on T which has essentially selected a particular tree, a new document can\nchange that hierarchy if its words provide justi\ufb01cation for such a change.\nIn another variation of this model, we can consider a process that \ufb02attens the nested CRP\ninto a standard CRP, but retains the idea that a tourist eats L meals. That is, the tourist eats\nL times in a single restaurant under the constraint that he does not choose the same table\ntwice. Though the vacation is less interesting, this model provides an interesting prior. In\nparticular, it can be used as a prior for a \ufb02at LDA model in which each document can use\nat most L topics from the potentially in\ufb01nite total set of topics. We examine such a model\nin Section 5 to compare CRP methods with selection based on Bayes factors.\n\n4 Approximate inference by Gibbs sampling\n\nIn this section, we describe a Gibbs sampling algorithm for sampling from the posterior\nnested CRP and corresponding topics in the hLDA model. The Gibbs sampler provides\na method for simultaneously exploring the parameter space (the particular topics of the\ncorpus) and the model space (L-level trees).\nThe variables needed by the sampling algorithm are: wm;n, the nth word in the mth docu-\nment (the only observed variables in the model); cm;\u2018, the restaurant corresponding to the\n\u2018th topic in document m; and zm;n, the assignment of the nth word in the mth document\n\n\f4 3 2 1\n\nb 1\n\nb 2\n\n2 1\n\nb 3\n\n4 3\n\nb 4\n\n1\n\nb 5\n\n2\n\nb 6\n\n4 3\n\nc1\n\nc2\n\nc3\n\ncL\n\nz\n\nw\n\nN M\n\n8\n\n(a)\n\n(b)\n\nFigure 1: (a) The paths of four tourists through the in\ufb01nite tree of Chinese restaurants (L =\n3). The solid lines connect each restaurant to the restaurants referred to by its tables. The\ncollected paths of the four tourists describe a particular subtree of the underlying in\ufb01nite\ntree. This illustrates a sample from the state space of the posterior nested CRP of Figure 1b\nfor four documents. (b) The graphical model representation of hierarchical LDA with a\nnested CRP prior. We have separated the nested Chinese restaurant process from the topics.\nEach of the in\ufb01nite (cid:12)\u2019s corresponds to one of the restaurants.\n\nto one of the L available topics. All other variables in the model\u2014(cid:18) and (cid:12)\u2014are integrated\nout. The Gibbs sampler thus assesses the values of zm;n and cm;\u2018.\nConceptually, we divide the Gibbs sampler into two parts. First, given the current state\nof the CRP, we sample the zm;n variables of the underlying LDA model following the\nalgorithm developed in [12], which we do not reproduce here. Second, given the values of\nthe LDA hidden variables, we sample the cm;\u2018 variables which are associated with the CRP\nprior. The conditional distribution for cm, the L topics associated with document m, is:\n\np(cm j w; c(cid:0)m; z) / p(wm j c; w(cid:0)m; z)p(cm j c(cid:0)m);\n\nwhere w(cid:0)m and c(cid:0)m denote the w and c variables for all documents other than m. This\nexpression is an instance of Bayes\u2019 rule with p(wm j c; w(cid:0)m; z) as the likelihood of the data\ngiven a particular choice of cm and p(cm j c(cid:0)m) as the prior on cm implied by the nested\nCRP. The likelihood is obtained by integrating over the parameters (cid:12), which gives:\n\ncm;\u2018;(cid:0)m + W (cid:17))\n\ncm;\u2018;(cid:0)m + (cid:17))Qw (cid:0)(n(w)\n\n(cid:0)(n((cid:1))\n\ncm;\u2018;(cid:0)m + n((cid:1))\n\ncm;\u2018;(cid:0)m + n(w)\n\ncm;\u2018;m + (cid:17))\n\ncm;\u2018;m + W (cid:17)) ! ;\n\np(wm j c; w(cid:0)m; z) =\n\nL\n\nY\u2018=1  (cid:0)(n((cid:1))\nQw (cid:0)(n(w)\n\nwhere n(w)\ncm;\u2018;(cid:0)m is the number of instances of word w that have been assigned to the topic\nindexed by cm;\u2018, not including those in the current document, W is the total vocabulary\nsize, and (cid:0)((cid:1)) denotes the standard gamma function. When c contains a previously unvisited\nrestaurant, n(w)\nNote that the cm must be drawn as a block. The set of possible values for cm corresponds\nto the union of the set of existing paths through the tree, equal to the number of leaves,\nwith the set of possible novel paths, equal to the number of internal nodes. This set can be\nenumerated and scored using Eq. (1) and the de\ufb01nition of a nested CRP in Section 2.2.\n\ncm;\u2018;(cid:0)m is zero.\n\nh\na\nq\nT\ng\nb\n\fStructure\n\n3 (7 6 5)\n4 (6 6 5 5)\n4 (6 6 6 4)\n5 (7 6 5 5 4)\n5 (6 5 5 5 4)\n\nLeaf error\n\nOther\n\n2\n\n1\n\n0\n12%\n70% 14% 4%\n20%\n48% 30% 2%\n12%\n52% 36% 0%\n30% 40% 16% 14%\n50% 22% 16% 12%\n\n(a)\n\n(b)\n\nFigure 2: (a) Six sample documents from\na 100 document corpus using the three\nlevel bars hierarchy described in Sec-\ntion 5 and (cid:11) skewed toward higher lev-\nels. Each document has 1000 words from\na 25 term vocabulary. (b) The correct hi-\nerarchy found by the Gibbs sampler on\nthis corpus.\n\nFigure 3: Results of estimating hierarchies\non simulated data. Structure refers to a three\nlevel hierarchy: the \ufb01rst integer is the number\nof branches from the root and is followed by\nthe number of children of each branch. Leaf\nerror refers to how many leaves were incor-\nrect in the resulting tree (0 is exact). Other\nsubsumes all other errors.\n\n5 Examples and empirical results\n\nIn this section we describe a number of experiments using the models described above.\nIn all experiments, we let the sampler burn in for 10000 iterations and subsequently took\nsamples 100 iterations apart for another 1000 iterations. Local maxima can be a problem\nin the hLDA model. To avoid them, we randomly restart the sampler 25 times and take the\ntrajectory with the highest average posterior likelihood.\nWe illustrate that the nested CRP process is feasible for learning text hierarchies in hLDA\nby using a contrived corpus on a small vocabulary. We generated a corpus of 100 1000-\nword documents from a three-level hierarchy with a vocabulary of 25 terms. In this corpus,\ntopics on the vocabulary can be viewed as bars on a 5 (cid:2) 5 grid. The root topic places its\nprobability mass on the bottom bar. On the second level, one topic is identi\ufb01ed with the\nleftmost bar, while the rightmost bar represents a second topic. The leftmost topic has two\nsubtopics while the rightmost topic has one subtopic. Figure 2a illustrates six documents\nsampled from this model. Figure 2b illustrates the recovered hierarchy using the Gibbs\nsampling algorithm described in Section 4.\nIn estimating hierarchy structures, hypothesis testing approaches to model selection are\nimpractical since they do not provide a viable method of searching over the large space\nof trees. To compare the CRP method on LDA models with a standard approach, we im-\nplemented the simpler, \ufb02at model described at the end of Section 3. We generated 210\ncorpora of 100 1000-word documents each from an LDA model with K 2 f5; : : : ; 25g,\nL = 5, a vocabulary size of 100, and randomly generated mixture components from a sym-\nmetric Dirichlet ((cid:17) = 0:1). For comparison with the CRP prior, we used the approximate\nBayes factors method of model selection [13], where one chooses the model that maxi-\nmizes p(data j K)p(K) for various K and an appropriate prior. With the LDA model, the\nBayes factors method is much slower than the CRP as it involves multiple runs of a Gibbs\nsampler with speed comparable to a single run of the CRP sampler. Furthermore, with the\nBayes factors method one must choose an appropriate range of K. With the CRP prior,\nthe only free parameter is (cid:13) (we used (cid:13) = 1:0). As shown in Figure 4, the CRP prior was\nmore effective than Bayes factors in this setting. We should note that both the CRP and\nBayes factors are somewhat sensitive to the choice (cid:17), the hyperparameter to the prior on\nthe topics. However, in simulated data, this hyperparameter was known and thus we can\nprovide a fair comparison.\nIn a similar experiment, we generated 50 corpora each from \ufb01ve different hierarchies using\n\n\fCRP prior\n\nBayes factors\n\ni\n\nn\no\ns\nn\ne\nm\nd\nd\nn\nu\no\n\n \n\ni\n\nf\n\n5\n2\n\n5\n1\n\n5\n\n5\n\n15\n\n10\n20\ntrue dimension\n\n25\n\ni\n\nn\no\ns\nn\ne\nm\nd\nd\nn\nu\no\n\n \n\ni\n\nf\n\n5\n2\n\n5\n1\n\n5\n\n5\n\n15\n\n10\n20\ntrue dimension\n\n25\n\nFigure 4: (Left) The average dimension found by a CRP prior plotted against the true\ndimension on simulated data (the true value is jiggled to see overlapping points). For each\ndimension, we generated ten corpora with a vocabulary size of 100. Each corpus contains\n100 documents of 1000 words. (Right) Results of model selection with Bayes factors.\n\nan hLDA model and the same symmetric Dirichlet prior on topics. Each corpus has 100\ndocuments of 1000 words from a vocabulary of 100 terms. Figure 3 reports the results of\nsampling from the resulting posterior on trees with the Gibbs sampler from Section 4. In\nall cases, we recover the correct structure more than any other and we usually recover a\nstructure within one leaf of the correct structure. In all experiments, no predicted structure\ndeviated by more than three nodes from the correct structure.\nLastly, to demonstrate its applicability to real data, we applied the hLDA model to a text\ndata set. Using 1717 NIPS abstracts from 1987\u20131999 [14] with 208,896 words and a vo-\ncabulary of 1600 terms, we estimated a three level hierarchy as illustrated in Figure 5. The\nmodel has nicely captured the function words without using an auxiliary list, a nuisance\nthat most practical applications of language models require. At the next level, it separated\nthe words pertaining to neuroscience abstracts and machine learning abstracts. Finally, it\ndelineated several important subtopics within the two \ufb01elds. These results suggest that\nhLDA can be an effective tool in text applications.\n\n6 Summary\n\nWe have presented the nested Chinese restaurant process, a distribution on hierarchical\npartitions. We have shown that this process can be used as a nonparametric prior for a\nhierarchical extension to the latent Dirichlet allocation model. The result is a \ufb02exible,\ngeneral model for topic hierarchies that naturally accommodates growing data collections.\nWe have presented a Gibbs sampling procedure for this model which provides a simple\nmethod for simultaneously exploring the spaces of trees and topics.\nOur model has two natural extensions. First, we have restricted ourselves to hierarchies\nof \ufb01xed depth L for simplicity, but it is straightforward to consider a model in which L\ncan vary from document to document. Each document is still a mixture of topics along\na path in a hierarchy, but different documents can express paths of different lengths as\nthey represent varying levels of specialization. Second, although in our current model a\ndocument is associated with a single path, it is also natural to consider models in which\ndocuments are allowed to mix over paths. This would be a natural way to take advantage\nof syntactic structures such as paragraphs and sentences within a document.\n\nAcknowledgements\n\nWe wish to acknowledge support from the DARPA CALO program, Microsoft Corpora-\ntion, and NTT Communication Science Laboratories.\n\n\fthe, of,\na, to,\nand, in,\nis, for\n\nneurons, visual,\n\ncells, cortex,\n\nsynaptic, motion,\n\nresponse, processing\n\ncell,\n\nneuron,\ncircuit,\ncells,\ninput,\n\ni,\n\n\ufb01gure,\nsynapses\n\nchip,\nanalog,\n\nvlsi,\n\nsynapse,\nweight,\ndigital,\ncmos,\ndesign\n\nrecognition,\n\nspeech,\ncharacter,\n\nword,\nsystem,\n\nclassi\ufb01cation,\n\ncharacters,\nphonetic\n\nalgorithm, learning,\ntraining, method,\n\nwe, new,\n\nproblem, on\n\nb,\nx,\ne,\nn,\np,\nany,\nif,\n\ntraining\n\nhidden,\nunits,\nlayer,\ninput,\noutput,\nunit,\nx,\n\nvector\n\ncontrol,\n\nreinforcement,\n\nlearning,\npolicy,\nstate,\nactions,\nvalue,\noptimal\n\nFigure 5: A topic hierarchy estimated from 1717 abstracts from NIPS01 through NIPS12.\nEach node contains the top eight words from its corresponding topic distribution.\n\nReferences\n[1] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019\u00b4et\u00b4e de probabilit\u00b4es de Saint-Flour,\n\nXIII\u20141983, pages 1\u2013198. Springer, Berlin, 1985.\n\n[2] E. Segal, D. Koller, and D. Ormoneit. Probabilistic abstraction hierarchies.\n\nNeural Information Processing Systems 14.\n\nIn Advances in\n\n[3] T. Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from\n\ntext data. In IJCAI, pages 682\u2013687, 1999.\n\n[4] T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics,\n\n1:209\u2013230, 1973.\n\n[5] J. Pitman. Combinatorial Stochastic Processes. Notes for St. Flour Summer School. 2002.\n[6] J. Ishwaran and L. James. Generalized weighted Chinese restaurant processes for species sam-\n\npling mixture models. Statistica Sinica, 13:1211\u20131235, 2003.\n\n[7] R. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9(2):249\u2013265, June 2000.\n\n[8] M. West, P. Muller, and M. Escobar. Hierarchical priors and mixture models, with application\n\nin regression and density estimation. In Aspects of Uncertainty. John Wiley.\n\n[9] M. Beal, Z. Ghahramani, and C. Rasmussen. The in\ufb01nite hidden Markov model. In Advances\n\nin Neural Information Processing Systems 14.\n\n[10] C. Rasmussen and Z. Ghahramani. In\ufb01nite mixtures of Gaussian process experts. In Advances\n\nin Neural Information Processing Systems 14.\n\n[11] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, January 2003.\n\n[12] T. Grif\ufb01ths and M. Steyvers. A probabilistic approach to semantic representation. In Proceed-\n\nings of the 24th Annual Conference of the Cognitive Science Society, 2002.\n\n[13] R. Kass and A. Raftery. Bayes factors.\n\n90(430):773\u2013795, 1995.\n\nJournal of the American Statistical Association,\n\n[14] S. Roweis. NIPS abstracts, 1987\u20131999. http://www.cs.toronto.edu/ roweis/data.html.\n\n\f", "award": [], "sourceid": 2466, "authors": [{"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}