{"title": "Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1392, "abstract": null, "full_text": "Sharing Clusters Among Related Groups:\n\nHierarchical Dirichlet Processes\n\nYee Whye Teh(1), Michael I. Jordan(1,2), Matthew J. Beal(3) and David M. Blei(1)\n(3)Dept. of Computer Science\n\n(1)Computer Science Div., (2)Dept. of Statistics\n\nUniversity of California at Berkeley\n\nBerkeley CA 94720, USA\n\n{ywteh,jordan,blei}@cs.berkeley.edu\n\nUniversity of Toronto\n\nToronto M5S 3G4, Canada\nbeal@cs.toronto.edu\n\nAbstract\n\nWe propose the hierarchical Dirichlet process (HDP), a nonparametric\nBayesian model for clustering problems involving multiple groups of\ndata. Each group of data is modeled with a mixture, with the number of\ncomponents being open-ended and inferred automatically by the model.\nFurther, components can be shared across groups, allowing dependencies\nacross groups to be modeled effectively as well as conferring generaliza-\ntion to new groups. Such grouped clustering problems occur often in\npractice, e.g. in the problem of topic discovery in document corpora. We\nreport experimental results on three text corpora showing the effective\nand superior performance of the HDP over previous models.\n\n1 Introduction\n\nOne of the most signi\ufb01cant conceptual and practical tools in the Bayesian paradigm is\nthe notion of a hierarchical model. Building on the notion that a parameter is a random\nvariable, hierarchical models have applications to a variety of forms of grouped or relational\ndata and to general problems involving \u201cmulti-task learning\u201d or \u201clearning to learn.\u201d A\nsimple and classical example is the Gaussian means problem, in which a grand mean \u00b50\nis drawn from some distribution, a set of K means are then drawn independently from a\nGaussian with mean \u00b50, and data are subsequently drawn independently from K Gaussian\ndistributions with these means. The posterior distribution based on these data couples the\nmeans, such that posterior estimates of the means are shrunk towards each other. The\nestimates \u201cshare statistical strength,\u201d a notion that can be made precise within both the\nBayesian and the frequentist paradigms.\n\nHere we consider the application of hierarchical Bayesian ideas to a problem in \u201cmulti-task\nlearning\u201d in which the \u201ctasks\u201d are clustering problems, and our goal is to share clusters\namong multiple, related clustering problems. We are motivated by the task of discovering\ntopics in document corpora [1]. A topic (i.e., a cluster) is a distribution across words while\ndocuments are viewed as distributions across topics. We want to discover topics that are\ncommon across multiple documents in the same corpus, as well as across multiple corpora.\n\nOur work is based on a tool from nonparametric Bayesian analysis known as the Dirichlet\nprocess (DP) mixture model [2, 3]. Skirting technical de\ufb01nitions for now, \u201cnonparametric\u201d\n\n\fcan be understood simply as implying that the number of clusters is open-ended. Indeed,\nat each step of generating data points, a DP mixture model can either assign the data point\nto a previously-generated cluster or can start a new cluster. The number of clusters is a\nrandom variable whose mean grows at rate logarithmic in the number of data points.\n\nExtending the DP mixture model framework to the setting of multiple related clustering\nproblems, we will be able to make the (realistic) assumption that we do not know the\nnumber of clusters a priori in any of the problems, nor do we know how clusters should be\nshared among the problems.\n\nWhen generating a new cluster, a DP mixture model selects the parameters for the cluster\n(e.g., in the case of Gaussian mixtures, the mean and covariance matrix) from a distribution\nG0\u2014the base distribution. So as to allow any possible parameter value, the distribution\nG0 is often assumed to be a smooth distribution (i.e., non-atomic). Unfortunately, if we\nnow wish to extend DP mixtures to groups of clustering problems, the assumption that G0\nis smooth con\ufb02icts with the goal of sharing clusters among groups. That is, even if each\ngroup shares the same underlying base distribution G0, the smoothness of G0 implies that\nthey will generate distinct cluster parameters (with probability one). We will show that this\nproblem can be resolved by taking a hierarchical Bayesian approach. We present a notion\nof a hierarchical Dirichlet process (HDP) in which the base distribution G0 for a set of DPs\nis itself a draw from a DP. This turns out to provide an elegant and simple solution to the\nproblem of sharing clusters among multiple clustering problems.\n\nThe paper is organized as follows. In Section 2, we provide the basic technical de\ufb01nition\nof DPs and discuss related representations involving stick-breaking processes and Chinese\nrestaurant processes. Section 3 then introduces the HDP, motivated by the requirement\nof a more powerful formalism for the grouped data setting. As for the DP, we present\nanalogous stick-breaking and Chinese restaurant representations for the HDP. We present\nempirical results on a number of text corpora in Section 5, demonstrating various aspects of\nthe HDP including its nonparametric nature, hierarchical nature, and the ease with which\nthe framework can be applied to other realms such as hidden Markov models.\n\n2 Dirichlet Processes\n\nThe Dirichlet process (DP) and the DP mixture model are mainstays of nonparametric\nBayesian statistics (see, e.g., [3]). They have also begun to be seen in applications in ma-\nchine learning (e.g., [7, 8, 9]). In this section we give a brief overview with an eye towards\ngeneralization to HDPs. We begin with the de\ufb01nition of DPs [4]. Let (\u0398, B) be a measur-\nable space, with G0 a probability measure on the space, and let \u03b10 be a positive real number.\nA Dirichlet process is the distribution of a random probability measure G over (\u0398, B) such\nthat, for any \ufb01nite partition (A1, . . . , Ar) of \u0398, the random vector (G(A1), . . . , G(Ar)) is\ndistributed as a \ufb01nite-dimensional Dirichlet distribution:\n\n(G(A1), . . . , G(Ar)) \u223c Dir(cid:0)\u03b10G0(A1), . . . , \u03b10G0(Ar)(cid:1) .\n\n(1)\nWe write G \u223c DP(\u03b10, G0) if G is a random probability measure distributed according to\na DP. We call G0 the base measure of G, and \u03b10 the concentration parameter.\nThe DP can be used in the mixture model setting in the following way. Consider a set\nof data, x = (x1, . . . , xn), assumed exchangeable. Given a draw G \u223c DP(\u03b10, G0),\nindependently draw n latent factors from G: \u03c6i \u223c G. Then, for each i = 1, . . . , n,\ndraw xi \u223c F (\u03c6i), for a distribution F . This setup is referred to as a DP mixture model.\nIf the factors \u03c6i were all distinct, then this setup would yield an (uninteresting) mixture\nmodel with n components. In fact, the DP exhibits an important clustering property, such\nthat the draws \u03c6i are generally not distinct. Rather, the number of distinct values grows as\nO(log n), and it is this that de\ufb01nes the random number of mixture components.\n\n\fThere are several perspectives on the DP that help to understand this clustering property.\nIn this paper we will refer to two: the Chinese restaurant process (CRP), and the stick-\nbreaking process. The CRP is a distribution on partitions that directly captures the cluster-\ning of draws from a DP via a metaphor in which customers share tables in a Chinese restau-\nrant [5]. As we will see in Section 4, the CRP refers to properties of the joint distribution\nof the factors {\u03c6i}. The stick-breaking process, on the other hand, refers to properties of\nG, and directly reveals its discrete nature [6]. For k = 1, 2 . . ., let:\n\n\u03b8k \u223c G0\n\n\u03b2 0\nk \u223c Beta(1, \u03b10)\n\nkQk\u22121\nThen with probability one the random measure de\ufb01ned by G = P\u221e\n\n(2)\nl=1 (1 \u2212 \u03b2 0\nk=1 \u03b2k\u03b4\u03b8k is a sample\nfrom DP(\u03b10, G0). The construction for \u03b21, \u03b22, . . . in (2) can be understood as taking a\nstick of unit length, and repeatedly breaking off segments of length \u03b2k. The stick-breaking\nconstruction shows that DP mixture models can be viewed as mixture models with a count-\nably in\ufb01nite number of components. To see this, identify each \u03b8k as the parameter of the\nkth mixture component, with mixing proportion given by \u03b2k.\n\n\u03b2k = \u03b2 0\n\nk).\n\n3 Hierarchical Dirichlet Processes\n\nWe will introduce the hierarchical Dirichlet process (HDP) in this section. First we de-\nscribe the general setting in which the HDP is most useful\u2014that of grouped data. We\nassume that we have J groups of data, each consisting of nj data points (xj1, . . . , xjnj ).\nWe assume that the data points in each group are exchangeable, and are to be modeled with\na mixture model. While each mixture model has mixing proportions speci\ufb01c to the group,\nwe require that the different groups share the same set of mixture components. The idea is\nthat while different groups have different characteristics given by a different combination\nof mixing proportions, using the same set of mixture components allows statistical strength\nto be shared across groups, and allows generalization to new groups.\n\nThe HDP is a nonparametric prior which allows the mixture models to share components.\nIt is a distribution over a set of random probability measures over (\u0398, B): one probability\nmeasure Gj for each group j, and a global probability measure G0. The global measure G0\nis distributed as DP(\u03b3, H), with H the base measure and \u03b3 the concentration parameter,\nwhile each Gj is conditionally independent given G0, with distribution Gj \u223c DP(\u03b10, G0).\nTo complete the description of the HDP mixture model, we associate each xji with a factor\n\u03c6ji, with distributions given by F (\u03c6ji) and Gj respectively. The overall model is given in\nFigure 1 left, with conditional distributions:\n\nG0 | \u03b3, H \u223c DP(\u03b3, H)\n\nGj | \u03b1, G0 \u223c DP(\u03b10, G0)\n\n\u03c6ji | Gj \u223c Gj\n\nxji | \u03c6ji \u223c F (\u03c6ji) .\n\n(3)\n(4)\n\nsum of point masses: G0 = P\u221e\n\nThe stick-breaking construction (2) shows that a draw of G0 can be expressed as a weighted\nk=1 \u03b2k\u03b4\u03b8k. This fact that G0 is atomic plays an important\nrole in ensuring that mixture components are shared across different groups. Since G0 is\nthe base distribution for the individual Gj\u2019s, (2) again shows that the atoms of the individual\nGj are samples from G0. In particular, since G0 places non-zero mass only on the atoms\n\u03b8 = (\u03b8k)\u221e\n\nk=1, the atoms of Gj must also come from \u03b8, hence we may write:\n\nk=1 \u03b2k\u03b4\u03b8k\n\nG0 =P\u221e\n\n(5)\nIdentifying \u03b8k as the parameters of the kth mixture component, we see that each submodel\ncorresponding to distinct groups share the same set of mixture components, but have dif-\nfering mixing proportions, \u03c0j = (\u03c0jk)\u221e\nFinally, it is useful to explicitly describe the relationships between the mixing proportions\n\u03b2 and (\u03c0j)J\nj=1. Details are provided in [10]. Note that the weights \u03c0j are conditionally in-\ndependent given \u03b2 since each Gj is independent given G0. Applying (1) to \ufb01nite partitions\n\nGj =P\u221e\n\nk=1 \u03c0jk\u03b4\u03b8k .\n\nk=1.\n\n\f\u03b3\n\n0\u03b1\n\nH\n\nG0\n\nG2\n\n\u03c62i\n\nx2i\n\nn2\n\n0\u03b1\n\nG1\n\n\u03c6\n1i\n\nx\n1i\n\nn\n1\n\n0\u03b1\n\nG3\n\n\u03c6\n3i\n\nx\n3i\n\nglobal\n\n2\u03b8\n\u03c8\n\n23\n\n\u03c8\n\n32\n\n\u03c8\n\u03c8\n\n11\n\n13\n\n\u03c8\n\n12\n\n\u03c8\n\n\u03b8\n\n22\n\n\u03c8\n1\n31\u03c8\n3\u03b8\n\u03c8\n\n24\n\n21\n\n\u03c6\n\n16\n\n\u03c6\n\u03c6\n\n15\n\n12\n\n\u03c8\n12\n\n\u03c6\n22\u03c6\n\n21\n\n\u03c8\n21\n\n\u03c6\n\n25\n\n\u03c8\n23\n\n\u03c8\n22\n\n\u03c6\n\n26\n\n\u03c6\n\u03c6\n\n24\n\n23\n\n\u03c6\n\n36\n\n\u03c6\n\u03c6\n\n35\n\n32\n\n\u03c8\n31\n\n\u03c6\n\n31\n\n\u03c6\n\n34\n\n\u03c6\n\n33\n\n\u03c8\n32\n\n\u03c6\n\n18\n\n\u03c6\n\u03c8\n11\n\n11\n\n\u03c6\n\u03c6\n\n13\n\n14\n\n\u03c6\n\n17\n\n\u03c8\n\n13\n\n\u03c8\n24\n\u03c6\n\n28\n\n\u03c6\n\n27\ngroup j=2\n\nn3\n\ngroup j=1\n\ngroup j=3\n\nFigure 1: Left: graphical model of an example HDP mixture model with 3 groups. Corresponding\nto each DP node we also plot a sample draw from the DP using the stick-breaking construction.\nRight: an instantiation of the CRF representation for the 3 group HDP. Each of the 3 restaurants has\ncustomers sitting around tables, and each table is served a dish (which corresponds to customers in\nthe Chinese restaurant for the global DP).\n\nof \u03b8, we get \u03c0j \u223c DP(\u03b10, \u03b2), where we interpret \u03b2 and \u03c0j as probability measures over\nthe positive integers. Hence \u03b2 is simply the putative mixing proportion over the groups.\nWe may in fact obtain an explicit stick-breaking construction for the \u03c0j\u2019s as well. Applying\n(1) to partitions ({1, . . . , k \u2212 1}, {k}, {k + 1, . . .}) of positive integers, we have:\n\n\u03c00\n\njk \u223c Beta(cid:16)\u03b10\u03b2k, \u03b10(cid:16)1 \u2212Pk\n\nl=1 \u03b2l(cid:17)(cid:17)\n\n\u03c0jk = \u03c00\n\nl=1 (1 \u2212 \u03c00\n\njl) .\n\njkQk\u22121\n\n(6)\n\n4 The Chinese Restaurant Franchise\n\nWe describe an alternative view of the HDP based directly upon the distribution a HDP in-\nduces on the samples \u03c6ji, where we marginalize out G0 and Gj\u2019s. This view directly leads\nto an ef\ufb01cient Gibbs sampler for HDP mixture models, which is detailed in the appendix.\nConsider, for one group j, the distribution of \u03c6j1, . . . , \u03c6jnj as we marginalize out Gj. Re-\ncall that since Gj \u223c DP(\u03b10, G0) we can describe this distribution by describing how to\ngenerate \u03c6j1, . . . , \u03c6jnj using the CRP. Imagine nj customers (each corresponds to a \u03c6ji)\nat a Chinese restaurant with an unbounded number of tables. The \ufb01rst customer sits at the\n\ufb01rst table. A subsequent customer sits at an occupied table with probability proportional\nto the number of customers already there, or at the next unoccupied table with probability\nproportional to \u03b10. Suppose customer i sat at table tji. The conditional distributions are:\n\ntji | tj1, . . . , tji\u22121, \u03b10 \u223c Xt\n\nnjt\n\nPt0 njt0 +\u03b10\n\n\u03b4t +\n\n\u03b10\n\nPt0 njt0 +\u03b10\n\n\u03b4tnew ,\n\n(7)\n\nwhere njt is the number of customers currently at table t. Once all customers have sat down\nthe seating plan corresponds to a partition of \u03c6j1, . . . , \u03c6jnj . This is an exchangeable pro-\ncess in that the probability of a partition does not depend on the order in which customers\nsit down. Now we associate with table t a draw \u03c8jt from G0, and assign \u03c6ji = \u03c8jtji .\nPerforming this process independently for each group j, we have now integrated out all the\nGj\u2019s, and have an assignment of each \u03c6ji to a sample \u03c8jtji from G0, with the partition\nstructures given by CRPs. Notice now that all \u03c8jt\u2019s are simply i.i.d. draws from G0, which\nis again distributed according to DP(\u03b3, H), so we may apply the same CRP partitioning\nprocess to the \u03c8jt\u2019s. Let the customer associated with \u03c8jt sit at table kjt. We have:\n\nkjt | k11, . . . , k1n1 , k21, . . . , kjt\u22121, \u03b3 \u223c Xk\n\nmk\n\nPk0 mjk0 +\u03b3 \u03b4k +\n\n\u03b3\n\nPk0 mk0 +\u03b10\n\n\u03b4knew .\n\n(8)\n\n\fPerplexity on test abstacts of LDA and HDP mixture\n\nPosterior over number of topics in HDP mixture\n\n1050\n\n1000\n\n950\n\n900\n\n850\n\n800\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\n750\n\n10\n\n20\n\n30\n\n40\n\nLDA\nHDP Mixture\n\n50\n\n60\n\nNumber of LDA topics\n\n70\n\n80\n\n90 100 110 120\n\ns\ne\nl\np\nm\na\ns\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n15\n\n10\n\n5\n\n0\n\n61 62 63 64 65 66 67 68 69 70 71 72 73\n\nNumber of topics\n\nFigure 2: Left: comparison of LDA and HDP mixture. Results are averaged over 10 runs, with error\nbars being 1 standard error. Right: histogram of the number of topics the HDP mixture used over 100\nposterior samples.\n\nFinally we associate with table k a draw \u03b8k from H and assign \u03c8jt = \u03b8kjt . This completes\nthe generative process for the \u03c6ji\u2019s, where we marginalize out G0 and Gj\u2019s. We call this\ngenerative process the Chinese restaurant franchise (CRF). The metaphor is as follows: we\nhave J restaurants, each with nj customers (\u03c6ji\u2019s), who sit at tables (\u03c8jt\u2019s). Now each table\nis served a dish (\u03b8k\u2019s) from a menu common to all restaurants. The customers are sociable,\nprefering large tables with many customers present, and also prefer popular dishes.\n\n5 Experiments\n\nWe describe 3 experiments in this section to highlight the various aspects of the HDP: its\nnonparametric nature; its hierarchical nature; and the ease with which we can apply the\nframework to other models, speci\ufb01cally the HMM.\nNematode biology abstracts. To demonstrate the strength of the nonparametric approach\nas exempli\ufb01ed by the HDP mixture, we compared it against latent Dirichlet allocation\n(LDA), which is a parametric model similar in structure to the HDP [1]. In particular,\nwe applied both models to a corpus of nematode biology abstracts1, evaluating the per-\nplexity of both models on held out abstracts. Here abstracts correspond to groups, words\ncorrespond to observations, and topics correspond to mixture components, and exchange-\nability correspond to the typical bag-of-words assumption. In order to study speci\ufb01cally the\nnonparametric nature of the HDP, we used the same experimental setup for both models2,\nexcept that in LDA we had to vary the number of topics used between 10 and 120, while\nthe HDP obtained posterior samples over this automatically.\n\nThe results are shown in Figure 2. LDA performs best using between 50 and 80 topics,\nwhile the HDP performed just as well as these. Further, the posterior over the number of\ntopics used by HDP is consistent with this range. Notice however that the HDP infers the\nnumber of topics automatically, while LDA requires some method of model selection.\nNIPS sections. We applied HDP mixture models to a dataset of NIPS 1-12 papers orga-\nnized into sections3. To highlight the transfer of learning achievable with the HDP, we\n\n1Available at http://elegans.swmed.edu/wli/cgcbib. There are 5838 abstracts in total. After removing\nstandard stop words and words appearing less than 10 times, we are left with 476441 words in total\nand a vocabulary size of 5699.\n\n2In both models, we used a symmetric Dirichlet distribution with weights of 0.5 for the prior H\nover topic distributions, while the concentration parameters are integrated out using a vague gamma\nprior. Gibbs sampling using the CRF is used, while the concentration parameters are sampled using\na method described in [10]. This also applies to the NIPS sections experiment on next page.\n\n3To ensure we are dealing with informative words in the documents, we culled stop words as well\n\n\fAverage perplexity over NIPS sections of 3 models\n\nGeneralization from LT, AA, AP to VS\n\n6000\n\n5500\n\n5000\n\n4500\n\n4000\n\n3500\n\n3000\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\n2500\n0\n\n10\n\nM1: additional sction ignored\nM2: flat, additional section\nM3: hierarchical, additional section\n\n60\n20\nNumber of VS training documents\n\n40\n\n30\n\n50\n\n5000\n\n4500\n\n4000\n\n3500\n\n3000\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\n70\n\n80\n\n2500\n0\n\n10\n\nLT\nAA\nAP\n\n70\n\n80\n\n60\n20\nNumber of VS training documents\n\n50\n\n30\n\n40\n\nFigure 3: Left: perplexity of test VS documents given training documents from VS and another\nsection for 3 different models. Curves shown are averaged over the other sections and 5 runs. Right:\nperplexity of test VS documents given LT, AA and AP documents respectively, using M3, averaged\nover 5 runs. In both, the error bars are 1 standard error.\n\nshow improvements to the modeling of a section when the model is also given documents\nfrom another section. Our test section is always the VS (vision sciences) section, while\nthe additional section is varied across the other eight. The training set always consist of\n80 documents from the other section (so that larger sections like AA (algorithms and ar-\nchitecures) do not get an unfair advantage), plus between 0 and 80 documents from VS.\nThere are 47 test documents, which are held \ufb01xed as we vary over the other section and the\nnumber N of training VS documents. We compared 3 different models for this task. The\n\ufb01rst model (M1) simply ignores documents from the additional section, and uses a HDP to\nmodel the VS documents. It serves as a baseline. The second model (M2) uses a HDP mix-\nture model, with one group per document, but lumping together training documents from\nboth sections. The third model (M3) takes a hierarchical approach and models each section\nseparately using a HDP mixture model, and places another DP prior over the common base\ndistributions for both submodels4.\nAs we see in Figure 3 left, the more hierarchical approach of M3 performs best, with per-\nplexity decreasing drastically with modest values of N , while M1 does worst for small N .\nHowever with increasing N , M1 improves until it is competitive with M3 but M2 does\nworst. This is because M2 lumps all the documents together, so is not able to differentiate\nbetween the sections, as a result the in\ufb02uence of documents from the other section is un-\nduly strong. This result con\ufb01rms that the hierarchical approach to the transfer-of-learning\nproblem is a useful one, as it allows useful information to be transfered to a new task (here\nthe modeling of a new section), without the data from the previous tasks overwhelming\nthose in the new task.\n\nWe also looked at the performance of the M3 model on VS documents given speci\ufb01c other\nsections. This is shown in Figure 3 right. As expected, the performance is worst given LT\n(learning theory), and improves as we move to AA and AP (applications). In Table 1 we\nshow the topics pertinent to VS discovered by the M3 model. First we trained the model\non all documents from the other section. Then, keeping the assignments of words to topics\n\ufb01xed in the other section, we introduced VS documents and the model decides to reuse\nsome topics from the other section, as well as create new ones. The topics reused by VS\ndocuments con\ufb01rm to our expectations of the overlap between VS and other sections.\n\nas words occurring more than 4000 or less than 50 times in the documents. As sections differ over\nthe years, we assigned by hand the various sections to one of 9 prototypical sections: CS, NS, LT,\nAA, IM, SP, VS, AP and CN.\n\n4Though we have only described the 2 layer HDP the 3 layer extension is straightforward. In\nfact on our website http://www.cs.berkeley.edu/\u02dcywteh/research/npbayes we have an implementation of the\ngeneral case where DPs are coupled hierarchically in a tree-structured model.\n\n\fCS\n\nAA\n\nCN\n\nIM\nprocessing\npattern\napproach\narchitecture\nsingle shows\nsimple based\nlarge control\n\nNS\ncells cell\nactivity\nresponse\nneuron visual\npatterns\npattern single\n\ufb01g\n\nLT\nsignal layer\ngaussian cells\n\ufb01g nonlinearity\nnonlinear rate\neq cell\n\nSP\n\nvisual images\nvideo language\nimage pixel\nacoustic delta\nlowpass \ufb02ow\n\nAP\napproach\nbased trained\ntest layer\nfeatures table\nclassi\ufb01cation\nrate paper\n\nalgorithms test\napproach\nmethods based\npoint problems\nform large\npaper\n\ntask\nrepresentation\npattern\nprocessing\ntrained\nrepresentations\nthree process\nunit patterns\nexamples\nconcept\nsimilarity\nbayesian\nhypotheses\ngeneralization\nnumbers\npositive classes\nhypothesis\nTable 1: Topics shared between VS and the other sections. Shown are the two topics with most\nnumbers of VS words, but also with signi\ufb01cant numbers of words from the other section.\n\ndistance\ntangent image\nimages\ntransformation\ntransformations\npattern vectors\nconvolution\nsimard\n\nvisual cells\ncortical\norientation\nreceptive\ncontrast spatial\ncortex stimulus\ntuning\n\nii tree pomdp\nobservable\nstrategy class\nstochastic\nhistory\nstrategies\ndensity\n\nmotion visual\nvelocity \ufb02ow\ntarget chip eye\nsmooth\ndirection optical\n\nsignals\nseparation\nsignal sources\nsource matrix\nblind mixing\ngradient eq\n\nimage images\nface similarity\npixel visual\ndatabase\nmatching facial\nexamples\n\npolicy optimal\nreinforcement\ncontrol action\nstates actions\nstep problems\ngoal\n\nlarge examples\nform point see\nparameter\nconsider\nrandom small\noptimal\n\nAlice in Wonderland. The in\ufb01nite hidden Markov model (iHMM) is a nonparametric\nmodel for sequential data where the number of hidden states is open-ended and inferred\nfrom data [11]. In [10] we show that the HDP framework can be applied to obtain a cleaner\nformulation of the iHMM, providing effective new inference algorithms and potentially hi-\nerarchical extensions. In fact the original iHMM paper [11] served as inspiration for this\nwork and \ufb01rst coined the term \u201chierarchical Dirichlet processes\u201d\u2014though their model is\nnot hierarchical in the Bayesian sense, involving priors upon priors, but is rather a set of\ncoupled urn models similar to the CRF. Here we report experimental comparisons of the\niHMM against other approaches on sentences taken from Lewis Carroll\u2019s Alice\u2019s Adven-\ntures in Wonderland.\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n0\n\nPerplexity on test sentences of Alice\n\nML\nMAP\nVB\n\n5\n\n20\n10\nNumber of hidden states\n\n15\n\n25\n\n30\n\nFigure 4: Comparing iHMM (horizontal line)\nversus ML, MAP and VB trained HMMs. Er-\nror bars are 1 standard error (those for iHMM too\nsmall to see).\n\n6 Discussion\n\nML, MAP, and variational Bayesian (VB)\n[12] models with numbers of states rang-\ning from 1 to 30 were trained multiple\ntimes on 20 sentences of average length\n51 symbols (27 distinct symbols, consist-\ning of 26 letters and \u2018 \u2019), and tested on\n40 sequences of average length 100. Fig-\nure 4 shows the perplexity of test sen-\ntences. For VB, the predictive probability\nis intractable to compute, so the modal set-\nting of parameters was used. Both MAP\nand VB models were given optimal set-\ntings of the hyperparameters found in the\niHMM. We see that the iHMM has a lower\nperlexity than every model size for ML,\nMAP, and VB, and obtains this with one\ncountably in\ufb01nite model.\n\nWe have described the hierarchical Dirichlet process, a hierarchical, nonparametric model\nfor clustering problems involving multiple groups of data. HDP mixture models are able\nto automatically determine the appropriate number of mixture components needed, and\nexhibit sharing of statistical strength across groups by having components shared across\ngroups. We have described the HDP as a distribution over distributions, using both the\nstick-breaking construction and the Chinese restaurant franchise. In [10] we also describe\na fourth perspective based on the in\ufb01nite limit of \ufb01nite mixture models, and give detail for\n\n\fhow the HDP can be applied to the iHMM. Direct extensions of the model include use of\nnonparametric priors other than the DP, building higher level hierarchies as in our NIPS\nexperiment, as well as hierarchical extensions to the iHMM.\nAppendix: Gibbs Sampling in the CRF\nThe CRF is de\ufb01ned by the variables t = (tji), k = (kjt), and \u03b8 = (\u03b8k). We describe an\ninference procedure for the HDP mixture model based on Gibbs sampling t, k and \u03b8 given\ndata items x. For the full derivation see [10]. Let f (\u00b7|\u03b8) and h be the density functions for\nF (\u03b8) and H respectively, n\u2212i\nbe\nthe number of kj 0t0\u2019s equal to k except kjt. The conditional probability for tji given the\nother variables is proportional to the product of a prior and likelihood term. The prior term\nis given by (7) where, by exchangeability, we can take tji to be the last one assigned. The\nlikelihood is given by f (xji|\u03b8kjt ) where for t = tnew we may sample kjtnew using (8), and\n\u03b8knew \u223c H. The distribution is then:\n\njt be the number of tji0 \u2019s equal to t except tji, and m\u2212jt\n\nk\n\np(tji = t | t\\tji, k, \u03b8, x) \u221d(cid:26)\u03b10f (xji|\u03b8kjt )\n\njt f (xji|\u03b8kjt )\n\nn\u2212i\n\nif t = tnew\nif t currently used.\n\nSimilarly the conditional distributions for kjt and \u03b8k are:\n\np(kjt = k | t, k\\kjt, \u03b8, x) \u221d(\u03b3Qi:tji =t f (xji|\u03b8k)\nk Qi:tji=t f (xji|\u03b8k)\n\np(\u03b8k | t, k, \u03b8\\\u03b8k, x) \u221d h(\u03b8k) Yji:kjtji =k\n\nf (xji|\u03b8k)\n\nm\u2212t\n\nif k = knew\nif k currently used.\n\n(9)\n\n(10)\n\n(11)\n\nwhere \u03b8knew \u223c H. If H is conjugate to F (\u00b7) we have the option of integrating out \u03b8.\n\nReferences\n[1] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[2] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal\n\nof the American Statistical Association, 90:577\u2013588, 1995.\n\n[3] S.N. MacEachern and P. M\u00a8uller. Estimating mixture of Dirichlet process models. Journal of\n\nComputational and Graphical Statistics, 7:223\u2013238, 1998.\n\n[4] T.S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics,\n\n1(2):209\u2013230, 1973.\n\n[5] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019\u00b4et\u00b4e de probabilit\u00b4es de Saint-Flour\n\nXIII\u20131983, pages 1\u2013198. Springer, Berlin, 1985.\n\n[6] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[7] R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9:249\u2013265, 2000.\n\n[8] C.E. Rasmussen. The in\ufb01nite Gaussian mixture model. In NIPS, volume 12, 2000.\n[9] D.M. Blei, T.L. Grif\ufb01ths, M.I. Jordan, and J.B. Tenenbaum. Hierarchical topic models and the\n\nnested Chinese restaurant process. NIPS, 2004.\n\n[10] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical dirichlet processes. Technical\n\nReport 653, Department of Statistics, University of California at Berkeley, 2004.\n\n[11] M.J. Beal, Z. Ghahramani, and C.E. Rasmussen. The in\ufb01nite hidden Markov model. In NIPS,\n\nvolume 14, 2002.\n\n[12] M.J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby\n\nUnit, University College London, 2004.\n\n\f", "award": [], "sourceid": 2698, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Matthew", "family_name": "Beal", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}