{"title": "The Infinite Markov Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1024, "abstract": null, "full_text": "The In\ufb01nite Markov Model\n\nDaichi Mochihashi \u2217\n\nNTT Communication Science Laboratories\n\nHikaridai 2-4, Keihanna Science City\n\nKyoto, Japan 619-0237\n\nEiichiro Sumita\n\nATR / NICT\n\nHikaridai 2-2, Keihanna Science City\n\nKyoto, Japan 619-0288\n\ndaichi@cslab.kecl.ntt.co.jp\n\neiichiro.sumita@atr.jp\n\nAbstract\n\nWe present a nonparametric Bayesian method of estimating variable order Markov\nprocesses up to a theoretically in\ufb01nite order. By extending a stick-breaking prior,\nwhich is usually de\ufb01ned on a unit interval, \u201cvertically\u201d to the trees of in\ufb01nite depth\nassociated with a hierarchical Chinese restaurant process, our model directly infers\nthe hidden orders of Markov dependencies from which each symbol originated.\nExperiments on character and word sequences in natural language showed that\nthe model has a comparative performance with an exponentially large full-order\nmodel, while computationally much ef\ufb01cient in both time and space. We expect\nthat this basic model will also extend to the variable order hierarchical clustering\nof general data.\n\n1 Introduction\n\nSince the pioneering work of Shannon [1], Markov models have not only been taught in elementary\ninformation theory classes, but also served as indispensable tools and building blocks for sequence\nmodeling in many \ufb01elds, including natural language processing, bioinformatics [2], and compres-\nsion [3]. In particular, (n\u22121)th order Markov models over words are called \u201cn-gram\u201d language\nmodels and play a key role in speech recognition and machine translation, as regards choosing the\nmost natural sentence among candidate transcriptions [4].\n\nDespite its mathematical simplicity, an inherent problem with a Markov model is that we must\ndetermine its order. Because higher-order Markov models have an exponentially large number of\nparameters, their orders have been restricted to a small, often \ufb01xed number. In fact, for \u201cn-gram\u201d\nmodels the assumed word dependency n is usually set at from three to \ufb01ve due to the high dimen-\nsionality of the lexicon. However, word dependencies will often have a span of greater than n for\nphrasal expressions or compound proper nouns, or a much shorter n will suf\ufb01ce for some grammat-\nical relationships. Similarly, DNA or amino acid sequences might have originated from multiple\ntemporal scales that are unknown to us.\n\nTo alleviate this problem, many \u201cvariable-order\u201d Markov models have been proposed [2, 5, 6, 7].\nHowever, all stemming from [5] and [7], they are based on pruning a huge candidate suf\ufb01x tree by\nemploying such criteria as KL-divergences. This kind of \u201cpost-hoc\u201d approach suffers from several\nimportant limitations: First, when we want to consider deeper dependences, the candidate tree to be\npruned will be extremely large. This is especially prohibitive when the lexicon size is large as with\nlanguage. Second, the criteria and threshold for pruning the tree are inherently exogeneous and must\nbe set carefully so that they match the desired model and current data. Third, pruning by empirical\ncounts in advance, which is often used to build \u201carbitrary order\u201d candidate trees in these approaches,\nis shown to behave very badly [8] and has no theoretical standpoints.\n\nIn contrast, in this paper we propose a complete generative model of variable-order Markov pro-\ncesses up to a theoretically in\ufb01nite order. By extending a stick-breaking prior, which is usually\n\n\u2217This research was conducted while the \ufb01rst author was af\ufb01liated with ATR/NICT.\n\n\fDepth 0\n\n1\n\n\u01eb\n\nDepth 0\n\n\u01eb\n\nsing like\n\n\u201dwill\u201d\n\n\u201dof\u201d\n\n\u201dand\u201d\n\n1\n\nsing like\n\n\u201dwill\u201d\n\n\u201dof\u201d\n\n\u201dand\u201d\n\n\u201dshe will\u201d\n2\n\n\u201dhe will\u201d\n\n\u201dstates of\u201d\n\n\u201dbread and\u201d\n\n\u201dshe will\u201d\n2\n\n\u201dorder of\u201d\n\n\u201dstates of\u201d\n\n\u201dbread and\u201d\n\ncry\n\nlike\n\namerica\n\nbutter\n\nsing\n\n=customer\n\n=proxy customer\n\ncry\n= customer\n\n\u201dunited states of\u201d\n\nbutter\n\n= proxy customer\n\n\u201dthe united states of\u201d\n\namerica\n\n(a) Suf\ufb01x Tree representation of the hierarchical Chinese\nRestaurant process on a second-order Markov model. Each\ncount is a customer in this suf\ufb01x tree.\n\n(b) In\ufb01nite suf\ufb01x tree of the proposed model.\nDeploying customers at suitable depths, i.e.\nMarkov orders, is our inference problem.\nFigure 1: Hierarchical Chinese restaurant processes over the \ufb01nite and in\ufb01nite suf\ufb01x trees.\n\nde\ufb01ned on a unit interval, \u201cvertically\u201d to the trees of in\ufb01nite depth associated with a hierarchical\nChinese restaurant process, our model directly infers the hidden orders of Markov dependencies\nfrom which each symbol originated. We show this is possible with a small change to the inference\nof the hiearchical Pitman-Yor process in discrete cases, and actually makes it more ef\ufb01cient in both\ncomputational time and space. Furthermore, we extend the variable model by latent topics to show\nthat we can induce the variable length \u201cstochastic phrases\u201d for topic by topic.\n\n2 Suf\ufb01x Trees on Hierarchical Chinese Restaurant Processes\n\nThe main obstacle that has prevented consistent approaches to variable order Markov models is\nthe lack of a hierarchical generative model of Markov processes that allows estimating increas-\ningly sparse distributions as its order gets larger. However, now we have the hierarchical (Poisson-)\nDirichlet process that can be used as a \ufb01xed order language model [9][10], it is natural for us to\nextend these models to variable orders also by using a nonparametric approach. While we concen-\ntrate here on discrete distributions, the same basic approach can be applied to a Markov process on\ncontinuous distributions, such as Gaussians that inherit their means from their parent distributions.\nFor concreteness below we use a language model example, but the same model can be applied to\nany discrete sequences, such as characters, DNAs, or even binary streams for compression.\n\nConsider a trigram language model, which is a second-order Markov model over words often em-\nployed in speech recognition. Following [9], this Markov model can be represented by a suf\ufb01x tree\nof depth two, as shown in Figure 1(a). When we predict a word \u201csing\u201d after a context \u201cshe will\u201d,\nwe descend this suf\ufb01x tree from the root (which corresponds to null string context), using the con-\ntext backwards to follow a branch \u201cwill\u201d and then \u201cshe will\u201d.1 Now we arrive at the leaf node that\nrepresents the context, and we can predict \u201csing\u201d by using the count distribution at this node.\n\nDuring the learning phase, we begin with a suf\ufb01x tree that has no counts. For every time a three word\nsequence appears in the training data, such as \u201cshe will sing\u201d mentioned above, we add a count of a\n\ufb01nal word (\u201csing\u201d) given the context (\u201cshe will\u201d) to the context node in the suf\ufb01x tree. In fact this\ncorresponds to a hierarchical Chinese restaurant process, where each context node is a restaurant\nand each count is a customer associated with a word. Here each node, i.e. restaurant, might not\nhave customers for all the words in the lexicon. Therefore, when a customer arrives at a node and\nstochastically needs a new table to sit down, a copy of him, namely a proxy customer, is sent to its\nparent node. When a node has no customer to compute the probability of some word, it uses the\ndistribution of customers at the parent node and appropriately interpolates it to sum to 1.\n\nAssume that the node \u201cshe will\u201d does not have a customer of \u201clike.\u201d We can nevertheless compute\nthe probability of \u201clike\u201d given \u201cshe will\u201d if its sibling \u201che will\u201d has a customer \u201clike\u201d. Because that\nsibling has sent a copy of the customer to the common parent \u201cwill\u201d, the probability is computed by\nappropriately interpolating the trigram probability given \u201cshe will\u201d, which is zero, with the bigram\nprobability given \u201cwill\u201d, which is not zero at the parent node.\n\n1This is the leftmost path in Figure 1(a). When there is no corresponding branch, we will create it.\n\n\f1 \u2212 qi\n\ni\n\n1 \u2212 qj\n\nj\n\n1 \u2212 qk\n\nk\n\nFigure 2: Probabilistic suf\ufb01x tree of an in\ufb01nite depth. (1 \u2212 qi) is a \u201cpenetration probability\u201d of a\ndescending customer at each node i, de\ufb01ning a stick-breaking process over the in\ufb01nite tree.\n\nConsequently, in the hierarchical Pitman-Yor language model (HPYLM), the predictive probability\nof a symbol s = st in context h = st\u2212n \u00b7 \u00b7 \u00b7 st\u22121 is recursively computed by\n\np(s|h) =\n\nc(s|h)\u2212d\u00b7ths\n\n\u03b8+c(h)\n\n+\n\n\u03b8+d\u00b7th\u00b7\n\u03b8+c(h)\n\np(s|h\u2032),\n\n(1)\n\nwhere h\u2032 = st\u2212n+1 \u00b7 \u00b7 \u00b7 st\u22121 is a shortened context with the farthest symbol dropped. c(s|h) is the\ncount of s at node h, and c(h) = Ps c(s|h) is the total count at node h. ths is the number of times\nsymbol s is estimated to be generated from its parent distribution p(s|h\u2032) rather than p(s|h) in the\ntraining data: th\u00b7 = Ps ths is its total. \u03b8 and d are the parameters of the Pitman-Yor process, and\ncan be estimated through the distribution of customers on a suf\ufb01x tree by Gamma and Beta posterior\ndistributions, respectively. For details, see [9].\n\nAlthough this Bayesian Markov model is very principled and attractive, we can see from Figure 1(a)\nthat all the real customers (i.e., counts) are \ufb01xed at the depth (n\u22121) in the suf\ufb01x tree. Because actual\nsequences will have heterogeneous Markov dependencies, we want a Markov model that deploys\ncustomers at different levels in the suf\ufb01x tree according to the true Markov order from which each\ncustomer originated. But how can we model such a heterogeneous property of Markov sequences?\n\n3 In\ufb01nite-order Hierarchical Chinese Restaurant Processes\n\nIntuitively, we know that suf\ufb01x trees that are too deep are improbable and symbol dependencies\ndecay largely exponentially with context lengths. However, some customers may reside in a very\ndeep node (for example, \u201cthe united states of america\u201d) and some in a shallow node (\u201cshorter than\u201d).\nOur model for deploying customers must be \ufb02exible enough to accommodate all these possibilities.\n\n3.1\n\nIntroducing Suf\ufb01x Tree Prior\n\nFor this purpose, we assume that each node i in the suf\ufb01x tree has a hidden probability qi of stopping\nat node i when following a path from the root of the tree to add a customer. In other words, (1 \u2212 qi)\nis the \u201cpenetration probability\u201d when descending an in\ufb01nite depth suf\ufb01x tree from its root (Figure 2).\nWe assume that each qi is generated from a prior Beta distribution independently as:\n\nqi \u223c Be(\u03b1, \u03b2)\n\n(2)\nThis choice is mainly for simplicity: however, later we will show that the \ufb01nal predictive perfor-\nmance does not signi\ufb01cantly depend on \u03b1 or \u03b2.\nWhen we want to generate a symbol st given a context h = s\u2212\u221e \u00b7 \u00b7 \u00b7 st\u22122st\u22121, we descend the suf\ufb01x\ntree from the root following a path st\u22121 \u2192 st\u22122 \u2192 \u00b7 \u00b7 \u00b7 , according to the probability of stopping at a\nlevel l given by\n\ni.i.d.\n\nl\u22121\n\nY\n\np(n = l|h) = ql\n\n(1 \u2212 qi) .\n\n(l = 0, 1, \u00b7 \u00b7 \u00b7 , \u221e)\n\n(3)\n\ni=0\n\nWhen we stop at level l, we generate a symbol st using the context st\u2212l\u00b7 \u00b7 \u00b7st\u22122st\u22121. Since qi differs\nfrom node to node, we may reach very deep nodes with high probability if the qi\u2019s along the path\nare equally small (the \u201cpenetration\u201d of this branch is high); or, we may stop at a very shallow node\nif the qi\u2019s are very high (the \u201cpenetration\u201d is low). In general, the probability to reach a node decays\nexponentially with levels according to (3), but the degrees are different to allow for long sequences\nof typical phrases.\n\nNote that even for the same context h, the context length that was used to generate the next symbol\nmay differ stochastically for each appearance according to (3).\n\n\f3.2\n\nInference\n\nOf course, we do not know the hidden probability qi possessed by each node. Then, how can\nwe estimate it? Note that the generative model above amounts to introducing a vector of hidden\nvariables, n = n1n2 \u00b7 \u00b7 \u00b7 nT , that corresponds to each Markov order (n = 0 \u00b7 \u00b7 \u00b7 \u221e) from which each\nsymbol st in s = s1s2 \u00b7 \u00b7 \u00b7 sT originated. Therefore, we can write the probability of s as follows:\n\np(s) = X\n\nX\n\np(s, z, n) .\n\nn\n\nz\n\n(4)\n\nHere, z = z1z2 \u00b7 \u00b7 \u00b7 zT is a vector that represents the hidden seatings of the proxy customers described\nin Section 2, where 0 \u2264 zt \u2264 nt means how recursively the st\u2019s proxy customers are stochastically\nsent to parent nodes. To estimate these hidden variables n and z, we use a Gibbs sampler as in [9].\nSince in the hierarchical (Poisson-)Dirichlet process the customers are exchangeable [9] and qi is\ni.i.d. as shown in (2), this process is also exchangeable and therefore we can always assume, by a\nsuitable permutation, that the customer to resample is the \ufb01nal customer.\n\nIn our case, we only explicitly resample nt given n\u2212t (n excluding nt), as follows:\n\nnt \u223c p(nt|s, z\u2212t, n\u2212t).\n\n(5)\n\nNotice here that when we sample nt, we already know the other depths n\u2212t that other words have\nreached in the suf\ufb01x tree. Therefore, when computing (5) using (3), the expectation of each qi is\n\nE[qi] =\n\nai +\u03b1\n\nai +bi +\u03b1+\u03b2\n\n,\n\n(6)\n\nwhere ai is the number of times node i was stopped at when generating other words, and bi is\nthe number of times node i was passed by. Using this estimate, we decompose the conditional\nprobability of (5) as\n\np(nt|s, z\u2212t, n\u2212t) \u221d p(st|s\u2212t, z\u2212t, n) p(nt|s\u2212t, z\u2212t, n\u2212t) .\n\n(7)\n\nThe \ufb01rst term is the probability of st under HPYLM when the Markov order is known to be nt,\ngiven by (1). The second term is the prior probability of reaching that node at depth nt. By using\n(6) and (3), this probability is given by\n\np(nt = l|s\u2212t, z\u2212t, n\u2212t) =\n\nal +\u03b1\n\nal +bl +\u03b1+\u03b2\n\nl\u22121\n\nY\n\ni=0\n\nbi +\u03b2\n\nai +bi +\u03b1+\u03b2\n\n.\n\n(8)\n\nExpression (7) is a tradeoff between these two terms: the prediction of st will be increasingly better\nwhen the context length nt becomes long, but we can select it only when the probability of reaching\nthat level in the suf\ufb01x tree is supported by the other counts in the training data.\n\nUsing these probabilities, we can construct a Gibbs sampler, as shown in Figure 3, to iteratively\nresample n and z in order to estimate the parameter of the variable order hierarchical Pitman-Yor\nlanguage model (VPYLM)2. In this sampler, we \ufb01rst remove the t\u2019th customer who resides at a depth\nof order[t] in the suf\ufb01x tree, and decrement ai or bi accordingly along the path. Sampling a new\ndepth (i.e. Markov order) according to (7), we put the t\u2019th customer back at the new depth recorded\nas order[t], and increment ai or bi accordingly along the new path. When we add a customer st, zt\nis implicitly sampled because st\u2019s proxy customer is recursively sent to parent nodes in case a new\ntable is needed to sit him down.\n\nif j > 1 then\n\n1: for j = 1 \u00b7 \u00b7 \u00b7 N do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nend for\n\nfor t = randperm(1 \u00b7 \u00b7 \u00b7 T ) do\n\nremove customer (order[t], st, s1:t\u22121)\n\nend if\norder[t] = add customer (st, s1:t\u22121) .\n\n};\n\nstruct ngram {\n\n/* n-gram node */\n\nngram *parent;\nsplay *children; /* = (ngram **) */\nsplay *symbols;\nint stop;\nint through;\nint ncounts;\nint ntables;\nint id;\n\n/* = (restaurant **) */\n/* ah */\n/* bh */\n/* c(h) */\n/* th \u00b7 */\n/* symbol id */\n\nFigure 4: Data structure of a suf\ufb01x tree node.\nCounts ah and bh are maintained at each node. We\nused Splay Trees for ef\ufb01cient insertion/deletion.\n\nFigure 3: Gibbs Sampler of VPYLM.\n\n2This is a speci\ufb01c application of our model to the hierarchical Pitman-Yor processes for discrete data.\n\n\f\u2018how queershaped little children drawling-desks, which would get through that dormouse!\u2019\nsaid alice; \u2018let us all for anything the secondly, but it to have and another question, but i\nshalled out, \u2018you are old,\u2019 said the you\u2019re trying to far out to sea.\n\n(a) Random walk generation from a character model.\na l i c e ;\n\nf o r a ny t h i ng\n\ns a i d\n\nCharacter\nMarkov order 5654710654371482446554455645677753345911648989444734 3 \u00b7 \u00b7 \u00b7\n\nt h e s e c ond l y ,\n\n\u2018 l e t u s a l l\n\n\u00b7 \u00b7 \u00b7\n\n(b) Markov orders used to generate each character above.\n\nFigure 5: Character-based in\ufb01nite Markov model trained on \u201cAlice in Wonderland.\u201d\n\nThis sampler is an extension of that reported in [9] using stochastically different orders n (n =\n0 \u00b7 \u00b7 \u00b7 \u221e) for each customer. In practice, we can place some maximum order nmax on n and sample\nwithin it 3, or use a small threshold \u01eb to stop the descent when the prior probability (8) of reaching\nthat level is smaller than \u01eb. In this case, we obtain an \u201cin\ufb01nite\u201d order Markov model: now we can\neliminate the order from Markov models by integrating it out.\n\nBecause each node in the suf\ufb01x tree may have a huge number of children, we used Splay Trees [11]\nfor the ef\ufb01cient search as in [6]. Splay Trees are self-organizing binary search trees having amortized\nO(log n) order, that automatically put frequent items at shallower nodes. This is ideal for sequences\nwith a power law property like natural languages. Figure 4 shows our data structure of a node in a\nsuf\ufb01x tree.\n\n3.3 Prediction\n\nSince we do not usually know the Markov order of a context h = s\u2212\u221e \u00b7 \u00b7 \u00b7 s\u22122s\u22121 beforehand, when\nmaking predictions we consider n as a latent variable and average over it, as follows:\n\np(s|h) = P\u221e\n= P\u221e\n\nn=0 p(s, n|h)\nn=0 p(s|h, n)p(n|h) .\n\n(9)\n(10)\n\nHere, p(s|n, h) is a HPYLM prediction of order n through (1), and p(n|h) is the probability distri-\nbution of latent Markov order n possessed by the context h, obtained through (8). In practice, we\nfurther average (10) over the con\ufb01gurations of n and s through N Gibbs iterations on training data\ns, as HPYLM does.\n\nSince p(n|h) has a product form as (3), we can also write the above expression recursively by\nintroducing an auxiliary probability p(s|h, n+) as follows:\n\np(s|h, n+) = qn \u00b7 p(s|h, n) + (1 \u2212 qn) \u00b7 p(s|h, (n+1)+) ,\n\n(11)\n(12)\nThis formula shows that qn in fact de\ufb01nes the stick-breaking process on an in\ufb01nite tree, where\nbreaking proportions will differ branch to branch as opposed to a single proportion on a unit interval\nused in ordinary Dirichlet processes. In practice, we can truncate the in\ufb01nite recursion in (11) and\nrescale it to make p(n|h) a proper distribution.\n\np(s|h) \u2261 p(s|h, 0+) .\n\n3.4 \u201cStochastic Phrases\u201d on Suf\ufb01x Tree\n\nIn the expression (9) above, p(s, n|h) is the probability that the symbol s is generated by a Markov\nprocess of order n on h, that is, using the last n symbols of h as a Markov state. This means that a\nsubsequence s\u2212n \u00b7 \u00b7 \u00b7 s\u22121s forms a \u201cphrase\u201d: for example, when \u201cGaussians\u201d was generated using a\ncontext \u201cmixture of\u201d, we can consider \u201cmixture of Gaussians\u201d as a phrase and assign a probability\nto this subsequence, which represents its cohesion strength irrespective of its length.\n\nIn other words, instead of emitting a single symbol s at the root node of suf\ufb01x tree, we can \ufb01rst\nstochastically descend the tree according to the probability to stop by (3). Finally, we emit s given\nthe context s\u2212n \u00b7 \u00b7 \u00b7 s\u22121, which yields a phrase s\u2212n \u00b7 \u00b7 \u00b7 s\u22121s and its cohesion probability. Therefore,\nby traversing the suf\ufb01x tree, we can compute p(s, n|h) for all the subsequences ef\ufb01ciently. For\nconcrete examples, see Figure 8 and 10 in Section 4.\n\n3Notice that by setting (\u03b1, \u03b2) = (0, \u221e), we always obtain qi = 0: with some maximum order nmax, this\nis equivalent to always using the maximum depth, and thus to reducing the model to the original HPYLM. In\nthis regard, VPYLM is a natural superset that includes HPYLM [9].\n\n\fn\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\ni\n\ng\nn\nm\nu\ns\nn\no\nc\n\nn\na\ne\np\no\nr\nu\ne\n\nd\ne\nz\na\nn\nu\n\nf\n\nr\na\ne\np\np\na\n\ns\nn\no\n\ni\nt\n\na\nn\n\nt\n\nu\no\nb\na\n\ns\nt\nc\ne\np\ns\no\nr\np\n\ne\nh\n\nt\n\nf\n\no\n\nr\ne\nc\nu\nd\no\nr\nap\n\nl\n\ne\n\nt\nr\na\nc\n\ne\n\nl\ni\n\nh\nw\n\ny\ne\nk\n\nt\n\ns\ne\na\ng\ne\ne\nd\n\nl\n\nm\no\nr\nf\n\n.\ns\n.\n\nu\n\nk\ne\ne\nw\n\nS\nO\nE\n\ni\n\ns\nh\n\nt\n\nt\n\np\nm\ne\n\nt\nt\n\na\n\nt\n\na\nh\n\nt\n\nl\nl\ni\n\nw\n\no\n\nt\n\nx\ni\nf\n\ns\ne\nc\ni\nr\np\n\ne\nh\n\nt\nc\na\np\n\ns\n\ni\n\nt\n\nl\n\ny\ne\nk\n\ni\nl\n\ne\ne\nom\n\nt\n\n|t\n\nn\no\n\ni\nt\ni\ns\no\np\np\no\n\ng\nn\no\nr\nt\ns\n\nFigure 6: Estimated Markov order distributions from which each word has been generated.\n\n4 Experiments\n\nTo investigate the behavior of the in\ufb01nite Markov model, we conducted experiments on character\nand word sequences in natural language.\n\n4.1\n\nIn\ufb01nite character Markov model\n\nCharacter-based Markov model is widely employed in data compression and has important applica-\ntion in language processing, such as OCR and unknown word recognition. In this experiment, we\nused a 140,931 characters text of \u201cAlice in Wonderland\u201d and built an in\ufb01nite Markov model using\nuniform Beta prior and truncation threshold \u01eb = 0.0001 in Section 3.2.\nFigure 5(a) is a random walk generation from this in\ufb01nite model.\nTo generate this, we begin with an in\ufb01nite sequence of \u2018beginning\nof sentence\u2019 special symbols, and sample the next character accord-\ning to the generative model given the already sampled sequence as\nthe context. Figure 5(b) is the actual Markov orders used for gen-\neration by (8). Without any notion of \u201cword\u201d, we can see that our\nmodel correctly captures it and even higher dependencies between\n\u201cwords\u201d. In fact, the model contained many nodes that correspond to valid words as well as the\nconnective fragments between them. Table 1 shows predictive perplexity4 results on separate test\ndata. Compared with truncations n = 3, 5 and 10, the in\ufb01nite model performs the best in all the\nvariable order options.\n\nTable 1: Perplexity results of\nCharacter models.\n\nMax. order\nn = 3\nn = 5\nn = 10\nn = \u221e\n\nPerplexity\n6.048\n3.803\n3.519\n3.502\n\n4.2 Bayesian \u221e-gram Language Model\n\nData For a word-based \u201cn-gram\u201d model of language, we used a random subset of the standard\nNAB Wall Street Journal language modeling corpus [12] 5, totalling 10,007,108 words (409,246\nsentences) for training and 10,000 sentences for testing. Symbols that occurred fewer than 10 times\nin total and punctuation (commas, quotation marks etc.) are mapped to special characters, and all\nsentences are lowercased, yielding a lexicon of 26,497 words. As HPYLM is shown to converge\nvery fast [9], according to preliminary experiments we used N = 200 Gibbs iterations for burn-in,\nand a further 50 iterations to evaluate the perplexity of the test data.\nResults Figure 6 shows the Hinton diagram of estimated Markov order distributions on part of the\ntraining data, computed according to (7). As for the perplexity, Table 2 shows the results compared\nwith the \ufb01xed-order HPYLM with the number of nodes in each model. n means the \ufb01xed order for\nHPYLM, and the maximum order nmax in VPYLM. For the \u201cin\ufb01nite\u201d model of n = \u221e, we used a\nthreshold \u01eb = 10\u22128 in Section 3.2 for descending the suf\ufb01x tree.\nAs empirically found by [12], perplexities will saturate when n becomes large, because only a small\nportion of words actually exhibit long-range dependencies. However, we can see that the VPYLM\nperformance is comparable to that of HPYLM with much fewer nodes and restaurants up to n = 7\nand 8, where vanilla HPYLM encounters memory over\ufb02ow caused by a rapid increase in the number\nof parameters. In fact, the inference of VPYLM is about 20% faster than that of HPYLM of the\n\n4Perplexity is a reciprocal of average predictive probabilities, thus smaller is better.\n5We also conducted experiments on standard corpora of Chinese (character-wise) and Japanese, and ob-\n\ntained the same line of results presented in this paper.\n\n\fn HPYLM VPYLM Nodes(H) Nodes(V)\n1,344K\n1,417K\n3\n7,466K\n12,699K\n5\n27,193K 10,182K\n7\n34,459K 10,434K\n8\n10,629K\n\u221e\n\n113.60\n101.08\nN/A\nN/A\n\u2014\n\n113.74\n101.69\n100.68\n100.58\n100.36\n\n\u2014\n\nTable 2: Perplexity Results of VPYLM and\nHPYLM on the NAB corpus with the number\nof nodes in each model. N/A means a memory\nover\ufb02ow caused by the expected number of nodes\nshown in italic.\n\ns\ne\nc\nn\ne\nr\nr\nu\nc\nc\nO\n\n3.5\u00d7106\n3.0\u00d7106\n2.5\u00d7106\n2.0\u00d7106\n1.5\u00d7106\n1.0\u00d7106\n5.0\u00d7105\n0.0\u00d7100\n\n 0  1  2  3  4  5  6  7  8  9  10 11 12\n\nn\n\nFigure 7: Global distribution of sampled\nMarkov orders on the \u221e-gram VPYLM over\nthe NAB corpus. n = 0 is unigram, n = 1 is\nbigram,\u00b7 \u00b7 \u00b7 .\n\nsame order despite the additional cost of sampling n-gram orders, because it appropriately avoids\nthe addition of unnecessarily deep nodes on the suf\ufb01x tree. The perplexity at n = \u221e is the lowest\ncompared to all \ufb01xed truncations, and contains only necessary number of nodes in the model.\n\nFigure 7 shows a global n-gram order distribution from a single posterior sample of Gibbs iteration in\n\u221e-gram VPYLM. Note that since we added an in\ufb01nite number of dummy symbols to the sentence\nheads as usual, every word context has a maximum possible length of \u221e. We can see from this\n\ufb01gure that the context lengths that were actually used decay largely exponentially, as intuitively\nexpected. Because of the tradeoff between using a longer, more predictive context and the penalty\nincurred when reaching a deeper node, interestingly a peak emerges around n = 3 \u223c 4 as a global\nphenomenon.\n\nWith regard to the hyperparameter that de\ufb01nes the prior forms of suf\ufb01x trees, we used a (4, 1)-\nprior in this experiment.\nIn fact, this hyperparameter can be optimized by the empirical Bayes\nmethod using each Beta posterior of qi in (6). By using the Newton-Raphson iteration of [13], this\nconverged to (0.85, 0.57) on a 1 million word subset of the NAB corpus. However, we can see that\nthe performance does not depend signi\ufb01cantly on the prior. Figure 9 shows perplexity results for the\nsame data, using (\u03b1, \u03b2) \u2208 (0.1 \u223c 10)\u00d7(0.1 \u223c 10). We can see from this \ufb01gure that the performance\nis almost stable, except when \u03b2 is signi\ufb01cantly greater than \u03b1. Finally, we show in Figure 8 some\n\u201cstochastic pharases\u201d in Section 3.4 induced on the NAB corpus.\n\n4.3 Variable Order Topic Model\n\nWhile previous approaches to latent topic modeling assumed a \ufb01xed order such as unigrams or\nbigrams, the order is generally not \ufb01xed and unknown to us. Therefore, we used a Gibbs sampler\nfor the Markov chain LDA [14] and augmented it by sampling Markov orders at the same time.\n\nBecause \u201ctopic-speci\ufb01c\u201d sequences constitute only some part of the entire data, we assumed that the\n\u201cgeneric\u201d model generated the document according to probability \u03bb, and the rest are generated by\nthe LDA of VPYLM. We endow \u03bb a uniform Beta prior and used the posterior estimate for sampling\nthat will differ document to document.\n\nFor the experiment, we used the NIPS papers dataset of 1739 documents. Among them, we used\nrandom 1500 documents for training and random 50 documents from the rest of 239 documents\nfor testing, after the same preprocessing for the NAB corpus. We set a symmetric Dirichlet prior\n\np(s, n)\n0.9784\n0.9726\n0.9512\n0.9026\n0.8896\n0.8831\n0.7566\n0.7134\n0.6617\n\n:\n\nStochastic phrases in the suf\ufb01x tree\nprimary new issues\n\u02c6 at the same time\nis a unit of\nfrom # % in # to # %\nin a number of\nin new york stock exchange composite trading\nmechanism of the european monetary\nincrease as a result of\ntiffany & co.\n\nPPL\nPPL\n\n 136\n 134\n 132\n 130\n 128\n 126\n 124\n0.1\n\n 134\n 132\n 130\n 128\n 126\n 124\n 122\n\n0.5\n\n1\n\u03b1\n\n2\n\n5\n\n10\n\n0.1\n\n105\n\n210.5\n\n\u03b2\n\nFigure 9: Perplexity results using dif-\nferent hyperparameters on the 1M NAB\ncorpus.\n\nFigure 8:\nVPYLM trained on the NAB corpus.\n\n\u201cStochastic phrases\u201d induced by the 8-gram\n\n\fp(n, s) Phrase\n0.9904 in section #\n0.9900 the number of\n0.9856 in order to\n0.9832 in table #\n0.9752 dealing with\n0.9693 with respect to\n(a) Topic 0 (\u201cgeneric\u201d)\n\np(n, s) Phrase\n0.9853 et al\n0.9840 receptive \ufb01eld\n0.9630 excitatory and inhibitory\n0.9266 in order to\n0.8939 primary visual cortex\n0.8756 corresponds to\n(b) Topic 1\n\np(n, s) Phrase\n0.9823 monte carlo\n0.9524 associative memory\n0.9081 as can be seen\n0.8206 parzen windows\n0.8044 in the previous section\n0.7790 american institute of physics\n\n(c) Topic 4\n\nFigure 10: Topic based stochastic pharases.\n\n\u03b3 = 0.1 and the number of topics M = 5, nmax = 5 and ran a N = 200 Gibbs iterations to obtain\na single posterior set of models.\n\nin\n\npredictive\n\nperplexity\n\nAlthough\n(VPYLDA=116.62,\nVPYLM=117.28), \u201cstochastic pharases\u201d computed on each topic VPYLM show interesting\ncharacteristics shown in Figure 10. Although we used a small number of latent topics in this\nexperiment to avoid data sparsenesses, in future research we need a more \ufb02exible topic model where\nthe number of latent topics will differ from node to node in the suf\ufb01x tree.\n\nimprovements\n\nslight\n\nthe\n\nare\n\n5 Discussion and Conclusion\nIn this paper, we presented a completely generative approach to estimating variable order Markov\nprocesses. By extending a stick-breaking process \u201cvertically\u201d over a suf\ufb01x tree of hierarchical Chi-\nnese restaurant processes, we can make a posterior inference on the Markov orders from which each\ndata originates.\n\nAlthough our architecture looks similar to Polya Trees [15], in Polya Trees their recursive partitions\nare independent while our stick-breakings are hierarchically organized according to the suf\ufb01x tree.\nIn addition to apparent application of our approach to hierarchical continuous distributions like\nGaussians, we expect that the basic model can be used for the distribution of latent variables. Each\ndata is assigned to a deeper level just when needed, and resides not only in leaf nodes but also in the\nintermediate nodes, by stochastically descending a clustering hierarchy from the root as described\nin this paper.\nReferences\n[1] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379\u2013423,\n\n623\u2013656, 1948.\n\n[2] Alberto Apostolico and Gill Bejerano. Optimal amnesic probabilistic automata, or, how to learn and\n\nclassify proteins in linear time and space. Journal of Computational Biology, 7:381\u2013393, 2000.\n\n[3] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens. The Context-Tree Weighting Method: Basic Properties.\n\nIEEE Trans. on Information Theory, 41:653\u2013664, 1995.\n\n[4] Frederick Jelinek. Statistical Methods for Speech Recognition. Language, Speech, and Communication\n\n[5] Peter Buhlmann and Abraham J. Wyner. Variable Length Markov Chains. The Annals of Statistics,\n\nSeries. MIT Press, 1998.\n\n27(2):480\u2013513, 1999.\n\n[6] Fernando Pereira, Yoram Singer, and Naftali Tishby. Beyond Word N-grams.\n\nIn Proc. of the Third\n\nWorkshop on Very Large Corpora, pages 95\u2013106, 1995.\n\n[7] Dana Ron, Yoram Singer, and Naftali Tishby. The Power of Amnesia. In Advances in Neural Information\n\nProcessing Systems, volume 6, pages 176\u2013183, 1994.\n\n[8] Andreas Stolcke. Entropy-based Pruning of Backoff Language Models. In Proc. of DARPA Broadcast\n\nNews Transcription and Understanding Workshop, pages 270\u2013274, 1998.\n\n[9] Yee Whye Teh. A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2/06,\n\nSchool of Computing, NUS, 2006.\n\n[10] Sharon Goldwater, Thomas L. Grif\ufb01ths, and Mark Johnson. Interpolating Between Types and Tokens by\n\nEstimating Power-Law Generators. In NIPS 2005, 2005.\n\n[11] Daniel Sleator and Robert Tarjan. Self-Adjusting Binary Search Trees. JACM, 32(3):652\u2013686, 1985.\n[12] Joshua T. Goodman. A Bit of Progress in Language Modeling, Extended Version. Technical Report\n\nMSR\u2013TR\u20132001\u201372, Microsoft Research, 2001.\n\n[13] Thomas P. Minka. Estimating a Dirichlet distribution, 2000. http://research.microsoft.com/\u02dcminka/papers/\n\n[14] Mark Girolami and Ata Kab\u00b4an. Simplicial Mixtures of Markov Chains: Distributed Modelling of Dy-\n\ndirichlet/.\n\nnamic User Pro\ufb01les. In NIPS 2003. 2003.\n\nAnnals of Statistics, 20(3):1203\u20131221, 1992.\n\n[15] R. Daniel Mauldin, William D. Sudderth, and S. C. Williams. Polya Trees and Random Distributions.\n\n\f", "award": [], "sourceid": 837, "authors": [{"given_name": "Daichi", "family_name": "Mochihashi", "institution": null}, {"given_name": "Eiichiro", "family_name": "Sumita", "institution": null}]}