{"title": "Forward-Backward Activation Algorithm for Hierarchical Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1493, "page_last": 1501, "abstract": "Hierarchical Hidden Markov Models (HHMMs) are sophisticated stochastic models that enable us to capture a hierarchical context characterization of sequence data. However, existing HHMM parameter estimation methods require large computations of time complexity O(TN^{2D}) at least for model inference, where D is the depth of the hierarchy, N is the number of states in each level, and T is the sequence length. In this paper, we propose a new inference method of HHMMs for which the time complexity is O(TN^{D+1}). A key idea of our algorithm is application of the forward-backward algorithm to ''state activation probabilities''. The notion of a state activation, which offers a simple formalization of the hierarchical transition behavior of HHMMs, enables us to conduct model inference efficiently. We present some experiments to demonstrate that our proposed method works more efficiently to estimate HHMM parameters than do some existing methods such as the flattening method and Gibbs sampling method.", "full_text": "Forward-Backward Activation Algorithm for\n\nHierarchical Hidden Markov Models\n\nKei Wakabayashi\n\nFaculty of Library, Information and Media Science\n\nUniversity of Tsukuba, Japan\n\nTakao Miura\n\nDepartment of Engineering\n\nHosei University, Japan\n\nkwakaba@slis.tsukuba.ac.jp\n\nmiurat@hosei.ac.jp\n\nAbstract\n\nHierarchical Hidden Markov Models (HHMMs) are sophisticated stochastic mod-\nels that enable us to capture a hierarchical context characterization of sequence\ndata. However, existing HHMM parameter estimation methods require large com-\nputations of time complexity O(T N 2D) at least for model inference, where D is\nthe depth of the hierarchy, N is the number of states in each level, and T is the\nsequence length. In this paper, we propose a new inference method of HHMMs\nfor which the time complexity is O(T N D+1). A key idea of our algorithm is ap-\nplication of the forward-backward algorithm to state activation probabilities. The\nnotion of a state activation, which offers a simple formalization of the hierarchical\ntransition behavior of HHMMs, enables us to conduct model inference ef\ufb01ciently.\nWe present some experiments to demonstrate that our proposed method works\nmore ef\ufb01ciently to estimate HHMM parameters than do some existing methods\nsuch as the \ufb02attening method and Gibbs sampling method.\n\n1 Introduction\n\nLatent structure analysis of sequence data is an important technique for many applications such\nas speech recognition, bioinformatics, and natural language processing. Hidden Markov Models\n(HMMs) play a key role in solving these problems. HMMs assume a single Markov chain of hidden\nstates as the latent structure of sequence data. Because of this simple assumption, HMMs tend to\ncapture only local context patterns of sequence data. Hierarchical Hidden Markov Models (HH-\nMMs) are stochastic models which assume hierarchical Markov chains of hidden states as the latent\nstructure of sequence data [3]. HHMMs have a hierarchical state transition mechanism that yields\nthe capability of capturing global and local sequence patterns in various granularities. By their na-\nture, HHMMs are applicable to problems of many kinds including handwritten letter recognition [3],\ninformation extraction from documents [11], musical pitch structure modeling [12], video structure\nmodeling [13], and human activity modeling [8, 6].\nFor conventional HMMs, we can conduct unsupervised learning ef\ufb01ciently using the forward-\nbackward algorithm, which is a kind of dynamic programming [9].\nIn situations where few or\nno supervised data are available, the existence of the ef\ufb01cient unsupervised learning algorithm is\na salient advantage of using HMMs. The unsupervised learning of HHMMs is an important tech-\nnique, as it is for HMMs. In this paper, we discuss unsupervised learning techniques for HHMMs.\nWe introduce a key notion, activation probability, to formalize the hierarchical transition mecha-\nnism naturally. Using this notion, we propose a new exact inference algorithm which has less time\ncomplexity than existing methods have.\nThe remainder of the paper is organized as follows. In section 2, we overview HHMMs. In section\n3, we survey HHMM parameter estimation techniques proposed to date. In section 4, we introduce\nour parameter estimation algorithm. Section 5 presents experiments to show the effectiveness of our\nalgorithm. We conclude our discussion in section 6.\n\n1\n\n\fFigure 1: (left) Dynamic Bayesian network of the HHMM. (top-right) Tree representation of the\nHHMM state space. (bottom-right) State identi\ufb01cation by the absolute path of the tree.\n\n2 Hierarchical Hidden Markov Models\nLet O = fO1; :::; Ot; :::; OTg be a sequence of observations in which subscript t denotes the time\nin the sequence. We designate time as an integer index of observation numbered from the beginning\nt for 1 (cid:20) t (cid:20) T; 1 (cid:20) d (cid:20) D as a hidden state at time t and\nof the sequence. HHMMs de\ufb01ne Qd\nlevel d, where d = 1 represents the top level and d = D represents the bottom level. HHMMs also\nt = 1, then it is indicated that the\nde\ufb01ne binary variables F d\nMarkov chain of level d terminates at time t. In HHMMs, a state transition at level d is permitted\nonly when the Markov chain of level d + 1 terminates, i.e. Qd\nt(cid:0)1 = 0. A terminated\nMarkov chain is initialized again at the next time. Figure 1 (left) presents a Dynamic Bayesian\nNetwork (DBN) expression for an HHMM of hierarchical depth D = 3. The conditional probability\ndistribution of Q, F and O is de\ufb01ned as follows [7].\n\nt , called termination indicators. If F d\n\nt(cid:0)1 if F d+1\n\nt = Qd\n\n\uf8f1\uf8f2\uf8f3 (cid:14)(i; j)\n\nk(i; j)\nAd\nk(j)\n(cid:25)d\n(if b = 0)\n(if b = 1)\n\n{\nt(cid:0)1 = f; Q1:d(cid:0)1\n\nt\n\n= k) =\n\n0\nAd\n\nk(i; end)\n\n(if b = 0)\n(if b = 1; f = 0)\n(if b = 1; f = 1)\n\np(Qd\n\nt = jjQd\n\nt(cid:0)1 = i; F d+1\n\nt(cid:0)1 = b; F d\n\nt = 1jQd\np(F d\np(Ot = vjQ1:D\n\nt = i; Q1:d(cid:0)1\nt = k) = Bk(v)\n\nt\n\n= k; F d+1\n\n= b) =\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n= k. (cid:25)d\n\nt = k.\n\nt ; :::; Qd(cid:0)1\n\nas a combination of states fQ1\n\nk(j) is an initial state probability of state j at level d when Q1:d(cid:0)1\n\ng. Probabilities of the initializa-\nWe use a notation Q1:d(cid:0)1\ntion and the state transition of Markov chains at level d depend on all higher states Q1:d(cid:0)1. Ad\nk(i; j)\nis a model parameter of the transition probability at level d from state i to j when Q1:d(cid:0)1\n= k.\nk(i; end) denotes a termination probability that state i terminates the Markov chain at level d\nAd\nwhen Q1:d(cid:0)1\n= k. Bk(v)\nis an output probability of observation v when Q1:D\nA state space of HHMM is expressed as a tree structure [3]. Figure 1 (top-right) presents a tree\nexpression of state space of an HHMM for which the depth D = 3 and the number of states in each\nlevel N = 3. The level of the tree corresponds to the level of HHMM states. Each node at level d\ncorresponds to a combination of states Q1:d. Each node has N children because there are N possible\nstates for each level. The rectangles in the \ufb01gure denote local HMMs in which nodes can mutually\ntransit directly using the transition probability A. For the analysis described herein, we assume the\nbalanced N-ary tree to simplify discussions of computational complexity. However, arbitrary state\nspace trees do not change the substance of what follows.\nThe behavior of Markov chain at level d depends on the combination of all higher-up states Q1:d(cid:0)1,\nnot only on the individual Qd. In the tree structure, the absolute path which corresponds to Q1:d\nis meaningful, rather than the relative path which corresponds to Qd. We refer to Q1:d as Z d and\ncall it absolute path state. Figure 1 (bottom-right) presents an absolute path state identi\ufb01cation. The\nset of values taken by an absolute path state at level d, denoted by (cid:10)d, contains N d elements in the\nbalanced N-ary tree state space. We de\ufb01ne a function to obtain the parent absolute path state of Z d\nas parent(Z d). Similarly, we de\ufb01ne a function to obtain the set of child absolute path states of Z d\nas child(Z d), and a function to obtain the set of siblings of Z d as sib(Z d) = child(parent(Z d)).\n\n2\n\n\fTable 1: Notation for HHMMs.\n\nDepth of hierarchy\nNumber of states in each level\nSet of values taken by absolute path state at level d\nAbsolute path state at time t and level d\nTermination indicator at time t and level d\n\nD\nN\n(cid:10)d\nt 2 (cid:10)d\nZ d\nt 2 f0; 1g\nF d\nOt 2 f1; :::; V g Observation at time t\nAdij\nAdiEnd\n(cid:25)di\nBiv\n\nState transition probability from state Z d\nTermination probability of Markov chain at level d from state Z d\nInitial state probability of state Z d = i at level d\nOutput probability of observation v with Z D = i\n\nt = i to state Z d\n\nt = i\n\nt+1 = j at level d\n\nTable 1 presents the notation used for the HHMM description. We use the notation of the absolute\npath state Z d rather than Qd throughout the paper. Therefore, we de\ufb01ne compatible notations for\nk(j) denotes the initial state probability\nthe model parameters. Whereas the conventional notation (cid:25)d\nof Qd = j when Q1:d(cid:0)1 = k, we aggregate Qd and Q1:d(cid:0)1 into Q1:d = Z d and de\ufb01ne (cid:25)di as the\ninitial state probability of Z d = i. Similarly, we de\ufb01ne Adij as the state transition probability from\nZ d = i to j. Note that\n\nj02fsib(i)[Endg Adij0 = 1.\n\ni02sib(i) (cid:25)di0 = 1 and\n\n\u2211\n\n\u2211\n\n3 Existing Parameter Estimation Methods for HHMMs\n\n1 ; :::; Z D\nT\n\nThe \ufb01rst work for HHMMs [3] proposed the generalized Baum-Welch algorithm. This algorithm is\nbased on an inside-outside algorithm used for inference of probabilistic context free grammars. This\nmethod takes O(T 3) time complexity, which is not practical for long sequence data.\nA more ef\ufb01cient approach is the \ufb02attening method [7]. The hierarchical state sequence can be\ng. If we regard\nreduced to a single sequence of the bottom level absolute path states fZ D\nZ D as a \ufb02at HMM state, then we can conduct the inference by using the forward-backward algorithm\nwith O(T N 2D) time complexity since j(cid:10)Dj = N D. Notice that the \ufb02at state Z D can transit to any\nother \ufb02at state, and we cannot apply ef\ufb01cient algorithms for HMMs of sparse transition matrix. In\nthe \ufb02attening method, we must make a weak constraint on the HHMM parameters, say minimally\nself-referential (MinSR) [12], which restricts the self-transition at higher levels i.e. Adii = 0 for 1 (cid:20)\nd (cid:20) D(cid:0)1. The MinSR constraint enables us to identify the path connecting two \ufb02at states uniquely.\nThis property is necessary for estimating HHMM parameters by using the \ufb02attening method.\nWe also discuss a sampling approach as an alternative parameter estimation technique. The Gibbs\nsampling is often used for parameter estimation of probabilistic models including latent variables\n[4]. We can estimate HMM parameters using a Gibbs sampler, which sample each hidden states\niteratively. This method is applicable to inference of HHMMs in a straightforward manner on the \ufb02at\nHMM. This straightforward approach, called the Direct Gibbs Sampler (DGS), takes the O(T N D)\ntime complexity for a single iteration.\nThe convergence of a posterior distribution by the DGS method is said to be extremely slow for\nHMMs [10] because the DGS ignores long time dependencies. Chib [2] introduced an alternative\nmethod, called the Forward-Backward Gibbs Sampler (FBS), which calculates forward probabilities\nin advance. FBS samples hidden states from the end of the sequence regarding the forward proba-\nbilities. FBS method requires larger computations for a single iteration than DGS does, but it can\nbring a posterior of hidden states to its stationary distribution with fewer iterations [10].\nHeller [5] proposed In\ufb01nite Hierarchical Hidden Markov Models (IHHMMs) which can have an\nin\ufb01nitely large depth by weakening the dependency between the states at different levels. They pro-\nposed the inference method for IHHMMs based on a blocked Gibbs sampler of which the sampling\nunit is a state sequence from t = 1 to T at a single level. This inference takes only O(T D) time\nfor a single iteration. In HHMMs, the states in each level are strongly dependent, so resampling\na state at an intermediate level causes all lower states to alter into a state which has a completely\ndifferent behavior. Therefore, it is not practical to apply this Gibbs sampler to HHMMs in terms of\nthe convergence speed.\n\n3\n\n\f4 Forward-Backward Activation Algorithm\n\nIn this section, we introduce a new parameter estimation algorithm for HHMMs, which theoretically\nhas O(T N D+1) time complexity. The basic idea of our algorithm is a decomposition of the \ufb02at\njZ D\nt ), which the \ufb02attening method calculates directly for\ntransition probability distribution p(Z D\nt+1\nall pairs of the \ufb02at states. We can rewrite the \ufb02at transition probability distribution into a sum of two\ncases that the Markov chain at level D terminates or not, as follows.\nt = 0jZ D\nt ) +\njZ D(cid:0)1\n\njZ D\nt ) = p(Z D\nt+1\np(Z D\nt+1\n\njZ D\nt ; F D\njZ D(cid:0)1\nt+1 ; F D\n\nt =1)p(Z D(cid:0)1\n\nt =1jZ D\nt )\n\nt = 0)p(F D\n\nt =1)p(F D\n\np(Z D\nt+1\n\n; F D\n\nt+1\n\nThe \ufb01rst term corresponds to the direct transition without the Markov chain termination. The ac-\ntual computational complexity for calculating this term is O(N D+1) because the direct transition\nis permitted only between the sibling states, i.e. ADij = 0 if j =2 sib(i). The second term, cor-\nresponding to the case in which the Markov chain terminates at level D, contains two factors: The\nupper level transition probability p(Z D(cid:0)1\nt = 1) and the state initialization probability\n; F D\nfor the terminated Markov chain p(Z D\nt = 1). We attempt to compute these probability\nt+1\ndistributions ef\ufb01ciently in a dynamic programming manner.\njZ d\nThe transition probability at level d has the form p(Z d\nt ; F d+1\ntion ed\nt , as the condition of the transition probability from Z d\nt , formally:\n\njZ D(cid:0)1\njZ D(cid:0)1\nt+1 ; F D\n\n= 1). We de\ufb01ne ending activa-\n\nt+1\n\nt+1\n\nt\n\nt\n\nt\n\n\uf8f1\uf8f2\uf8f3 p(Z d\n\uf8f1\uf8f2\uf8f3 p(Z d\n\nt\n\np(ed\n\nt = i) =\n\nt = i; F d+1\nt = i)\n\nt\n\n= 0)\n\np(Z d\np(F d+1\n\n= 1)\n\n(if i 6= null and d < D)\n(if i 6= null and d = D)\n(if i = null)\n\nt\n\np(bd\n\nt = i) =\n\nt(cid:0)1 = 1)\n\njZ d\nt ; F d+1\n\nt = i; F d+1\nt = i)\nt(cid:0)1 = 0)\n\nThe null value in ed\nThe state initialization probability for level d + 1 has the form p(Z d+1\nbeginning activation bd\n\nt indicates that the Markov chain at level d + 1 does not terminate at time t.\nt(cid:0)1 = 1). We de\ufb01ne\nt , formally, as\n\nt , as the condition of the state initialization probability from Z d\n(if i 6= null and d < D and t > 1)\n(if i 6= null and (d = D or t = 1))\n(if i = null)\n\np(Z d\np(F d+1\nt indicates that the Markov chain at level d + 1 does not terminate at time t (cid:0) 1.\nThe null value in bd\nUsing these notations, we can represent the \ufb02at transition with propagations of activation probabil-\njeD\nt ). This representation naturally\nities as shown in \ufb01gure 2 (left) because p(Z D\nt+1\ndescribes the decomposition of the \ufb02at transition probability distribution discussed above, and it\nenables us to apply the decomposition recursively for all levels. We can derive the conditional\nprobability distributions of ed\n(if i6= null)\n(if i= null)\n(if i6= null)\n(if i= null)\nIn the following subsections, we show the ef\ufb01cient inference algorithm and the parameter estimation\nalgorithm using the activation probabilities.\n\nt = c)A(d+1)cEnd\nt = c)(1(cid:0)A(d+1)cEnd)+p(ed+1\nj2sib(i) p(ed\n\nt and bd\nc2child(i) p(ed+1\nc2(cid:10)d+1 p(ed+1\np(bd(cid:0)1\np(ed\n\nt+1 = parent(i))(cid:25)di +\nt = null)\n\nt = ijed+1\nt+1 = ijed\n\n{ \u2211\n\u2211\n{\n\nt = null)\nt = j)Adji\n\njZ D\nt ) = p(bD\n\nt ; bd(cid:0)1\n\nt+1 ) =\n\n\u2211\n\nt+1 as\n\np(ed\n\np(bd\n\n) =\n\nt+1\n\nt\n\n4.1\n\nInference using Forward and Backward Activation Probabilities\n\nWe can translate the DBN of HHMMs in \ufb01gure 1 (left) equivalently into simpler DBN using acti-\nvation probabilities. The translated DBN is portrayed in \ufb01gure 2 (right). The inference algorithm\nproposed herein is based on a forward-backward calculation over this DBN. We de\ufb01ne forward\nactivation probability (cid:11) and backward activation probability (cid:12) as follows.\n\nt\n\n(cid:11)ed\n(cid:11)bd\n(cid:12)ed\n(cid:12)bd\n\nt\n\nt = i; O1:t)\nt = i; O1:t(cid:0)1)\n\n(i) = p(ed\n(i) = p(bd\n(i) = p(Ot+1:T ; F 1\n(i) = p(Ot:T ; F 1\n\nT = 1jed\nT = 1jbd\nt = i)\n\nt\n\nt\n\nt = i)\n\n4\n\n\fFigure 2: (left) Propagation of activation probabilities for calculating the \ufb02at transition probability\nfrom time t to t + 1. (right) Equivalent DBN of the HHMM using activation probabilities.\n\n1\n\n(cid:11)bd\n1\n\nelse\n\nend for\n\nif t = 1 then\n\n(parent(i))(cid:25)di\n\n(i 2 (cid:10)1) = (cid:25)1i\n(cid:11)b1\n1\nfor d = 2 to D do\n(i 2 (cid:10)d) = (cid:11)bd(cid:0)1\n\u2211\n\nAlgorithm 1 Calculate forward activation probabilities\n1: for t = 1 to T do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: end for\n\nend for\nend if\n\u2211\n(i 2 (cid:10)D) = (cid:11)bD\n(cid:11)eD\nfor d = D (cid:0) 1 to 1 do\n(i 2 (cid:10)d) =\n\n(i 2 (cid:10)1) =\n(i 2 (cid:10)d) = (cid:11)bd(cid:0)1\n\n(cid:11)b1\nfor d = 2 to D do\n\n(parent(i))(cid:25)di +\n\nc2child(i) (cid:11)ed+1\n\nj2sib(i)(cid:11)e1\n\n(i)BiOt\n\n(j)A1ji\n\nend for\n\n\u2211\n\n(cid:11)ed\n\n(cid:11)bd\n\nt(cid:0)1\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n(c)A(d+1)cEnd\n\nj2sib(i)(cid:11)ed\n\nt(cid:0)1\n\n(j)Adji\n\nt\n\nt\n\nare derived downward from (cid:11)b1\n\nare propagated upward from (cid:11)eD\n\nThese probabilities are ef\ufb01ciently calculable in a dynamic programming manner. Algorithm 1\npresents the pseudocodes to calculate whole (cid:11). (cid:11)bd\nby\nsumming up to the initialization probability from the parent and the transition probabilities from the\nsiblings (Line 8 to 11). (cid:11)ed\nby summing up to the probabil-\n\u2211\nities of the child Markov chain termination (Line 13 to 16). This algorithm includes the calculation\nof j(cid:10)dj = N d quantities involving the summation of jsib(i)j = N terms for d = 1 to D and for\nd=1 N d+1) = O(T N D+1).\nt = 1 to T . Therefore, the time complexity of algorithm 1 is O(T\nAlgorithm 2 propagates the backward activation probabilities similarly in backward order.\nT = 1g given ed\nWe can derive the conditional independence of O1:t and fOt+1:T ; F 1\n6= null or\n6= null indicates that the Markov chains at level\nbd\nt+1\nd + 1; :::; D terminates at time t. On the basis of this conditional independence, the exact inference\nof a posterior of activation probabilities can be obtained using (cid:11) and (cid:12) as presented below.\n\n6= null, because both of ed\n\n6= null and bd\n\nto (cid:11)bD\n\nto (cid:11)e1\n\nt+1\n\nD\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt = ijO1:T ; F 1\nt = ijO1:T ; F 1\n\nT = 1) / p(ed\nT = 1) / p(bd\n\np(ed\np(bd\n\nt = i; O1:t)p(Ot+1:T ; F 1\nt = i; O1:t(cid:0)1)p(Ot:T ; F 1\n\nT = 1jed\nT = 1jbd\n\nt = i) = (cid:11)ed\nt = i) = (cid:11)bd\n\nt\n\n(i)(cid:12)ed\n(i)(cid:12)bd\n\nt\n\n(i)\n(i)\n\nt\n\nt\n\nThe inference of the \ufb02at state p(Z D\nt\nprobability p(eD\nt\n\nT = 1) is identical to of the bottom level activation\nT =1). We can calculate the likelihood of the whole observation as follows.\n\njO1:T ; F 1\n\njO1:T ; F 1\n\np(O1:T ; F 1\n\nT = 1) =\n\n\u2211\n\ni2(cid:10)1\n\n\u2211\n\ni2(cid:10)1\n\n(cid:11)e1\n\nT\n\n(i)(cid:12)e1\n\nT\n\n(i)\n\np(e1\n\nT = i; O1:T )p(F 1\n\nT = 1je1\n\nT = i) =\n\n5\n\n\fT\n\nT\n\nT\n\n(cid:12)ed\n\nelse\n\n\u2211\n\nend for\n\n(parent(i))AdiEnd\n\n(cid:12)e1\nfor d = 2 to D do\n\n(i 2 (cid:10)1) = A1iEnd\n(i 2 (cid:10)d) = (cid:12)ed(cid:0)1\n\nAlgorithm 2 Calculate backward activation probabilities\n1: for t = T to 1 do\nif t = T then\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17: end for\n\nend for\nend if\n\u2211\n(i 2 (cid:10)D) = (cid:12)eD\n(cid:12)bD\nfor d = D (cid:0) 1 to 1 do\n(i 2 (cid:10)d) =\n\n(i 2 (cid:10)1) =\n(i 2 (cid:10)d) = (cid:12)ed(cid:0)1\n\n(cid:12)e1\nfor d = 2 to D do\n\n(parent(i))AdiEnd +\n\nc2child(i) (cid:12)bd+1\n\nj2sib(i) (cid:12)b1\n\n(c)(cid:25)(d+1)c\n\n(i)BiOt\n\nend for\n\n(j)A1ij\n\n(cid:12)ed\n\n(cid:12)bd\n\nt+1\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n\u2211\n\nj2sib(i)(cid:12)bd\n\nt+1\n\n(j)Adij\n\n4.2 Updating Parameters\n\nUsing the forward and backward activation probabilities, we can estimate HHMM parameters ef\ufb01-\nciently in the EM framework. In the EM algorithm, the function Q((cid:18); (cid:22)(cid:18)) is de\ufb01ned, where (cid:18) is a\nparameter set before updating and (cid:22)(cid:18) is a parameter set after updating, as described below.\n\n\u2211\n\nQ((cid:18); (cid:22)(cid:18)) =\n\np(cid:18)(Y jX) log p(cid:22)(cid:18)(X; Y )\n\nY\n\nIn that equation, X represents a set of observed variables, and Y is a set of latent variables. The dif-\nference of log likelihood between the models of (cid:18) and (cid:22)(cid:18) is known to be greater than Q((cid:18); (cid:22)(cid:18))(cid:0)Q((cid:18); (cid:18))\n[1]. For this reason, we can increase the likelihood monotonically by selecting a new parameter (cid:22)(cid:18) to\nmaximize the function Q. For HHMMs, the set of parameters is (cid:18) = fA; (cid:25); Bg. The set of observed\nvariables is X = fO1:T ; F 1\ng. Therefore,\nthe function Q can be represented as shown below.\nQ((cid:18); (cid:22)(cid:18)) /\n\nT = 1g. The set of latent variables is Y = fZ 1:D\n\n1:T(cid:0)1) log p(cid:22)(cid:18)(O1:T ; F 1\n\nT = 1; Z 1:D\n\nT = 1; Z 1:D\n\np(cid:18)(O1:T ; F 1\n\n\u2211\n\n1:T ; F 1:D\n\n1:T ; F 1:D\n\n1:T ; F 1:D\n\n1:T(cid:0)1)\n\n1:T(cid:0)1\n\n(1)\n\nZ1:D\n1:T ;F 1:D\n\n1:T (cid:0)1\n\nThe joint probability of observed variables and latent variables is given below.\n\np(cid:18)(O1:T ; F 1\n\nT = 1; Z 1:D\n\n1:T ; F 1:D\n\n1:T(cid:0)1)\n\nD\u220f\n\nT(cid:0)1\u220f\n\nD\u220f\n\n=\n\n(cid:25)dZd\n1\n\nd=1\n\nt=1\n\nd=1\n\n(AF d\ndZd\n\nt\n\nD\u220f\n\n)\nd=1\n\n(1(cid:0)F d\nt )\n\nt Zd\n\nt+1\n\n(cid:25)F d\ndZd\n\nt\n\nt+1\n\nAdZd\n\nT End\n\nQ((cid:18); (cid:22)(cid:18)) / D\u2211\n\n\u2211\n\nd=1\n\ni2(cid:10)d\n\nWe substitute this equation for the joint probability in equation (1). We integrate out irrelevant\nvariables and organize around each parameter. Thereby, we obtain the following.\n\ng(cid:25)di log (cid:22)(cid:25)di +\n\ngAdij log (cid:22)Adij +\n\ngBiv log (cid:22)Biv\n\ni2(cid:10)d\n\nj2fsib(i)[Endg\n\nd=1\n\nTherein, g(cid:25)di, gAdij, gBiv are shown by equation (2)(3)(4)(5). They are calculable using forward\nand backward activation probabilities.\n\nT\u220f\n\nt=1\n\nBZD\n\nt Ot\n\n\u2211\n\nV\u2211\n\ni2(cid:10)D\n\nv=1\n\nt\n\ndZd\n\nt EndAF d+1\n\u2211\nD\u2211\n\n\u2211\n\nT(cid:0)1\u2211\n\nt=1\n\ng(cid:25)di = (cid:11)bd\n\n(i)(cid:12)bd\n\n(i) +\n\n1\n\n(cid:11)bd(cid:0)1\n\nt+1\n\n(parent(i))(cid:25)di(cid:12)bd\n\nt+1\n\n(i)\n\n1\n\nT(cid:0)1\u2211\n\nt=1\n\ngAdiEnd =\n\n(cid:11)ed\n\nt\n\n(i)AdiEnd(cid:12)ed(cid:0)1\n\nt\n\n(parent(i)) + (cid:11)ed\n\nT\n\n(i)(cid:12)ed\n\nT\n\n(i)\n\n6\n\n(2)\n\n(3)\n\n\fIteration\n\nTable 2: Log-likelihood achieved at each iteration.\n1\n10\n\n2\n\n4\n\n5\n\n3\n\nFBA w/o MinSR\n-773.47\nFBA with MinSR -773.89\n-773.89\n\nFFB\n\n-672.44\n-672.47\n-672.47\n\n-668.50\n-670.40\n-670.40\n\n-631.30\n-643.62\n-643.62\n\n-610.63\n-614.98\n-614.98\n\n-577.33\n-573.84\n-573.84\n\n50\n\n-457.66\n-453.09\n-453.09\n\n100\n\n-447.90\n-448.52\n-448.52\n\nT(cid:0)1\u2211\n\u2211\n\nt=1\n\nt:Ot=v\n\ngAdij =\n\n(cid:11)ed\n\nt\n\n(i)Adij(cid:12)bd\n\nt+1\n\n(j)\n\ngBiv =\n\n(cid:11)eD\n\nt\n\n(i)(cid:12)eD\n\nt\n\n(i)\n\n(4)\n\n(5)\n\nUsing Lagrange multipliers, we can obtain parameters (cid:22)(cid:25); (cid:22)A; (cid:22)B, which maximize the function Q\n(cid:22)Biv = 1 as shown below.\nunder the constraint\n\nj02fsib(i)[Endg (cid:22)Adij0 = 1,\n\n\u2211\n\n\u2211\n\ni02sib(i) (cid:22)(cid:25)di0 = 1,\ng(cid:25)di\n\n; (cid:22)Adij =\n\ni02sib(i) g(cid:25)di0\n\n(cid:22)(cid:25)di =\n\n\u2211\n\u2211\n\n\u2211\n; (cid:22)Biv = gBiv\u2211\n\nv\n\ngAdij\n\nj02fsib(i)[Endg gAdij0\n\nv gBiv\n\nConsequently, we can calculate the update parameters using (cid:11) and (cid:12). The time complexity for\ncomputing a single EM iteration is O(T N D+1), which is identical to the calculation of forward and\nbackward activation probabilities.\n\n5 Experiments\n\nFirstly, we experimentally con\ufb01rm that the forward-backward activation algorithm yields exactly\nidentical parameter estimation to the \ufb02attening method does. Remind that we must make the MinSR\nconstraint on the HHMM parameter set in the \ufb02attening method (see section 3). We compare\nthree parameter estimation algorithms: our forward-backward activation algorithm for a MinSR\nHHMM (FBA with MinSR), for a HHMM without MinSR (FBA w/o MinSR), and the \ufb02attening\nmethod(FFB). The dataset to learn includes 5 sequences of 10 length, which are arti\ufb01cially gener-\nated by a MinSR HHMM of biased parameter set. We execute three algorithms and examine the\nlog-likelihood achieved at each iteration.\nTable 2 presents the result. The FBA with MinSR and the FFB achieve the identical log-likelihood\nthrough the training. This result provides experimental evidence that our algorithm estimates\nHHMM parameters exactly identically to the \ufb02attening method does. Furthermore, the FBA enables\nus to conduct the parameter estimation of HHMMs which has non-zero self-transition parameters.\nTo evaluate the computational costs empirically, we compare four methods of HHMM parameter\nestimation. Two are based on the EM algorithm with inference by the forward-backward activation\nalgorithm (FBA), and by the \ufb02attening forward-backward method (FFB). Another two are based on\na sampling approach: direct Gibbs sampling for the \ufb02at HMMs (DGS) and forward-backward acti-\nvation sampling (FBAS). FBAS is a straightforward application of the forward-backward sampling\nscheme to the translated DBN presented in \ufb01gure 2. In FBAS, we \ufb01rst calculate forward activation\nprobabilities. Then we sample state activation variables from e1\n1 in the backward order with\nrespect to forward activation probabilities. We evaluate four methods based on three aspects: execu-\ntion time, convergence speed, and scalability of the state space size. We apply each method to four\ndifferent HHMMs of (D = 3,N = 3), (D = 3,N = 4), (D = 4,N = 3), and (D = 4,N = 4). We\nexamine the log-likelihood of the training dataset achieved at each iteration to ascertain the learn-\ning convergence. As a training dataset, we use 100 documents from the Reuters corpus as word\nsequences. The dataset includes 36,262 words in all, with a 4,899 word vocabulary.\nFigure 3 presents the log-likelihood of the training data. The horizontal axis shows the logarith-\nmically scaled execution time. Table 2 presents the average execution time for a single iteration.\nFrom these results, we can say primarily that FBA outperforms FFB in terms of execution time. The\nimprovement is remarkable, especially for the HHMMs of large state space size because FBA has\nless time complexity for N and D than FFB has.\n\nT to b1\n\n7\n\n\fFigure 3: Convergence of log-likelihood for the training data on the Reuters corpus. Log-likelihood\n(vertical) is shown against the log-scaled execution time (horizontal) to display the execution time\nnecessary to converge the learning of each algorithm. (top-left) HHMM of D = 3, N = 3. (top-\nright) D = 3, N = 4. (bottom-left) D = 4, N = 3. (bottom-right) HHMM of D = 4, N = 4.\n\nTable 3: Average execution time for a single iteration (ms).\n\n(N D = 27)\n\n(N D = 64)\n\n(N D = 81)\n\nMethod D = 3; N = 3 D = 3; N = 4 D = 4; N = 3 D = 4; N = 4\n(N D = 256)\nFBA\nFFB\nFBAS\nDGS\n\n476.92\n19257.80\n183.39\n45.43\n\n186.65\n1729.90\n82.45\n24.19\n\n391.73\n9242.35\n142.20\n37.50\n\n1652.03\n220224.00\n\n581.58\n265.98\n\nThe results show that the likelihood convergence using DGS is much slower than that of other\nmethods.The execution time of DGS is less than that of other methods for a single iteration, but\nthis cannot compensate for the low convergence speed. However, FBAS achieves a competitive\nlikelihood in comparison to FBA. Results show that FBAS might be appropriate for some situations\nbecause FBAS \ufb01nds a better solution than that FBA do in some results.\n\n6 Conclusion\n\nIn this work, we proposed a new inference algorithm for HHMMs based on the activation probability.\nResults show that the performance of our proposed algorithm surpasses that of existing methods.\nThe forward-backward activation algorithm described herein enables us to conduct unsupervised\nparameter learning with a practical computational cost for HHMMs of larger state space size.\n\nReferences\n\n[1] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.\n[2] S. Chib. Calculating posterior distributions and modal estimates in markov mixture models.\n\nJournal of Econometrics, 1996.\n\n[3] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden markov model: Analysis and appli-\n\ncations. Machine Learning, 1998.\n\n8\n\n-260000-250000-240000-230000-220000-210000-200000-190000-180000 100 1000 10000 100000 1e+006Log-LikelihoodExecution Time (ms)FB ActivationFlattening FBFB Activation SamplingDirect Gibbs Sampling-260000-250000-240000-230000-220000-210000-200000-190000-180000 100 1000 10000 100000 1e+006Log-LikelihoodExecution Time (ms)FB ActivationFlattening FBFB Activation SamplingDirect Gibbs Sampling-260000-250000-240000-230000-220000-210000-200000-190000-180000 100 1000 10000 100000 1e+006Log-LikelihoodExecution Time (ms)FB ActivationFlattening FBFB Activation SamplingDirect Gibbs Sampling-260000-250000-240000-230000-220000-210000-200000-190000-180000 1000 10000 100000 1e+006Log-LikelihoodExecution Time (ms)FB ActivationFlattening FBFB Activation SamplingDirect Gibbs Sampling\f[4] T. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proc. the National Academy of Sciences\n\nof the United States of America, 2004.\n\n[5] K. Heller, Y. Teh, and D. Gorur. In\ufb01nite hierarchical hidden markov models. In Proc. Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, 2009.\n\n[6] S. Luhr, H. Bui, S. Venkatesh, and G. West. Recognition of human activity through hierarchical\n\nstochastic learning. In Proc. Pervasive Computing and Communication, 2003.\n\n[7] K. Murphy and M. Paskin. Linear time inference in hierarchical hmms.\n\nInformation Processing Systems, 2001.\n\nIn Proc. Neural\n\n[8] N. Nguyen, D. Phung, and S. Venkatesh. Learning and detecting activities from movement tra-\njectories using the hierarchical hidden markov models. In Proc. Computer Vision and Pattern\nRecognition, 2005.\n\n[9] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recogni-\n\ntion. Proc. IEEE, 1989.\n\n[10] S. Scott. Bayesian methods for hidden markov models: Recursive computing in the 21st\n\ncentury. Journal of the American Statistical Association, 2002.\n\n[11] M. Skounakis, M. Craven, and S. Ray. Hierarchical hidden markov models for information\n\nextraction. In Proc. International Joint Conference on Arti\ufb01cial Intelligence, 2003.\n\n[12] M. Weiland, A. Smaill, and P. Nelson. Learning musical pitch structures with hierarchical\n\nhidden markov models. In Proc. Journees Informatiques Musicales, 2005.\n\n[13] L. Xie, S. Chang, A. Divakaran, and H. Sun. Learning hierarchical hidden markov models for\n\nvideo structure discovery. Technical report, Columbia University, 2002.\n\n9\n\n\f", "award": [], "sourceid": 715, "authors": [{"given_name": "Kei", "family_name": "Wakabayashi", "institution": null}, {"given_name": "Takao", "family_name": "Miura", "institution": null}]}