{"title": "Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": "", "full_text": "Inferring State Sequences for Non-linear\n\nSystems with Embedded Hidden Markov Models\n\nRadford M. Neal, Matthew J. Beal, and Sam T. Roweis\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nToronto, Ontario, Canada M5S 3G3\n\nfradford,beal,roweisg@cs.utoronto.ca\n\nAbstract\n\nWe describe a Markov chain method for sampling from the distribution\nof the hidden state sequence in a non-linear dynamical system, given a\nsequence of observations. This method updates all states in the sequence\nsimultaneously using an embedded Hidden Markov Model (HMM). An\nupdate begins with the creation of \u201cpools\u201d of candidate states at each\ntime. We then de\ufb01ne an embedded HMM whose states are indexes within\nthese pools. Using a forward-backward dynamic programming algo-\nrithm, we can ef\ufb01ciently choose a state sequence with the appropriate\nprobabilities from the exponentially large number of state sequences that\npass through states in these pools. We illustrate the method in a simple\none-dimensional example, and in an example showing how an embed-\nded HMM can be used to in effect discretize the state space without any\ndiscretization error. We also compare the embedded HMM to a particle\nsmoother on a more substantial problem of inferring human motion from\n2D traces of markers.\n\n1\n\nIntroduction\n\nConsider a dynamical model in which a sequence of hidden states, x = (x0; : : : ; xn(cid:0)1), is\ngenerated according to some stochastic transition model. We observe y = (y0; : : : ; yn(cid:0)1),\nwith each yt being generated from the corresponding xt according to some stochastic ob-\nservation process. Both the xt and the yt could be multidimensional. We wish to randomly\nsample hidden state sequences from the conditional distribution for the state sequence given\nthe observations, which we can then use to make Monte Carlo inferences about this poste-\nrior distribution for the state sequence. We suppose in this paper that we know the dynamics\nof hidden states and the observation process, but if these aspects of the model are unknown,\nthe method we describe will be useful as part of a maximum likelihood learning algorithm\nsuch as EM, or a Bayesian learning algorithm using Markov chain Monte Carlo.\n\nIf the state space is \ufb01nite, of size K, so that this is a Hidden Markov Model (HMM), a\nhidden state sequence can be sampled by a forward-backwards dynamic programming al-\ngorithm in time proportional to nK 2 (see [5] for a review of this and related algorithms).\nIf the state space is <p and the dynamics and observation process are linear, with Gaussian\nnoise, an analogous adaptation of the Kalman \ufb01lter can be used. For more general models,\n\n\for for \ufb01nite state space models in which K is large, one might use Markov chain sampling\n(see [3] for a review). For instance, one could perform Gibbs sampling or Metropolis up-\ndates for each xt in turn. Such simple Markov chain updates may be very slow to converge,\nhowever, if the states at nearby times are highly dependent. A popular recent approach is\nto use a particle smoother, such as the one described by Doucet, Godsill, and West [2], but\nthis approach can fail when the set of particles doesn\u2019t adequately cover the space, or when\nparticles are eliminated prematurely.\n\nIn this paper, we present a Markov chain sampling method for a model with an arbitrary\nstate space, X , in which ef\ufb01cient sampling is facilitated by using updates that are based\non temporarily embedding an HMM whose \ufb01nite state space is a subset of X , and then\napplying the ef\ufb01cient HMM sampling procedure. We illustrate the method on a simple\none-dimensional example. We also show how it can be used to in effect discretize the state\nspace without producing any discretization error. Finally, we demonstrate the embedded\nHMM on a problem of tracking human motion in 3D based on the 2D projections of marker\npositions, and compare it with a particle smoother.\n\n2 The Embedded HMM Algorithm\n\nIn our description of the algorithm, model probabilities will be denoted by P , which\nwill denote probabilities or probability densities without distinction, as appropriate for\nthe state space, X , and observation space, Y. The model\u2019s initial state distribution is\ngiven by P (x0), transition probabilities are given by P (xt j xt(cid:0)1), and observation prob-\nabilities are given by P (yt j xt). Our goal is to sample from the conditional distribution\nP (x0; : : : ; xn(cid:0)1 j y0; : : : ; yn(cid:0)1), which we will abbreviate to (cid:25)(x0; : : : ; xn(cid:0)1), or (cid:25)(x).\nTo accomplish this, we will simulate a Markov chain whose state space is X n \u2014 i.e., a\nstate of this chain is an entire sequence of hidden states. We will arrange for the equilib-\nrium distribution of this Markov chain to be (cid:25)(x0; : : : ; xn(cid:0)1), so that simulating the chain\nfor a suitably long time will produce a state sequence from the desired distribution. The\nstate at iteration i of this chain will be written as x\nn(cid:0)1). The transition\nprobabilities for this Markov chain will be denoted using Q. In particular, we will use some\ninitial distribution for the state of the chain, Q(x\n(0)), and will simulate the chain according\nto the transition probabilities Q(x\n(i(cid:0)1)). For validity of the sampling method, we\nneed these transitions to leave (cid:25) invariant:\n\n0 ; : : : ; x(i)\n\n(i) = (x(i)\n\n(i) j x\n\n(cid:25)(x\n\n0) = X\nx 2 X n\n\n(cid:25)(x)Q(x\n\n0 j x);\n\nfor all x\n\n0 in X n\n\n(1)\n\n(If X is continuous, the sum is replaced by an integral.) This is implied by the detailed\nbalance condition:\n\n(cid:25)(x)Q(x\n\n0 j x) = (cid:25)(x\n\n0)Q(x j x\n\n0);\n\nfor all x and x\n\n0 in X n\n\n(2)\n\n(i) j x\n\nThe transition Q(x\n(i(cid:0)1)) is de\ufb01ned in terms of \u201cpools\u201d of states for each time. The\ncurrent state at time t is always part of the pool for time t. Other states in the pool are\nproduced using a pool distribution, (cid:26)t, which is designed so that points drawn from (cid:26)t are\nplausible alternatives to the current state at time t. The simplest way to generate these\nadditional pool states is to draw points independently from (cid:26)t. This may not be feasible,\nhowever, or may not be desirable, in which case we can instead simulate an \u201cinner\u201d Markov\nchain de\ufb01ned by transition probabilities written as Rt((cid:1) j (cid:1)), which leave the pool distribu-\ntion, (cid:26)t, invariant. The transitions for the reversal of this chain with respect to (cid:26)t will be\ndenoted by ~Rt((cid:1) j (cid:1)), and are de\ufb01ned so as to satisfy the following condition:\n\n(cid:26)t(xt)Rt(x0\n\nt j xt) = (cid:26)t(x0\n\nt) ~Rt(xt j x0\n\nt);\n\nfor all xt and x0\n\nt in X\n\n(3)\n\n\fIf the transitions Rt satisfy detailed balance with respect to (cid:26)t, ~Rt will be the same as\nRt. To generate pool states by drawing from (cid:26)t independently, we can let Rt(x0jx) =\n~Rt(x0jx) = (cid:26)t(x0). For the proof of correctness below, we must not choose (cid:26)t or Rt based\non the current state, x\n\n(i), but we may choose them based on the observations, y.\n\nTo perform a transition Q to a new state sequence, we begin by at each time, t, producing\na pool of K states, Ct. One of the states in Ct is the current state, x(i(cid:0)1)\n; the others are\nproduced using Rt and ~Rt. The new state sequence, x\n(i), is then randomly selected from\namong all sequences whose states at each time t are in Ct, using a form of the forward-\nbackward procedure.\n\nt\n\nIn detail, the pool of candidate states for time t is found as follows:\n\n1) Pick an integer Jt uniformly from f0; : : : ; K (cid:0)1g.\n2) Let x[0]\n3) For j from 1 to Jt, randomly pick x[j]\n\nt = x(i(cid:0)1)\n\nt\n\n. (So the current state is always in the pool.)\n\nt\n\naccording to the transition probabilities\n\nRt(x[j]\n\nt\n\nj x[j(cid:0)1]\n\nt\n\n).\n\n4) For j from (cid:0)1 down to (cid:0)K + Jt + 1, randomly pick x[j]\n\nt according to the reversed\n\ntransition probabilities, ~Rt(x[j]\n\nt\n\n5) Let Ct be the pool consisting of x[j]\n\nt\n\n).\n\nj x[j+1]\nt , for j 2 f(cid:0)K+Jt+1; : : : ; 0; : : : ; Jtg. If some\n\nof the x[j]\n\nt are the same, they will be present in the pool more than once.\n\nOnce the pools of candidate states have been found, a new state sequence, x\n(i), is picked\nfrom among all sequences, x, for which every xt is in Ct. The probability of picking\n(i) = x is proportional to (cid:25)(x)=Qn(cid:0)1\nx\n\nt=0 (cid:26)t(xt), which is proportional to\n\nThe division by Qn(cid:0)1\nt=0 (cid:26)t(xt) is needed to compensate for the pool states having been drawn\nfrom the (cid:26)t distributions. If duplicate states occur in some of the pools, they are treated\nas if they were distinct when picking a sequence in this way. In effect, we pick indexes of\nstates in these pools, with probabilities as above, rather than states themselves.\n\nThe distribution of these sequences of indexes can be regarded as the posterior distribu-\ntion for a hidden Markov model, with the transition probability from state j at time t(cid:0)1\nto state k at time t being proportional to P (x[k]\nt(cid:0)1), and the probabilities of the hypo-\nthetical observed symbols being proportional to P (yt j x[k]\nt ). Crucially, using the\nforward-backward technique, it is possible to randomly pick a new state sequence from this\ndistribution in time growing linearly with n, even though the number of possible sequences\ngrows as K n. After the above procedure has been used to produce the pool states, x[j]\nfor\nt\nt = 0 to n(cid:0)1 and j = (cid:0)K +Jt + 1 to Jt, this algorithm operates as follows (see [5]):\n\nt )=(cid:26)t(x[k]\n\nj x[j]\n\nt\n\n1) For t = 0 to n(cid:0)1 and for j = (cid:0)K +Jt +1 to Jt, let ut;j = P (yt j x[j]\n2) For j = (cid:0)K +J0 +1 to J0, let w0;j = u0;j P (X0 = x[j]\n3) For t = 1 to n(cid:0)1 and for j = (cid:0)K +Jt + 1 to Jt, let\nwt(cid:0)1;k P (x[j]\n\nj x[k]\n\n0 ).\n\nt(cid:0)1)\n\nt\n\nwt;j = ut;j Pk\n\nt )=(cid:26)t(x[j]\nt ).\n\n4) Randomly pick sn(cid:0)1 from f(cid:0)K +Jn(cid:0)1 +1; : : : ; Jn(cid:0)1g, picking the value j with\n\nprobability proportional to wn(cid:0)1;j.\n\nP (x0)Qn(cid:0)1\n\nt=1 P (xt j xt(cid:0)1)Qn(cid:0)1\n\nt=0 P (yt j xt)\n\nQn(cid:0)1\nt=0 (cid:26)t(xt)\n\n(4)\n\n\f5) For t = n(cid:0)1 down to 1, randomly pick st(cid:0)1 from f(cid:0)K +Jt(cid:0)1 +1; : : : ; Jt(cid:0)1g,\n\npicking the value j with probability proportional to wt(cid:0)1;j P (x[st]\n\nt\n\nj x[j]\n\nt(cid:0)1).\n\nNote that when implementing this algorithm, one must take some measure to avoid \ufb02oating-\npoint under\ufb02ow, such as representing the wt;j by their logarithms.\n\nFinally, the embedded HMM transition is completed by letting the new state sequence, x\nbe equal to (x[s0]\n\n; x[s1]\n\n; : : : ; x[sn(cid:0)1]\nn(cid:0)1 )\n\n0\n\n1\n\n(i),\n\n3 Proof of Correctness\n\nTo show that a Markov chain with these transitions will converge to (cid:25), we need to show that\nit leaves (cid:25) invariant, and that the chain is ergodic. Ergodicity need not always hold, and\nproving that it does hold may require considering the particulars of the model. However,\nit is easy to see that the chain will be ergodic if all possible state sequences have non-zero\nprobability density under (cid:25), the pool distributions, (cid:26)t, have non-zero density everywhere,\nand the transitions Rt are ergodic. This probably covers most problems that arise in prac-\ntice.\n\nTo show that the transitions Q((cid:1) j (cid:1)) leave (cid:25) invariant, it suf\ufb01ces to show that they satisfy\ndetailed balance with respect to (cid:25). This will follow from the stronger condition that the\nprobability of moving from x to x\n0 (starting from a state picked from (cid:25)) with given values\nfor the Jt and given pools of candidate states, Ct, is the same as the corresponding proba-\nbility of moving from x\n0 to x with the same pools of candidate states and with values J 0\nt\nde\ufb01ned by J 0\nt in the\ncandidate pool.\n\nt = Jt (cid:0) ht, where ht is the index (from (cid:0)K + Jt + 1 to Jt) of x0\n\nThe probability of such a move from x to x\n0 is the product of several factors. First, there is\nthe probability of starting from x under (cid:25), which is (cid:25)(x). Then, for each time t, there is the\nprobability of picking Jt, which is 1=K, and of then producing the states in the candidate\npool using the transitions Rt and ~Rt, which is\n\nJt\n\nY\n\nj=1\n\nRt(x[j]\n\nt\n\nj x[j(cid:0)1]\n\nt\n\n) (cid:2)\n\n(cid:0)1\n\nY\n\nj=(cid:0)K+Jt+1\n\n~Rt(x[j]\n\nt\n\nj x[j+1]\n\nt\n\n)\n\n(cid:0)1\n\n=\n\nJt(cid:0)1\n\nY\n\nj=0\n\nRt(x[j+1]\n\nt\n\nj x[j]\n\nt ) (cid:2)\n\nRt(x[j+1]\n\nt\n\nj x[j]\nt )\n\nY\n\nj=(cid:0)K+Jt+1\n\n(cid:26)t(x[j]\nt )\n(cid:26)t(x[j+1]\n\nt\n\n)\n\n(5)\n\n(6)\n\nRt(x[j+1]\n\nt\n\nj x[j]\nt )\n\n0 from among all the sequences with states from\nt). The product of all these factors is\n0)=Q (cid:26)t(x0\nY\n\nRt(x[j+1]\n\nJt(cid:0)1\n\n(cid:2)\n\nt\n\nt )3\nj x[j]\n5\n\nj=(cid:0)K+Jt+1\n\n0)\n\n(cid:25)(x\nQn(cid:0)1\nt=0 (cid:26)t(x0\nt)\nt )3\nj x[j]\n5\n\nt\n\nRt(x[j+1]\n\n(7)\n\n(cid:26)t(x[(cid:0)K+Jt+1]\n\n)\n\nJt(cid:0)1\n\nt\n\n=\n\nY\nFinally, there is the probability of picking x\nthe pools, Ct, which is proportional to (cid:25)(x\n\n(cid:26)t(x[0]\nt )\n\nj=(cid:0)K+Jt+1\n\n(cid:25)(x) (cid:2)\n\n1\nK n (cid:2)\n\nn(cid:0)1\n\nY\n\nt=0\n\n2\n4\n\n(cid:26)t(x[(cid:0)K+Jt+1]\n\nt\n\n)\n\n(cid:26)t(x[0]\nt )\n\n=\n\n1\nK n\n\n0)\n\n(cid:25)(x)(cid:25)(x\nQn(cid:0)1\nt=0 (cid:26)(xt)(cid:26)(x0\nt)\n\nn(cid:0)1\n\nY\n\nt=0\n\n2\n4(cid:26)t(x[(cid:0)K+Jt+1]\n\nt\n\nJt(cid:0)1\n\nY\n\n)\nj=(cid:0)K+Jt+1\n\nWe can now see that the corresponding expression for a move from x\napart from a relabelling of candidate state x[j]\n\nt as x[j(cid:0)ht]\n\n.\n\nt\n\n0 to x is identical,\n\n\f4 A simple demonstration\n\nThe following simple example illustrates the operation of the embedded HMM. The state\nspace X and the observation space, Y, are both <, and each observation is simply the state\nplus Gaussian noise of standard deviation (cid:27) \u2014 i.e., P (yt j xt) = N (yt j xt; (cid:27)2). The state\ntransitions are de\ufb01ned by P (xt j xt(cid:0)1) = N (xt j tanh((cid:17)xt(cid:0)1); (cid:28) 2), for some constant\nexpansion factor (cid:17) and transition noise standard deviation (cid:28) .\nFigure 1 shows a hidden state sequence, x0; : : : ; xn(cid:0)1, and observation sequence,\ny0; : : : ; yn(cid:0)1, generated by this model using (cid:27) = 2:5, (cid:17) = 2:5, and (cid:28) = 0:4, with\nn = 1000. The state sequence stays in the vicinity of +1 or (cid:0)1 for long periods, with\nrare switches between these regions. Because of the large observation noise, there is con-\nsiderable uncertainty regarding the state sequence given the observation sequence, with the\nposterior distribution assigning fairly high probability to sequences that contain short-term\nswitches between the +1 and (cid:0)1 regions that are not present in the actual state sequence,\nor that lack some of the short-term switches that are actually present.\n\nWe sampled from this distribution over state sequences using an embedded HMM in which\nthe pool distributions, (cid:26)t, were normal with mean zero and standard deviation one, and the\npool transitions simply sampled independently from this distribution (ignoring the current\npool state). Figure 2 shows that after only two updates using pools of ten states, embedded\nHMM sampling produces a state sequence with roughly the correct characteristics. Figure 3\ndemonstrates how a single embedded HMM update can make a large change to the state\nsequence. It shows a portion of the state sequence after 99 updates, the pools of states\nproduced for the next update, and the state sequence found by the embedded HMM using\nthese pools. A large change is made to the state sequence in the region from time 840 to\n870, with states in this region switching from the vicinity of (cid:0)1 to the vicinity of +1.\nThis example is explored in more detail in [4], where it is shown that the embedded HMM\nis superior to simple Metropolis methods that update one hidden state at a time.\n\n5 Discretization without discretization error\n\nA simple way to handle a model with a continuous state space is to discretize the space\nby laying down a regular grid, after transforming to make the space bounded if necessary.\nAn HMM with grid points as states can then be built that approximates the original model.\nInference using this HMM is only approximate, however, due to the discretization error\ninvolved in replacing the continuous space by a grid of points.\n\nThe embedded HMM can use a similar grid as a deterministic method of creating pools of\nstates, aligning the grid so that the current state lies on a grid point. This is a special case of\nthe general procedure for creating pools, in which (cid:26)t is uniform, Rt moves to the next grid\npoint and ~Rt moves to the previous grid point, with both wrapping around when the \ufb01rst or\nlast grid point is reached. If the number of pool states is set equal to the number of points\nin a grid, every pool will consist of a complete grid aligned to include the current state.\n\nOn their own, such embedded HMM updates will never change the alignments of the grids.\nHowever, we can alternately apply such an embedded HMM update and some other MCMC\nupdate (eg, Metropolis) which is capable of making small changes to the state. These small\nchanges will change the alignment of the new grids, since each grid is aligned to include the\ncurrent state. The combined chain will be ergodic, and sample (asymptotically) from the\ncorrect distribution. This method uses a grid, but nevertheless has no discretization error.\n\nWe have tried this method on the example described above, laying the grid over the trans-\nformed state tanh(xt), with suitably transformed transition densities. With K = 10, the\ngrid method samples more ef\ufb01ciently than when using N (0; 1) pool distributions, as above.\n\n\f5\n\n0\n\n5\n\u2212\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nFigure 1: A state sequence (black dots) and observation sequence (gray dots) of length\n1000 produced by the model with (cid:27) = 2:5, (cid:17) = 2:5, and (cid:28) = 0:4.\n\n5\n\n0\n\n5\n\u2212\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nFigure 2: The state sequence (black dots) produced after two embedded HMM updates,\nstarting with the states set equal to the data points (gray dots), as in the \ufb01gure above.\n\n6\n\n4\n\n2\n\n0\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\u2212\n\n820\n\n840\n\n860\n\n880\n\n900\n\n920\n\n940\n\nFigure 3: Closeup of an embedded HMM update. The true state sequence is shown by\nblack dots and the observation sequence by gray dots. The current state sequence is shown\nby the dark line. The pools of ten states at each time used for the update are shown as small\ndots, and the new state sequence picked by the embedded HMM by the light line.\n\n\fFigure 4: The four-second motion se-\nquence used for the experiment, shown\nin three snapshots with streamers show-\ning earlier motion. The left plot shows\nframes 1-59, the middle plot frames 59-\n91, and the right plot frames 91-121.\nThere were 30 frames per second. The\northographic projection in these plots\nis the one seen by the model. (These\nplots were produced using Hertzmann\nand Brand\u2019s mosey program.)\n\n6 Tracking human motion\n\nWe have applied the embedded HMM to the more challenging problem of tracking 3D\nhuman motion from 2D observations of markers attached to certain body points. We con-\nstructed this example using real motion-capture data, consisting of the 3D positions at each\ntime frame of a set of identi\ufb01ed markers. We chose one subject, and selected six markers\n(on left and right feet, left and right hands, lower back, and neck). These markers were\nprojected to a 2D viewing plane, with the viewing direction being known to the model.\nFigure 4 shows the four-second sequence used for the experiment.1\nOur goal was to recover the 3D motion of the six markers, by using the embedded HMM\nto generate samples from the posterior distribution over 3D positions at each time (the\nhidden states of the model), given the 2D observations. To do this, we need some model\nof human dynamics. As a crude approximation, we used Langevin dynamics with respect\nto a simple hand-designed energy function that penalizes unrealistic body positions. In\nLangevin dynamics, a gradient descent step in the energy is followed by the addition of\nGaussian noise, with variance related to the step size. The equilibrium distribution for this\ndynamics is the Boltzmann distribution for the energy function. The energy function we\nused contains terms pertaining to the pairwise distances between the six markers and to the\nheights of the markers above the plane of the \ufb02oor, as well as a term that penalizes bending\nthe torso far backwards while the legs are vertical. We chose the step size for the Langevin\ndynamics to roughly match the characteristics of the actual data.\n\nThe embedded HMM was initialized by setting the state at all times to a single frame of\nthe subject in a typical stance, taken from a different trial. As the pool distribution at time\nt, we used the posterior distribution when using the Boltzmann distribution for the energy\nas the prior and the single observation at time t. The pool transitions used were Langevin\nupdates with respect to this pool distribution.\n\nFor comparison, we also tried solving this problem with the particle smoother of [2], in\nwhich a particle \ufb01lter is applied to the data in time order, after which a state sequence is\nselected at random in a backwards pass. We used a strati\ufb01ed resampling method to reduce\nvariance. The initial particle set was created by drawing frames randomly from sequences\nother than the sequence being tested, and translating the markers in each frame so that their\ncentre of mass was at the same point as the centre of mass in the test sequence.\n\nBoth programs were implemented in MATLAB. The particle smoother was run with 5000\nparticles, taking 7 hours of compute time. The resulting sampled trajectories roughly \ufb01t the\n2D observations, but were rather unrealistic \u2014 for instance, the subject\u2019s feet often \ufb02oated\nabove the \ufb02oor. We ran the embedded HMM using \ufb01ve pool states for 300 iterations,\ntaking 1.7 hours of compute time. The resulting sampled trajectories were more realistic\n\n1Data from the graphics lab of Jessica Hodgins, at http://mocap.cs.cmu.edu. We chose\nmarkers 167, 72, 62, 63, 31, 38, downsampled to 30 frames per second. The experiments reported\nhere use frames 400-520 of trial 20 for subject 14. The elevation of the view direction was 45 degrees,\nand the azimuth was 45 degrees away from a front view of the person in the \ufb01rst frame.\n\n\fthan those produced by the particle smoother, and were quantitatively better with respect to\nlikelihood and dynamical transition probabilities. However, the distribution of trajectories\nfound did not overlap the true trajectory. The embedded HMM updates appeared to be\nsampling from the correct posterior distribution, but moving rather slowly among those\ntrajectories that are plausible given the observations.\n\n7 Conclusions\nWe have shown that the embedded HMM can work very well for a non-linear model with\na low-dimensional state. For the higher-dimensional motion tracking example, the embed-\nded HMM has some dif\ufb01culties exploring the full posterior distribution, due, we think,\nto the dif\ufb01culty of creating pool distributions with a dense enough sampling of states to\nallow linking of new states at adjacent times. However, the particle smoother was even\nmore severely affected by the high dimensionality of this problem. The embedded HMM\ntherefore appears to be a promising alternative to particle smoothers in such contexts.\n\nThe idea behind the embedded HMM should also be applicable to more general tree-\nstructured graphical models. A pool of values would be created for each variable in the\ntree (which would include the current value for the variable). The fast sampling algorithm\npossible for such an \u201cembedded tree\u201d (a generalization of the sampling algorithm used for\nthe embedded HMM) would then be used to sample a new set of values for all variables,\nchoosing from all combinations of values from the pools.\n\nFinally, while much of the elaboration in this paper is designed to create a Markov chain\nwhose equilibrium distribution is exactly the correct posterior, (cid:25)(x), the embedded HMM\nidea can be also used as a simple search technique, to \ufb01nd a state sequence, x, which\nmaximizes (cid:25)(x). For this application, any method is acceptable for proposing pool states\n(though some proposals will be more useful than others), and the selection of a new state\nsequence from the resulting embedded HMM is done using a Viterbi-style dynamic pro-\ngramming algorithm that selects the trajectory through pool states that maximizes (cid:25)(x). If\nthe current state at each time is always included in the pool, this Viterbi procedure will al-\nways either \ufb01nd a new x that increases (cid:25)(x), or return the current x again. This embedded\nHMM optimizer has been successfully used to infer segment boundaries in a segmental\nmodel for voicing detection and pitch tracking in speech signals [1], as well as in other\napplications such as robot localization from sensor logs.\nAcknowledgments. This research was supported by grants from the Natural Sciences and\nEngineering Research Council of Canada, and by an Ontario Premier\u2019s Research Excel-\nlence Award. Computing resources were provided by a CFI grant to Geoffrey Hinton.\n\nReferences\n[1] Achan, K., Roweis, S. T., and Frey, B. J. (2004) \u201cA Segmental HMM for Speech Waveforms\u201d,\n\nTechnical Report UTML-TR-2004-001, University of Toronto, January 2004.\n\n[2] Doucet, A., Godsill, S. J., and West, M. (2000) \u201cMonte Carlo \ufb01ltering and smoothing with appli-\ncation to time-varying spectral estimation\u201d Proc. IEEE International Conference on Acoustics,\nSpeech and Signal Processing, 2000, volume II, pages 701-704.\n\n[3] Neal, R. M. (1993) Probabilistic Inference Using Markov Chain Monte Carlo Methods, Techni-\ncal Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto, 144 pages. Avail-\nable from http://www.cs.utoronto.ca/(cid:24)radford.\n\n[4] Neal, R. M. (2003) \u201cMarkov chain sampling for non-linear state space models using embedded\nhidden Markov models\u201d, Technical Report No. 0304, Dept. of Statistics, University of Toronto,\n9 pages. Available from http://www.cs.utoronto.ca/(cid:24)radford.\n\n[5] Scott, S. L. (2002) \u201cBayesian methods for hidden Markov models: Recursive computing in the\n\n21st century\u201d, Journal of the American Statistical Association, vol. 97, pp. 337\u2013351.\n\n\f", "award": [], "sourceid": 2391, "authors": [{"given_name": "Radford", "family_name": "Neal", "institution": null}, {"given_name": "Matthew", "family_name": "Beal", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}