{"title": "Constrained Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 782, "page_last": 788, "abstract": null, "full_text": "Constrained Hidden Markov Models \n\nSam Roweis \n\nroweis@gatsby.ucl.ac.uk \n\nGatsby Unit, University College London \n\nAbstract \n\nBy  thinking  of each  state in  a hidden Markov  model  as  corresponding  to  some \nspatial region of a fictitious topology space it is possible to naturally define neigh(cid:173)\nbouring states as  those which are connected in that space.  The transition matrix \ncan then be constrained to allow transitions only between neighbours; this means \nthat all valid state sequences correspond to connected paths in the topology space. \nI show  how  such constrained HMMs  can learn to  discover underlying  structure \nin complex sequences  of high dimensional data,  and apply  them  to  the problem \nof recovering mouth movements from acoustics in continuous speech. \n\n1  Latent variable models for structured sequence data \nStructured time-series are generated by systems whose underlying state variables change in \na continuous way but whose state to output mappings are highly nonlinear, many to one and \nnot smooth.  Probabilistic unsupervised learning for such sequences requires models with \ntwo  essential features:  latent (hidden)  variables and topology in those variables.  Hidden \nMarkov models  (HMMs)  can be thought of as  dynamic  generalizations of discrete  state \nstatic data models such as Gaussian mixtures, or as discrete state versions of linear dynam(cid:173)\nical  systems  (LDSs)  (which are themselves dynamic generalizations of continuous latent \nvariable models such as factor analysis). While both HMMs and LDSs provide probabilistic \nlatent variable models for time-series, both have important limitations.  Traditional HMMs \nhave a very powerful model of the relationship between the underlying state and the associ(cid:173)\nated observations because each state stores a private distribution over the output variables. \nThis means that any change in the hidden state can cause arbitrarily complex changes in the \noutput distribution.  However, it is  extremely difficult to  capture reasonable dynamics on \nthe discrete latent variable because in principle any state is reachable from any other state at \nany time step and the next state depends only on the current state. LDSs, on the other hand, \nhave  an  extremely impoverished representation of the outputs as  a  function of the latent \nvariables since this transformation is restricted to be global and linear.  But it is somewhat \neasier to capture state dynamics since the state is a multidimensional vector of continuous \nvariables  on which  a  matrix  \"flow\" is  acting;  this  enforces some continuity of the  latent \nvariables across time.  Constrained hidden Markov  models address the modeling of state \ndynamics by building some topology into  the hidden state representation.  The essential \nidea is to constrain the transition parameters of a conventional HMM so that the discrete(cid:173)\nvalued hidden state evolves in a structured way.l  In particular, below I consider parameter \nrestrictions  which  constrain  the  state  to  evolve  as  a  discretized  version  of a  continuous \nmultivariate variable,  i.e.  so  that it  inscribes only  connected  paths  in  some  space.  This \nlends a physical interpretation to the discrete state trajectories in an HMM. \n\nI A standard trick in traditional  speech applications  of HMMs  is  to  use  \"left-to-right\" transition \n\nmatrices which are a special case of the type of constraints investigated in this paper.  However,  left(cid:173)\nto-right (Bakis) HMMs force state trajectories that are inherently one-dimensional and uni-directional \nwhereas here I also consider higher dimensional topology and free omni-directional motion. \n\n\fConstrained Hidden Markov Models \n\n783 \n\n2  An illustrative game \nConsider playing the following game:  divide a sheet of paper into several contiguous, non(cid:173)\noverlapping regions which between them cover it entirely.  In each region inscribe a symbol, \nallowing symbols to be repeated in different regions. Place a pencil on the sheet and move it \naround, reading out (in order) the symbols in the regions through which it passes.  Add some \nnoise to  the observation process  so  that some fraction  of the  time incorrect symbols  are \nreported in the list instead of the correct ones. The game is to reconstruct the configuration \nof regions on the sheet from only such an ordered list(s) of noisy symbols.  Of course, the \nabsolute scale, rotation and reflection of the sheet can never be recovered, but learning the \nessential topology may be possible.2  Figure 1 illustrates this setup. \n\n_ _ \n\n1,  11,  1,  11, .. . \n24(V, 21, 2, .. . \n..... 18, 19, 10,3, .. . \n\n~ \n\n8 \n\n2UJ 16,  16,.~ \n15,15,2(]), ... \n\nTrue Generative Map \n\niteration:030 \n\nlogLikelihood:-1.9624 \n\nFigure 1:  (left) True map which generates  symbol sequences by random movement between con(cid:173)\nnected cells.  (centre) An example noisy output sequence with noisy symbols circled.  (right) Learned \nmap after training on 3  sequences  (with  15% noise probability) each 200 symbols  long.  Each cell \nactually contains an entire distribution over all  observed symbols, though in this case only the upper \nright cell has significant probability mass on more than one symbol (see figure 3 for display details). \n\nWithout noise or repeated symbols, the game is easy (non-probabilistic methods can solve \nit) but in their presence it is not.  One way of mitigating the noise problem is to do statistical \naveraging.  For example, one could attempt to  use the average separation in  time of each \npair of symbols to define a dissimilarity between them.  It then would be possible to  use \nmethods like multi-dimensional scaling or a sort of Kohonen mapping though time3  to  ex(cid:173)\nplicitly construct a configuration of points obeying those distance relations.  However, such \nmethods still cannot deal with many-to-one state to output mappings (repeated numbers in \nthe sheet) because by their nature they assign a unique spatial location to each symbol. \nPlaying  this  game is  analogous to  doing  unsupervised learning on structured  sequences. \n(The game can also be played with  continuous outputs,  although often high-dimensional \ndata can be effectively clustered around a manageable number of prototypes; thus a vector \ntime-series can be converted into a sequence of symbols.)  Constrained HMMs incorporate \nlatent variables with topology yet retain powerful nonlinear output mappings and can deal \nwith  the  difficulties  of noise  and  many-to-one mappings  mentioned  above;  so  they  can \n\"win\" our game (see figs.  1 &  3).  The key insight is that the game generates sequences ex(cid:173)\nactly according to a hidden Markov process whose transition matrix allows only transitions \nbetween neighbouring cells and whose output distributions have most of their probability \non a single symbol with a small amount on all other symbols to account for noise. \n\n2The observed symbol  sequence must be \"informative enough\"  to reveal the map structure (this \n\ncan be quantified using the idea of persistent excitation from control theory). \n\n3Consider a network of units which compete to explain input data points. Each unit has a position \nin the output space as  well as a position in a lower dimensional topology space. The winning unit has \nits position in output space updated towards the data point; but also the recent (in time) winners have \ntheir positions in topology space updated towards  the  topology space location of the current winner. \nSuch a rule works well, and yields topological maps in which nearby units code for data that typically \noccur close together in time.  However it cannot learn many-to-one maps in which more than one unit \nat different topology locations have the same (or very similar) outputs. \n\n\f784 \n\nS.  Roweis \n\n3  Model definition: state topologies from cell packings \nDefining  a  constrained HMM involves  identifying  each  state  of the  underlying  (hidden) \nMarkov  chain  with  a  spatial  cell  in  a  fictitious  topology  space.  This  requires  selecting \na dimensionality d for  the topology space and choosing a packing (such as  hexagonal or \ncubic) which fills  the space.  The number of cells in the packing is equal to the number of \nstates  M  in the original Markov model.  Cells are taken  to be all of equal size and (since \nthe scale of the topology space is completely arbitrary) of unit volume.  Thus, the packing \ncovers a  volume M  in  topology  space  with  a  side length l  of roughly l  =  MIld.  The \ndimensionality and packing together define a vector-valued function x(m),  m  =  1 ... M \nwhich  gives  the  location  of cell  m  in  the  packing.  (For  example,  a  cubic  packing of d \ndimensional space defines x(m+l) to be  [m, mil, mll2, ... ,mild-I]  mod l.) State m \nin the Markov model is assigned to to cell m in the packing, thus giving it a location x( m) \nin the topology space. Finally, we must choose a neighbourhood rule in the topology space \nwhich defines the neighbours of cell m; for example, all \"connected\" cells, all face neigh(cid:173)\nbours, or all those  within a certain radius.  (For cubic packings, there are 3d -1  connected \nneighbours and 2d face neighbours in a d dimensional topology space.) The neighbourhood \nrule also defines the boundary conditions of the space - e.g. periodic boundary conditions \nwould make cells on opposite extreme faces of the space neighbours with each other. \nThe transition matrix of the HMM is now preprogrammed to only allow transitions between \nneighbours.  All other transition probabilities are set to  zero,  making the transition matrix \nvery  sparse.  (I have  set  all  permitted  transitions  to  be equally  likely.)  Now,  all  valid \nstate sequences in the underlying Markov model represent connected ( \"city block\") paths \nthrough the topology space. Figure 2 illustrates this for a three-dimensional model. \n\n/ \n\n/ \n\n/ \n\n/ \n\nL  L \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n/.1,,< \u2022. \n\u2022 \n\n\u00b7 ... \nL.\" \n\u00b7 \n\n/ \n\n/ \n\n/ \n/ \n~V  / \n.~ /V  / \n/ \n!;.y  / \nV \nyV \n\nFigure  2:  (left)  PhYSical  depiction  of the \ntopology  space  for  a  constrained  HMM  with \nd=3,l=4 and M =64 showing an example state \n(right)  Corresponding  transition \ntrajectory. \nmatrix  structure  for  the  64-state HMM com(cid:173)\nputed using  face-centred  cubic  packing.  The \ngaps in the inner bands are due to edge effects. \n\n641 \n\n4  State inference and learning \nThe constrained HMM has exactly the same inference procedures as  a regular HMM: the \nforward-backward algorithm for computing state occupation probabilities and  the  Viterbi \ndecoder for finding the single best state sequence. Once these discrete state inferences have \nbeen performed, they can be transformed using the state position function x( m)  to  yield \nprobability distributions over the topology space (in the case offorward-backward) or paths \nthrough the topology space (in the  case of Viterbi decoding).  This transformation makes \nthe outputs of state decodings in constrained HMMs comparable to the outputs of inference \nprocedures for continuous state dynamical systems such as Kalman smoothing. \nThe learning procedure for constrained HMMs is also almost identical to  that for HMMs. \nIn particular, the EM algorithm (Baum-Welch)  is  used to  update model  parameters.  The \ncrucial difference is that the transition probabilities which are precomputed by the topology \nand packing are never updated during learning.  In fact,  this  makes learning much easier \nin  some cases.  Not only do the  transition  probabilities not have  to  be learned,  but their \nstructure constrains the hidden state sequences in such a way as to make the learning of the \noutput parameters much more efficient when the underlying data really does come from a \nspatially  structured  generative model.  Figure 3 shows  an  example of parameter learning \nfor the game discussed above.  Notice that in this case, each part of state space had only a \nsingle output (except for noise) so the final learned output distributions became essentially \nminimum entropy. But constrained HMMs can in principle model stochastic or multimodal \noutput processes since each state stores an entire private distribution over outputs. \n\n\fConstrained Hidden Markov Models \n\n785 \n\n1'\"IIlloa-olO \n\nIDp.lkdihood.-2.1451 \n\nFigure 3: Snapshots of model parameters during constrained HMM learning for the game described \nin section 2.  At every iteration each cell in the map has a complete distribution over all of the observed \nsymbols.  Only  the  top  three symbols  of each cell's histogram are  show,  with/ont size proportional \nto the square root o/probability (to make ink roughly proportional). The map was trained on 3 noisy \nsequences  each  200  symbols  long  generated from  the  map  on  the  left of figure  1 using  15%  noise \nprobability. The final map after convergence (30 iterations) is shown on the right of figure  l. \n\n5  Recovery of mouth movements from speech audio \nI have applied the constrained HMM approach described above to the problem of recover(cid:173)\ning mouth movements from the acoustic waveform in human speech.  Data containing si(cid:173)\nmultaneous audio and articulator movement information was obtained from the University \nof Wisconsin X-ray microbeam database [9].  Eight separate points (four on the tongue, one \non each lip and two on the jaw) located in the midsaggital plane of the speaker's head were \ntracked while subjects read various words, sentences, paragraphs and lists of numbers. The \nx and y coordinates (to within about \u00b1 Imm) of each point were sampled at 146Hz by an X(cid:173)\nray system which located gold beads attached to the feature points on the mouth, producing \na  16-dimensional vector every 6.9ms.  The audio was  sampled at 22kHz with roughly  14 \nbits of amplitude resolution but in the presence of machine noise. \nThese data are well suited to the constrained HMM architecture. They come from a system \nwhose  state  variables  are known,  because of physical constraints,  to  move  in  connected \npaths in a low degree-of-freedom space.  In other words the (normally hidden) articulators \n(movable structures of the mouth),  whose positions represent the underlying state of the \nspeech production system,4  move slowly and smoothly.  The observed speech  signal-the \nsystem's output--can be characterized by a sequence of short-time spectral feature vectors, \noften known as a spectrogram.  In the experiments reported here, I have characterized the \naudio signal using  12 line spectral frequencies (LSFs) measured every 6.9ms (to coincide \nwith  the articulatory  sampling rate)  over a  25ms  window.  These LSF vectors character(cid:173)\nize only  the spectral shape of the speech  waveform over a  short time but not its  energy. \nAverage energy (also over a 25ms window every 6.9ms) was  measured as  a separate one \ndimensional signal.  Unlike the movements of the articulators, the audio spectrum/energy \ncan exhibit quite abrupt changes, indicating that the mapping between articulator positions \nand  spectral  shape  is  not  smooth.  Furthermore,  the  mapping  is  many  to  one:  different \narticulator configurations can produce very similar spectra (see below). \nThe unsupervised learning task, then, is to explain the complicated sequences of observed \nspectral  features  (LSFs)  and  energies  as  the outputs of a  system with  a  low-dimensional \nstate vector that changes slowly and smoothly. In other words, can we learn the parameters5 \nof a constrained HMM such that connected paths through the topology space (state space) \ngenerate the acoustic training data with high likelihood?  Once this unsupervised learning \ntask  has  been performed,  we  can (as  I  show  below) relate the  learned trajectories in  the \ntopology space to the true (measured) articulator movements. \n\n4 Articulator positions  do  not  provide  complete  state  information.  For example,  the  excitation \nsignal  (voiced or unvoiced)  is not captured by  the bead locations.  They do,  however,  provide much \nimportant information; other state information is easily accessible directly from acoustics. \n\n5Model structure (dimensionality and number of states) is currently set using cross validation. \n\n\f786 \n\nS.  Roweis \n\nWhile many  models of the speech production process predict the many-to-one and non(cid:173)\nsmooth  properties of the  articulatory  to  acoustic  mapping,  it  is  useful  to  confirm these \nfeatures by looking at real data.  Figure 4 shows the experimentally observed distribution \nof articulator configurations used to produce similar sounds.  It was computed as follows. \nAll the acoustic and articulatory data for a single speaker are collected together.  Starting \nwith  some  sample called  the  key  sample,  I  find  the  1000  samples  \"nearest\"  to  this  key \nby two measures:  articulatory distance, defined using the Mahalanobis norm between two \nposition vectors  under the global covariance of all  positions for  the  appropriate speaker, \nand spectral shape distance, again defined using the Mahalanobis norm but now between \ntwo line spectral frequency vectors using the global LSF covariance of the speaker's audio \ndata.  In other words, I find the 1000 samples that \"look most like\" the key sample in mouth \nshape and that \"sound most like\" the key sample in spectral shape.  I then plot the tongue \nbead positions of the key sample (as a thick cross), and the 1000 nearest samples by mouth \nshape (as a thick ellipse) and spectral shape (as dots). The points of primary interest are the \ndots;  they show the distribution of tongue positions used to  generate very similar sounds. \n(The thick ellipses are shown only as a control to ensure that many nearby points to the key \nsample do  exist in  the  dataset.)  Spread or multimodality in  the  dots  indicates that many \ndifferent articulatory configurations are used to generate the same sound. \n\n;~~.; \n\n:Ill \n\nI 10 \n., \n\n0 \n\n.:~~~ . -\nf\u00b7i1~W\u00b7 \n.;.: \n.:. \n'~ . \n\n30 \n\n:Ill \n\nI  10 \n., \n\n0 \n\n:.fI:<tl/ \n\nI \n., \n\n:Ill \" l \n\n.  ~ \n\" ~. '  :. \n\n10 \n\n.\\ \n\n\"\n\n0 \n\n;~~t \n\n30 \n\n:Ill \n\nI  10  1-: .~~( \n\n.  , \n\n., \n\n0 \n\n-~  -30 \n\n-:Ill \n\n-10 \ntooguelip>(mmJ \n\n20 \n\nI  10 \n., \n\n0 \n\n~~\u00a5f \n\n-~  -so \n\n_ \n\n-40 \n\nbody2 > (nmJ \n\n0 \n\n-30 \n\n-~  -so \n\n-30 \ntoogue bodyl > (mmJ \n\n-40 \n\n:Ill \n\n0 \n\n10 \n\nI  .J!! \n\n., \n-10 \n\n-60 \n\n_  ->  (om\u00bb \n\n-so \n\n-~o \n\n-:Ill \n\n-40 \n\n-10 \n\n-40 \n\n20 \n\nI  10 \n., \n\n0 \n\n'lMiJf \n\n-~  -so \n\n-40 \n\ntongue body2 > (mmJ \n\n-30 \n\n-:Ill \n\n-10 \ntoque lip:r. (mmJ \n\n-~  -so \n\n-30 \n..\"... bodyl> (mmJ \n\n-40 \n\n10 \n\nI  0 \n., \n-10 \n\n-30 \n\n-~o \n\n-60 \n\n__  > (mmJ \n\n-so \n\n-:Ill \n\n-40 \n\nFigure 4:  Inverse mapping from acoustics to articulation is ill-posed in real speech production data. \nEach group of four articulator-space plots  shows  the  1000 samples  which are \"nearest\" to  one key \nsample (thick cross). The dots are the 1000 nearest samples using an acoustic measure based on line \nspectral  frequencies.  Spread or multimodality in the dots  indicates that many  different articulatory \nconfigurations are used to generate very similar sounds.  Only the positions of the four tongue beads \nhave been plotted. 1\\vo examples (with different key samples) are shown, one in the left group of four \npanels and another in the  right group.  The thick ellipses  (shown as  a control)  are the  two-standard \ndeviation contour of the 1000 nearest samples using an articulatory position distance metric. \n\nWhy not do direct supervised learning from short-time spectral features (LSFs) to the artic(cid:173)\nulator positions? The ill-posed nature of the inverse problem as shown in figure 4 makes this \nimpossible. To illustrate this difficulty, I have attempted to recover the articulator positions \nfrom the acoustic feature vectors using Kalman smoothing on a LDS. In this case, since we \nhave access to both the hidden states (articulator positions) and the system outputs (LSFs) \nwe  can  compute  the optimal  parameters  of the  model  directly.  (In  particular,  the  state \ntransition matrix is obtained by regression from articulator positions and velocities at time \nt  onto positions  at time t + 1;  the output matrix  by regression  from  articulator positions \nand  velocities  onto LSF vectors;  and  the  noise  covariances  from  the  residuals  of these \nregressions.) Figure 5b shows the results of such smoothing; the recovery is quite poor. \nConstrained HMMs  can be applied to  this recovery problem, as  previously reported  [6]. \n(My earlier results used a small subset of the same database that was not continuous speech \nand did not provide the hard experimental verification (fig. 4) of the many-to-one problem.) \n\n\fConstrained Hidden Markov Models \n\n787 \n\nFigure 5:  (A) Recovered articulator movements  using  state inference on  a constrained HMM.  A \nfour-dimensional  model  with 4096 states was  trained on data (all  beads)  from  a  single  speaker but \nnot including the test utterance shown.  Dots  show the actual measured articulator movements for a \nsingle bead coordinate versus  time;  the thin lines are estimated movements  from the corresponding \nacoustics. (B) Unsuccessful recovery of articulator movements using Kalman smoothing on a global \nLDS model.  All  the (speaker-dependent) parameters of the underlying linear dynamical  system are \nknown;  they  have  been  set to  their optimal  values  using  the  true  movement information  from  the \ntraining data.  Furthermore,  for this example,  the  test utterance  shown was  included in the training \ndata used to  estimate model  parameters.  (C) All  16 bead coordinates;  all vertical axes  are the same \nscale.  Bead names are shown on the left.  Horizontal movements are plotted in the left-hand column \nand vertical movements in the right-hand column,  The separation between the two horizontal  lines \nnear the centre of the right panel indicates the machine measurement error. \n\nRecovery of tongue tip vertical motion from acoustics \n\n2  345  \n\ntime [sec] \n\n6 \n\n7 \n\n8 \n\nKalman smoothing on optimal linear dynamical system \n\nI 20 \n\n\u00a7 \n'::l \n'[ \n'0 \n\n0 \n\n~-10 B \n\n-20L-~--~--~--~--~--~--~~ \n8 \n\n02345  \n\n6 \n\n7 \n\ntime [sec] \n\nThe basic idea is  to  train  (unsupervised)  on  sequences  of acoustic-spectral features  and \nthen map the topology space state trajectories onto the measured articulatory movements. \nFigure 5 shows movement recovery using state inference in a four-dimensional model with \n4096 states (d=4,\u00a3=8,M =4096) trained on data (all beads)  from a  single speaker.  (Naive \nunsupervised learning runs into severe local minima problems.  To avoid these, in the sim(cid:173)\nulations shown above, models were trained by slowly annealing two learning parameters6 : \na  term f.!3  was  used in place of the zeros in the sparse transition matrix,  and If was used \nin place of It  =  p(mtlobservations)  during inference of state occupation probabilities. \nInverse temperature (3  was raised from 0  to 1.)  To infer a  continuous state trajectory from \nan utterance after learning, I first do Viterbi decoding on the acoustics to generate a discrete \nstate sequence mt and then interpolate smoothly between the positions x(mt) of each state. \n\n6 An  easier way  (which I have  used previously)  to  find  good minima is  to initialize  the  models \nusing the articulatory data themselves.  This does not provide as  impressive \"structure discovery\" as \nannealing but still yields a system capable of inverting acoustics into articulatory movements on pre(cid:173)\nviously unseen test data.  First, a constrained HMM is trained onjust the articulatory movements; this \nworks easily because of the natural geometric (physical) constraints.  Next,  I take the distribution of \nacoustic features  (LSFs) over all times  (in the training data) when Viterbi decoding places the model \nin a particular state and use those LSF distributions to  initialize an  equivalent  acoustic  constrained \nHMM. This new model is then retrained until convergence using Baum-Welch. \n\n\f788 \n\nS.  Roweis \n\nAfter unsupervised learning, a single linear fit is performed between these continuous state \ntrajectories and actual articulator movements on the training data.  (The model cannot dis(cid:173)\ncover the units system or axes used to represent the articulatory data.) To recover articulator \nmovements from a previously unseen test utterance, I infer a continuous state trajectory as \nabove and then apply the single linear mapping (learned only once from the training data). \n6  Conclusions, extensions and other work \nBy enforcing a simple constraint on the transition  parameters of a standard HMM, a link \ncan be forged between discrete state dynamics and the motion of a real-valued state vector \nin a  continuous space.  For complex time-series  generated by systems  whose  underlying \nlatent variables do in fact change slowly and smoothly, such constrained HMMs provide a \npowerful unsupervised learning paradigm.  They can model state to  output mappings that \nare  highly  nonlinear,  many to one and  not smooth.  Furthermore, they  rely  only  on well \nunderstood learning and inference procedures that come with convergence guarantees. \nResults on synthetic and real data show that these models can successfully capture the low(cid:173)\ndimensional structure present in  complex vector time-series.  In  particular,  I  have  shown \nthat a speaker dependent constrained HMM can accurately recover articulator movements \nfrom  continuous  speech  to  within  the  measurement  error  of the  data.  This  acoustic  to \narticulatory  inversion  problem has  a  long  history  in  speech  processing  (see  e.g. [7]  and \nreferences therein).  Many previous approaches have attempted to exploit the smoothness \nof articulatory movements for inversion or modeling: Hogden et.al (e.g. [4]) provided early \ninspiration for my ideas, but do not address the many-to-one problem; Simon Blackburn [1] \nhas  investigated a forward mapping from articulation to acoustics but does not explicitly \nattempt inversion; early work at Waterloo [5]  suggested similar constraints for improving \nspeech  recognition  systems  but  did  look  at  real  articulatory  data,  more  recent  work  at \nRutgers [2] developed a very similar system much further with good success.  Perpinan [3], \nconsiders a related problem in  sequence learning using EPG speech data as an example. \nWhile in this note I have described only \"diffusion\" type dynamics (transitions to all  neigh(cid:173)\nbours are equally  likely)  it is  also  possible to  consider directed flows  which  give certain \nneighbours of a state lower (or zero) probability. The left-to-right HMMs mentioned earlier \nare an  example of this  for one-dimensional topologies.  For higher dimensions, flows  can \nbe derived from discretization of matrix (linear) dynamics or from other physical/structural \nconstraints.  It is also possible to have many connected local flow  regimes (either diffusive \nor directed) rather than one global regime as discussed above; this gives rise to mixtures of \nconstrained HMMs  which  have  block-structured rather  than  banded  transition  matrices. \nSmyth  [8]  has  considered  such  models  in  the  case  of one-dimensional  topologies  and \ndirected  flows;  I  have  applied  these  to  learning  character sequences  from  English  text. \nAnother  application  I  have  investigated  is  map  learning  from  mUltiple  sensor  readings. \nAn explorer (robot) navigates in an unknown environment and records at each time many \nlocal  measurements such  as  altitude,  pressure,  temperature,  humidity,  etc.  We  wish  to \nreconstruct from only  these sequences of readings the topographic maps (in  each  sensor \nvariable) of the area as  well as the trajectory of the explorer. A final application is tracking \n(inferring movements) of articulated bodies using video measurements of feature positions. \n\nReferences \n[1]  S.  Blackburn & S.  Young.  ICSLP 1996, Philadephia, v.2 pp.969-972 \n[2]  S.  Chennoukh et.al, Eurospeech 1997, Rhodes, Greece, v.l  pp.429-432 \n[3] M. Carreira-Perpinan.  NIPS'12, 2000.  (This volume.) \n[4]  D.  Nix & 1.  Hogden. NIPS'lI, 1999, pp.744-750 \n[5]  G.  Ramsay &  L.  Deng.  1. Acoustical Society of America, 95(5),  1994, p.2873 \n[6]  S.  Roweis &  A.  Alwan.  Eurospeech 1997, Rhodes, Greece, v.3 pp.1227-1230 \n[7] 1.  Schroeter &  M.  Sondhi.  IEEE Trans.Speech  & Audio Processing, 2(1 p2),  1994, pp.133-150 \n[8] P. Smyth. NIPS'9,  1997, pp.648-654 \n[9] J.  Westbury.  X-ray microbeam speech production database user's handbook version  J.O. \n\nUniversity of Wisconsin, Madison, June  1994. \n\n\f", "award": [], "sourceid": 1738, "authors": [{"given_name": "Sam", "family_name": "Roweis", "institution": null}]}