{"title": "Boltzmann Chains and Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 435, "page_last": 442, "abstract": null, "full_text": "Boltzmann Chains and Hidden \n\nMarkov Models \n\nLawrence K.  Saul and Michael I.  Jordan \n\nlksaulOpsyche.mit.edu,  jordanOpsyche.mit.edu \nCenter for  Biological and Computational Learning \n\nMassachusetts  Institute of Technology \n\n79  Amherst  Street,  E10-243 \n\nCambridge,  MA  02139 \n\nAbstract \n\nWe  propose  a  statistical  mechanical  framework  for  the  modeling \nof discrete  time series.  Maximum likelihood estimation is  done via \nBoltzmann learning in one-dimensional networks with tied weights. \nWe  call  these  networks  Boltzmann  chains  and  show  that  they \ncontain  hidden  Markov  models  (HMMs)  as  a  special  case.  Our \nframework  also  motivates  new  architectures  that  address  partic(cid:173)\nular  shortcomings of HMMs.  We  look  at two  such  architectures: \nparallel  chains  that model feature  sets  with  disparate  time scales, \nand  looped  networks  that  model long-term  dependencies  between \nhidden  states.  For  these  networks,  we  show  how  to  implement \nthe  Boltzmann learning rule  exactly,  in  polynomial time,  without \nresort  to simulated or  mean-field  annealing.  The  necessary  com(cid:173)\nputations are done by exact decimation procedures from statistical \nmechanics. \n\n1 \n\nINTRODUCTION  AND  SUMMARY \n\nStatistical  models  of discrete  time series  have  a  wide  range  of applications,  most \nnotably to problems in speech  recognition  (Juang &  Rabiner,  1991)  and molecular \nbiology  (Baldi,  Chauvin,  Hunkapiller,  &  McClure,  1992).  A  common problem in \nthese  fields  is  to find  a  probabilistic  model,  and  a  set  of model  parameters,  that \n\n\f436 \n\nLawrence K.  Saul,  Michael I.  Jordan \n\naccount for sequences  of observed data.  Hidden  Markov models (HMMs)  have been \nparticularly successful  at  modeling discrete  time series.  One  reason  for  this  is  the \npowerful learning rule (Baum) 1972\u00bb) a special case of the Expectation-Maximization \n(EM)  procedure  for  maximum likelihood  estimation  (Dempster)  Laird)  &  Rubin) \n1977). \nIn  this  work)  we  develop  a  statistical  mechanical  framework  for  the  modeling  of \ndiscrete  time series.  The framework  enables  us  to  relate  HMMs  to  a  large family \nof exactly  solvable  models  in  statistical  mechanics.  The  connection  to  statistical \nmechanics  was  first  noticed  by  Sourlas  (1989\u00bb)  who  studied  spin  glass  models  of \nerror-correcting  codes.  We  view  the  estimation procedure  for  HMMs  as  a  special \n(and particularly tractable) case of the Boltzmann learning rule  (Ackley)  Hinton)  & \nSejnowski)  1985;  Byrne)  1992). \nThe rest of this paper is  organized as follows .  In Section 2)  we  review  the modeling \nproblem for  discrete  time series  and  establish  the  connection  between  HMMs  and \nBoltzmann  machines.  In  Section  3)  we  show  how  to  quickly  determine  whether \nor  not  a  particular  Boltzmann  machine  is  tractable)  and  if so)  how  to  efficiently \ncompute  the  correlations  in  the  Boltzmann  learning  rule.  Finally)  in  Section  4) \nwe  look  at  two  architectures  that  address  particular  weaknesses  of  HMMs: \nthe \nmodelling of disparate  time scales  and long-term dependencies. \n\n2  MODELING DISCRETE TIME SERIES \n\nA discrete  time series is  a sequence of symbols {jdr=l in which each symbol belongs \nto a finite countable set)  i.e.  jl E {1) 2)  .. . ) m}.  Given one long sequence)  or perhaps \nmany shorter ones)  the  modeling task is  to characterize  the probability distribution \nfrom which  the time series  are generated. \n\n2.1  HIDDEN  MARKOV MODELS \n\nA  first-order  Hidden  Markov  Model  (HMM)  is  characterized  by  a  set  of n  hidden \nstates)  an  alphabet  of m  symbols)  a  transmission  matrix ajj')  an  emission  matrix \nbjj )  and a prior distribution 7I'j  over  the initial hidden state.  The sequence  of states \n{idr=l  and symbols {jdr=l is  modeled to occur  with probability \n\n(1) \n\nThe  modeling problem is  to find  the  parameter  values  (ajj' , bij ) 7I'j)  that maximize \nthe  likelihood  of observed  sequences  of training  data.  We  will  elaborate  on  the \nlearning  rule  in  section  2.3)  but first  let  us  make  the  connection  to  a  well-known \nfamily of stochastic  neural networks, namely Boltzmann machines. \n\n2.2  BOLTZMANN MACHINES \n\nConsider a Boltzmann machine with m-state visible units) n-state hidden units) tied \nweights)  and the linear architecture shown in Figure 1.  This example represents  the \nsimplest  possible  Boltzmann  \"chain)))  one  that  is  essentially  equivalent  to  a  first(cid:173)\norder  HMM  unfolded in time (MacKay)  1994).  The transition weights Aii'  connect \nadjacent  hidden units)  while  the  emission weights  Bjj  connect  each  hidden  unit to \n\n\fBoltzmann  Chains and Hidden  Markov  Models \n\n437 \n\n1J ViS~ble \n\nBij  Bij \n\nun~ts \n\n\u2022\u2022\u2022 A i i ,  Au' \n\nhidden \nunits \n\nFigure  1:  Boltzmann chain  with  n-state hidden  units,  m-state  visible units,  transi(cid:173)\ntion weights  Aiil, emission weights  Bij, and  boundary weights  IIi. \n\nits  visible  counterpart.  In  addition,  boundary  weights  IIi  model  an  extra  bias  on \nthe first  hidden  unit.  Each  configuration of units represents  a state of energy \n\n1t[{il' jd] =  -Ilil - L Ailil+t  - 2: Bitio \n\nL-1 \n\nL \n\nl=l \n\nl=l \n\n(2) \n\nwhere {idf=l  ({jl }f=l) is the sequence of states over the hidden (visible) units.  The \nprobability to find  the network in  a  particular configuration is  given by \n\nP({ ' \n\n'}) \n\nZl,)l  =  Ze \n\n1  -{31-l \n\n, \n\nwhere  f3  = I/T is  the inverse  temperature,  and  the  partition function \n\nZ  =  L  e-fJ'H. \n\n{idd \n\n(3) \n\n(4) \n\nis  the sum over  states  that normalizes the  Boltzmann distribution,  eq.  (3). \nComparing  this  to  the  HMM  distribution,  eq.  (1),  it  is  clear  that  any  first-order \nHMM  can  be  represented  by  the  Boltzmann chain of figure  1,  provided we  take 1 \n\nAii' = TIn aij/,  Bij = TIn bij , \n\n(5) \nLater,  in  Section  4,  we  will  consider  more  complicated  chains  whose  architectures \naddress  particular  shortcomings  of HMMs.  For  now,  however,  let  us  continue  to \ndevelop  the example of figure  1,  making explicit  the connection  to HMMs. \n\nIIi  = TIn 7ri\u00b7 \n\n2.3  LEARNING  RULES \n\nIn  the  framework  of Boltzmann learning  (Williams & Hinton,  1990),  the  data for \nour problem consist  of sequences  of states over  the  visible units;  the goal is  to find \nthe  weights  (Ail, B ij , IIi)  that  maximize the  likelihood  of the observed  data.  The \nlikelihood of a sequence  {jd is  given  by  the  ratio \n\n. \n\nZc \nP({Jd) =  P({idl{jl}) = e-{3'H./Zc = Z' \n\nP({il,jd) \n\ne-{3'H./Z \n\n(6) \n\n1 Note, however,  that the reverse statement-that for  any  set of parameters,  this Boltz(cid:173)\n\nmann  chain  can  be  represented  as  an  HMM-is  not true.  The  weights  in  the  Boltzmann \nchain  represent  arbitrary  energies  between  \u00b1oo,  whereas  the  HMM  parameters  represent \nprobabilities  that are constrained  to obey  sum  rules,  such as  Lil aiil  = 1.  The Boltzmann \nchain  of figure  1  therefore  has  slightly  more  degrees  of freedom  than  a  first-order  HMM. \nAn interpretation  of these extra degrees  of freedom  is  given  by  MacKay  (1994). \n\n\f438 \n\nLawrence K.  Saul,  Michael I.  Jordan \n\nwhere  Zc  is  the clamped partition function \n\nZc  = L e-/31i . \n\n{it} \n\n(7) \n\nNote  that  the  sum in  Zc  is  only  over  the  hidden  states  in  the  network,  while  the \nvisible states  are  clamped to the observed  values  bt}. \nThe Boltzmann learning rule  adjusts the weights of the network  by  gradient-ascent \non the log-likelihood.  For  the example of figure  1,  this leads  to weight  updates \n\n~Aii'  =  7J/3 L [(6iil6ilil+Jc - (6iil6ilil+l)]  ; \n\nL-l \n\nl=1 \nL \n\n(8) \n\n(9) \n\n~Bij \n\n(6i il 6jjl)] , \n\n7J/3 L [(6iil6jjl)C  -\n7J/3 [(6ii1 )c  - (6ii1 )] , \n\nl=1 \n\n~ni \n\n(10) \nwhere  6ij  stands for  the  Kronecker  delta function,  7J  is  a  learning rate,  and  (-)  and \n(-) c  denote expectations over  the free  and clamped Boltzmann distributions. \nThe  Boltzmann learning rule  may also be  derived  as  an Expectation-Maximization \n(EM)  algorithm.  The  EM  procedure  is  an  alternating  two-step  method for  max(cid:173)\nimum likelihood  estimation in  probability  models  with  hidden  and  observed  vari(cid:173)\nables.  For  Boltzmann machines in  general,  neither  the  E-step  nor  the  M-step  can \nbe  done  exactly;  one  must  estimate  the  necessary  statistics  by  Monte  Carlo  sim(cid:173)\nulation  (Ackley  et  al.,  1985)  or  mean-field  theory  (Peterson  &  Anderson,  1987). \nIn  certain  special  cases  (e.g.  trees  and  chains) ,  however,  the  necessary  statistics \ncan  be  computed  to  perform  an  exact  E-step  (as  shown  below).  While  the  M(cid:173)\nstep  in  these  Boltzmann machines cannot  be  done  exactly,  the  weight  updates  can \nbe  approximated  by  gradient  descent.  This  leads  to  learning  rules  in  the  form  of \neqs.  (8-10). \n\nHMMs  may  be  viewed  as  a  special  case  of Boltzmann  chains  for  which  both  the \nE-step  and the M-step are  analytically tractable.  In this case,  the maximization in \nthe M-step is performed subject to the constraints 2:i e/3Il \u2022  = 1, 2:il e/3A ;;1  = 1,  and \n2:j  e/3B ;i  =  1.  These  constraints imply Z  = 1 and lead  to closed-form equations \nfor  the weight  updates in  HMMs. \n\n3  EXACT  METHODS  FOR BOLTZMANN  LEARNING \n\nThe  key  technique  to  compute  partition functions  and  correlations  in  Boltzmann \nchains is  known  as decimation.  The idea behind decimation 2  is  the following.  Con(cid:173)\nsider  three  units  connected  in  series,  as  shown  in  Figure  2a.  Though  not  directly \nconnected,  the end units have an effective interaction that is mediated by the middle \none.  In fact,  the  two  weights  in series  exert  the same influence  as  a single  effective \nweight,  given by \n\njl \n\n(11) \n\n2 A  related  method,  the transfer  matrix, is  described  by  Stolarz  (1994). \n\n\fBoltzmann Chains and Hidden  Markov  Models \n\n439 \n\n1.1. \n\nA~~)' \n+ \nA~~). \n11. \n\n(a) \n\n(b) \n\n(c) \n\nFigure  2:  Decimation, pruning, and joining in  Boltzmann machines. \n\nReplacing the  weights  in  this  way  amounts  to  integrating out,  or  decimating,  the \ndegree offreedom represented  by the middle unit.  An analogous rule may be derived \nfor  the situation shown  in  Figure  2b.  Summing over  the degrees  of freedom  of the \ndangling unit generates  an effective  bias on  its  parent, given  by \n\nef3B \u2022 = L:: ef3B \u2022j  \u2022 \n\n(12) \n\nj \n\nWe  call  this  the  pruning rule.  Another  type  of equivalence  is  shown  in  Figure  2c. \nThe two weights  in parallel have  the same effect  as  the sum total weight \n\nAjjl =  A~P + A~i) . \n\n(13) \n\nWe  call  this the joining rule.  It holds  trivially for  biases  as  well  as  weights. \nThe rules  for  decimating, pruning,  and joining have  simple analogs in other  types \nof networks  (e.g. the law for  combining resistors  in electric circuits),  and the strat(cid:173)\negy  for  exploiting  them  is  a  familiar  one.  Starting  with  a  complicated  network, \nwe  iterate  the  rules  until  we  have  a  simple  network  whose  properties  are  easily \ncomputed.  A  network  is  tractable for  Boltzmann learning  if it  can  be  reduced  to \nany  pair of connected  units.  In  this  case,  we  may use  the  rules  to compute all the \ncorrelations required for Boltzmann learning.  Clearly, the rules do not make all net(cid:173)\nworks  tractable;  certain networks  (e.g.  trees  and  chains),  however,  lend themselves \nnaturally to these  types  of operations. \n\n4  DESIGNER NETS \n\nThe  rules  in  section  3  can  be  used  to  quickly  assess  whether  or  not  a  network  is \ntractable for  Boltzmann learning.  Conversely,  they can be  used  to design networks \nthat are computationally tractable.  This section looks  at two networks  designed  to \naddress  particular shortcomings of HMMs. \n\n4.1  PARALLEL CHAINS  AND  DISPARATE TIME  SCALES \n\nAn important problem in speech  recognition (Juang et al.,  1991) is how to \"combine \nfeature  sets  with  fundamentally  different  time scales.\"  Spectral  parameters,  such \n\n\f440 \n\nLawrence K.  Saul,  Michael I.  Jordan \n\nfast \n\nfeatures \n\ncoupled \nhidden \nunits \n\nslow \n\nfeatures \n\nFigure 3:  Coupled parallel chains for  features  with different  time scales. \n\nas  the cepstrum and delta-cepstrum,  vary on  a  time scale of 10  msec;  on the  other \nhand, prosodic parameters, such as  the signal energy and pitch, vary on a time scale \nof 100 msec.  A model that takes into account this disparity should avoid two things. \nThe first  is  redundancy-in particular, the rather lame solution of oversampling the \nnonspectral features.  The second  is  overfitting.  How  might this  arise?  Suppose  we \nhave  trained  two  separate  HMMs  on  sequences  of spectral  and  prosodic  features, \nknowing that the different features  \"may not warrant a single, unified Markov chain\" \n(Juang et  al.,  1991).  To exploit  the  correlation between feature  sets,  we  must  now \ncouple  the  two  HMMs.  A naive solution is  to form  the  Cartesian product  of their \nhidden state spaces and resume training.  Unfortunately, this results in an explosion \nin  the  number  of parameters  that  must  be  fit  from  the  training  data.  The likely \nconsequences  are overfitting and poor generalization. \n\nFigure  3 shows  a  network for  modeling feature  sets  with  disparate  time scales-in \nthis  case,  a  2: 1 disparity.  Two  parallel  Boltzmann  chains  are  coupled  by  weights \nthat  connect  their  hidden  units.  Like  the  transition  and  emission  weights  within \neach  chain,  the  coupling  weights  are  tied  across  the  length  of the  network.  Note \nthat  coupling  the  time  scales  in  this  way  introduces  far  fewer  parameters  than \nforming the Cartesian product of the hidden state spaces.  Moreover,  the network is \ntractable by  the  rules  of section  3.  Suppose, for  example,  that we  wish  to compute \nthe correlation between two neighboring hidden units in the middle of the network. \nThis is done  by first  pruning all the visible units, then repeatedly decimating hidden \nunits from  both ends  of the network. \nFigure 4 shows typical results on a simple benchmark problem, with data generated \nby  an  artificially  constructed  HMM.  We  tested  the  parallel  chains  model  on  10 \ntraining sets,  with  varying levels  of built-in  correlation  between  features.  A  two(cid:173)\nstep method was used to train the parallel chains.  First, we set the coupling weights \nto  zero  and  trained  each  chain  by  a  separate  Baum-Welch  procedure.  Then,  after \nlearning  in  this  phase  was  complete,  we  lifted  the  zero  constraints  and  resumed \ntraining  with  the  full  Boltzmann learning  rule.  The  percent  gain  in  this  second \nphase  was  directly  related to  the  degree  of correlation  built into  the training data, \nsuggesting that the coupling weights were  indeed capturing the correlation between \nfeature  sets.  We  also compared the performance of this Boltzmann machine versus \nthat  of a  simple  Cartesian-product  HMM  trained  by  an  additional  Baum-Welch \nprocedure.  While in both cases  the second phase of learning led to reduced  training \nerror,  the Cartesian product  HMMs  were  decidedly  more prone to overfitting. \n\n\fBoltzmann Chains and Hidden  Markov  Models \n\n441 \n\nI \n\nI-J\"anv::luu::a:cnu:nnuI I II JOU\"'XJ:o:x:o:a:J::) \n\n/tfA='=u\"na\"m .... .,m\"l \n\n1-1rainl\"O \n- - croaa-vaJdation \n\n! \n\n~ 20 \n\n'\" '\" \n\n10 \n\n\u00b71500 \n\n-1700 \n\n200 \n\n400 \n\neoo \n\neoo \n\nepoch \n(a) \n\n0.2 \n\n0.' \nfeature colT8latkwl \n\n0.6 \n\n0.8 \n\n(b) \n\nFigure  4:  (a)  Log-likelihood  versus  epoch  for  parallel  chains  with  4-state  hidden \nunits,  6-state  visible  units,  and  100  hidden-visible  unit  pairs  (per  chain) .  The \nsecond  jump  in  log-likelihood  occurred  at  the  onset  of  Boltzmann  learning  (see \ntext).  (b)  Percent  gain in log-likelihood versus  built-in correlation  between feature \nsets. \n\n4.2  LOOPS  AND  LONG-TERM  DEPENDENCIES \n\nAnother  shortcoming  of first-order  HMMs  is  that  they  cannot  exhibit  long-term \ndependencies  between  the  hidden  states  (Juang  et  aL ,  1991).  Higher-order  and \nduration-based  HMMs  have  been  used  in  this  regard  with  varying  degrees  of suc(cid:173)\ncess.  The rules of section  3 suggest  another approach-namely, designing tractable \nnetworks  with  limited long-range  connectivity.  As  an example,  Figure  5a shows  a \nBoltzmann  chain  with  an  internal  loop  and  a  long-range  connection  between  the \nfirst  and  last  hidden  units.  These  extra features  could  be  used  to  enforce  known \nperiodicities in the time series.  Though tractable for  Boltzmann learning, the loops \nin this network do not fit  naturally into the framework of HMMs.  Figure 5b  shows \nlearning curves for  a  toy problem, with  data generated by  another  looped network. \nCarefully  chosen  loops  and long-range  connections  provide  additional flexibility  in \nthe  design  of probabilistic  models for  time series.  Can  networks  with  these  extra \nfeatures  capture the  long-term dependencies  exhibited  by  real  data?  This  remains \nan important issue for  future  research . \n\nAcknowledgements \n\nWe  thank  G.  Hinton,  D. MacKay,  P.  Stolorz,  and  C.  Williams for  useful  discus(cid:173)\nsions.  This work  was funded  by  ATR Human Information Processing Laboratories, \nSiemens Corporate  Research,  and  NSF  grant  CDA-9404932 . \n\nReferences \n\nD.  H. Ackley,  G. E.  Hinton,  and T . J.  Sejnowski.  (1985)  A Learning Algorithm for \nBoltzmann Machines.  Cog.  Sci.  9:  147- 160. \nP. Baldi, Y.  Chauvin, T . Hunkapiller, and M. A.  McClure.  (1992)  Proc.  Nat .  Acad. \nSci.  (USA)  91:  1059-1063. \n\n\f442 \n\nLawrence K.  Saul,  Michael I. Jordan \n\n\u00b7700 \n\no \n\n(a) \n\nI-tralning \n\n~ crosa..validation \n\nI \n\n10 \n\n12 \n\n1. \n\n8 \nepoch \n\n(b) \n\nFigure 5:  (a)  Looped network.  (b)  Log-likelihood versus epoch for  a looped network \nwith 4-state hidden  units,  6-state visible  units,  and 100  hidden-visible unit pairs . \n\nL.  Baum.  (1972)  An  Inequality and Associated  Maximization Technique  in Statis(cid:173)\ntical Estimation of Probabilistic Functions of Markov  Processes,  Inequalities 3:1-8. \nByrne,  W.  (1992)  Alternating  Minimization  and  Boltzmann  Machine  Learning. \nIEEE  Trans.  Neural  Networks 3:612-620. \n\nA.  P.  Dempster,  N.  M. Laird,  and  D.  B.  Rubin.  (1977)  Maximum Likelihood from \nIncomplete  Data via the  EM  Algorithm.  J.  Roy.  Statist.  Soc.  B,  39:1-38 . \nC. Itzykson and J . Drouffe.  (1991)  Statistical Field  Theory,  Cambridge:  Cambridge \nUniversity  Press. \nB.  H. Juang and L. R.  Rabiner.  (1991)  Hidden  Markov  Models  for  Speech  Recog(cid:173)\nnition,  Technometrics 33:  251-272. \nD.  J.  MacKay.  (1994)  Equivalence of Boltzmann Chains and  Hidden Markov Mod(cid:173)\nels,  submitted to  Neural  Compo \nC.  Peterson and J.  R.  Anderson.  (1987)  A Mean Field Theory Learning Algorithm \nfor  Neural  Networks,  Complex  Systems  1:995-1019. \n1.  Saul  and  M.  Jordan.  (1994)  Learning  in  Boltzmann Trees.  Neural  Comp o  6 : \n1174-1184. \n\nN.  Sourlas.  (1989)  Spin  Glass  Models  as  Error  Correcting  Codes.  Nature  339: \n693-695 . \nP.  Stolorz.  (1994)  Links  Between  Dynamic Programming  and  Statistical  Physics \nfor  Heterogeneous  Systems,  JPL/Caltech preprint . \n\nC. Williams and G.  E. Hinton.  (1990) Mean Field Networks That Learn To Discrim(cid:173)\ninate Temporally Distorted  Strings.  Proc.  Connectionist  Models  Summer  School: \n18-22. \n\n\f", "award": [], "sourceid": 966, "authors": [{"given_name": "Lawrence", "family_name": "Saul", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}