{"title": "Diffusion of Credit in Markovian Models", "book": "Advances in Neural Information Processing Systems", "page_first": 553, "page_last": 560, "abstract": null, "full_text": "Diffusion of Credit  in Markovian Models \n\nYoshua Bengio\u00b7 \n\nPaolo Frasconi \n\nDept.  I.R.O., Universite de Montreal, \n\nDipartimento di Sistemi e  Informatica \n\nMontreal,  Qc,  Canada H3C-3J7 \nbengioyCIRO.UMontreal.CA \n\nUniversita di Firenze,  Italy \n\npaoloCmcculloch.ing.unifi.it \n\nAbstract \n\nThis  paper  studies the  problem of diffusion  in  Markovian  models, \nsuch  as  hidden  Markov  models  (HMMs)  and  how  it  makes  very \ndifficult the task of learning of long-term dependencies in sequences. \nUsing results from Markov chain theory,  we show that the problem \nof diffusion is reduced if the transition probabilities approach 0 or 1. \nUnder  this condition, standard HMMs have very  limited modeling \ncapabilities,  but input/output HMMs can still perform interesting \ncomputations. \n\n1 \n\nIntroduction \n\nThis paper  presents  an  important  new  element  in  our  research  on  the  problem  of \nlearning  long-term dependencies  in sequences.  In  our  previous  work  [4J  we  found \ntheoretical reasons for  the difficulty in training recurrent networks (or more gen(cid:173)\nerally  parametric  non-linear  dynamical systems)  to  learn  long-term  dependencies. \nThe main result  stated that either long-term storing or gradient propagation would \nbe  harmed,  depending  on  whether  the  norm of the  Jacobian of the  state  to state \nfunction was greater or less than 1.  In this paper we consider a special case in which \nthe norm of the Jacobian of the state to state function is constrained to be exactly \n1 because this matrix is  a stochastic matrix. \nWe  consider  both  homogeneous  and  non-homogeneous  Markovian  models.  Let  n \nbe  the  number  of states  and  At  be  the  transition  matrices  (constant  in  the  ho(cid:173)\nmogeneous  case):  Aij(ut}  = P(qt  = j  I qt-l  = i, Ut; e)  where  Ut  is  an  external \ninput  (constant  in  the  homogeneous  case)  and  e  is  a  vector  of parameters.  In \nthe homogeneous case  (e.g., standard HMMs),  such  models can  learn  the distribu(cid:173)\ntion  of output  sequences  by  associating  an  output  distribution  to  each  state.  In \n\n\u00b7also,  AT&T Bell  Labs,  Holmdel,  NJ 07733 \n\n\f554 \n\nYoshua  Bengio.  Paolo  Frasconi \n\nthe  non-homogeneous case,  transition  and  output distributions  are  conditional on \nthe input sequences,  allowing to model relationships between input and output se(cid:173)\nquences  (e.g.  to do sequence 'regression or classification as with recurrent networks). \nWe  thus  called  Input/Output HMM  (IOHMM)  this  kind  of non-homogeneous \nHMM . In  [3,  2]  we  proposed  a  connectionist  implementation of IOHMMs.  In  both \ncases,  training  requires  propagating forward  probabilities and  backward  probabili(cid:173)\nties,  taking  products  with  the  transition  probability matrix or  its transpose.  This \npaper  studies  in  which  conditions  these  products  of matrices  might gradually \nconverge  to lower  rank,  thus harming storage and learning of long-term context. \nHowever,  we  find  in  this  paper  that  IOHMMs  can  better  deal  with  this  problem \nthan homogeneous HMMs. \n\n2  Mathematical Preliminaries \n2.1  Definitions \nA  matrix A  is said  to  be  non-negative,  written  A  2::  0,  if Aij  2::  0  Vi, j .  Positive \nmatrices  are  defined  similarly.  A  non-negative square  matrix A  E  R nxn  is  called \nrow  stochastic  (or  simply  stochastic  in  this  paper)  if 'L,'l=1  Aij  = 1  Vi  = 1 . . . n. \nA  non-negative  matrix is  said  to  be  row  [column}  allowable  if every  row  [column] \nsum  is  positive.  An  allowable  matrix is  both  row  and  column  allowable.  A  non(cid:173)\nnegative matrix can be associated to the directed  transition graph 9 that constrains \nthe  Markov  chain.  An  incidence  matrix A corresponding  to  a  given  non-negative \nmatrix A  replaces  all  positive  entries  of A  by  1.  The  incidence  matrix of A  is  a \nconnectivity  matrix corresponding  to the graph 9  (assumed to be connected  here). \nSome algebraic properties of A  are described  in terms of the  topology of g. \nDefinition 1  (Irreducible  Matrix)  A  non-negative  n  x  n  matrix  A  is  said  to  be \nirreducible  if for  every  pair  i,j  of indices,  :3  m  =  m(i,j)  positive  integer  s.t. \n(Amhj  > O. \nA  matrix A  is  irreducible  if and only  if the  associated  graph  is  strongly connected \n(i.e.,  there exists  a path between  any pair of states i,j) .  If :3k  s.t.  (Ak)ii  > 0, d(i) \nis called the period of index i  ifit is  the greatest common divisor  (g.c.d.)  of those k \nfor  which  (Ak)ii > O.  In an irreducible  matrix all  the indices have the same period \nd,  which  is  called  the  period of the  matrix.  The period  of a  matrix is  the  g.c.d.  of \nthe lengths  of all  cycles  in the associated  transition graph. \nDefinition 2  (Primitive  matrix)  A  non-negative  matrix A  is  said  to  be  primitive \nif there  exists  a positive  integer k  S.t.  Ak > O. \nAn irreducible  matrix is  either  periodic  or primitive (i.e.  of period  1).  A  primitive \nstochastic matrix is  necessarily  allowable. \n\n2.2  The Perron-Frobenius Theorem \nTheorem 1  (See  [6],  Theorem 1.1.)  Suppose  A  is  an  n  x  n  non-negative  prim(cid:173)\nitive  matrix.  Then  there  exists  an  eigenvalue  r  such  that: \n\n1.  r  is  real  and positive; \n2.  with  r  can  be  associated strictly positive  left  and right  eigenvectors; \n3.  r> 1>'1  for any eigenvalue>. 1=  r; \n4\u00b7  the  eigenvectors  associated  with  r  are  unique  to  constant  multiples. \n5.  If 0 S B  s A  and f3  is an  eigenvalue  of B,  then  1f31  s r .  Moreover,  1f31  = r \n\nimplies B  = A. \n\n\fDiffusion of Credit  in  Markovian  Models \n\n555 \n\n6.  r  is  simple  root  of the  characteristic  equation  of A. \n\nA  simple consequence  of the theorem for  stochastic matrices is  the following: \nCorollary 1  Suppose  A  is  a  primitive  stochastic  matrix.  Then  its  largest  eigeh(cid:173)\n1  = \nvalue  is  1  and  there  is  only  one  corresponding  right  eigenvector,  which  is \n[1, 1 .. \u00b71]'.  Furthermore,  all other eigenvalues  < 1. \nProof.  A1 = 1  by  definition  of stochastic  matrices.  This  eigenvector  is  unique \nand  all other eigenvalues < 1 by  the  Perron-Frobenius Theorem. \nIf A  is stochastic but periodic with period d,  then  A  has d eigenvalues of module 1 \nwhich  are the  d complex roots of 1. \n\n3  Learning Long-Term Dependencies with  HMMs \nIn  this section  we  analyze  the case  of a  primitive transition  matrix as  well  as  the \ngeneral  case  with  a  canonical  re-ordering  of the  matrix indices.  We  discuss  how \nergodicity  coefficients  can  be  used  to  measure  the  difficulty  in  learning  long-term \ndependencies.  Finally,  we  find  that in  order  to  avoid  all  diffusion,  the  transitions \nshould  be deterministic  (0  or  1 probability). \n\n3.1  Training Standard HMMs \nTheorem 2  (See [6],  Theorem 4.2.)  If A  is  a  primitive stochastic  matrix,  then \nas  t  -+  00,  At  -+  1V'  where  v'  is  the  unique  stationary distribution  of the  Markov \nchain.  The  rate  of approach  is  geometric. \nThus if A  is primitive, then liIDt-+oo  At converges  to a matrix whose eigenvalues are \nall  0  except  for  ,\\ =  1 (with  eigenvector  1), i.e.  the  rank of this product  converges \nto 1,  i.e.  its rows are equal.  A consequence  oftheorem 2 is that it is very difficult to \ntrain ordinary hidden Markov models,  with a primitive transition matrix, to model \nlong-term dependencies  in observed  sequences.  The reason is  that  the distribution \nover the states at time t > to  becomes gradually independent of the distribution over \nthe states at time to  as t  increases.  It means that states  at time to  become equally \nresponsible for  increasing the likelihood of an output at time t.  This corresponds in \nthe backward phase of the EM  algorithm for  trainin~ HMMs to a  diffusion  of credit \nover  all  the  states.  In  practice  we  train  HMMs  WIth  finite  sequences.  However , \ntraining  will  become  more  and  more  numerically  ill-conditioned  as  one  considers \nlonger term dependencies.  Consider  two events  eo  (occurring  at to)  and  et  (occur(cid:173)\nring at t),  and suppose there are also  \"interesting\"  events occurring in between.  Let \nus consider the  overall influence of states at times 1\"  < t  upon the likelihood of the \noutputs at time t.  Because  of the  phenomenon of diffusion  of credit,  and because \ngradients  are  added  together,  the  influence  of intervening  events  (especially  those \noccurring  shortly  before  t)  will  be  much  stronger  than  the  influence  of eo .  Fur(cid:173)\nthermore, this problem gets geometrically worse as t  increases.  Clearly a positive \nmatrix is  primitive.  Thus in order  to  learn long-term dependencies,  we  would  like \nto  have  many  zeros  in  the  matrix of transition  probabilities.  Unfortunately,  this \ngenerally supposes  prior knowledge of an appropriate connectivity graph . \n\n3.2  Coefficients of ergodicity \nTo study products of non-negative matrices and the loss of information about initial \nstate  in  Markov  chains  (particularly  in  the  non-homogeneous case),  we  introduce \nthe projective distance  between  vectors  x  and y: \n\nx\u00b7y\u00b7 \nd(x',y') = ~~ In(--.:..l.). \n\nI ,) \n\nXjYi \n\nClearly, some  contraction takes place when  d(x'A,y'A)  ::;  d(x',y'). \n\n\f556 \n\nYoshua  Bengio,  Paolo  Frasconi \n\nDefinition 3  BirkhofJ's  contraction  coefficient TB(A),  for  a  non-negative  column(cid:173)\nallowable  matrix A,  is  defined  in  terms  of the  projective  distance: \n\nTB(A)  = \n\nsup \n\nx ,y> Ojx;t>.y \n\nd(x' A, y' A) \n\nd(x', y') \n\nDobrushin's  coefficient Tl(A),  for  a  stochastic  matrix A,  is  defined  as  follows: \n\nTl(A)  =  2 s~p L laik  - ajkl\u00b7 \n\n1 \n\nI,) \n\nk \n\nBoth are  proper ergodicity coefficients:  0 ~ T(A)  ~ 1 and T(A)  = 0 if and only if A \nhas  identical rows.  Furthermore,  T(AIA2)  ~ T(Al)T(A2)(see  [6]). \n\n3.3  Products of Stochastic Matrices \nLet  A (1 ,t)  =  A 1A2 \u00b7\u00b7\u00b7 At- 1 At  denote  a  forward  product  of  stochastic  matrices \nAI, A2, ... At.  From  the  properties  of  TB  and  Tl,  if  T(At}  <  1, t  >  0  then \nlimt-l-oo T(A(l,t\u00bb)  =  0,  i.e.  A(l,t)  has  rank  1  and  identical  rows.  Weak  ergodic(cid:173)\nity  is  then  defined  in  terms  of a  proper  ergodic  coefficient  T  such  as  TB  and  Tl: \n\nDefinition 4  (Weak  Ergodicity)  The  products  of stochastic  matrices  A(p,r)  are \nweakly  ergodic  if and only if for all to  ~ 0  as  t  -+ 00,  T(A(to,t\u00bb)  -+ O. \n\nTheorem 3  (See [6],  Lemma 3.3 and 3.4.)  Let  A(l,t)  a  forward  product  of \nnon-negative  and  allowable  matrices,  then  the  products  A(l,t)  are  weakly  ergodic \nif and only if the  following  conditions  both  hold: \n1.  3to  S.t.  A(to,t)  > 0  Vt  > to \n2.  A (;~,t)  -+ Wij (t)  > 0  as t -+  00,  i. e.  rows  of A (to,t)  tend to proportionality. \n\nA(to,t) \n\n-\n\n),k \n\nFor stochastic matrices, row-proportionality is equivalent to row-equality since rows \nsum to  1.  limt-l-oo  ACto,t)  does  not  need  to exist  in order to have weak ergodicity. \n\n3.4  Canonical Decomposition and Periodic Graphs \n\nAny non-negative matrix A can be rewritten by relabeling its indices in the following \ncanonical  decomposition  [6],  with diagonal blocks  B i ,  Ci  and  Q: \n\nA= \n\n0 \nB2 \n..... . ..... . \n\nC'+ 1 \n\n0 \n\n( Bl \n\n0 \n. . . . . . . . . \n0 \nLl \n\n0 \nL2 \n\n\" \n\n0 \n0 \n\n... \n. \n. . . . . . . \n0 \n. .... . .. \n\n0 \n0 \n\n0 \n\nCr \n0 \nLr  Q \n\n) \n\n(1 ) \n\nwhere  Bi  and  Ci  are  irreducible,  Bi  are  primitive and  Ci  are  periodic.  Define  the \ncorresponding sets of states as SBi'  Se\"  Sq.  Q might be reducible,  but the groups \nof states in Sq  leak into the  B  or C  blocks,  i.e.,  Sq  represents  the transient part of \nthe state space.  This decomposition is  illustrated in  Figure  1a.  For  homogeneous \nand  non-homogeneous Markov  models  (with  constant  incidence  matrix At  =  Ao), \nbecause  P(qt E Sqlqt-l E Sq)  < 1,  liIl1t-l-oo  P(qt  E Sqlqo  E Sq) =  O.  Furthermore, \nbecause  the  Bi  are  primitive,  we  can  apply  Theorem  1,  and  starting from  a  state \nin SB\"  all information about  an initial state at to  is  gradually lost. \n\n\fDiffusion  of Credit in  Markovian  Models \n\n557 \n\n(b) \n\nFigure  1:  (a):  Transition graph corresponding  to the canonical decomposition. \n(b):  Periodic  graph  91  becomes  primitive  (period  1)  92  when  adding  loop  with \nstates 4,5. \n\nA  more difficult  case  is  the  one  of  (A(to ,t))jk  with  initial state  j  ESc, .  Let  d i  be \nthe period of the ith  periodic  block  Cj.  It can be shown  r6]  that taking d  products \nof periodic  matrices  with  the same incidence  matrix and  period  d  yields  a  block(cid:173)\ndiagonal  matrix  whose  d  blocks  are  primitive.  Thus  C(to ,t)  retains  information \nabout  the  initial  block in  which  qt  was.  However,  for  every  such  block  of size \n> 1,  information will  be gradually lost  about the exact  identity  of the state within \nthat  block.  This  is  best  demonstrated  through  a  simple example.  Consider  the \nincidence matrix represented  by  the graph 91  of Figure  lb.  It has period 3 and the \nonly non-deterministic transition is from state 1,  which  can yield into either one  of \ntwo loops.  When many stochastic matrices with this graph are multiplied together, \ninformation about  the loop in which the initial state was is gradually lost  (i.e.  if the \ninitial state was  2 or  3,  this information is  gradually lost).  What is retained is  the \nphase information, i.e.  in which  block  ({O},  {I}, or {2,3}) of a cyclic chain was  the \ninitial state.  This suggests  that it  will  be  easy  to learn  about  the  type of outputs \nassociated  to  each  block  of a  cyclic  chain,  but  it  will  be  hard  to  learn  anything \nelse.  Suppose  now  that the sequences  to be modeled are slightly more complicated, \nrequiring  an extra loop  of period 4 instead of 3,  as  in  Figure  lb.  In that case  A  is \nprimitive:  all information about the initial state will  be gradually lost. \n\n3.5  Learning Long-Term Dependencies:  a  Discrete Problem? \nWe  might wonder  if,  starting from  a  positive stochastic  matrix,  the  learning  algo(cid:173)\nrithm could learn the topology, i.e.  replace some transition probabilities by  zeroes. \nLet  us  consider the update rule for  transition probabilities in the EM  algorithm: \n\nA \n\noL \nij 8A;j \n\nA \n\nij  ~ \" \n\noL  . \n\nwj Aij oA.j \n\n(2) \n\nStarting from  Aij  > 0  we  could  obtain  a  new  Aij  =  0 only  if  O~~j  = 0,  i.e.  on  a \nlocal maximum of the likelihood L.  Thus the EM training algorithm will not exactly \nobtain zero  probabilities.  Transition probabilities might however  approach O. \nIt is  also interesting to ask in which conditions we are guaranteed that there will not \nbe any  diffusion (of influence  in  the forward  phase,  and  credit  in  the  backward \nphase of training).  It requires  that some of the eigenvalues other  than Al  = 1 have \na  norm that  is  also  1.  This  can  be  achieved  with  periodic  matrices  C  (of period \n\n\f558 \n\nYoshua  Bengio,  PaoLo  Frasconi \n\n5 \n\n-:  -\"~::::-~:-~,~~:.~~:-~:-~~:~--.-.-.~~~~~-\n\nPeriodic_ \n\n\"'-\"\" \n\n'-. '--\n\nLeft-to-right-\n\n, .. '/, .. , .. , \n\n. . ~:;:~. \n\n:l:~; \n\n\"\"\" \n\n%\u00b7.';:\"'\u00b7.::1: .. \u00b7\u00b7 .. \u00b7:.\u00b7 .. \u00b7 \nII \n. \nill) \n\n.. : \n\nl?:t \n\n'''''I \n\".,'. \n\nd \u00b7 . : \u00b7 \u00b7  \n\n\u2022.\u2022.. \u00b7.\u00b7.:.'.\u00b7\u00b7 .. :\u00b7.\u00b7: .\u2022 . . .  \n\nI I \n\nt=4 \n\n-10 \n\n-15 \n~ --20 \n-25 \n\n-30 \n\n/ \n\nFull connected \n\nLeft-to-right \n(triangular) \n\n5 \n\n10 \n\n15 \nT \n(a) \n\n20 \n\n25 \n\n30 \n\nt=3 \n\n(b) \n\nFigure 2:  (a)  Convergence of Dobrushin's coefficient (see  Definition 3.  (b)  Evolution \nof products  A(l,t)  for  fully  connected  graph.  Matrix elements  are  visualized  with \ngray  levels. \n\nd),  which  have  d eigenvalues  that  are  the  d roots  of 1 on the  complex unit  circle. \nTo  avoid  any  loss  of information also  requires  that  Cd  =  I  be  the  identity,  since \nany  diagonal block  of Cd  with size  more  than  1 will  yield  to a  loss  of information \n(because  of diffusion  in  primitive matrices) .  This can  be  generalized  to  reducible \nmatrices whose  canonical form  is  composed of periodic  blocks  Ci  with ct = I. \nThe condition we  are describing actually corresponds to a  matrix with only  1 's and \nO's_  If At is fixed,  it would mean that the Markov chain is also homogeneous.  It ap(cid:173)\npears that many interesting computations can not be achieved with such constraints \n(i.e.  only  allowing one or more cycles of the same period  and  a purely deterministic \nand  homogeneous  Markov  chain).  Furthermore,  if the  parameters  of the  system \nare  the  transition  probabilities themselves  (as  in  ordinary  HMMs),  such  solutions \ncorrespond  to a subset of the  corners of the 0-1  hypercube in parameter space. \nAway  from  those  solutions, learning  is  mostly  influenced  by  short  term dependen(cid:173)\ncies,  because  of diffusion  of credit.  Furthermore,  as  seen  in equation 2,  algorithms \nlike  EM  will  tend  to stay  near  a  corner  once  it  is  approached.  This suggests  that \ndiscrete  optimization  algorithms, rather continuous local  algorithms, may  be  more \nappropriate to explore the  (legal)  corners  of this hypercube. \n\n4  Experiments \n4.1  Diffusion:  Numerical Simulations \nFirstly,  we  wanted to measure how  (and if)  different kinds of products of stochastic \nmatrices converged,  for  example to  a  matrix of equal  rows.  We  ran  4 simulations, \neach with an 8 states non-homogeneous Markov chain but with different constraints \non the transition graph:  1) 9 fully  connected;  2)  9 is  a left-to-right model (i.e.  A is \nupper  triangular);  3)  9 is  left-to-right  but  only  one-state skips  are  allowed  (i.e.  A \nis  upper  bidiagonal); 4)  At  are  periodic  with  period  4.  Results  shown  in  Figure  2 \nconfirm  the  convergence  towards  zero  of the ergodicity  coefficient 1 ,  at  a  rate  that \ndepends  on the graph topology.  In Figure 2,  we  represent  visually the convergence \nof fully  connected  matrices, in only 4 time steps, towards equal columns. \n\nlexcept for  the experiments  with periodic  matrices,  as expected \n\n\fDiffusion  of Credit in  Markovian  Models \n\n559 \n\n100,----~-.. -\u00b7\u00b7\u00b7-~/~\u00b7\u00b7\u00b7~~~\\-.~-:-/~:--~-~-~--~~-~--yg-iV-en~ \n/ \\  \u2022 .1  ~_.  \\  Randomly co\",,..;ted. \n\n80 \n\n\u2022 \n\nFully connected, \n40stales \n\n\\ \n\\. \n\n\\  24\" ... , \n\\ \n\n. /  \n\n\u2022 .  / \n\n.... \\ \n\n__ -----'\\ \n\n\\\\ \n',\\ \n\n\\\\ \n, ' \n..,\\  \\ \n~ \nFully <XlOIIeCIed, .. \n16 llitale.1II \n\\ \n\n.........-\n\n~ \n'I \n\\ \n\n20 \n\nCb.1 \n\n(a) \n\nFully conrect.:ted. \n\n\\. \n\n\\\"--24 sl.ate.\" \n\n\"  \\ \n, \n, \n\n\\ \n\\ \n-_ \u2022\u2022 -yoo,.. ...... \n\n10 \nSpan \n(b) \n\n1000 \n\nFigure 3:  (a):  Generating  HMM.  Numbers out of state circles  denote  output sym(cid:173)\nbols.  (b):  Percentage  of convergence to  a good solution  (over  20  trials)  for  various \nseries  o( experiments as  the span of dependencies  is  increased. \n4.2  Training Experiments \n\nTo  evaluate  how  diffusion  impairs  training,  a  set  of controlled  experiments  were \nperformed, in which the training sequences were generated by a simple homogeneous \nHMM  with  long-term dependencies,  depicted  in Figure 3a.  Two branches generate \nsimilar  sequences  except  for  the  first  and  last  symbol.  The  extent  of  the  long(cid:173)\nterm  context  is  controlled  by  the  self  transition  probabilities  of states  2  and  5, \nA =  P(qt =  2lqt-l - 2)  =  P(qt  =  5lqt-l = 5).  Span or  \"half-life\"  is  log(.5)/ log(A), \ni.e.  Aspan  =  .5).  Following  [4],  data was  generated  for  various  span  of long-term \ndependencies  (0.1  to  1000). \nFor  each  series  of experiments,  varying  the  span,  20  different  training  trials  were \nrun  per  span  value,  with  100  training  sequences2 .  Training  was  stopped  either \nafter  a  maximum number of epochs  (200),  of after  the  likelihood  did  not  improve \nsignificantly, i.e.,  (L(t) - L(t - l))/IL(t)1 < 10- 5 ,  where L(t) is the logarithm of the \nlikelihood of the training set  at epoch t. \nIf the  HMM  is  fully  connected  (except  for  the final  absorbing  state)  and  has  just \nthe  right  number  of states,  trials  almost  never  converge  to  a  good  solution  (1  in \n160  did).  Increasing  the number of states and randomly putting zeroes  helps.  The \nrandomly  connected  HMMs  had  3  times  more  states  than  the  generating  HMM \nand  random connections  were  created  with  20%  probability.  Figure  3b  shows  the \naverage  number of converged  trials for  these  different  types  of HMM  topology.  A \ntrial  is  considered  successful  when  it  yields  a  likelihood  almost  as  good  or  better \nthan  the  likelihood  of the  generating  HMM  on  the  same  data.  In  all  cases  the \nnumber of successful  trials rapidly drops  to zero  beyond  some value of span. \n\n5  Conclusion  and  Future Work \n\nIn  previous  work  on  recurrent  networks  we  had  found  that  propagating credit \nover  the  long  term  was  incompatible with  storing information for  the  long  term. \nFor  Markovian  models,  we  found  that  when  the  transition  probabilities  are  close \nto  1  and  0,  information can be  stored  for  the  long  term  AND  credit  can  be  prop-\n\n2it  appeared  sufficient  since  the  likelihood  of  the  generating  HMM  did  not  improve \n\nmuch  when  trained on this  data \n\n\f560 \n\nYoshua  Bengio,  PaoLo  Frasconi \n\nagated  over  the  long  term.  However,  like  for  recurrent  networks,  this  makes  the \nproblem of learning  long-term dependencies  look  more like  a  discrete optimization \nproblem.  Thus it appears difficult for  local learning algorithm such  as  EM  to learn \noptimal transition probabilities near  1 or 0,  i.e.  to learn the topology,  while taking \ninto  account  long-term  dependencies.  The  arguments  presented  are  essentially  an \napplication of established  mathematical results  on  Markov  chains  to  the  problem \nof learning  long term dependencies  in homogeneous  and non-homogeneous HMMs. \nThese  arguments  were  also  supported  by  experiments  on  artificial  data,  studying \nthe  phenomenon of diffusion  of credit  and  the  corresponding  difficulty  in  training \nHMMs  to learn  long-term dependencies. \nIOHMMs  [1]  introduce  a  reparameterization  of  the  problem:  instead  of directly \nlearning the transition probabilities,  we  learn  parameters of a function of an input \nsequence.  Even  with  a fully  connected  topology,  transition probabilities computed \n\nat each  time step  might be  very  close  to \u00b0 and  1.  Because  of the non-stationarity, \n\nmore  interestin~ computations can  emerge  than  the  simple cycles  studied  above. \nFor  example in  l3]  we  found  IOHMMs  effective  in  grammar inference tasks.  In  [1] \ncomparative experiments  were  performed  with  a  preliminary  version  of  IOHMMs \nand  other  algorithms such  as  recurrent  networks,  on  artificial  data  on  which  the \nspan of long-term dependencies  was controlled.  IOHMMs  were  found  much  better \nthan the other  algorithms at learning these  tasks. \nBased  on  the  analysis  presented  here,  we  are  also  exploring  another  approach  to \nlearning long-term dependencies  that consists in building a hierarchical represen(cid:173)\ntation of the state.  This can be  achieved  by  introducing several sub-state variables \nwhose  Cartesian product  corresponds  to the system state.  Each  of these  sub-state \nvariables  can  operate  at  a  different  time  scale,  thus  allowing  credit  to  propagate \nover  long  temporal spans for  some of these  variables.  Another  interesting  issue  to \nbe investigated is whether techniques of symbolic prior knowledge injection (such \nas in (5])  can be exploited  to choose good  topologies.  One advantage,  compared to \ntraditIOnal  neural  network  approaches,  is  that  the model  has  an  underlying finite \nstate structure  and is  thus well  suited to inject discrete  transition rules. \nAcknowledgments \nWe would like to thank Leon Bottou for  his many useful comments and suggestions, \nand the NSERC  and  FCAR Canadian funding  agencies  for  support. \n\nReferences \n[1]  Y.  Bengio  and  P.  Frasconi.  Credit  assignment  through  time:  Alternatives  to \nbackpropagation.  In J.  D.  Cowan, et  al.,  eds.,  Advances  in  Neural  Information \nProcessing  Systems  6. Morgan Kaufmann,  1994. \n\n[2]  Y. Bengio and P.  Frasconi. An Input Output HMM Architecture.  In this volume: \nJ.  D. Cowan,  et  al.,  eds.,  Advances  in  Neural  Information  Processing  Systems \n7.  Morgan  Kaufmann, 1994. \n\n[3]  Y.  Bengio  and  P.  Frasconi.  An  EM  approach  to  learning  sequential  behavior. \n\nTechnical  Report  RT-DSI-ll/94 , University  of Florence,  1994. \n\n[4]  Y.  Bengio,  P.  Simard,  and  P.  Frasconi.  Learning  long-term dependencies  with \ngradient descent is difficult.  IEEE  Trans.  Neural Networks,  5(2):157- 166,  1994. \n[5]  P.  Frasconi,  M.  Gori,  M.  Maggini,  and  G. Soda.  Unified  integration of explicit \nrules and learning by example in recurrent networks.  IEEE Trans.  on Knowledge \nand Data  Engineering,  7(1),  1995. \n\n[6]  E. Seneta. Nonnegative Matrices and Markov  Chains.  Springer, New York,  1981. \n\n\f", "award": [], "sourceid": 919, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Paolo", "family_name": "Frasconi", "institution": null}]}