{"title": "A Dynamical Approach to Temporal Pattern Processing", "book": "Neural Information Processing Systems", "page_first": 750, "page_last": 759, "abstract": null, "full_text": "750 \n\nA DYNAMICAL APPROACH TO TEMPORAL PATTERN \n\nPROCESSING \n\nW. Scott Stornetta \n\nStanford University, Physics Department, Stanford, Ca., 94305 \n\nTad Hogg and B. A.  Huberman \n\nXerox Palo Alto Research Center, Palo Alto, Ca. 94304 \n\nABSTRACT \n\nRecognizing  patterns  with  temporal  context  is  important for \nsuch  tasks  as  speech  recognition,  motion  detection  and  signature \nverification.  We  propose  an  architecture  in  which  time  serves as its \nown representation, and temporal context is encoded in the state of the \nnodes. We contrast this with the approach of replicating portions of the \narchitecture to represent time. \n\nAs one example of these ideas, we demonstrate an architecture \nwith  capacitive  inputs  serving  as  temporal  feature  detectors  in  an \notherwise  standard  back  propagation  model.  Experiments  involving \nmotion  detection  and  word  discrimination  serve  to  illustrate  novel \nfeatures  of the system.  Finally, we discuss possible  extensions of the \narchitecture. \n\nINTRODUCTION \n\nRecent  interest in connectionist,  or \"neural\" networks  has emphasized  their \nability to store, retrieve and process patterns1,2.  For most applications, the patterns to \nbe processed are static in the sense that they lack temporal context. \n\nAnother important class consists of those problems that require the processing \nof temporal  patterns.  In  these  the  information  to  be  learned  or  processed  is  not  a \nparticular  pattern  but  a  sequence  of  patterns.  Such  problems  include  speech \nprocessing,  signature  verification,  motion  detection,  and  predictive \nsignal \nprocessin,r-8. \n\nMore  precisely,  temporal  pattern  processing  means  that  the  desired  output \ndepends  not only on the  current input but also  on  those  preceding or  following  it as \nwell.  This  implies  that  two  identical  inputs  at  different  time  steps  might  yield \ndifferent desired outputs depending on what patterns precede or follow them. \n\nThere  is another feature  characteristic of much  temporal  pattern processing. \nHere  an  entire  sequence  of  patterns  is  recognized  as  a  single  distinct  category, \n\n\u00a9 American Institute of Physics 1988 \n\n\f751 \n\ngenerating a  single output.  A typical example of this would  be  the need  to  recognize \nwords  from  a  rapidly  sampled  acoustic  signal.  One  should  respond  only once  to  the \nappearance of each word, even though the word consists of many samples. Thus, each \ninput may not produce an output. \n\nWith these features  in mind,  there are at least three additional issues  which \nnetworks  that process temporal  patterns must address, above  and beyond those  that \nwork with static patterns. The first is how to represent temporal context in the state of \nthe network. The second is how  to train at intermediate time steps before  a  temporal \npattern is complete. The third issue is how to interpret the outputs during recognition, \nthat is,  how  to  tell  when the sequence has been completed.  Solutions to  each of these \nissues require the construction of appropriate  input and output representations. This \npaper  is  an  attempt  to  address  these  issues,  particularly  the  issue  of representing \ntemporal context in the state of the machine. We  note  in passing that the recognition \nof temporal sequences is distinct from the  related problem of generating a  sequence, \ngiven its first few  members9.l O\u202211 . \n\nTEMPORAL CLASSIFICATION \n\nWith some exceptions10.12, in  most  previous  work  on  temporal  problems  the \nsystems  record  the  temporal  pattern by  replicating part of the architecture for  each \ntime step. In some instances input nodes and their associated links are replicated3,4. In \nother  cases  only  the  weights  or  links  are  replicated,  once  for  each  of several  time \ndelays 7,8.  In either case, this amounts to  mapping the temporal  pattern into a  spatial \none of much higher dimension before processing. \n\nThese systems have generated significant and encouraging results.  However, \nthese approaches also  have  inherent drawbacks.  First,  by  replicating portions of the \narchitecture for  each time step the amount of redundant computation is significantly \nincreased.  This  problem  becomes  extreme  when  the  signal \nis  sampled  very \nfrequently4.  :-.l' ext, by re lying on replications of the architecture for each time step, the \nsystem is quite inflexible to variations in the rate at which the data is presented or size \nof the temporal window. Any variability in the rate of the input signal can generate an \ninput  pattern  which  bears  little  or  no  resemblance  to  the  trained  pattern.  Such \nvariability is an important issue, for example, in speech recognition . Moreover, having \na  temporal  window  of  any  fixed  length  makes  it  manifestly  impossible  to  detect \ncontextual effects on time scales longer than the window  size.  An additional difficulty \nis  that  a  misaligned  signal,  in  its  spatial  representation,  may  have  very  little \nresemblance to  the correctly aligned training signal. That is, these systems typically \nsuffer from not being translationally invariant in time. \n\n~etworks based on  relaxation to  equilibrium 11,13,14 also  have  difficulties for \nuse  with  temporal  problems.  Such  an  approach  removes  any  dependence  on  initial \n\n\f752 \n\nconditions and hence is difficult to reconcile directly with temporal problems, which by \ntheir nature depend on inputs from earlier times.  Also,  if a  temporal problem is  to be \nhandled in terms of relaxation to equilibrium, the equilibrium points themselves must \nbe changing in time. \n\nA NON\u00b7REPLICATED, DYNAMIC ARCHITECTURE \n\nWe  believe  that  many  of  the  difficulties  mentioned  above  are  tied  to  the \nattempt  to  map  an  inherently  dynamical  problem  into  a  static  problem  of  higher \ndimension.  As an alternative, we  propose  to  represent the history of the inputs in the \nstate  of  the  nodes  of  a  system,  rather  than  by  adding  additional  units.  Such  an \napproach to capturing temporal context shows some very immediate advantages over \nthe systems mentioned above . F'irst, it requires no replication of units for each distinct \ntime step.  Second,  it does  not  fix  in  the architecture  itself the  window  for  temporal \ncontext or the presentation rate. These advantages are a direct result of the decision to \nlet time serve as its own representation for  temporal sequences,  rather than creating \nadditional spatial dimensions to represent time. \n\nIn  addition to  providing a  solution  to  the  above  problems,  this  system  lends \nitself naturally  to  interpretation  as  an  evolving  dynamical  system.  Our  approach \nallows  one  to  think  of  the  process  of  mapping  an  evolving  input  into  a  discrete \nsequence  of outputs  (such  as  mapping  continuous  speech  input  into  a  sequence  of \nwords) as a dynamical system moving from one attractor to another 15. \n\nAs  a  preliminary  example  of the  application  of these  ideas,  we  introduce  a \nsystem that captures the temporal context of input patterns without replicating units \nfor each time step. We modify the conventional back propagation algorithm by  making \nthe input units capacitive.  In contrast to  the  conventional architecture  in  which  the \ninput  nodes  are  used  simply  to  distribute  the  signal  to  the  next  layer,  our  system \nperforms an additional computation. Specifically,  let Xi  be  the  value computed by  an \n\ninput node at time ti ' and Ii be the input signal to this node at the same time. Then the \n\nnode computes successive values according to \n\n(1) \n\nwhere a  is an input amplitude and d is a decay rate. Thus, the result computed by an \ninput unit is the sum of the current input value multiplied by a,  plus a fractional part, \nd, of the previously computed value of the input unit. In  the absence  of further  input, \nthis produces an exponential decay in the activation of the input nodes. The value for d \nis chosen so that this decay reaches lie of its original value in a time  t  characteristic of \nthe time scale for the particular problem, i.e., d=e'tr, where r is the presentation rate. \nThe  value  for  a  is  chosen  to  produce  a  specified  maximum  value  for  X,  given  by \n\n\f753 \n\nal ma/(1-d) .  We  note  that Eq. (1)  is  equivalent  to  having a  non-modifiable  recurrent \nlink with weight d on the input nodes, as illustrated in Fig.  l. \n\no \n\n0 \n\nFig.  1:  Schematic architecture with capacitive inputs. The input nodes \ncompute  values  according  to  Eq.  (1).  Hidden  and  output  units  are \nidentical to standard back propagation nets. \n\nThe processing which takes  place at the  input node can also be  thought of in \nterms of an infinite impulse response (IIR) digital filter. The infinite impulse response \nof the  filter  allows  input from  the  arbitrarily  distant  past  to  influence  the  current \noutput of the filter, in contrast to  methods which employ fixed  windows,  which can be \nviewed in terms of finite impulse response (FIR) filters. The capacitive node of Fig. 1 is \nequivalent to pre-processing the signal with a filter with transfer function a/(1-dz\u00b7 1) . \n\nThis  system  has  the  unique  feature  that  a  simple  transformation  of  the \nparameters a and d allows it to respond in a near-optimal way to a signal which differs \nfrom  the training signal in its rate.  Consider a system initially trained at rate  r  with \ndecay rate d and amplitude a. To make use of these weights for a different presentation \nrate, r~ one simply adjusts the values a 'and d'according to \n\nd' =  d r/r' \n\n1 - d' \na' =  a \"\"[:\"d \n\n(2) \n\n(3) \n\n\f754 \n\nThese equations can be derived by the following argument. The general idea is \nthat  the  values  computed  by  the  input  nodes  at  the  new  rate  should  be  as  close  as \npossible  to  those  computed  at the  original  rate.  Specifically,  suppose  one  wishes  to \nchange the sampling rate from  r to  nr, where  n is an integer. Suppose that at a time to \n\nthe computed value of the input node  is  Xo '  If this node  receives no  additional  input, \n\nthen after m  time steps, the computed  value of the  input  node  will  be  Xod m .  For the \n\nmore  rapid  sampling  rate,  Xodm  should be  the  value  obtained after  nm  time  steps. \n\nThus we require \n\n(4) \n\nwhich  leads  to  Eq.  (2)  because  n= r7r.  Now  suppose  that an  input  I  is  presented  m \n\ntimes in succession to an input node  that is initially zero.  After the  mth presentation, \nthe computed value of the input node is \n\n(5) \n\nRequiring this value to  be equal to the corresponding value for the faster presentation \nrate after nm time steps leads  to  Eq.  (3).  These equations, then,  make  the computed \nvalues of the input  nodes  identical,  independent of the  presentation  rate . Of course, \nthis statement only holds exactly in  the  limit that the computed  values  of the  input \nnodes change only infinitesimally from one time step to the next. Thus, in practice, one \nmust insure that the signal is sampled frequently enough that the computed  value of \nthe input nodes is slowly changing. \n\nThe point in weight space obtained after initial training at the rate  r  has two \ndesirable properties.  First, it can be trained on a signal at one sampling rate and then \nthe  values of the  weights arrived at can be  used  as  a  near-optimal  starting point  to \nfurther  train  the  system  on  the  same  signal  but  at  a  different  sampling  rate. \nAlternatively,  the system can respond to  temporal patterns which differ  in  rate from \nthe training signal, without any retraining of the weights. These factors are a result of \nthe choice of input representation,  which essentially present the same pattern to  the \nhidden  unit and other layers, independent of sampling rate. These  features  highlight \nthe fact that in this system the weights to some degree represent the temporal pattern \nindependent of the  rate  of presentation.  In  contrast,  in  systems  which  use  temporal \nwindows,  the weights obtained after training on  a  signal at one  sampling rate would \nhave  little or no  relation to  the desired values of the  weights  for  a  differen.t sampling \nrate or window size. \n\n\f755 \n\nEXPERIMENTS \n\nAs  an  illustration of this  architecture  and  related algorithm,  a  three-layer, \n15-30-2 system was trained to  detect the  leftward or rightward motion of a  gaussian \npulse  moving  across  the  field  of input  units  with  sudden  changes  in  direction.  The \nvalues of d and a  were 0.7788 and 0.4424,  respectively. These values were chosen  to \ngive a  characteristic decay time of 4 time steps with a  maximum value computed by \nthe input nodes of 2.0 . The pulse was of unit height with a half-width, 0, of 1.3.  Figure \n2 shows the input pulse as well as the values computed by the input nodes for  leftward \nor  rightward  motion.  Once  trained  at a  velocity  of 0.1 unit  per  sampling  time,  the \nvelocity was varied over a wide range, from a factor of2 slower to a factor of2 faster as \nshown  in  Fig.  3.  For small  variations  in  velocity  the system  continued  to  correctly \nidentify  the  type  of motion.  More  impressive  was  its  performance  when  the  scaling \nrelations given in Eqs. (2) and (3) were used to modify the amplitude and decay rate . In \nthis  case,  acceptable  performance  was  achieved  over  the  entire  range  of velocities \ntested. This was without any additional retraining at the new rates. The difference in \nperformance between the two curves also demonstrates that the excellent performance \nof the system is not an anomaly of the particular problem chosen, but characteristic of \nrescaling  a  and  d  according  to  Eqs.  (2)  and  (3).  We  thus  see  that  a  simple  use  of \ncapacitive  links  to  store  temporal  context  allows  for  motion  detection  at  variable \nvelocities. \n\nA  second  experiment  involving  speech  data  was  performed  to  compare  the \nsystem's  performance  to  the  time-delay-neural-network of Watrous  and Shastri 8.  In \ntheir work, they trained a system to discriminate between suitably processed acoustic \nsignals of the words \"no\" and \"go.\" Once trained on a single utterance, the system was \nable to  correctly  identify  other samples  of these  words  from  the  same  speaker.  One \ndrawback of their approach was that the weights did not converge to a fixed point. We \nwere therefore particularly interested in whether our system could converge smoothly \nand  rapidly to  a  stable solution,  using  the same data,  and yet generalize  as  well  as \ntheirs  did.  This  experiment  also  provided  an  opportunity  to  test  a  solution  to  the \nintermediate step training problem. \n\nThe architecture was a  16-30-2 network. Each of the  input nodes  received an \ninput  signal  corresponding  to  the  energy  (sampled  every  2.5  milliseconds)  as  a \nfunction of time in one of 16 frequency channels. The input values were normalized to \nlie in the range 0.0 to  1.0. The values of d and a  were 0.9944 and 0.022,  respectively. \nThese values were chosen to give a characteristic decay time comparable to the length \nof each word (they were  nearly the same length), and a  maximum value computed by \nthe input nodes of 4.0.  For an input signal that was part of the word \"no\", the training \nsignal  was  (t.O,  0.0),  while  for  the word  \"go\"  it  was  (0.0,  1.0).  Thus the  outputs that \nwere compared to the training signal can be interpreted as evidence for one word or the \nother at each  time  step. The  error shown  in  Fig.  4  is  the sum of the  squares of the \n\n\f756 \n\ndifference  between the desired outputs and the computed outputs for  each time  step, \nfor both words, after training up to the number ofiterations indicated along the x-axis. \n\na) input wavepacket \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\nB \n\n9 \n\n10 \n\nb) rightward \n\nmotion \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\nB \n\n9 \n\n10 \n\nc) leftward \n\nmotion \n\n4 \n\n5 \n\n6 \n\n3 \n\n2 \nFig.  2:  a)  Packet presented  to  input nodes.  The  x-axis  represents  the \ninput nodes.  b)  Computed values  from  input nodes  during rightward \nmotion. c) Computed values during leftward motion. \n\n10 \n\n7 \n\nB \n\n9 \n\n\f757 \n\n100~ ______________ ~~~~ __ ~~~::::~ \n\n%  80  __  . \n\n: \" ' :w'  - - - -. . . . . . . . . . . . . . . . . .  ~I!\"'.::..:/,j/.:).  . 'i \n\nc \no \nr \nr \ne \nc \nt \n\n60  _ \n\nI) \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nI~ \n\n;~ \n\n40  +-\n\n20  -I-\n\no \n\nI \nI \n.5 \n\nI \nI \n\n1.0 \nv'lv \n\nI \nI \n\n1.5 \n\n-\n\n'0 \n\nI \nI \n\n2.0 \n\nFig.  3:  Performance  of  motion  detection  experiment  for  various \nvelocities.  Dashed  curve  is  performance  without  scaling  and  solid \ncurve is with the scaling given in Eqs. (2) and (3). \n\n125.0 \n\n100.0 \n\n75.0 \n\n50.0 \n\n25.0 \n\n0.0 \n\ne \nr \nr \n0 \nr \n\n0 \n\n500 \n\n1000 \n\n1500 \n\n2000 \n\n2500 \n\niterations \n\nFig.  4:  Error in  no/go  discrimination  as a  function  of the  number of \ntraining iterations. \n\nEvidence for each word was obtained by summing the values of the respective \nnodes  over  time.  This  suggests  a  mechanism  for  signaling  the  completion  of  a \nsequence:  when this sum crosses a certain threshold value, the sequence (in this case, \nthe  word)  is  considered  recognized.  Moreover,  it  may  be  possible  to  extend  this \nmechanism to  apply to  the  case of connected  speech: after a  word  is  recognized,  the \nsums could be reset to zero, and the input nodes reinitialized. \n\nOnce  we  had  trained  the  system  on  a  single  utterance,  we  tested  the \nperfor~ance of the  resulting  weights on  additional  utterances of the same speaker. \n\n\f758 \n\nPreliminary  results  indicate  an  ability  to  correctly  discriminate  between  \"no\"  and \n\"go.\" This suggests that the system has at least a  limited ability to generalize in this \ntask domain. \n\nDISCUSSION \n\nAt  a  more  general  level,  this  paper  raises  and  addresses  some  issues  of \nrepresentation. By choosing input and output representations in a  particular way, we \nare able to  make a  static optimizer work on  a  temporal problem while  still allowing \ntime to  serve as its own  representation. In this broader context, one realizes  that the \nchoice  of capacitive  inputs  for  the  input  nodes  was  only  one  among  many  possible \ntemporal feature detectors. \n\nOther possibilities include refractory units, derivative units and delayed spike \nunits.  Refractory units would compute a  value which was some fraction of the current \ninput. The fraction would decrease the more frequently and recently the node had been \n\"on\" in the recent past. A derivative unit would have a  larger output the more rapidly \na  signal changed from  one  time  step to  the  next.  A delayed  spike unit might have a \ntransfer function of the form  Itne-at,  where t  is the time since the presentation of the \nsignal.  This is similar to  the function  used  by  Tank and  Hopfield7,  but here it could \nserve a different purpose. The maximum value that a  given input generated would be \ndelayed  by  a  certain amount of time.  By  similarly delaying  the  training signal, the \nsystem could be  trained to  recognize a  given  input in the context of signals not only \npreceding  but  also  following  it.  An  important  point  to  note  is  that  the  transfer \nfunctions  of each of these proposed  temporal feature  detectors could  be  rescaled in  a \nmanner similar to the capacitive nodes. This would preserve the property of the system \nthat  the  weights  contain  information  about  the  temporal  sequence  to  some  degree \nindependent of the sampling rate. \n\nAn  even  more  ambitious  possibility  would  be  to  have  the  system  train  the \nparameters, such as d in the capacitive node case.  It may be feasible  to do  this in  the \nsame way that weights are trained, namely by taking the partial of the computed error \nwith respect to the parameter in question. Such a system may be able to determine the \nrelevant time scales of a temporal signal and adapt accordingly. \n\nACKNOWLEDGEMENTS \n\nWe  are  grateful  for  fruitful  discllssions  with  Jeff  Kephart  and  the  help  of \nRaymond  Watrous  in  providing  data  from  his  own  experiments.  This  work  was \npartially  supported  by  DARPA  ISTO  Contract  #  N00140-86-C-8996  and  ONR \nContract #  N00014-82-0699_ \n\n\f759 \n\n1. \n\n2. \n\n3. \n\nD.  Rumelhart,  ed.,  Parallel  Distributed  Processing,  (:\\'lIT  Press,  Cambridge, \n1986). \n\nJ. Denker, ed., Neural Networks for Computing, AlP Conf. Proc.,151 (1986). \n\nT. J. Sejnowski and C.  R.  Rosenberg, NETtalk: A Parallel Network that Learns to \nRead Aloud, Johns Hopkins Univ. Report No. JHU/EECS-86101 (1986). \n\n4. \n\nJ.L. McClelland and J.L. Elman, in Parallel Distributed Processing, vol. II, p.  58. \n\n5.  W.  Keirstead and B.A. Huberman, Phys . Rev. Lett. 56,1094 (1986). \n\n6.  A. Lapedes and R.  Farber, Nonlinear Signal Processing Using Neural Networks, \n\n7. \n\n8. \n\n9. \n\nLos Alamos preprint LA-uR-87-2662 (1987). \n\nD. Tank and J. Hopfield, Proc.  Nat. Acad. Sci., 84, 1896 (1987). \n\nR.  Watrous  and  L.  Shastri,  Proc.  9th  Ann.  Conf  Cog.  Sci.  Soc.,  (Lawrence \nErlbaum, Hillsdale, 1987), p.  518. \n\nP.  Kanerva,  Self-Propagating  Search:  A  Unified  Theory  of Memory,  Stanford \nUniv. Report No. CSLI-84-7 (1984). \n\n10.  M.1.  Jordan, Proc.  8th Ann. Conf.  Cog.  Sci.  Soc.,  (Lawrence Erlbaum,  Hillsdale, \n\n1986), p. 531. \n\n11.  J. Hopfield,Proc. Nat. Acad. SCi., 79, 2554 (1982). \n\n12.  S.  Grossberg,  The  Adaptive  Brain,  vol.  II,  ch.  6,  (North-Holland,  Amsterdam, \n\n1987). \n\n13.  G.  Hinton and T. J. Sejnowski, in Parallel Distributed Processing, vol.  I, p.  282. \n\n14.  B. Gold, in Neural Networks for Computing, p.  158. \n\n15.  T. Hogg and B.A. Huberman, Phys. Rev. A32, 2338 (1985). \n\n\f", "award": [], "sourceid": 76, "authors": [{"given_name": "W.", "family_name": "Stornetta", "institution": null}, {"given_name": "Tad", "family_name": "Hogg", "institution": null}, {"given_name": "Bernardo", "family_name": "Huberman", "institution": null}]}