{"title": "Modeling Temporal Structure in Classical Conditioning", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 10, "abstract": null, "full_text": "Modeling Temporal  Structure in  Classical \n\nConditioning \n\nAaron C.  Courville1,3  and David  S.  Touretzky 2,3 \n\n1 Robotics  Institute,  2Computer Science Department \n\n3Center for  the Neural  Basis of Cognition \n\nCarnegie Mellon  University,  Pittsburgh, PA  15213-3891 \n\n{ aarone, dst} @es.emu.edu \n\nAbstract \n\nThe Temporal Coding Hypothesis of Miller  and colleagues  [7]  sug(cid:173)\ngests  that  animals  integrate  related  temporal  patterns  of stimuli \ninto  single  memory  representations.  We  formalize  this  concept \nusing  quasi-Bayes  estimation  to  update  the  parameters  of a  con(cid:173)\nstrained hidden Markov model.  This approach allows us to account \nfor  some surprising temporal effects  in the second order condition(cid:173)\ning  experiments  of Miller  et  al.  [1 ,  2,  3],  which  other  models  are \nunable to explain. \n\n1 \n\nIntroduction \n\nAnimal learning involves more than just predicting reinforcement.  The well-known \nphenomena  of  latent  learning  and  sensory  preconditioning  indicate  that  animals \nlearn about stimuli in their environment before any reinforcement is supplied.  More \nrecently,  a  series  of experiments  by  R.  R.  Miller  and  colleagues  has  demonstrated \nthat in classical conditioning paradigms, animals appear to learn the temporal struc(cid:173)\nture of the stimuli  [8].  We  will  review three of these experiments.  We  then present \na  model of conditioning based on a  constrained hidden Markov model, using quasi(cid:173)\nBayes estimation to adjust the model parameters online.  Simulation results confirm \nthat the model  reproduces the experimental observations,  suggesting that this  ap(cid:173)\nproach is  a  viable  alternative to earlier models of classical conditioning which can(cid:173)\nnot account for  the Miller et al.  experiments.  Table 1 summarizes the experimental \nparadigms and the results. \n\nExpt.  1:  Simultaneous  Conditioning.  Responding to a  conditioned stimulus \n(CS)  is  impaired when it is  presented simultaneously with the unconditioned stimu(cid:173)\nlus  (US)  rather than preceding the US.  The failure of the simultaneous conditioning \nprocedure to demonstrate a  conditioned response  (CR)  is  a  well  established result \nin the classical  conditioning literature [9].  Barnet et al.  [1]  reported an interesting \n\n\fExpt.  1 \n\nExpt.2A \nExpt.  2B \n\nExpt.  3 \n\nPhase 1 \n(4)T+ US \n(12)T  -+  C \n(12)T  -+  C \n(96)L  -+  US  -+  X \n\nPhase 2 \n(4)C  -+  T \n(8)T -+  US \n(8)T  ---+  US \n(8) B  -+  X \n\nTest  => Result  Test  => Result \n\nT=> -\n\nC=> -\nC  =>CR \n\nX=> -\n\nC  =>CR \n\nB  =>CR \n\nTable 1:  Experimental Paradigms.  Phases  1 and 2 represent two  stages of training trials, \neach presented (n)  times.  The plus sign  (+ ) indicates simultaneous presentation of stimuli; \nthe short  arrow  (-+)  indicates  one  stimulus immediately following  another;  and  the  long \narrow  (-----+)  indicates a 5 sec gap between stimulus offset and the following  stimulus onset. \nFor  Expt.  1,  the tone  T,  click  train  C,  and footshock  US  were  all of  5 sec  duration.  For \nExpt.  2,  the  tone  and  click  train  durations  were  5 sec  and  the  footshock  US  lasted  0.5 \nsec.  For  Expt. 3,  the light  L,  buzzer  E ,  and auditory stimulus  X  (either  a  tone  or  white \nnoise)  were  all  of 30  sec  duration,  while the  footshock  US  lasted  1 sec.  CR indicates  a \nconditioned response  to the test stimulus. \n\nsecond-order extension of the classic paradigm.  While a tone CS  presented simulta(cid:173)\nneously with a footshock results in a minimal CR to the tone, a click train preceding \nthe tone  (in  phase  2)  does acquire associative strength, as indicated by a  CR. \n\nExpt.  2:  Sensory  Preconditioning.  Cole  et  al.  [2]  exposed  rats  to  a  tone  T \nimmediately  followed  by  a  click  train  C.  In  a  second  phase,  the  tone  was  paired \nwith  a  footshock  US  that  either  immediately  followed  tone  offset  (variant  A),  or \noccurred 5 sec  after tone offset  (variant B). They found  that when  C  and US  both \nimmediately follow  T , little  conditioned response is  elicited  by the presentation of \nC.  However, when the US  occurs 5 sec after tone offset, so that it occurs later than \nC  (measured relative to T), then  C  does  come to elicit  a  CR. \n\nExpt.  3:  Backward  Conditioning. \nIn  another  experiment  by  Cole  et  al.  [3], \nrats were presented with a  flashing  light  L followed  by a  footshock  US,  followed  by \nan auditory stimulus  X  (either  a  tone or white  noise).  In phase  2,  a  buzzer B  was \nfollowed by X.  Testing revealed that while X did not elicit  a  CR (in fact, it  became \na  conditioned inhibitor), X  did impart an excitatory association to B. \n\n2  Existing Models of Classical Conditioning \n\nThe  Rescorla-Wagner  model  [11]  is  still  the  best-known  model  of classical  condi(cid:173)\ntioning,  but  as  a  trial-level  model,  it  cannot  account  for  within-trial  effects  such \nas  second  order  conditioning  or  sensitivity  to  stimulus  timing.  Sutton  and  Barto \ndeveloped  V-dot  theory  [14] as  a  real-time extension of Rescorla-Wagner.  Further \nrefinements  led  to  the  Temporal  Difference  (TD)  learning  algorithm  [14].  These \nextensions  can  produce  second  order  conditioning.  And  using  a  memory  buffer \nrepresentation  (what  Sutton  and  Barto call  a  complete  serial  compound),  TD  can \nrepresent the temporal structure of a trial.  However, TD cannot account for the em(cid:173)\npirical data in Experiments 1- 3 because it does not make inferences about temporal \nrelationships  among stimuli;  it  focuses  solely  on predicting the  US. In  Experiment \n1,  some  versions  of TD  can  account  for  the  reduced  associative  strength  of a  CS \nwhen its onset occurs simultaneously with the US,  but no version of TD can explain \nwhy  the  second-order stimulus  C  should  acquire  greater  associative  strength  than \n\n\fT.  In Experiment  2,  no learning occurs in Phase 1 with TD  because no  prediction \nerror  is  generated  by  pairing  T  with  C.  As  a  result,  no  CR is  elicited  by  C  after \nT  has  been  paired  with  the  US  in  Phase  2.  In  Experiment  3,  TD  fails  to  predict \nthe  results  because  X  is  not  predictive  of the  US;  thus  X  acquires  no  associative \nstrength to pass on  to B  in  the second phase. \n\nEven models that predict  future  stimuli  have trouble accounting for  Miller  et al. 's \nresults.  Dayan's  \"successor  representation\"  [4],  the  world  model  of  Sutton  and \nPinette  [15],  and  the  basal  ganglia model  of Suri  and  Schultz  [13]  all  attempt  to \npredict  future  stimulus  vectors.  Suri  and  Schultz's  model  can  even  produce  one \nform  of sensory  preconditioning.  However,  none  of these  models  can  account  for \nthe responses in any of the three experiments in Table 1,  because they do not make \nthe necessary inferences about relations among stimuli. \n\nTemporal  Coding  Hypothesis  The  temporal  coding  hypothesis  (TCH)  [7] \nposits that temporal contiguity is sufficient to produce an association between stim(cid:173)\nuli.  A  CS  does  not need to  predict reward in  order to acquire an association with \nthe US.  Furthermore, the association is  not a simple scalar quantity.  Instead, infor(cid:173)\nmation  about  the temporal  relationships  among stimuli  is  encoded  implicitly  and \nautomatically  in  the  memory  representation  of the trial.  Most  importantly,  TCH \nclaims that memory representations of trials with similar stimuli become integrated \nin such a  way  as to preserve the relative temporal information  [3]. \n\nIf we  apply the concept of memory integration to Experiment 1, we  get the memory \nrepresentation,  C  ---+  T + US.  If we  interpret  a  CR  as  a  prediction  of  imminent \nreinforcement,  then  we  arrive  at  the  correct  prediction  of a  strong  response  to  C \nand a  weak response to T. Integrating the hypothesized memory representations of \nthe two phases of Experiment 2 results in:  A)  T  ---+  C+US and B) T  ---+  C ---+  US. The \nstimulus C is only predictive ofthe US in variant B, consistent with the experimental \nfindings.  For Experiment 3,  an integrated memory representation of the two phases \nproduces L+ B ---+  US  ---+  X.  Stimulus B is predictive of the US  while X is not.  Thus, \nthe temporal coding hypothesis is  able to account for the results of each of the three \nexperiments by associating stimuli with a  timeline. \n\n3  A  Computational Model of Temporal Coding \n\nA  straightforward  formalization  of  a  timeline  is  a  Markov  chain  of  states.  For \nthis  initial  version  of our  model,  state  transitions  within  the  chain  are  fixed  and \ndeterministic.  Each  state  represents  one  instant  of time,  and  at  each  timestep  a \ntransition is  made to  the next  state in the  chain.  This restricted representation is \nkey  to  capturing  the  phenomena underlying  the  empirical  results.  Multiple  time(cid:173)\nlines  (or  Markov chains)  emanate from  a  single holding  state.  The transitions out \nof this holding state are the only probabilistic and adaptive transitions in the sim(cid:173)\nplified  model.  These  transition  probabilities  represent  the  frequency  with  which \nthe timelines  are experienced.  Figure  1 illustrates the model  structure used  in  all \nsimulations. \n\nOur goal is  to show  that our model  successfully integrates the timelines of the two \ntraining phases of each experiment.  In the context of a collection of Markov chains, \nintegrating timelines amounts to both phases of training becoming associated with \na single Markov chain.  Figure 1 shows the integration of the two phases of Expt. 2B. \n\n\fFigure 1:  A depiction of the state and observation structure of the model.  Shown are two \ntimelines, one headed by state j  and the other headed by state k.  State i, the holding state, \ntransitions to states j  and k  with probabilities aij  and aik respectively.  Below the timeline \nrepresentations  are  a  sequence  of observations represented here  as  the symbols  T,  C  and \nUS.  The T  and C stimuli appear for  two time steps each to simulate their presentation for \nan extended duration in the experiment. \n\nDuring  the  second  phase  of the  experiments,  the  second  Markov  chain  (shown  in \nFigure 1 starting with state k)  offers an alternative to the chain associated with the \nfirst  phase of learning.  If we  successfully integrate the timelines, this  second  chain \nis  not used. \n\nAs  suggested  in  Figure  1,  associated  with  each  state  is  a  stimulus  observation. \n\"Stimulus  space\"  is  an  n-dimensional  continuous  space,  where  n  is  the  number \nof  distinct  stimuli  that  can  be  observed  (tone,  light,  shock,  etc.)  Each  state  has \nan expectation  concerning  the stimuli  that  should  be  observed  when  that state is \noccupied.  This expectation is  modeled  by a  probability density function,  over this \nspace,  defined by a mixture of two multivariate Gaussians.  The probability density \nat stimulus observation xt in state i  at time tis, \n\nwhere Wi  is  a mixture coefficient for  the two Gaussians associated with state i.  The \nGaussian  means  /tiD  and  /til  and  variances  ufo  and  ufl  are  vectors  of  the  same \ndimension  as  the  stimulus  vector  xt.  Given  knowledge  of the  state,  the  stimulus \ncomponents  are assumed  to  be  mutually independent  (covariance terms  are  zero). \nWe  chose a  continuous model  of observations over a  discrete  observation model to \ncapture stimulus generalization effects.  These are not pursued in  this paper. \n\nFor  each  state,  the first  Gaussian  pdf is  non-adaptive,  meaning  /tiO  is  fixed  about \na  point  in  stimulus  space  representing the  absence  of stimuli.  ufo  is  fixed  as  well. \nFor the second  Gaussian,  /til  and Ufl  are  adaptive.  This  mixture of one fixed  and \none adaptive Gaussian is an approximation to the animal's belief distribution about \nstimuli,  reflecting  the observed  tolerance  animals  have to  absent expected  stimuli. \nPut another way,  animals seem to be  less  surprised by the absence of an expected \nstimulus than by the presence of an unexpected stimulus. \n\nWe assume that knowledge of the current state st is inaccessible to the learner.  This \ninformation  must  be  inferred  from  the  observed  stimuli.  In  the  case  of a  Markov \nchain, learning with hidden state is  exactly the problem of parameter estimation in \n\nhidden Markov models.  That is,  we  must update the estimates of w,  /tl  and ur  for \n\n\feach state,  and  aij  for  each  state transition  (out  of the holding  state), in order to \nmaximize the likelihood of the sequence of observations \n\nThe  standard  algorithm  for  hidden  Markov  model  parameter  estimation  is  the \nBaum-Welch  method  [10].  Baum-Welch  is  an  off-line  learning  algorithm  that  re(cid:173)\nquires all observations used in training to be held in memory.  In a model of classical \nconditioning, this is  an unrealistic assumption about animals' memory capabilities. \nWe  therefore require an online learning scheme for  the hidden Markov model, with \nonly limited  memory requirements. \n\nRecursive  Bayesian  inference  is  one  possible  online  learning  scheme. \nthe  appealing  property  of  combining  prior  beliefs  about  the  world  with  cur(cid:173)\nrent  observations  through the  recursive  application  of Bayes'  theorem,  p(Alxt)  IX \np(xt lxt- 1 , A)p(AIXt- 1 ).  The  prior  distribution,  p(AIXt- 1 )  reflects  the  belief over \nthe parameter A before the observation at time t, xt.  X t- 1  is  the observation his(cid:173)\ntory up to  time  t-l, i.e.  X t- 1  =  {xt- 1 ,xt- 2 ,  ... }.  The likelihood,  p(xtIXt-l,A) \nis  the probability density over xt  as  a  function  of the parameter A. \n\nIt  offers \n\nUnfortunately,  the implementation of exact recursive Bayesian inference for  a  con(cid:173)\ntinuous  density  hidden  Markov  model  (CDHMM)  is  computationally  intractable. \nThis  is  a  consequence  of  there  being  missing  data  in  the  form  of  hidden  state. \nWith hidden state, the posterior distribution over the model  parameters, after the \nobservation, is  given by \n\nN \n\np(Alxt)  IX  LP(xtlst =  i, X t- 1 , A)p(st  =  iIXt- 1 , A)p(AIXt- 1 ), \n\n(2) \n\ni=1 \n\nwhere  we  have  summed  over  the  N  hidden  states.  Computing  the  recursion  for \nmultiple time steps results in an exponentially growing number of terms contributing \nto the exact posterior. \n\nWe  instead  use  a  recursive  quasi-Bayes  approximate  inference  scheme  developed \nby  Huo  and  Lee  [5],  who  employ  a  quasi-Bayes  approach  [12].  The  quasi-Bayes \napproach exploits the existence of a repeating distribution  (natural conjugate)  over \nthe parameters for the complete-data  CDHMM.  (i.e.  where missing data such as the \nstate sequence is  taken to be known).  Briefly,  we  estimate the value of the missing \ndata.  We  then  use  these estimates,  together with the  observations, to  update  the \nhyperparameters governing the prior distribution over the parameters (using Bayes' \ntheorem).  This results in an approximation to the exact posterior distribution over \nCDHMM  parameters  within  the  conjugate family  of the  complete-data  CDHMM. \nSee  [5]  for  a  more detailed  description of the algorithm. \n\nEstimating the missing data (hidden state) involves estimating transition probabil(cid:173)\nities  between  states,  ~0 = Pr(sT  = i, ST+1  = jlXt , A),  and joint  state and  mixture \ncomponent  label  probabilities  ([k  = Pr(sT  = i, IT  = klXt , A).  Here  zr  = k  is  the \nmixture component label indicating which Gaussian,  k  E  {a, I}, is  the source of the \nstimulus observation at time T.  A is  the current estimate of all  model parameters. \n\nWe  use an online version of the forward-backward algorithm [6]  to estimate ~0 and \n([1.  The forward pass computes the joint probability over state occupancy (taken to \nbe both the state value and the mixture component label) at time T  and the sequence \nof observations  up to time  T.  The backward pass  computes the  probability of the \nobservations in  a  memory  buffer  from  time  T  to the present time t  given  the state \n\n\foccupancy  at  time  T.  The  forward  and  backward  passes  over  state/observation \nsequences are combined to give an estimate of the state occupancy at time T  given \nthe  observations  up  to  the  present  time  t.  In  the  simulations  reported  here  the \nmemory buffer  was  7 time steps long  (t  - T  =  6). \n\nWe  use  the estimates  from  the forward-backward  algorithm  together with  the  ob(cid:173)\nservations  to  update  the  hyperparameters.  For  the  CDHMM,  this  prior  is  taken \nto  be  a  product of Dirichlet  probability density  functions  (pdfs)  for  the transition \nprobabilities  (aij) ,  beta  pdfs  for  the  observation  model  mixture  coefficients  (Wi) \nand normal-gamma pdfs for  the Gaussian parameters  (Mil  and afl)'  The basic hy(cid:173)\nperparameters are exponentially weighted counts of events, with recency weighting \ndetermined by a forgetting parameter p.  For example, \"'ij is the number of expected \ntransitions observed from  state i  to state j, and is  used  to update the estimate of \nparameter aij.  The hyperparameter Vik  estimates the number of stimulus observa(cid:173)\ntions in state i  credited to Gaussian k , and is  used to update the mixture parameter \nWi.  The  remaining hyperparameters 'Ij;,  \u00a2,  and  ()  serve  to  define  the  pdfs  over  Mil \nand afl'  The variable  d  in  the  equations below  indexes  over  stimulus  dimensions. \nSi1d  is  an estimate of the sample variance, and is  a  constant in the present model. \n\nT  _ \n\n((T-1) \n\n\"'ij  - P  \"'ij \n\n1) \n\n1 \n\n+  + '>ij \n(:T \n\n-\n\nT  _ \n\n((T-1) \n\nv ik  - P  v ik \n\n1) \n\n1 \n\n+  + '>ik \nr T \n\n-\n\n. I,T \nr T \n'l' i1d  =  P 'I' i1d  + '>i1 \n\n. 1,(T-1) \n\n,/,T \n'l'i 1d  -\n\n_  p(,/,(T-1)  _  1) + 1H[1 \n\n'l'i1d \n\n2 \n\n2 \n\n_  p()  T- 1  +  (i1 Sild  + \n\n7\" \n\n) \n\n()T \n\ni1d  -\n\n( \ni1d \n\n2 \n\n0,,( 7\"-1 ) ,7\" \n\nPo/ i 1d \n\n' i l  \n\n2(p1/Ji;d 1) H[1) \n\n( )  \n(xT  _  II.  T- 1  )2 \n\nd \n\nf\"'i 1d \n\nIn  the  last  step  of our  inference  procedure,  we  update  our  estimate  of the  model \nparameters as  the mode of their  approximate posterior distribution.  While  this  is \nan approximation to proper Bayesian inference on the parameter values, the mode \nof  the  approximate  posterior  is  guaranteed  to  converge  to  a  mode  of  the  exact \nposterior.  In the equations below, N  is  the number of states in the model. \n\nT_ \n\nWi  -\n\nv[1- 1 \n\nvio + viI -2 \n\n4  Results  and  Discussion \n\nThe  model  contained  two  timelines  (Markov  chains).  Let  i  denote  the  holding \nstate and j, k the initial states of the two chains.  The transition probabilities were \ninitialized as  aij =  aik  =  0.025  and aii =  0.95.  Adaptive Gaussian means Mild  were \ninitialized  to  small  random  values  around  a  baseline  of  10-4  for  all  states.  The \nexponential forgetting  factor  was  P  =  0.9975, and  both the  sample variances  Si1d \nand the fixed  variances  aIOd  were  set to 0.2. \n\nWe  trained the model  on  each  of the experimental  protocols of Table  1,  using the \nsame numbers of trials reported in the original papers.  The model was  run contin(cid:173)\nuously  through  both  phases  of the  experiments  with  a  random  intertrial  interval. \n\n\f'+ \n-:::: \nC \nOJ \nE \nOJ \n~ \n.E \nc \n&!1 \nOi \n0 \ntrr \n\nnoCR \n\nCR \n\nT \nC \nExperiment 1 \n\n4 \n'+ \n-:::: \n~3 \nE \nOJ \n() \n02 \nC \n\"Qi \n~1 \ntrr \n0 \n\n0 \n\nnoCR \n\n(A)C \n(B)C \nExperiment 2 \n\n5 \n'+ \n-:g4 \nOJ \nE \n~3 \n.E \n~2 \na: \ng;1 \ntrr \n0 \n\nnoCR \n\nX \nB \nExperiment 3 \n\nFigure 2:  Results from  20  runs of the model simulation with each experimental paradigm. \nOn  the  ordinate  is  the  total  reinforcement  (US) ,  on  a  log  scale,  above  the  baseline  (an \narbitrary  perception  threshold)  expected to  occur  on  the next  time  step.  The  error  bars \nrepresent  two  standard deviations away  from  the mean. \n\nFigure  2  shows  t he  simulation  results  from  each  of the  three  experiments.  If we \nassume that the CR varies monotonically with the US  prediction, then in each case, \nt he  model's  predicted CR agreed with the observations of Miller et al. \n\nThe CR predictions are the result of the model integrating t he two phases of learning \ninto one t imeline.  At the time of the presentation of the Phase 2 stimuli, the states \nforming  the  timeline  describing  the  Phase  1  pattern  of stimuli  were  judged  more \nlikely to have produced the Phase 2 stimuli than states in the other timeline,  which \nserved as  a  null hypothesis.  In another experiment, not shown here, we  trained the \nmodel  on  disjoint  stimuli  in the  two  phases.  In  that  situation it  correctly  chose  a \nseparate timeline for  each  phase, rather than merging the two. \n\nWe  have  shown that under the assumption t hat observation probabilities are mod(cid:173)\neled  by a  mixture of Gaussians,  and a  very restrictive state transition structure, a \nhidden Markov model can integrate the memory representations of similar temporal \nstimulus  patterns.  \"Similarity\"  is  formalized  in this framework as likelihood under \nthe t imeline  model.  We  propose t his  model  as  a  mechanism for the  integration of \nmemory representations postulated in the Temporal  Coding Hypothesis. \n\nThe model can be extended in many ways.  The current version assumes t hat event \nchains are long enough to represent an entire trial, but short enough that the model \nwill  return  to  the  holding  state  before  the  start  of  the  next  trial.  An  obvious \nrefinement  would  be  a  mechanism  to  dynamically  adjust  chain  lengths  based  on \nexperience.  We are also exploring a generalization of the model to the semi-Markov \ndomain, where state occupancy duration is modeled explicitly as  a  pdf.  State tran(cid:173)\nsitions  would then be tied to changes in observations, rather than following  a  rigid \nprogression as is currently the case.  Finally, we  are experimenting with mechanisms \nthat allow new  chains to be split  off from  old ones when the model determines that \ncurrent stimuli differ  consistently from  t he  closest matching timeline. \n\nFitting stimuli  into  existing  timelines  serves  to maximize the likelihood of current \nobservations in light  of past experience.  But why should animals learn the temporal \nstructure  of  stimuli  as  timelines?  A  collection  of timelines  may  be  a  reasonable \nmodel of the natural world.  If t his is true, t hen learning with such a strong inductive \nbias may help t he animal to bring experience of related phenomena to bear in novel \nsit uations- a  desirable characteristic for an adaptive  system in  a  changing  world. \n\n\fAcknowledgments \n\nThanks to  Nathaniel  Daw  and  Ralph Miller  for  helpful  discussions.  This  research \nwas  funded  by  National  Science  Foundation  grants  IRI-9720350  and  IIS-997S403. \nAaron Courville was  funded  in part by a  Canadian NSERC  PGS B fellowship. \n\nReferences \n\n[1]  R.  C.  Barnet,  H.  M.  Arnold,  and  R.  R.  Miller.  Simultaneous  conditioning  demon(cid:173)\n\nstrated  in  second-order  conditioning:  Evidence  for  similar  associative  structure  in \nforward  and simultaneous conditioning.  Learning  and  Motivation,  22:253- 268,  1991. \n\n[2]  R.  P.  Cole, R.  C.  Barnet, and R.  R . Miller.  Temporal encoding in trace conditioning. \n\nAnimal Learning  and  Behavior, 23(2) :144- 153,  1995. \n\n[3]  R.  P.  Cole  and R.  R.  Miller.  Conditioned excitation  and conditioned  inhibition  ac(cid:173)\nquired through backward conditioning.  Learning  and  Motivation, 30:129- 156,  1999. \n\n[4]  P.  Dayan.  Improving  generalization  for  temporal  difference  learning:  the  successor \n\nrepresentation.  Neural  Computation,  5:613- 624,  1993. \n\n[5]  Q.  Huo  and  C.-H.  Lee.  On-line  adaptive  learning  of the  continuous  density  hidden \nMarkov  model  based on  approximate  recursive  Bayes  estimate.  IEEE  Transactions \non  Speech  and  Audio  Processing,  5(2):161- 172,  1997. \n\n[6]  V .  Krishnamurthy  and  J .  B.  Moore.  On-line  estimation  of  hidden  Markov  model \nparameters based on  the Kullback-Leibler information  measure.  IEEE  Transactions \non  Signal  Processing,  41(8):2557- 2573,  1993. \n\n[7]  L.  D. Matzel, F.  P.  Held, and R.  R.  Miller.  Information  and the expression  of simul(cid:173)\n\ntaneous and backward associations:  Implications for  contiguity  theory.  Learning  and \nMotivation,  19:317- 344,  1988. \n\n[8]  R.  R.  Miller  and R .  C. Barnet.  The role  of time in elementary associations.  Current \n\nDirections  in Psychological  Science,  2(4):106- 111,  1993. \n\n[9]  1.  P.  Pavlov.  Conditioned  Reflexes.  Oxford University Press,  1927. \n\n[10]  L.  R.  Rabiner.  A  tutorial  on  hidden  Markov  models  and  selected  applications  III \n\nspeech recognition.  Proceedings  of the  IEEE, 77(2) :257- 285,  1989. \n\n[11]  R.  A.  Rescorla and A. R.  Wagner.  A  theory of Pavlovian conditioning:  Variations in \nthe effectiveness  of reinforcement  and  nonreinforcement.  In  A.  H.  Black  and W.  F. \nProkasy, editors,  Classical  Conditioning II.  Appleton-Century-Crofts, 1972. \n\n[12]  A. F .  M.  Smith and U. E.  Makov.  A  quasi-Bayes sequential  procedure for  mixtures. \n\nJournal  of th e Royal Statistical  Society, 40(1):106- 112,  1978. \n\n[13]  R.  E.  Suri and W. Schultz.  Temporal difference model reproduces anticipatory neural \n\nactivity.  Neural  Computation, 13(4):841- 862,  200l. \n\n[14]  R.  S.  Sutton and A. G.  Barto.  Time-derivative models of Pavlovian reinforcement.  In \n\nM.  Gabriel  and J.  Moore, editors,  Learning  and  Computational  Neuroscience:  Foun(cid:173)\ndations  of Adaptive  N etworks, chapter 12, pages 497- 537.  MIT  Press,  1990. \n\n[15]  R.  S.  Sutton and B.  Pinette.  The learning of world models by connectionist networks. \nIn L.  Erlbaum,  editor,  Proceedings  of the  seventh  annual  conference  of the  cognitive \nscience  society, pages  54- 64,  Irvine,  California,  August  1985. \n\n\f", "award": [], "sourceid": 2026, "authors": [{"given_name": "Aaron", "family_name": "Courville", "institution": null}, {"given_name": "David", "family_name": "Touretzky", "institution": null}]}