{"title": "Foraging in an Uncertain Environment Using Predictive Hebbian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 605, "abstract": null, "full_text": "Foraging in an Uncertain Environment Using \n\nPredictive Hebbian Learning \n\nP.  Read Montague:  Peter Dayan, and  Terrence J. Sejnowski \n\nComputational Neurobiology Lab, The Salk Institute, \n\n100 ION. Torrey Pines Rd, \nLa Jolla, CA, 92037, USA \nread~bohr.bcm.tmc.edu \n\nAbstract \n\nSurvival  is  enhanced  by  an  ability  to  predict  the  availability  of food, \nthe  likelihood of predators,  and  the  presence  of mates.  We  present  a \nconcrete model that uses diffuse neurotransmitter systems to implement \na  predictive version  of a  Hebb  learning  rule  embedded  in  a  neural  ar(cid:173)\nchitecture based  on  anatomical  and  physiological studies on  bees.  The \nmodel captured the strategies seen in the behavior of bees and a number of \nother animals when foraging in an uncertain environment.  The predictive \nmodel  suggests a unified  way  in which  neuromodulatory influences can \nbe used to bias actions and control synaptic plasticity. \n\nSuccessful predictions enhance adaptive behavior by allowing organisms to prepare for fu(cid:173)\nture actions, rewards, or punishments.  Moreover, it is possible to improve upon behavioral \nchoices  if the consequences  of executing different actions can  be  reliably predicted.  Al(cid:173)\nthough classical and instrumental conditioning results from the psychological literature [1] \ndemonstrate that the vertebrate brain is capable of reliable prediction, how these predictions \nare computed in brains is not yet known. \n\nThe  brains  of  vertebrates  and  invertebrates  possess  small  nuclei  which  project  axons \nthroughout large  expanses  of target  tissue  and  deliver  various  neurotransmitters  such  as \ndopamine, norepinephrine, and acetylcholine [4].  The activity in these systems may report \non reinforcing stimuli in the world or may reflect an expectation of future reward [5, 6,7,8]. \n\n*Division of Neuroscience, Baylor College of Medicine, Houston, TX 77030 \n\n598 \n\n\fForaging in an Uncertain Environment Using Predictive Hebbian Learning \n\n599 \n\nA particularly striking example is that of the honeybee.  Honeybees can be conditioned to \na sensory stimulus such as a color, visual pattern, or an odorant when the sensory stimulus \nis  paired with application of sucrose to the antennae  or  proboscis.  An  identified neuron, \nVUMmxl,  projects  widely  throughout the  entire  bee  brain,  becomes  active  in  response \nto  sucrose,  and  its  firing  can  substitute for  the  unconditioned  odor stimulus  in  classical \nconditioning experiments  [8].  Similar diffusely  projecting neurons in  the bee brain  may \nsubstitute for reward  when paired with a visual stimulus. \n\nIn this paper, we suggest a role for diffuse neurotransmitter systems in learning and behavior \nthat is  analogous to the function we previously postulated for them in developmental self(cid:173)\norganization[3,  2].  Specifically,  we:  (i) identify a  neural  substrate/architecture  which  is \nknown  to  exist  in  both  vertebrates  and  invertebrates  and  which  delivers  information to \nwidespread  regions  of the  brain;  (ii)  describe  an  algorithm  that  is  both  mathematically \nsound and biologically feasible; and (iii) show that a version of this local algorithm, in the \ncontext of the neural architecture, reproduces the foraging and decision behavior observed \nin bumble bees and a number of other animals. \n\nOur premise is  that the  predictive relationships between  sensory  stimuli and  rewards  are \nconstructed through these diffuse systems and are used to shape both ongoing behavior and \nreward-dependent synaptic plasticity.  We  illustrate this using a  simple example  from  the \nethological literature for which constraints are available at a number of different levels. \n\nA Foraging Problem \n\nReal and colleagues [9,  10] performed a series of experiments on bumble bees foraging on \nartificial flowers  whose colors, blue and yellow, predicted of the delivery of nectar.  They \nexamined how bees respond to the mean and variability of this reward delivery in a foraging \nversion of a stochastic two-armed bandit problem [11].  All the blue flowers  contained 2\\-1l \nof nectar, l of the yellow flowers contained 6 \\-1l,  and the remaining j of the yellow flowers \ncontained no  nectar at  all.  In practice,  85%  of the bees'  visits  were to the constant yield \nblue flowers  despite the equivalent  mean  return  from  the more  variable yellow  flowers. \nWhen the contingencies for reward  were reversed,  the bees switched their preference  for \nflower color within 1 to 3 visits to flowers.  They further demonstrated that the bees could be \ninduced to visit the variable and constant flowers  with equal frequency  if the mean reward \nfrom the variable flower type was made sufficiently high. \n\nThis experimental  finding shows that bumble bees,  like honeybees, can  learn to associate \ncolor with reward.  Further,  color and  odor learning in honeybees  has  approximately the \nsame time course as the shift in preference descri bed above for the bumble bees [12].  It also \nindicates that under the conditions of a foraging task, bees prefer less variable rewards and \ncompute the reward availability in the short term.  This is  a behavioral strategy utilized by \na variety of animals under similar conditions for reward [9,  10,  13] suggesting a common \nset of constraints in the underlying neural substrate. \n\nThe Model \n\nFig.  1 shows  a  diagram  of the  model  architecture,  which  is  based  on  the  considerations \nabove about diffuse systems.  Sensory input drives the units  'B' and  'Y' representing blue \nand  yellow  flowers.  These  neurons  (outputs  x~ and  xi  respectively  at  time  t)  project \n\n\f600 \n\nMontague, Dayan, and Sejnowski \n\nAction selection \n\nMotor \nsystems \n\nLateral \ninhibition \n\nFigure  1:  Neural architecture showing how predictions about future  expected  rein(cid:173)\nforcement can be made in the brain using a diffuse neurotransmitter system [3, 2].  In \nthe context of bee foraging [9], sensory input drives the units Band Y representing blue and \nyellow flowers.  These units project to a reinforcement neuron P through a set of variable \nweights (filled circles w B and w Y) and to an action selection system.  Unit S provides input \nto n and fires  while the bee sips the nectar.  R projects its output rt through a fixed  weight \nto P.  The variable weights onto P implement predictions about future reward rt (see text) \nand P's output is sensitive to temporal changes in its input.  The output projections of P, bt \n(lines with arrows),  influence learning and also the selection of actions such as  steering in \nflight and  landing, as  in equation 5 (see text).  Modulated lateral inhibition (dark circle) in \nthe action selection layer symbolizes this.  Before encountering a flower and its nectar,  the \noutput of P  will reflect the temporal  difference only between  the sensory inputs Band Y. \nDuring an encounter with a flower and nectar,  the prediction error bt  is determined by the \noutput of B or Y and  R,  and  learning occurs at connections w B and  w Y.  These strengths \nare modified according to the correlation between  presynaptic activity and  the  prediction \nerror bt  produced by neuron P as  in equation 3 (see text).  Learning is restricted to visits to \nflowers  [14]. \n\nthrough  excitatory  connection  weights  both  to  a  diffusely  projecting  neuron  P  (weights \nw B and w Y)  and  to other processing stages  which control the selection of actions such as \nsteering in flight and landing.  P receives additional input rt through unchangeable wei~hts. \n\nIn the absence of nectar (rt = 0), the net input to P becomes Vt = Wt \u00b7Xt = w~x~ +wt x~. \n\nThe  first  assumption  in  the  construction  of this  model  is  that  learning  (adjustment  of \nweights)  is  contingent  upon  approaching  and  landing  on  a  flower.  This  assumption  is \nsupported specifically by data from learning in the honeybee:  color learning for flowers  is \nrestricted to the final few seconds prior to landing on the flower and experiencing the nectar \n[14]. \n\nThis fact  suggests a simple model  in  which  the strengths of variable connections Wt are \nadjusted according to a presynaptic correlational rule: \n\nwhere oc is the learning rate [15].  There are two problems with this formulation:  (i) learning \nwould only occur about contingencies in the presence  of a reinforcing stimulus (rt =/:  0); \n\n(1 ) \n\n\fForaging in an  Uncertain Environment Using Predictive Hebbian Learning \n\n601 \n\nB \n\n100.0 \n\n~ \n' - '  \n(1) \n:::::s \n\n-- 80.0 \n-..0 \n..... . -<:n .-\n\n0 ..... \n<:n \n\n20.0 \n\n60.0 \n\n40.0 \n\n> \n\nA \n\n1.0 \n\n0 .8 \n\n..... \n:::::s \n..... \n0..0.6 \n:::::s o  0.4 \n\n0.2 \n\n0.0  '----~---~----' \n\n0.0 \nNectar volume (f-ll) \n\n10.0 \n\n5.0 \n\n0.0 \n\n0 \n\n5 \n\n10  15  20  25  30 \n\nTrial \n\nFigure 2:  Simulations of bee foraging behavior using predictive Hebbian learning.  A) \nReinforcement neuron output as  a function  of nectar volume for  a fixed  concentration of \nnectar[9,  10].  B) Proportion of visits to blue flowers.  Each  trial represents  approximately \n40 flower visits averaged  over 5 real  bees  and exactly 40 flower  visits for a single model \nbee.  Trials  1 - 15 for  the real  and  model  bees  had blue flowers  as  the constant type,  the \nremaining trials had yellow flowers as constant.  At the beginning of each trial, wYand w B \nwere  set to 0.5 consistent with evidence that information from  past foraging  bouts is  not \nused[14].  The real  bees  were more variable than the model bees - sources of stochasticity \nsuch as the two-dimensional feeding ground were not represented.  The real bees also had a \nslight preference for blue flowers  [21].  Note the slower drop for A =  0.1  when the flowers \nare switched. \n\nand (ii) there is no provision for allowing a sensory event to predict the future delivery of \nreinforcement.  The latter problem makes equation 1 inconsistent with a substantial volume \nof data on  classical  and  instrumental  conditioning [16].  Adding a  postsynaptic factor  to \nequation 1 does not alter these conclusions [17]. \n\nThis inadequacy suggests that another form of learning rule and a model in which P  has  a \ndirect input from rt.  Assume that the firing rate of P is sensitive only to changes in its input \nover time and habituates to constant or slowly varying input, like magnocellular ganglion \ncells  in  the retina  [18].  Under this  assumption,  the  output of P,  bt.  reflects  a  temporal \nderivative of its net input, approximated by: \n\n(2) \n\nwhere y  is  a  factor  that controls the  weighting of near  against distant rewards.  We  take \ny  =  1 for the current discussion. \n\nIn the presence of the reinforcement, the weights w B  and w Y are adjusted according to the \nsimple correlational rule: \n\n(3) \n\nThis permits the  weights onto P  to act as  predictions of the expected reward consequent \non landing on a flower and can also be derived in a more general way  for the prediction of \nfuture values of any scalar quantity [19]. \n\n\f602 \n\nMontague, Dayan, and Sejnowski \n\nA \n\n.- 100.0 \n~ \n' - '  \nQ.) \n0... \n~ \nQ.) \n\n80.0 \n\n60.0 \n\n-~ \n..... .-C'-l .-> \n\n.~ \n> \n8 \n\nC'-l \n\n40.0 \n\n20.0 \n\n0 .0 \n\n0 .0 \n\n8--\u00a3lv=2 \n<r-----(>  v  = 8 \nb------i!.  V = 30 \n\n2 .0 \n\n4.0 \nMean \n\n6.0 \n\nB \n\n30.0 \n\n8  20.0 \n~ \n\u00b70 \n~ >  10.0 \n\n0 .0 \n\n0.0 \n\no  A= 0 . 1 \n+  A= 0.9 \n\n2.0 \n\n4.0 \n\nMean \n\n6 .0 \n\nFigure 3:  Tradeoff between the mean and variance of nectar delivery.  A) Method of \nselecting indifference points.  The indifference point is taken  as  the first  mean  for a given \nvariance (bold v in legend) for which a stochastic trial demonstrates the indifference.  This \nmethod of calculation tends to bias the indifference points to the left.  B) Indifference plot \nfor  model  and  real  bees.  Each  point represents  the  (mean,  variance)  pair  for  which  the \nbee sampled each  flower  type equally.  The circles  are  for A = 0.1  and  the pluses are for \nA = 0.9. \n\nWhen the bee actually lands on a flower and samples the nectar,  R influences the output of \nP through its fixed  connection (Fig.  1).  Suppose that just prior to sampling the nectar the \nbee  switched to  viewing a  blue flower,  for example.  Then,  since Tt -l = 0,  lit  would be \nTt  - x~_1 w~_I. In this way,  the term x~_1 w~_1 is  a prediction of the value of Tt  and the \ndifference Tt - x~_1 wt 1 is the error in that prediction. Adjusting the weight w~ according \nto the correlational rule in equation 3 allows the weight w~, through P's outputs, to report \nto  the rest  of the brain the amount of reinforcement Tt  expected  from  blue flowers  when \nthey are sensed. \nAs the model  bee flies  between flowers,  reinforcement from nectar is not present (Tt =  0) \nand lit is proportional to V t - V t- 1.  w B and w Y can again be used as predictions but through \nmodulation of action choice.  For example, suppose the learning process in equation 3 sets \nw Y less than w B\u2022  In flight, switching from viewing yellow flowers to viewing blue flowers \ncauses  lit  to  be  positive and  biases  the  activity  in  any  action  selection  units  driven  by \noutgoing connections from  B.  This makes  the  bee more likely than chance to  land on or \nsteer towards blue flowers.  This  discussion is  not offered  as  an  accurate  model  of action \nchoice, rather,  it simply indicates how output from  a diffuse system could also be used to \ninfluence action choice. \n\nThe  biological assumptions of this  neural  architecture are explicit:  (i) the diffusely  pro(cid:173)\njecting neuron changes its firing according to the temporal difference  in its inputs; (ii) the \noutput of P is used to adjust its weights upon landing; and (iii) the output otherwise biases \nthe selection of actions by modulating the activity of its target neurons. \n\nFor the particular case  of the bee,  both the  learning rule described  in equation 3  and  the \nbiasing of action selection described above can  be further simplified for  the purposes of a \n\n\fForaging in an  Uncertain Environment Using Predictive Hebbian Learning \n\n603 \n\nsimple demonstration.  As  mentioned above,  significant learning about a particular flower \ncolor  may  occur  only  in  the  1 - 2  seconds  just  prior  to  an  encounter  [21,  14].  This \nis  tantamount to restricting  weight changes  to each  encounter  with  the  reinforcer  which \nallows  only  the  sensory  input just preceding  the  delivery  or  non-delivery  of r t  to  drive \nsynaptic plasticity.  We  therefore make the learning rule punctate, updating the weights on \na flower  by flower basis.  During each encounter with the reinforcer in the environment, P \nproduces a prediction error cSt  = rt - Vt-l where rt is the actual reward at time t, and the \nlast flower color seen by the bee at time t, say blue, causes a prediction Vt -l = wt lX~_l \nof future reward rt to be made through the weight w~_l and the input activity xt l' The \nweights are then updated using a form of the delta rule[20]: \n\n(4) \nwhere A is a time constant and controls the rate of forgetting.  In this rule, the weights from \nthe sensory input onto P still mediate a prediction of r; however,  the temporal component \nfor choosing how to steer and  when to land has been removed. \n\nWe model the temporal biasing of actions such as steering and landing with a probabilistic \nalgorithm that  uses  the  same  weights  onto  P  to choose  which  flower  is  actually  visited \non  each  trial.  At each  flower  visit,  the predictions are  used directly to choose an  action, \naccording to: \n\ne~(WYxY) \n\nq(Y) =  e~(wBxB) + ell(wYxY) \n\n(5) \nwhere q(Y)  is  the probability of choosing a yellow flower.  Values  of J.L  > 0  amplify  the \ndifference between the two predictions so that larger values of J.L  make  it more likely that \nthe larger prediction will result in choice toward the associated flower color.  In the limit as \nJ.L  ---+  00 this approaches a winner-take-all rule.  In the simulations, J.L  was varied from 2.8 to \n6.0 and comparable results obtained.  Changing J.L  alters the magnitude of the weights that \ndevelop onto neuron P since different values of J.L  enforce different degrees of competition \nbetween the predictions. \n\nTo apply the model to the foraging experiment, it is necessary to specify how the amount of \nnectar in  a particular flower  gets reported to P.  We assume that the reinforcement neuron \nR delivers  its  signal  rt as  a  saturating  function  of nectar  volume  (Fig.  2A).  Harder  and \nReal  [10]  suggest just this sort of decelerating function of nectar volume and justify it on \nbiomechanical grounds.  Fig. 2B  shows the behavior of model bees compared with that of \nreal  bees [9]  in the experiment testing the extent to which they prefer a constant reward to \na variable reward  of the same  long-term mean.  Further details are  presented in the figure \nlegend. \nThe behavior of the model  matched the observed data for A =  0.9 suggesting that the real \nbee utilizes information over a small time window for controlling its foraging [9].  At this \nvalue  of A,  the average  proportion of visits  to  blue  was  85%  for  the real  bees  and  83% \nfor  the model bees.  The constant and  variable flower  types  were  switched at  trial  15 and \nboth bees  switched flower  preference  in  1 - 3 subsequent visits.  The average  proportion \nof visits to blue changed to 23% and 20%, respectively, for the real and model bee.  Part of \nthe reason  for the real  bees'  apparent preference for blue may  come from inherent biases. \nHoney bees,  for instance, are known to learn about shorter wavelengths more quickly than \nothers [21].  In our model, A is  a measure of the length of time over which an observation \nexerts an influence on flower selection rather than being a measure of the bee's time horizon \nin terms of the mean rate of energy intake [9,  10]. \n\n\f604 \n\nMontague, Dayan, and Sejnowski \n\nReal  bees can be induced to forage equally on the constant and variable flower types if the \nmean  reward from the variable type is  made sufficiently large,  as  in Fig. 3B. For a given \nvariance,  the mean  reward  was  increased until  the bees  appeared  indifferent between the \nflowers.  In  this experiment, the constant flower  type contained 0.5J.11 of nectar.  The data \nfor  the real  bee  is  shown  as  points  connected  by  a  solid line  in  order to  make  clear the \nenvelope of the real data.  The indifference points for A = 0.1  (circles) and A =  0.9 (pluses) \nalso demonstrate that a higher value of A is again better at reproducing the bee's behavior. \nThe model captured both the functional relationship and the spread of the real data. \n\nThe diffuse neurotransmitter system reports prediction errors to control learning and bias \nthe  selection  of actions.  Distributing such  a  signal  diffusely  throughout a  large  set  of \ntarget structures permits this prediction error to influence learning generally as a factor in a \ncorrelational or Hebbian rule.  The same signal, in its second role, biases activity in an action \nselection system to favor rewarding behavior.  In the model, construction of the prediction \nerror only requires convergent input from sensory representations onto a neuron or neurons \nwhose output is  a  temporal derivative of its  input.  The output of this neuron can also  be \nused as a secondary reinforcer to associate other sensory stimuli with the predicted reward. \nWe have shown how this relatively simple predictive learning system closely simulates the \nbehavior of bumble bees in a foraging task. \n\nAcknowledgements \n\nThis  work  was  supported by  the Howard  Hughes Medical  Institute, the National Institute \nof Mental Health, the UK Science and Engineering Research  Council, and computational \nresources  from  the  San  Diego  Supercomputer  Center.  We  would  like  to  thank  Patricia \nChurchland, Anthony Dayan, Alexandre Pouget, David Raizen, Steven Quartz and Richard \nZemel for their helpful comments and criticisms. \n\nReferences \n\n[1]  Konorksi,  1.  Conditioned reflexes  and neuron  organization,  (Cambridge,  England, \n\nCambridge University Press,  1948). \n\n[2]  Quartz, SR, Dayan, P, Montague, PR, Sejnowski, Tl. (1992) Society for Neurosciences \n\nAbstracts. 18, 210. \n\n[3]  Montague, PR, Dayan, P,  Nowlan,  Sl, Pouget, A,  Sejnowski, Tl. (1993) In Advances \nin Neural Information Processing Systems 5, Sl Hanson, ID Cowan, CL Giles, editors, \n(San Mateo CA: Morgan Kaufmann), pp. 969-976. \n\n[4]  Morrison, IH and Magistretti, Pl. Trends  in Neurosciences,  6,  146 (1983). \n\n[5]  Wise, RA.  Behavioral and Brain Sciences,  5,39 (1982). \n\n[6]  Cole, Bl and Robbins, TW.  Neuropsychopharmacology,  7,  129 (1992). \n\n[7]  Schultz, W.  Seminars in the Neurosciences,  4,  129 (1992). \n\n[8]  Hammer, M, thesis, FU Berlin (1991). \n\n[9]  Real, LA.  Science,  253, pp 980 (1991). \n\n\fForaging in  an  Uncertain Environment Using Predictive Hebbian Learning \n\n605 \n\n[10]  Real,  LA.  Ecology,  62,20 (1981);  Harder,  LD and  Real,  LA.  Ecology,  68(4),  1104 \n\n(1987); Real, LA, Ellner, S, Harder, LD. Ecology,  71(4), 1625 (1990). \n\n[11]  Berry,  DA and Fristedt, B.  Bandit Problems:  Sequential Allocation of Experiments. \n\n(London, England:  Chapman and Hall,  1985). \n\n[12]  Gould, JL. In Foraging Behavior, AC Kamil, JR Krebs and HR Pulliam, editors, (New \n\nYork, NY:  Plenum, 1987), p 479. \n\n[13]  Krebs,  JR,  Kacelnik,  A,  Taylor,  P.  Nature\"  275,  27  (1978),  Houston,  A,  Kacelnik, \nA,  McNamara,  J.  In  Functional Ontogeny,  D McFarland,  editor,  (London:  Pitman, \n1982). \n\n[14]  Menzel, R and Erber, 1.  Scientific American, 239(1), 102. \n\n[15]  Carew, TJ, Hawkins RD, Abrams 1W and Kandel ER. Journal of Neuroscience,  4(5), \n\n1217 (1984). \n\n[16]  Mackintosh, NJ.  Conditioning and Associative Learning.  (Oxford, England:  Oxford \nUniversity Press,  1983). Sutton, RS and Barto, AG. Psychological Review,  882, 135 \n(1981).  Sutton, RS  and  Barto, AG.  Proceedings of the Ninth Annual Conference  of \nthe Cognitive Science Society.  Seattle, WA  (1987). \n\n[17]  Reeke, GN, Jr and Sporns, O. Annual Review of Neuroscience.  16,597 (1993). \n\n[18]  Dowling, JE.  The Retina.  (Cambridge, MA:  Harvard University Press,  1987). \n\n[19]  The  overall  algorithm  is  a  temporal  difference  (TO)  learning rule and  is  related  to \nan  algorithm Samuel  devised  for  teaching a checker playing program,  Samuel,  AL. \nIBM Journal of Research and Development,  3,211  (1959).  It was first suggested  in \nits present form in Sutton, RS, thesis, University of Massachusetts (1984); Sutton and \nBarto [1]  showed how it could be used for classical conditioning; Barto, AG, Sutton, \nRS  and  Anderson,  CWo  IEEE Transactions on Systems,  Man,  and Cybernetics,  13, \n834  (1983)  used  a  variant of it in  a  form  of instrumental conditioning task;  Barto, \nAG, Sutton, RS, Watkins, CJCH, Technical Report 89-95, (Computer and Information \nScience, University of Massachusetts, Amherst, MA,  1989); Barto, AG, Bradtke, SJ, \nSingh, SP, Technical Report 91-57, (Computer and Information Science, University of \nMassachusetts, Amherst, MA, 1991) showed its relationship to dynamic programming, \nan engineering method of optimal control. \n\n[20]  Rescorla,  RA and Wagner,  AR.  In Classical Conditioning II:  Current Research  and \n\nTheory,  AH  Black  and  WF  Prokasy,  editors,  (New  York,  NY:  Appleton-Century(cid:173)\nCrofts, 1972), p 64; Widrow, B and Stearns, SD. Adaptive Signal Processing,  (Engle(cid:173)\nwood Cliffs, NJ:  Prentice-Hall,  1985). \n\n[21]  Menzel,  R,  Erber,  J and Masuhr,  J.  In Experimental Analysis of Insect Behavior,  LB \n\nBrowne, editor, (Berlin, Germany:  Springer-Verlag,  1974), p 195. \n\n\f", "award": [], "sourceid": 800, "authors": [{"given_name": "P.", "family_name": "Montague", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}