{"title": "Dopamine Bonuses", "book": "Advances in Neural Information Processing Systems", "page_first": 131, "page_last": 137, "abstract": null, "full_text": "Dopamine Bonuses \n\nSham Kakade \n\nPeter Dayan \n\nGatsby Computational Neuroscience Unit \n\n17 Queen Square, London, England, WC1N 3AR. \n\nsham@gat sby.u c l. ac . uk \n\nda y a n @gat sby.u c l. ac .uk \n\nAbstract \n\nSubstantial data support a temporal difference (TO) model of \ndopamine (OA) neuron activity in which the cells provide a global \nerror signal for reinforcement learning. However, in certain cir(cid:173)\ncumstances, OA activity seems anomalous under the TO model, \nresponding to non-rewarding stimuli. We address these anoma(cid:173)\nlies by suggesting that OA cells multiplex information about re(cid:173)\nward bonuses, including Sutton's exploration bonuses and Ng et \nal's non-distorting shaping bonuses. We interpret this additional \nrole for OA in terms of the unconditional attentional and psy(cid:173)\nchomotor effects of dopamine, having the computational role of \nguiding exploration. \n\n1 \n\nIntroduction \n\nMuch evidence suggests that dopamine cells in the primate midbrain play an im(cid:173)\nportant role in reward and action learning. Electrophysiological studies support \na theory that OA cells signal a global prediction error for summed future reward \nin appetitive conditioning tasks (Montague et al, 1996; Schultz et al, 1997), in the \nform of a temporal difference prediction error term. This term can simultaneously \nbe used to train predictions (in the model, the projections of the OA cells in the \nventral tegmental area to the limbic system and the ventral striatum) and to train \nactions (the projections of OA cells in the substantia nigra to the dorsal striatum \nand motor and premotor cortex). Appetitive prediction learning is associated with \nclassical conditioning, the task of learning which stimuli are associated with re(cid:173)\nward; appetitive action learning is associated with instrumental conditioning, the \ntask of learning actions that result in reward delivery. \nThe computational role of dopamine in reward learning is controversial for two \nmain reasons (Ikemoto & Panksepp, 1999; Redgrave et al, 1999). First, stimuli that \nare not associated with reward prediction are known to activate the dopamine sys(cid:173)\ntem persistently, including in particular stimuli that are novel and salient, or that \nphysically resemble other stimuli that do predict reward (Schultz, 1998). Second, \ndopamine release is associated with a set of motor effects, such as species- and \nstimulus-specific approach behaviors, that seem either irrelevant or detrimental to \nthe delivery of reward. We call these unconditional effects. \nIn this paper, we study this apparently anomalous activation of the OA system, \nsuggesting that it multiplexes information about bonuses, potentially including ex(cid:173)\nploration bonuses (Sutton, 1990; Dayan & Sejnowski, 1996) and shaping bonuses \n(Ng et al, 1999), on top of reward prediction errors. These responses are associated \nwith the unconditional effects of OA, and are part of an attentional system. \n\n\fA \n.' \\\" \",L,.I\\ ,,11,1 +'lj,.'-4'I~~II/\"\"\"\"\"\" \n\n+ \nlight \n\nB \n\"'1\"'oApIL\"\", \u2022\u2022 ~alijwb\"\"'Jo.''''',jJ,'Ll.u. \nr\n\n. ~ \n\n' \nIIlf!t \n\nro\"Nd \n\ni \n\n111.1.1 .. . I~ .. ~ \"d~~L.II. \".11 11 11,,1, \n\n+ \nraward \n\n7oom. \n\n\u00b74ooms \n\nD \n\n\"\" \u2022 \u2022 ' \n\n, 1 .... + ,i..IeI ............... ~ \n~f . \n\n\u00b7320ms \n\n\u2022 \n\ndoor \u00b7 \n\n480ms \n\n~j!tuw.,lJ!\"\" 44I,L.; ...... 1.1Wv.. \nAlMW,J \\J.I.a M \u2022 1+*''''''''''' \n\nc~e + \n\ndJor + \n\ndJor . \n\n300ms \nbins \n\ncLe . \n\nFigure 1: Activity of individual DA neurons -\nthough substantial data suggest the homogeneous \ncharacter of these responses (Schultz, 1998). See text for description. The latency and duration of the \nDA activation is about lOOms. The depression has duration of about 200 ms. The baseline spike rate is \nabout 2-4 Hz. Adapted from Schultz et al (1990, 1992, & 1993) and Jacobs et al (1997). \n\n2 DA Activity \n\nFigure 1 shows three different types of dopamine responses that have been ob(cid:173)\nserved by Schultz et al and Jacobs et al. Figures 1A;B show the response to a con(cid:173)\nditioned stimulus that becomes predictive of reward (CS+). For this, in early trials \n(figure 1A), there is no, or only a weak response to the CS+, but a strong response \njust after the time of delivery of the reward. In later trials (figure 18), after learn(cid:173)\ning is complete (but before overtraining), the DA cells are activated in response to \nthe stimulus, and fire at background rates to the reward. Indeed, if the reward is \nomitted, there is depression of DA activity at just the time during early trials that \nit used to excite the cells. These are the key data for which the temporal difference \nmodel accounts. Under the model, the cells report the temporal difference (TD) \nerror for reward, ie the difference in amount of reward that is delivered and the \namount that is expected. Let r(t) be the amount of reward received at time t and \nv(t) be the prediction of the sum total (undiscounted) reward to be delivered in a \ntrial after time t, or: \n\nv(t) '\" L r(T + t) . \n\n(1) \n\nr~O \n\nThe TD component to the dopamine activity is the prediction error: \n\nc5(t) = r(t) + v(t + 1) - v(t) \n\n(2) \nwhich uses r(t) + v(t + 1) as an estimate of :Er>or(T + t), so that the TD error is an \nestimate of :Er>or(T + t) - v(t). Provided that the information about state includes \ninformation ab-out how much time has elapsed since the CS+ was presented (which \nmust be available because of the precisely timed nature of the inhibition at the time \nof reward, if the expected reward is not presented), this model accounts well for \nthe results in figure 1A. \nThe general framework of reinforcement learning methods for Markov decision \nproblems (MDPs) extends these results to the case of control. An MDP consists \nof states, actions, transition probabilities between states under the chosen action, \n\n\fand the associated rewards with these transitions. The goal of the subject solving a \nMOP is to find a policy (a choice of actions in each state) so as to optimize the sum \ntotal reward it receives. The TO error 8(t) can be used to learn optimal policies \nby implementing a form of policy iteration, which is an optimal control teclmique \nthat is standard in engineering (Sutton & Barto, 1998; Bertsekas & Tsitsiklis, 1996). \nFigures lC;O show that reporting a prediction error for reward does not exhaust \nthe behavioral repertoire of the OA cells. Figure lC shows responses to salient, \nnovel, stimuli. The dominant effect is that there is a phasic activation of dopamine \ncells followed by a phasic inhibition, both locked to the stimulus. These novelty \nresponses decrease over trials, but quite slowly for very salient stimuli (Schultz, \n1998). In some cases, particularly in early trials of appetitive learning (figure lA \ntop), there seems to be little or no phasic inhibition of the cells following the acti(cid:173)\nvation. Figure 10 shows what happens when a stimulus (door -) that resembles \na reward-predicting stimulus (door +) is presented without reinforcement. Again \na phasic increase over baseline followed by a depression is seen (lower 10). How(cid:173)\never, unlike the case in figure 1 B, there is no persistent reward prediction, since \nif a reward is subsequently delivered (unexpectedly), the cells become active (not \nshown) (Schultz, 1998). \n\n3 Multiplexing and reward distortion \n\nThe most critical issue is whether it is possible to reconcile the behavior of the \nOA cells seen in figures lC;O with the putative computational role of OA in terms \nof reporting prediction error for reward. Intuitively, these apparently anomalous \nresponses are benign, that is they do not interfere with the end point of normal \nreward learning, provided that they sum to zero over a trial. \nTo see this, consider what happens once learning is complete. If we sum the \nprediction error terms from equation 2, starting from the time of the stimulus \nonset at t = I, we get \n\nL:t~l 8(t) = v(tend) - v(l) + L:t~l r(t) \n\nwhere tend is the time at the end of the trial. Assuming that v(tend)=O and v(l) =0, \nie that the monkey confines its reward predictions to within a trial, we can see \nthat any additional influences on 8(t) that sum to 0 preserve predicted sum future \nrewards. From figure I, this seems true of the majority of the extra responses, ie \nanomalous activation is canceled by anomalous inhibition, though it is not true of \nthe uncancelled OA responses shown in figure lA (upper). Altogether, OA activity \ncan still be used to learn predictions and choose actions - although it should not \nstrictly be referred to solely in terms of prediction error for reward. \nApart from the issue of anomalous activation that is not canceled (upper figure lA), \nthis leaves open two key questions: what drives the extra OA responses; and what \neffects do they have. We offer a set of possible interpretations (mostly associated \nwith bonuses) that it is hard to decide between on the basis of current data. \n\n4 Novelty and Bonuses \n\nThree very different sorts of bonuses have been considered in reinforcement learn(cid:173)\ning, novelty, shaping and exploration bonuses. The presence of the first two of \nthese is suggested by the responses in figure 1. Bonuses modify the reward signals \nand so change the course of learning. They are mostly used to guide exploration \nof the world, and are typically heuristic ways of addressing the computationally \nintractable exploration-exploitation dilemma. \n\n\f000 \n-0.5 \n\nO:b ~ ~ b ~ \nO:~ b b b ~ \n\n20 0 10 20 \n\n20 0 \n\n000 \n-0.5 o \n\ntrial \n\n0 10 20 \n\ntrial \n\n20 0 \n\n10 \ntime \n\n10 \ntime \n\n10 \ntime \n\nFigure 2: Activity of the DA system given novelty bonuses. The plots show different as(cid:173)\npects of the TD error 8 as a function of time t within a trial (first three plots in each row) or \nas a function of number T of trials (last two). Upper) A novelty signal was applied for just \nthe first timesteps of the stimulus and decayed hyperbolically with trial number as liT. \nLower) A novelty signal was applied for the first two timesteps of the stimulus and now \ndecayed exponentially as e-\u00b7 3T to demonstrate that the precise form of decay is irrelevant. \nTrial numbers and times are shown in the plots. The learning rate was E = 0.3. \n\nWe first consider a novelty bonus, which we take as a model for uncancelled anoma(cid:173)\nlous activity. A novelty bonus is a value that is added to states or state-action \npairs associated with their unfamiliarity - novelty is made intrinsically reward(cid:173)\ning. This is computationally reasonable, at least in moderation, and indeed it has \nbecome standard practice in reinforcement learning to use optimistic initial values \nfor states to encourage systems to plan to get to novel or unfamiliar states. In TD \nterms, this is like replacing the true environmental reward r(t) at time t with \n\nr(t) --t r(t) + n(x(t), T) \n\nwhere x(t) is the state at time t and n(x(t), T) is the novelty of this state in trial T \n(an index we generally suppress). The effect on the TD error is then \n\nc5(t) = r(t) + n(x(t), T) + v(t + 1) - v(t) \n\n(3) \nThe upper plots in figure 2 show the effect of including such an exploration bonus, \nin a case in which just the first timestep of a new stimulus in any given trial are \nawarded a novelty signal which decays hyperbolically to 0 as the stimulus be(cid:173)\ncomes more familiar. Here, a novel stimulus is presented for a 25 trials without \nthere being any reward consequences. The effect is just a positive signal which de(cid:173)\ncreases over time. Learning has no effect on this, since the stimulus cannot predict \naway a novelty signal that lasts only a single timestep. The lower plots in figure 2 \nshow that it is possible to get partial apparent cancellation through learning, if the \nnovelty signal is applied for the first two timesteps of a stimulus (for instance if \nthe novelty signal is calculated relatively slowly). In this case, the initial effect is \njust a positive signal (leftmost graph), the effect of TD learning gives it a negative \ntransient after a few trials (second plot), and then, as the novelty signal decays to \n0, the effect goes away (third plot). The righthand plots show how c5(t) behaves \nacross trials. If there was no learning, then there would be no negative transient. \nThe depression of the DA signal comes from the decay of the novelty bonuses. \nNovelty bonuses are true bonuses in the sense that they actually distort the re(cid:173)\nward function. In particular, this means that we would not expect the sum of the \nextra TD error terms to be 0 across a trial. This property makes them useful, for \ninstance, in actually distorting the optimal policy in Markov decision problems to \nensure that exploration is plmmed and executed in favor of exploitation. However, \nthey can be dangerous for exactly the same reason - and there are reports of them \nleading to incorrect behavior, making agents search too much. \n\n\f000 \n\n-1 \n\n'[tj tj d D ~ \n'[tj S 8 LJ Ej \n\n20 0 10 20 \n\n0 10 20 \n\n20 0 \n\n-1 \n\n0 \n\n000 \n\ntrial \n\ntrial \n\n20 0 \n\n10 \ntime \n\n10 \ntime \n\n10 \ntime \n\nFigure 3: Activity of the DA system given shaping bonuses (in the same format as figure 2). \nUpper) The plots show different aspects of the TD error 8 as a function of time t within a \ntrial (first three plots) or as a function of number T of trials (last two). Here, the shaping \nbonus comes from a if>(t) = 0 for the first two timesteps a stimulus is presented within a \ntrial (t=1;2), and 0 thereafter, irrespective of trial number. The learning rate was to = 0.3. \nLower) The same plots for to = 0 \n\nIn answer to this concern, Ng et al (1999) invented the idea of non-distorting shap(cid:173)\ning bonuses. Ng et aI's shaping bonuses are guaranteed not to distort optimal poli(cid:173)\ncies, although they can still change the exploratory behavior of agents. This guar(cid:173)\nantee comes because a shaping bonus is derived from a potential function \u00a2(x) of \na state, distorting the TD error to \n\nc5(t) = r(t) + \u00a2(x(t + 1)) - \u00a2(x(t)) + v(t + 1) - v(t) \n\n(4) \n\nThe difference from the novelty bonus of equation 3 is that the bonus comes from \nthe difference between the potential functions for one state and the previous state, \nand they thus cancel themselves out when summed over a trial. Shaping bonuses \nmust remain constant for the guarantee about the policies to hold. \nThe upper plots in figure 3 show the effect of shaping bonuses on the TD error. \nHere, the potential function is set to the value 1 for the first two time steps of a \nstimulus in a trial, and 0 otherwise. The most significant difference between shap(cid:173)\ning and novelty bonuses is that the former exhibits a negative transient even in the \nvery first trial, whereas, for the latter, it is a learned effect. If the learning rate is \nnon-zero, then shaping bonuses are exactly predicted away over the course of nor(cid:173)\nmal learning. Thus, even though the same bonus is provided on trial 25 as trial 1, \nthe TD error becomes 0 since the shaping bonus is predicted away. The dynam(cid:173)\nics of the decay shown in the last two plots is controlled by the learning rate for \nTD. The lower plots show what happens if learning is switched off at the time the \nshaping bonus is provided - this would be the case if the system responsible for \ncomputing the bonus takes its effect before the inputs associated with the stimulus \nare plastic. In this case, the shaping bonus is preserved. \nThe final category of bonus is an ongoing exploration bonus (Sutton, 1990; Dayan \n& Sejnowski, 1996) which is used to ensure continued exploration. Sutton (1990) \nsuggested adding to the estimated value of each state (or each state-action pair), a \nnumber proportional to the length of time since it was last visited. This ultimately \nmakes it irresistible to go and visit states that have not been visited for a long \ntime. Dayan & Sejnowski (1996) derived a bonus of this form from a model of \nenvironmental change that justifies the bonus. There is no evidence for this sort \nof continuing exploration bonus in the dopamine data, perhaps not surprisingly, \nsince the tasks undertaken by the monkey offer little possibility for any trade-off \nbetween exploration and exploitation. \n\n\f~~8 \"171 ~rL-p~g_b_9g----,1 \n\no time 20 0 \n\nd.1 E]~~ .I:~-d~- J \n\ntime 20 0 \n\ntime 20 0 \n\ntime 20 \n\n0 \n\ntime \n\n40 \n\n0 \n\ntime \n\n40 \n\nFigure 4: Activity 5(t) of the dopamine for partial predictability. del = delivered, pred = predicted. \nA;B) CS+ is presented with (A) or surprisingly, without (B) reward. C;D) CS- is presented without (C) \nor surprisingly, with (D) reward. On each trial, an initial stimulus (presented at t = 3 is ambiguous as \nto whether CS+ or CS- is presented (each occurs equally often), and the ambiguity is perfectly resolved \nat t = 4. E;F) The model shows the same behavior. Since the CS\u00b1 comes at a random interval after the \ncue, the traces are stimulus locked to the relevant events. \n\n5 Generalization Responses and Partial Observability \n\nGeneralization responses (figure 10) show a persistent effect of stimuli that merely \nresemble a rewarded stimulus. However, animals do not terminally confuse nor(cid:173)\nmally rewarded and normally non-rewarded stimuli, since if a reward is provided \nin the latter case, then it engenders OA activity (as an unexpected reward should), \nand if it is not provided, then there is no depression (as would be the case if an \nexpected reward was not delivered) (Schultz, 1998). \nOne possibility is that this activity comes from a shaping bonus that is not learned \naway, as in the lower plots of figure 3. An alternative interpretation comes from \npartial observability. If the initial information from the world is ambiguous as \nto whether the stimulus is actually rewarding (door+, called CS+ trials) or non(cid:173)\nrewarding (door-, called CS- trials), because of the similarity, then the animal \nshould develop an initial expectation that there could be a reward (whose mean \nvalue is related to the degree of confusion). This should lead to a partial activation \nof the OA system. If the expectation is canceled by subsequent information about \nthe stimulus (available, for instance, following a saccade), then the OA system \nwill be inhibited below baseline exactly to nullify the earlier positive prediction. \nIf the expectation is confirmed, then there will be continued activity representing \nthe difference between the value of the reward and the expected value given the \nambiguous stimulus. Figure 4 shows an example of this in a simplified case for \nwhich the animal receives information about the true stimulus over two timesteps, \nthe first time step is ambiguous to the tune of 50%; the second perfectly resolves the \nambiguity. Figures 4A;B show CS+ trials, with and without the delivery of reward; \nfigures 4C;0 CS- trials, without and with the delivery of reward. The similarity of \n4A;C to figure 10 is clear. \nAnother instance of this generalization response is shown in figure IE. Here, an \ncue light (c\u00b1) is provided indicating whether a CS+ or a CS- (d\u00b1) is to appear at \na random later time, which in turn is followed (or not) after a fixed interval by a \nreward (r\u00b1). OA cells show a generalization response to the cue light; and then \nfire to the CS+ or are unaffected by the CS-; and finally do not respond to the \nappropriate presence or absence of the cue. Figures 4E;F shows that this is exactly \nthe behavior of the model. The OA response stimulus locked to CS+ arises because \nof the variability in the interval between the cue light and the CS+; if this interval \nwere fixed, then the cells would only respond to the cue (c+), as in Schultz (1993). \n\n6 Discussion \n\nWe have suggested a set of interpretations for the activity of the OA system to \nadd to that of reporting prediction error for reward. The two theoretically most \n\n\finteresting features are novelty and shaping bonuses. The former distort the re(cid:173)\nward function in such a way to encourage exploration of new stimuli, and new \nplaces. The latter are non-distorting, and can be seen as being multiplexed by the \nDA system together with the prediction error signal. \nSince shaping bonuses are not distorting they have no ultimate effect on action \nchoice. However, the signal provided by the activation (and then cancellation) of \nDA can nevertheless have a significant neural effect. We suggest that DA release \nhas unconditional effects in the ventral striatum (perhaps allowing stimuli to be \nread into pre-frontal working memory, Cohen et ai, 1998) and the dorsal striatum \n(perhaps engaging stimulus-directed approach and exploratory orienting behav(cid:173)\niors, see Ikemoto & Panksepp (1999) for review). For stimuli that actually predict \nrewards (and so cause an initial activation of the DA system), these behaviors are \noften called appetitive; for novel, salient, and potentially important stimuli that are \nnot known to predict rewards, they allow the system to pay appropriate attention. \nThese effects of DA are unconditional, since they are hard-wired and not learned. \nIn the case of partial observability, DA release due to the uncertain prediction of \nreward directly causes further investigation, and therefore resolution of the uncer(cid:173)\ntainty. When unconditional and conditioned behaviors conflict, the former seem \nto dominate, as in the inability of animals to learn to run away from a stimulus in \norder to get food from it. \nThe most major lacuna in the model is its lack of one or more opponent processes \nto DA that might report on punishments and the absence of predicted rewards. \nThere is substantial circumstantial evidence that this might be one role for sero(cid:173)\ntonin (which itself has unconditional effects associated with fear, fight, and flight \nresponses that are opposite to those of DA), but there is not the physiological \nevidence to support or refute this possibility. Understanding the interaction of \ndopamine and serotonin in terms of their conditioned and unconditioned effects is \na major task for future work. \n\nAcknowledgements \n\nFunding is from the NSF and the Gatsby Charitable Foundation. \n\nReferences \n\n[1] Bertsekas, DP & Tsitsitklis, IN (1996). Neuro-dynamic Programming. Cambridge, MA: Athena Sci(cid:173)\n\nentific. \n\n[2] Cohen, JD, Braver, TS & O'Reilly, RC (1998). In AC Roberts, TW Robbins, editors, The Prefrontal \n\nCortex: Executive and Cognitive Functions. Oxford: OUP. \n\n[3] Dayan, P, & Sejnowski, TJ (1996) . Machine Learning, 25: 5-22. \n[4] Horvitz, Je, Stewart, T, & Jacobs, B, (1997). Brain Research, 759:251-258. \n[5] Ikemoto, S, & Panksepp, J, (1999). Brain Research Reviews, 31:6-41. \n[6] Montague, PR, Dayan, P, & Sejnowski, TJ, (1996). Journal of Neuroscience, 16:1936-1947. \n[7] Ng, AY, Harada, D, and Russell, S, (1999) . Proceedings of the Sixteenth International Conference on \n\nMachine Learning. \n\n[8] Redgrave, P, Prescott, T, & Gurney, K (1999). Trends in Neurosciences, 22: 146-151. \n[9] Schultz, W, (1992). Seminars in the Neurosciences, 4: 129-138. \n[10] Schultz, W, (1998). Journal ofNeurophysiologJJ, 80: 1-27. \n[11] Schultz, W, Apicella, P, & Ljungberg, T, (1993) . Journal of Neuroscience, 13: 900-913. \n[12] Schultz, W, Dayan, P, and Montague, PR, (1997). Science, 275: 1593-1599. \n[13] Schultz, W, & Romo, R, (1990). Journal of Neuroscience, 63: 607-624. \n[14] Sutton, RS, (1990). Machine Learning: Proceedings of the Seventh International Conference, 216-224. \n[15] Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT \n\nPress. \n\n\f", "award": [], "sourceid": 1872, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}