{"title": "Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 937, "page_last": 944, "abstract": null, "full_text": "Effects of Stress and Genotype on Meta-parameter\n\nDynamics in Reinforcement Learning\n\nGediminas Luk\u02c7sys1,2\n\nJ\u00b4er\u00b4emie Kn\u00a8usel1\n\ngediminas.luksys@epfl.ch\n\njeremie.knuesel@epfl.ch\n\nDenis Sheynikhovich1\n\nCarmen Sandi2\n\ndenis.sheynikhovich@epfl.ch\n\ncarmen.sandi@epfl.ch\n\nWulfram Gerstner1\n\nwulfram.gerstner@epfl.ch\n\n1Laboratory of Computational Neuroscience\n\n2Laboratory of Behavioral Genetics\n\nEcole Polytechnique F\u00b4ed\u00b4erale de Lausanne\n\nCH-1015, Switzerland\n\nAbstract\n\nStress and genetic background regulate different aspects of behavioral learning\nthrough the action of stress hormones and neuromodulators.\nIn reinforcement\nlearning (RL) models, meta-parameters such as learning rate, future reward dis-\ncount factor, and exploitation-exploration factor, control learning dynamics and\nperformance. They are hypothesized to be related to neuromodulatory levels in\nthe brain. We found that many aspects of animal learning and performance can be\ndescribed by simple RL models using dynamic control of the meta-parameters. To\nstudy the effects of stress and genotype, we carried out 5-hole-box light condition-\ning and Morris water maze experiments with C57BL/6 and DBA/2 mouse strains.\nThe animals were exposed to different kinds of stress to evaluate its effects on\nimmediate performance as well as on long-term memory. Then, we used RL mod-\nels to simulate their behavior. For each experimental session, we estimated a set\nof model meta-parameters that produced the best \ufb01t between the model and the\nanimal performance. The dynamics of several estimated meta-parameters were\nqualitatively similar for the two simulated experiments, and with statistically sig-\nni\ufb01cant differences between different genetic strains and stress conditions.\n\n1 Introduction\n\nAnimals choose their actions based on reward expectation and motivational drives. Different aspects\nof learning are known to be in\ufb02uenced by acute stress [1, 2, 3] and genetic background [4, 5]. Stress\neffects on learning depend on the stress type (eg task-speci\ufb01c or unspeci\ufb01c) and intensity, as well\nas on the learning paradigm (eg spatial/episodic vs. procedural learning) [3].\nIt is known that\nstress can affect short- and long-term memory by modulating plasticity through stress hormones\nand neuromodulators [1, 2, 3, 6]. However, there is no integrative model that would accurately\npredict and explain differential effects of acute stress. Although stress factors can be described in\nquantitative measures, their effects on learning, memory, and performance are strongly in\ufb02uenced\nby how an animal perceives it. The subjective experience can be in\ufb02uenced by emotional memories\nas well as by behavioral genetic traits such as anxiety, impulsivity, and novelty reactivity [4, 5, 7].\n\n\fIn the present study, behavioral experiments conducted on two different genetic strains of mice\nand under different stress conditions were combined with a modeling approach.\nIn our models,\nbehavioral performance as a function of time was described in the framework of temporal difference\nreinforcement learning (TDRL).\n\nIn TDRL models [8] a modeled animal, termed agent, can occupy various states and undertake\nactions in order to acquire rewards. The expected values of cumulative future reward (Q-values) are\nlearned by observing immediate rewards delivered under different state-action combinations. Their\nupdate is controlled by certain meta-parameters such as learning rate, future reward discount factor,\nand memory decay/interference factor. The Q-values (together with the exploitation/exploration\nfactor) determine what actions are more likely to be chosen when the animal is at a certain state,\nie they represent the goal-oriented behavioral strategy learned by the agent. The activity of certain\nneuromodulators in the brain are thought to be associated with the role the meta-parameters play\nin the TDRL models. Besides dopamine (DA), whose levels are known to be related to the TD\nreward prediction error [9], serotonin (5-HT), noradrenaline (NA), and acetylcholine (ACh) were\ndiscussed in relation to TDRL meta-parameters [10]. Thus, the knowledge of the characteristic\nmeta-parameter dynamics can give an insight into the putative neuromodulatory activities in the\nbrain. Dynamic parameter estimation approaches, recently applied to behavioral data in the context\nof TDRL [11], could be used for this purpose.\n\nIn our study, we carried out 5-hole-box light conditioning and Morris water maze experiments with\nC57BL/6 and DBA/2 inbred mouse strains (referred to as C57 and DBA from now on), renown for\ntheir differences in anxiety, impulsivity, and spatial learning [4, 5, 12]. We exposed subgroups of\nanimals to different kinds of stress (such as motivational stress or task-speci\ufb01c uncertainty) in order\nto evaluate its effects on immediate performance, and also tested their long-term memory after a\nbreak of 4-7 weeks. Then, we used TDRL models to describe the mouse behavior and established\na number of performance measures that are relevant to task learning and memory (such as mean\nresponse times and latencies to platform) in order to compare the outcome of the model with the an-\nimal performance. Finally, for each experimental session we ran an optimization procedure to \ufb01nd\na set of the meta-parameters, best \ufb01tting to the experimental data as quanti\ufb01ed by the performance\nmeasures. This approach made it possible to relate the effects of stress and genotype to differences\nin the meta-parameter values, allowing us to make speci\ufb01c inferences about learning dynamics (gen-\neralized over two different experimental paradigms) and their neurobiological correlates.\n\n2 Reinforcement learning model of animal behavior\n\nIn the TDRL framework [8] animal behavior is modelled as a sequence of actions. After an action is\nperformed, the animal is in a new state where it can again choose from a set of possible actions. In\ncertain states the animal is rewarded, and the goal of learning is to choose actions so as to maximize\nthe expected future reward, or Q-value, formally de\ufb01ned as\n\nQ(st, at) = E(cid:18)\n\n\u221e\n\nXk=0\n\n\u03b3krt+k+1|st, at(cid:19) ,\n\n(1)\n\nwhere (st, at) is the state-action pair, rt is a reward received at time step t and 0 < \u03b3 < 1 is\nthe future reward discount factor which controls to what extent the future rewards are taken into\naccount. As soon as state st+1 is reached and a new action is selected, the estimate of the previous\nstate\u2019s value Q(st, at) is updated based on the reward prediction error \u03b4t [8]:\n\n\u03b4t = rt+1 + \u03b3Q(st+1, at+1) \u2212 Q(st, at) ,\nQ(st, at) \u2190 Q(st, at) + \u03b1\u03b4t ,\n\n(2)\n(3)\nwhere \u03b1 is the learning rate. The action selection at each state is controlled by the exploitation\nfactor \u03b2 such that actions with high Q-values are chosen more often if the \u03b2 is high, whereas random\nactions are chosen most of the time if the \u03b2 is close to zero. Meta-parameters \u03b1, \u03b2 and \u03b3 are the free\nparameters of the model.\n\n3 5-hole-box experiment and modeling\n\nExperimental subjects were male mice (24 of the C57 strain, and 24 of the DBA strain), 2.5-month\nold at the beginning of the experiment, and food deprived to 85-90% of the initial weight. During an\n\n\fexperimental session, each animal was placed into the 5-hole-box (5HB) (Figure 1a). The animals\nhad to learn to make a nose poke into any of the holes upon the onset of lights and not to make it\nin the absence of light. After the response to light, the animals received a reward in form of a food\npellet. Once a poke was initiated (see starting a poke in Figure 1b), the mouse had to stay in the\nhole at least for a short time (0.3-0.5 sec) in order to \ufb01nd the delivered reward (continuing a poke).\nTrial ended (lights turned off) as soon as the nose poke was \ufb01nished. If the mouse did not \ufb01nd the\nreward, the reward remained in the box and the animal could \ufb01nd it during the next poke in the same\nbox. The inter-trial interval (ITI) between subsequent trials was 15 sec. However, a new trial could\nonly start when during the last 3 sec before it there were no wrong (ITI) pokes, so as to penalize\nspontaneous poking. The total session time was 10 min. Hence, the number of trials depended on\nhow fast animals responded to light and how often they made ITI pokes.\n\na.\n\nb.\n\nTrial starts after 15 sec. ITI \n\nB.1\n\nB.2\n\nB.3\n\nB.4\n\nB.5\n\nITI, staying outside\n\nTrial, staying outside\n\nITI, starting a poke\n\nTrial, starting a poke\n\nReward (if available)\n\nReward\n\nITI, continuing a poke\n\nTrial, continuing a poke\n\nFigure 1: a. Scheme of the 5HB experiment. Open circles are the holes where the food is delivered,\n\ufb01lled circles are the lights. All 5 holes were treated as equivalent during the experiment. b. 5HB\nstate-action chart. Rectangles are states, arrows are actions.\n\nAfter 2 days of habituation, during which the mice learned that food could be delivered in the\nholes, they underwent 8 consecutive days of training. During days 5-7 subsets of the animals were\nexposed to different stress conditions: motivational stress (MS, food deprivation to 85-87% of the\ninitial weight vs. 88-90% in controls) and uncertainty in the reward delivery (US, in 50% of correct\nresponses they received either none or 2 food pellets). Mice of each strain were divided into 4 stress\ngroups: controls, MS, US, and MS+US. After a break of 26 days the long-term memory of the\nmice was tested by retraining them for another 8 days. During days 5-8 of the retraining, we again\nevaluated the impact of stress factors by exposing half of the mice to extrinsic stress (ES, 30 min on\nan elevated platform right before the 5HB experiment).\n\nTo model the mouse behavior we used a discrete state TDRL model with 6 states: [ITI, trial] \u00d7\n[staying outside, starting a poke, continuing a poke], and 2 actions: move (in or out), and stay (see\nFigure 1b). Actions were chosen according to the soft-max method [8]:\n\np(a|s) = exp(\u03b2Q(s, a))/Xk\n\nexp(\u03b2Q(s, ak)) ,\n\n(4)\n\nwhere k runs over all actions and \u03b2 is the exploitation factor. Initial Q-values were equal to zero.\nSince the time spent outside the holes was comparatively long and included multiple (task irrelevant)\nactions, state/action pair staying outside/stay was given much more weight in the above formula.\nThe time step (0.43 sec) was constant throughout the experiment and was chosen to \ufb01t the animal\nperformance in the beginning of the experiment. Finally, to account for the memory decay after each\nday all Q(s, a) values were updated as follows:\n\nQ(s, a) \u2190 Q(s, a) \u00b7 (1 \u2212 \u03bb) + hQ(s, a)is,a \u00b7 \u03bb ,\n\n(5)\n\nwhere \u03bb is a memory decay/interference factor, and hQ(s, a)is,a is the average over Q values for all\nstates and all actions at the end of the day.\n\nAll performance measures (PMs) used in the 5HB paradigm (number of trials, number of ITI pokes,\nmean response time, mean poke length, TimePref1 and LengthPref2) were evaluated over the\nentire session (10 min, 1400 time steps), during which different states3 could be visited multiple\n\n1TimePref = (average time between adjacent ITI pokes) / (average response time)\n2LengthPref = (average response length) / (average ITI poke length)\n3including the pseudo-states, corresponding to time steps within the 15 sec ITI\n\n\ftimes. As opposed to an online \u201dSARSA\u201d-type update of Q-values, we work with state occupancy\nprobabilities p(st) and update Q-values with the following reward prediction error:\n\n\u03b4t = E[rt] \u2212 Q(at, st) + \u03b3 X\u2200at+1,st+1\n\nQ(at+1, st+1) \u00b7 p(at+1, st+1|at, st) .\n\n(6)\n\n4 Morris water maze experiment and modeling\n\nThe same mice as in the 5HB (4.5-month old at the beginning of the experiment) were tested in a\nvariant of the Morris water maze (WM) task [13]. Starting from one of 4 starting positions in the\ncircular pool \ufb01lled with an opaque liquid they had to learn the location of a hidden escape platform\nusing stable extra-maze cues (Fig. 2a). Animals were initially trained for 4 days with 4 sessions a\nday (to avoid confusion with 5HB, we consider each WM session consisting of only one trial). Trial\nlength was limited to 60s, and the inter-session interval was 25 min.). Half of the mice had to swim\nin cold water of 19\u25e6C (motivational stress, MS), while the rest were learning at 26\u25e6C (control).\n\nAfter a 7-week break, 3-day long memory testing was done at 22-23\u25e6C for all animals. Finally,\nafter another 2 weeks, the mice performed the task for 5 more days: half of them did a version with\nuncertainty stress (US), where the platform location was randomly varying between the old position\nand its rotationally opposite; the other half did the same task as before.\n\nBehavior was quanti\ufb01ed using the following 4 PMs: time to reach the goal (escape latency), time\nspent in the target platform quadrant, the opposite platform quadrant, and in the wall region (Fig. 2a).\n\na.\n\n3\n\n\u001c\n\u001c\n\n\u001c\u001d\u001d\n\u001c\n\n\"\n\"\n\n\"##\n\"\n\n2\n\n\u001e\n\u001e\n\n\u001e\u001f\u001f\n\u001e\n\n1\n\nb.\n\n\u0002\n\u0003\n\u0002\n\u0003\n\u0002\n\u0003\n\n\u000e\n\u000e\n\n\u000b\u000b\n\n\u000e\u000f\u000f\n\u000e\n\n\f\n\f\n\u0010\n\u0010\n\n\b\n\b\n\u0012\n\u0012\n\n\f\n\n\f\n\u0010\u0011\u0011\n\u0010\n\n\u0016\n\u0016\n\n\b\t\t\n\b\n\u0012\n\u0012\n\u0013\u0013\n\n\u0016\u0017\u0017\n\u0016\n\n\u0014\u0014\u0015\u0015\n\nAC\n\n\u0018\n\u0019\n\u0018\n\u0019\n\n\u0018\n\u0019\n\u0018\n\u0019\n\n\u0005\n\u0004\n\u0005\n\u0004\n\n\u0005\n\u0004\n\u0005\n\u0004\n\n\u0006\n\u0006\n\u0007\n\u0006\n\u0007\n\u0007\n\n\u0006\n\u0006\n\u0007\n\u0006\n\u0007\n\u0007\n\nPC\n\nwij\n\u001a\u001a\u001b\u001b\n\n\u0001\u0001\n\n\n\n\n\n\u0002\n\u0003\n\u0002\n\u0003\n\u0002\n\u0003\n\n \n!\n \n!\n\n \n!\n \n!\n\nwater pool\n\nplatform\n\nFigure 2: WM experiment and model. a. Experimental setup. 1 \u2013 target platform quadrant, 2 \u2013\nopposite platform quadrant, 3 \u2013 wall region. Small \ufb01lled circles mark 4 starting positions, large\n\ufb01lled circle marks the target platform, open circle marks the opposite platform (used only in the US\ncondition), pool \u2205 = 1.4m. b. Activities of place cells (PC) encode position of the animal in the\nWM, activities of action cells encode direction of the next movement.\n\nA TDRL paradigm (1)-(3) in continuous state and action spaces has been used to model the mouse\nbehavior in the WM [14, 15]. The position of the animal is represented as a population activity of\nNpc = 211 \u2019place cells\u2019 (PC) whose preferred locations are distributed uniformly over the area of a\nmodelled circular arena (Fig. 2b). Activity of place cell j is modelled by a Gaussian centered at the\npreferred location ~pj of the cell:\n\nrpc\nj = exp(\u2212k~p \u2212 ~pjk2/2\u03c32\n\npc) ,\n\n(7)\n\nwhere ~p is the current position of the modelled animal and \u03c3pc = 0.25 de\ufb01nes the width of the\nspatial receptive \ufb01eld relative to the pool radius. Place cells project to the population of Nac = 36\n\u2019action cells\u2019 (AC) via feed-forward all-to-all connections with modi\ufb01able weights. Each action cell\nis associated with angle \u03c6i, all \u03c6i being distributed uniformly in [0, 2\u03c0]. Thus, an activity pro\ufb01le on\nthe level of place cells (i.e. state st) causes a different activity pro\ufb01le on the level of the action cells\ndepending on the value of the weight vector. The activity of action cell i is considered as the value\nof the action (de\ufb01ned as a movement in direction \u03c6i\n\nQ(st, at) = rac\n\nwijrpc\n\nj\n\n.\n\n(8)\n\n4):\ni = Xj\n\n4A constant step length was chosen to \ufb01t the average speed of the animals during the experiment\n\n\fThe action selection follows \u0001-greedy policy, where the optimal action a\u2217 is chosen with probability\n\u03b2 = 1 \u2212 \u0001 and a random action with probability 1 \u2212 \u03b2. Action a\u2217 is de\ufb01ned as movement in the\ndirection of the center of mass \u03c6\u2217 of the AC population5. Q-value corresponding to an action with\ncontinuous angle \u03c6 is calculated as linear interpolation between activities of the two closest action\ncells. During learning the PC\u2192AC connection weights are updated on each time step in such a way\nas to decrease the reward prediction error \u03b4t (3):\n\nj\n\ni rpc\n\n.\n\n\u2206wij = \u03b1\u03b4rac\n\n(9)\nThe Hebbian-like form of the update rule (9) is due to the fact that we use distributed representations\nfor states and actions, i.e. there is no single state/action pair responsible for the last movement.\nTo simulate one experimental session it is necessary to (i) initialize the weight matrix {wij}, (ii)\nchoose meta-parameter values and starting position ~p0, (iii) compute (7)-(8) and perform corre-\nsponding movements until k~p \u2212 ~pplk < Rpl at which point reward r = 15 is delivered (Rpl is the\nplatform radius). Wall hits result in a small negative reward (rwall = \u22123).\nFor each session and each set of the meta-parameters, 48 different sets of random initial weights wij\n(corresponding to individual mice) were used to run the model, with 50 simulations started out of\neach set. Final values of the PMs were averaged over all repetitions for each subgroup of mice.\n\nTo account for the loss of memory, after each day all weights were updated as follows:\n\nij\nwhere \u03bb is the memory decay factor, wold\nis the weight value at the end of the day, and winitial\nij\nthe initial weight value before any learning took place.\n\nij\n\nij\n\nwnew\n\nij = wold\n\n\u00b7 (1 \u2212 \u03bb) + winitial\n\n\u00b7 \u03bb\n\n(10)\nis\n\n5 Goodness-of-\ufb01t function and optimization procedure\n\nTo compare the model with the experiment we used the following goodness-of-\ufb01t function [16]:\n\n\u03c72 =\n\nNPM\n\nXk=1\n\n(PMexp\n\nk \u2212 PMmod\n\nk\n\n(\u03b1, \u03b2, \u03b3, \u03bb))2/(\u03c3exp\n\nk\n\n)2 ,\n\n(11)\n\nk\n\nk\n\nk\n\nand PMmod\n\nwhere PMexp\nare the PMs calculated for the animals and the model, respectively and\nNPM is the number of the PMs. PMmod\n(\u03b1, \u03b2, \u03b3, \u03bb) are calculated after simulation of one session\nwith \ufb01xed values of the meta-parameters. PMexp\nk were calculated either for each animal (5HB),\nor for each subgroup (WM). Using stochastic gradient ascent, we minimized (11) with respect to\n\u03b1, \u03b2, \u03b3 for each session separately by systematically varying the meta-parameters in the following\nranges: for WM, \u03b1 \u2208 [10\u22125, 5 \u00b7 10\u22122] and \u03b2, \u03b3 \u2208 [0.01, 0.99], and for 5HB, \u03b1, \u03b3 \u2208 [0.03, 0.99] and\n\u03b2 \u2208 [0.3, 9.9]. Decay factor \u03bb \u2208 [0.01, 0.99] was estimated only for the \ufb01rst session after the break,\notherwise constant values of \u03bb = 0.03 (5HB) and \u03bb = 0.2 (WM) were used.\nSeveral control procedures were performed to ensure that the meta-parameter optimization was sta-\ntistically ef\ufb01cient and self-consistent. To evaluate how well the model \ufb01ts the experimental data we\nused \u03c72-test with \u03bd = NPM \u2212 3 degrees of freedom (since most of the time we had only 3 free\nmeta-parameters). The P (\u03c72, \u03bd) value, de\ufb01ned as the probability that a realization of a chi-square-\ndistributed random variable would exceed \u03c72 by chance, was calculated for each session separately.\nGenerally, values of P (\u03c72, \u03bd) > 0.01 correspond to a fairly good model [16]. To check reliability\nof the estimated meta-parameters we used the same optimization procedure with PMexp\narti\ufb01cially\ngenerated by the model itself. In a self-consistent model such a procedure is expected to \ufb01nd meta-\nparameter values similar to those with which the PMs were generated. Finally, to see how well\nthe model generalizes to previously unseen data, we used half of the available experimental data\nfor optimization and tested the estimated parameters on the other half. Then we evaluated \u03c72 and\nP (\u03c72, \u03bd) values for the testing as well as the training data.\n\nk\n\n6 Results\n\nThe meta-parameter estimation procedure was performed for the models of both experiments using\nstochastic gradient ascent in \u03c72 goodness-of-\ufb01t. For the 5HB, meta-parameters were estimated for\n\n5i.e. \u03c6\u2217 = arctan(Pi rac\n\ni sin(2\u03c0k/Nac)/Pi rac\n\ni cos(2\u03c0k/Nac))\n\n\fa.\n\n%\n \ne\nm\n\ni\nt\n \nt\nn\na\nr\nd\na\nu\nq\n \nm\nr\no\nf\nt\na\nl\n\nP\n\n]\ns\n[\n \ne\nm\n\ni\nt\n \ne\ns\nn\no\np\ns\ne\nr\n \nn\na\ne\n\nM\n\n 50\n\n 40\n\n 30\n\n 20\n\n 10\n\n 16\n 14\n 12\n 10\n 8\n 6\n 4\n 2\n 0\n\nModel data\nExperimental data\n\n 1\n\n 2\n\n 3\n\n 4\n\n 5\n\n 6\n 7\nDay\n\n 8\n\n 9 10 11 12\n\nModel data\nExperimental data\n\nb.\n\nt\n\ne\na\nr\n \nt\n\nn\nu\no\nc\ns\nD\n\ni\n\n0.5\n\n0.2\n\n0.1\n\n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16\n\nDay\n\n1\n\n2\n\nExploitation factor\n\n0.5\n\n0.2\n\n0.1\n\n5\n\nLearning rate\n\nFigure 3: a. Example of PM evolution with learning in the WM (platform quadrant time, top) and in\nthe 5HB (mean response time, bottom). b. Self-consistency check: true (open circles) and estimated\n(\ufb01lled circles) meta-parameter values for the 24 random sets in the 5HB\n\neach animal and each experimental day. Further (sub)group values were calculated by averaging\nthe individual estimations. For the WM, meta-parameters were estimated for each subgroup and\neach experimental session. Learning dynamics in both experiments are illustrated in Figure 3a for\n2 representative PMs, where average performances for all mice and the corresponding models (with\nestimated meta-parameters) are shown.\n\nThe results of both meta-parameter estimation procedures indicated a reasonably good \ufb01t between\nthe model and animal performance. Evaluating the testing data, the condition P (\u03c72, \u03bd) > 0.01 was\nsatis\ufb01ed for 92.5% of 5HB estimated parameter sets, and for 98.4% in the WM. The mean \u03c72 values\nfor the testing data were h\u03c72i = 1.59 in the WM (P (\u03c72, 1) = 0.21) and h\u03c72i = 5.27 in the 5HB\n(P (\u03c72, 3) = 0.15). There was a slight over-\ufb01tting only in the WM estimation.\nTo evaluate the quality of the estimated optima and sensitivities to different meta-parameters, we\ncalculated eigenvalues of the Hessian of 1/\u03c72 around each of the estimated points. 98.4% of all\neigenvalues were negative, and most of the corresponding eigenvectors were aligned with the direc-\ntions of \u03b1, \u03b2, and \u03b3, indicating that there were no signi\ufb01cant correlations in parameter estimation.\nFurthermore, the absolute eigenvalues were highest in the directions of \u03b2 and \u03b3, thus the error sur-\nface is steep along these meta-parameters. To test the reliability of estimated meta-parameters, the\nself-consistency check was performed using a number of random meta-parameter sets. The mean\nabsolute errors (distances between real and estimated parameter values) were quite small for ex-\nploitation factors (\u03b2) \u2013 approximately 6% of the total range, but higher for the reward discount\nfactors (\u03b3) and for the learning rates (\u03b1) \u2013 10-29% of the total range (Figure 3b). This indicates that\nestimated \u03b2 values should be considered more reliable than those of \u03b1 and \u03b3.\n\n6.1 Meta-parameter dynamics\n\nDuring the course of learning, exploitation factors (\u03b2) (Figure 4a,b) showed progressive increase\n(regression p (cid:28) 0.001 for both the 5HB and the WM), reaching the peak at the end of each learning\nblock. They were consistently higher for the C57 mice than for the DBA mice (2-way ANOVA with\nreplications, p (cid:28) 0.001 for both experiments), indicating that the DBA mice were exploring the\nenvironment more actively, and/or were not able to focus their attention well on the speci\ufb01c task.\nFinally, C57 mouse groups, exposed to motivational stress in the WM and to extrinsic stress in the\n5HB, had elevated exploitation factors (ANOVA p < 0.01 for both experiments), however there was\nno effect for the DBA mice.\n\nThe estimated learning rates (\u03b1) did not show any obvious changes or trends with learning for\neither 5HB or WM. There were no differences between the 2 genetic strains (nor among the stress\nconditions) with one exception: for the \ufb01rst several days of the training, C57 learning rates were\n\n\fa.\n\n \nr\no\nt\nc\na\nf\n \n\nn\no\ni\nt\na\nt\ni\no\nl\np\nx\nE\n\nc.\n\ne\nc\nn\ne\nr\ne\nf\ne\nd\n\n \nr\no\nt\nc\na\nf\n\n \n\nd\nr\na\nw\ne\nr\n \ne\nr\nu\nt\nu\nF\n\n 8\n\n 7\n\n 6\n\n 5\n\n 4\n\n 3\n\n 2\n\n 1\n\n 0\n\nC57BL/6\nDBA/2\n\n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16\n\nDay\n\nFixed platform\nVariable platform\n\n 8\n\n 9\n\n 10\nDay\n\n 11\n\n 12\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\n 0.2\n\n 0.1\n\n 0\n\nb.\n\n b\nr\no\nt\nc\na\nf\n \nn\no\ni\nt\na\nt\ni\no\nl\np\nx\nE\n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\n 0\n\nC57BL/6\nDBA/2\n\n 1 2 3 4 5 6 7 8 9 10 11 12\n\nDay\n\nd.\n\ne\nc\nn\ne\nr\ne\nf\ne\nd\n \nd\nr\na\nw\ne\nr\n \ne\nr\nu\nt\nu\nF\n\n g\nr\no\nt\nc\na\nf\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\n 0.2\n\n 0.1\n\n 0\n\nControl\nUncertainty\n\n 5\n\n 7\n\n 6\nDay\n\ne.\n\nr\no\nt\nc\na\n\nf\n \n\ne\nc\nn\ne\nr\ne\n\nf\nr\ne\nn\n\nt\n\ni\n/\ny\na\nc\ne\nd\n \ny\nr\no\nm\ne\nM\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nPreviously exposed to US\nControl\n\nC57\n\nDBA\n\nFigure 4: a,b. Estimated exploitation factors \u03b2 for 5HB (a, break is between days 8 & 9) and WM (b,\nbreaks between days 4 & 5 and between 7 & 8). c,d. Estimated future reward deference factors for\nthe variable platform trials in the WM (c) and for the uncertainty trials in the 5HB (d). e. Estimated\nmemory decay / interference factors for the \ufb01rst day after the break in the 5HB.\n\nsigni\ufb01cantly higher (ANOVA p < 0.01 in both experiments), indicating that C57 mice could learn a\nnovel task more quickly.\n\nUnder uncertainty (in reward delivery for the 5HB, and in the target platform location for the WM)\nfuture reward discount factors (\u03b3) were signi\ufb01cantly elevated (ANOVA p < 0.02, Figure 4c,d). In\nthe 5HB, memory decay factors (\u03bb), estimated for the \ufb01rst day after the break, were signi\ufb01cantly\nhigher (p < 0.01, unpaired t-test) for animals, previously exposed to uncertainty (Figure 4e). This\nsuggests that uncertainty makes animals consider rewards further into the future, and it seems to\nimpair memory consolidation.\n\n7 Discussion\n\nIn this paper we showed that various behavioral outcomes (caused by genetic traits and/or stress\nfactors) could be predicted by our TDRL models for 2 different tasks. This provides hypotheses\nconcerning the neuromodulatory mechanisms, which we plan to test using pharmacological manip-\nulations (typically, injections of agonists or antagonists of relevant neurotransmitter systems).\n\nResults for the exploitation factors suggest that with learning (and decreasing reward prediction\nerrors) the acquired knowledge is used more for choosing actions. This might also be related to\ndecreased subjective stress and higher stressor controllability. The difference between C57 and DBA\nstrains shows two things. Firstly, the anxious DBA mice cannot exploit their knowledge as well as\nC57 can. Secondly, in response to motivational or extrinsic stress C57 mice are the only ones that\nincrease their exploitation. This may be related to an inverse-U-shaped effect of the noradrenergic\nin\ufb02uences on focused attention and performance accuracy [17]. Animals with low anxiety (C57)\nmight be on the left side of the curve, and additional stress might lead them to optimal performance,\nwhile those with high anxiety \u2013 already on the right side, leading to possibly impaired performance.\nOur results may also suggest that the widely proclaimed de\ufb01ciency of DBA mice in spatial learning\n(as compared to C57) [4, 12] might be primarily due to differential attentional capabilities.\n\nThe increased future reward discount factors under uncertainty indicate a reasonable adaptive re-\nsponse \u2013 animals should not concentrate their learning on immediate events when task-reward rela-\n\nb\ng\n\ftions become ambiguous. Uncertainty in behaviorally relevant outcomes under stress causes a de-\ncrease in subjective stressor controllability, which is known to be related to elevated serotonin levels\n[18]. Higher memory decay / interference factors for the animals previously exposed to uncertainty\ncould be due to partially impaired memory consolidation and/or due to stronger competition between\ndifferent strategies and perceptions of the uncertain task.\n\nAlthough estimated meta-parameter values can be easily compared between certain experimental\nconditions, it is dif\ufb01cult to study in this way the interactions between different genetic and environ-\nmental factors or extrapolate beyond the limits of available conditions. One could overcome this\ndisadvantage by developing a black-box parameter model that would help us to evaluate in a \ufb02exible\nway the contributions of speci\ufb01c factors (motivation, uncertainty, genotype) to meta-parameter dy-\nnamics, as well as their relationship with dynamics of TD errors (\u03b4t) during the process of learning.\n\nAcknowledgments\n\nThis work was partially supported by a grant from the Swiss National Science Foundation to C.S.\n(3100A0-108102).\n\nReferences\n\n[1] J. J. Kim and D. M. Diamond. The stressed hippocampus, synaptic plasticity and lost memories. Nat Rev\n\nNeurosci., 3(6):453\u201362., Jun 2002.\n\n[2] C. Sandi, M. Loscertales, and C. Guaza. Experience-dependent facilitating effect of corticosterone on\n\nspatial memory formation in the water maze. Eur J Neurosci., 9(4):637\u201342., Apr 1997.\n\n[3] M. Joels, Z. Pu, O. Wiegert, M. S. Oitzl, and H. J. Krugers. Learning under stress: how does it work?\n\nTrends Cogn Sci., 10(4):152\u20138. Epub 2006 Mar 2., Apr 2006.\n\n[4] J. M. Wehner, R. A. Radcliffe, and B. J. Bowers. Quantitative genetics and mouse behavior. Annu Rev\n\nNeurosci., 24:845\u201367., 2001.\n\n[5] A. Holmes, C. C. Wrenn, A. P. Harris, K. E. Thayer, and J. N. Crawley. Behavioral pro\ufb01les of inbred\nstrains on novel olfactory, spatial and emotional tests for reference memory in mice. Genes Brain Behav.,\n1(1):55\u201369., Jan 2002.\n\n[6] J. L. McGaugh. The amygdala modulates the consolidation of memories of emotionally arousing experi-\n\nences. Annu Rev Neurosci., 27:1\u201328., 2004.\n\n[7] M. J. Kreek, D. A. Nielsen, E. R. Butelman, and K. S. LaForge. Genetic in\ufb02uences on impulsivity, risk\ntaking, stress responsivity and vulnerability to drug abuse and addiction. Nat Neurosci., 8:1450\u20137, 2005.\n\n[8] R. Sutton and A. G. Barto. Reinforcement Learning - An Introduction. MIT Press, 1998.\n[9] W. Schultz, P. Dayan, and P. R. Montague. A neural substrate of prediction and reward. Science,\n\n275(5306):1593\u20139, Mar 14 1997.\n\n[10] K. Doya. Metalearning and neuromodulation. Neural Netw, 15(4-6):495\u2013506, Jun-Jul 2002.\n[11] K. Samejima, K. Doya, Y. Ueda, and M. Kimura. Estimating internal variables and paramters of a learning\n\nagent by a particle \ufb01lter. In Advances in Neural Information Processing Systems 16. 2004.\n\n[12] C. Rossi-Arnaud and M. Ammassari-Teule. What do comparative studies of inbred mice add to current\n\ninvestigations on the neural basis of spatial behaviors? Exp Brain Res., 123(1-2):36\u201344., Nov 1998.\n\n[13] R. G. M. Morris. Spatial localization does not require the presence of local cues. Learning and Motivation,\n\n12:239\u2013260, 1981.\n\n[14] D. J. Foster, R. G. M. Morris, and P. Dayan. A model of hippocampally dependent navigation, using the\n\ntemporal difference learning rule. Hippocampus, 10(1):1\u201316, 2000.\n\n[15] T. Str\u00a8osslin, D. Sheynikhovich, R. Chavarriaga, and W. Gerstner. Modelling robust self-localisation and\n\nnavigation using hippocampal place cells. Neural Networks, 18(9):1125\u20131140, 2005.\n\n[16] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C : The Art of\n\nScienti\ufb01c Computing. Cambridge University Press, 1992.\n\n[17] G. Aston-Jones, J. Rajkowski, and J. Cohen. Locus coeruleus and regulation of behavioral \ufb02exibility and\n\nattention. Prog Brain Res., 126:165\u201382., 2000.\n\n[18] J. Amat, M. V. Baratta, E. Paul, S. T. Bland, L. R. Watkins, and S. F. Maier. Medial prefrontal cortex deter-\nmines how stressor controllability affects behavior and dorsal raphe nucleus. Nat Neurosci., 8(3):365\u201371.\nEpub 2005 Feb 6., Mar 2005.\n\n\f", "award": [], "sourceid": 2958, "authors": [{"given_name": "Gediminas", "family_name": "Luk\u0161ys", "institution": null}, {"given_name": "J\u00e9r\u00e9mie", "family_name": "Kn\u00fcsel", "institution": null}, {"given_name": "Denis", "family_name": "Sheynikhovich", "institution": null}, {"given_name": "Carmen", "family_name": "Sandi", "institution": null}, {"given_name": "Wulfram", "family_name": "Gerstner", "institution": null}]}