{"title": "Neurally Plausible Reinforcement Learning of Working Memory Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 1871, "page_last": 1879, "abstract": "A key function of brains is undoubtedly the abstraction and maintenance of information from the environment for later use. Neurons in association cortex play an important role in this process: during learning these neurons become tuned to relevant features and represent the information that is required later as a persistent elevation of their activity. It is however not well known how these neurons acquire their task-relevant tuning. Here we introduce a biologically plausible learning scheme that explains how neurons become selective for relevant information when animals learn by trial and error. We propose that the action selection stage feeds back attentional signals to earlier processing levels. These feedback signals interact with feedforward signals to form synaptic tags at those connections that are responsible for the stimulus-response mapping. A globally released neuromodulatory signal interacts with these tagged synapses to determine the sign and strength of plasticity. The learning scheme is generic because it can train networks in different tasks, simply by varying inputs and rewards. It explains how neurons in association cortex learn to (1) temporarily store task-relevant information in non-linear stimulus-response mapping tasks and (2) learn to optimally integrate probabilistic evidence for perceptual decision making.", "full_text": "Neurally Plausible Reinforcement Learning of\n\nWorking Memory Tasks\n\nJaldert O. Rombouts, Sander M. Bohte\n\nPieter R. Roelfsema\n\nNetherlands Institute for Neuroscience\n\nAmsterdam, The Netherlands\n\np.r.roelfsema@nin.knaw.nl\n\nCWI, Life Sciences\n\nAmsterdam, The Netherlands\n\n{j.o.rombouts, s.m.bohte}@cwi.nl\n\nAbstract\n\nA key function of brains is undoubtedly the abstraction and maintenance of in-\nformation from the environment for later use. Neurons in association cortex play\nan important role in this process: by learning these neurons become tuned to rel-\nevant features and represent the information that is required later as a persistent\nelevation of their activity [1]. It is however not well known how such neurons\nacquire these task-relevant working memories. Here we introduce a biologically\nplausible learning scheme grounded in Reinforcement Learning (RL) theory [2]\nthat explains how neurons become selective for relevant information by trial and\nerror learning. The model has memory units which learn useful internal state rep-\nresentations to solve working memory tasks by transforming partially observable\nMarkov decision problems (POMDP) into MDPs. We propose that synaptic plas-\nticity is guided by a combination of attentional feedback signals from the action\nselection stage to earlier processing levels and a globally released neuromodula-\ntory signal. Feedback signals interact with feedforward signals to form synaptic\ntags at those connections that are responsible for the stimulus-response mapping.\nThe neuromodulatory signal interacts with tagged synapses to determine the sign\nand strength of plasticity. The learning scheme is generic because it can train\nnetworks in different tasks, simply by varying inputs and rewards.\nIt explains\nhow neurons in association cortex learn to 1) temporarily store task-relevant in-\nformation in non-linear stimulus-response mapping tasks [1, 3, 4] and 2) learn to\noptimally integrate probabilistic evidence for perceptual decision making [5, 6].\n\n1\n\nIntroduction\n\nBy giving reward at the right times, animals like monkeys can be trained to perform complex tasks\nthat require the mapping of sensory stimuli onto responses, the storage of information in working\nmemory and the integration of uncertain sensory evidence. While signi\ufb01cant progress has been\nmade in reinforcement learning theory [2, 7, 8, 9], a generic learning rule for neural networks that is\nbiologically plausible and also accounts for the versatility of animal learning has yet to be described.\nWe propose a simple biologically plausible neural network model that can solve a variety of working\nmemory tasks. The network predicts action-values (Q-values) for different possible actions [2],\nand it learns to minimize SARSA [10, 2] temporal difference (TD) prediction errors by stochastic\ngradient descent. The model has memory units inspired by neurons in lateral intraparietal (LIP)\ncortex and prefrontal cortex. Such neurons exhibit persistent activations for task related cues in\nvisual working memory tasks [1, 11, 4]. Memory units learn to represent an internal state that\nallows the network to solve working memory tasks by transforming POMDPs into MDPs [25]. The\nupdates for synaptic weights have two components. The \ufb01rst is a synaptic tag [12] that arises from\nan interaction between feedforward and feedback activations. Tags form on those synapses that are\nresponsible for the chosen actions by an attentional feedback process [13]. The second factor is a\n\n1\n\n\fFigure 1: Model and learning (see section 2). Pentagons represent synaptic tags.\n\nglobal neuromodulatory signal \u03b4 that re\ufb02ects the TD error, and this signal interacts with the tags to\nyield synaptic plasticity. TD-errors are represented by dopamine neurons in the ventral tegmental\narea and substantia nigra [9, 14]. The persistence of tags permits learning if time passes between\nsynaptic activity and the animal\u2019s choice, for example if information is stored in working memory\nor evidence accumulates before a decision is made. The learning rules are biologically plausible\nbecause the information required for computing the synaptic updates is available at the synapse. We\ncall the new learning scheme AuGMEnT (Attention-Gated MEmory Tagging).\nWe \ufb01rst discuss the model and then show that it explains how neurons in association cortex learn to 1)\ntemporarily store task-relevant information in non-linear stimulus-response mapping tasks [1, 3, 4]\nand 2) learn to optimally integrate probabilistic evidence for perceptual decision making [5, 6].\n\n2 Model\n\nAuGMEnT is modeled as a three layer neural network (Fig. 1). Units in the motor (output) layer\npredict Q-values [2] for their associated actions. Predictions are learned by stochastic gradient\ndescent on prediction errors.\nThe sensory layer contains two types of units; instantaneous and transient on(+)/off(-) units. Instan-\ntaneous units xi encode sensory inputs si(t), and + and - units encode positive and negative changes\nin sensory inputs with respect to the previous time step t \u2212 1:\n\ni (t) = [si(t) \u2212 si(t \u2212 1)]+ ;\nx+\n\ni (t) = [si(t \u2212 1) \u2212 si(t)]+ ,\nx\u2212\n\n(1)\nwhere [.]+ is a threshold operator that returns 0 for all negative inputs but leaves positive inputs\nunchanged. Each sensory variable si is thus represented by three units xi, x+\ni (we only explicitly\nwrite the time dependence if it is ambiguous). We denote the set of differentiating units as x(cid:48). The\nhidden layer models the association cortex and it contains regular units and memory units. The\nregular units j (Fig. 1, circles) are fully connected to the instantaneous units i in the sensory layer\nby connections vR\n\n0j is a bias weight. Regular unit activations yR\n\nj are computed as:\n\ni , x\u2212\n\nij; vR\n\nj = \u03c3(aR\nyR\n\nj ) =\n\n1\n\n1 + exp (\u03b8 \u2212 aR\nj )\n\nvR\nijxi .\n\n(2)\n\nMemory units m (Fig. 1, diamonds) are fully connected to the +/- units in the sensory layer by\nconnections vM\n\nlm and they derive their activations yM\n\nj (t) by integrating their inputs:\n\nwith \u03c3 as de\ufb01ned in eqn. (2). Output layer units k are fully connected to the hidden layer by connec-\ntions wR\nmk (for memory hiddens). Activations\nare computed as:\n\n0k is a bias weight) and wM\n\njk (for regular hiddens, wR\n\ni\n\nwith aR\n\nj =(cid:88)\nm (t \u2212 1) +(cid:88)\njk +(cid:88)\n\nyM\nm wM\n\nmk .\n\nlmx(cid:48)\nvM\nl ,\n\nl\n\nm = \u03c3(aM\nyM\n\nm ) with aM\n\nm = aM\n\nqk =(cid:88)\n\nyR\nj wR\n\nj\n\nm\n\n2\n\n(3)\n\n(4)\n\nInstantOffOnFeedforwardFeedbackFeedforwardQ-valuesActionSelectionSensoryAssociationAction\fA Winner Takes All (WTA) competition now selects an action based on the estimated Q-values.\nWe used a max-Boltzmann [15] controller which executes the action with the highest estimated Q-\nvalue with probability 1 \u2212 \u0001 and otherwise it chooses an action with probabilities according to the\nBoltzmann distribution:\n\nP r(zk = 1) =\n\n.\n\n(5)\n\n(cid:80)\n\nexp qk\nk(cid:48) exp qk(cid:48)\n\nThe WTA mechanism then sets the activation of the winning unit to 1 and the activation of all\nother units to 0; zk = \u03b4kK where \u03b4kK is the Kronecker delta function. The winning unit sends\nfeedback signals to the earlier processing layers, informing the rest of the network about the action\nthat was taken. This feedback signal interacts with the feedforward activations to give rise to synaptic\ntags on those synapses that were involved in taking the decision. The tags then interact with a\nneuromodulatory signal \u03b4, which codes a TD error, to modify synaptic strengths.\n\n2.1 Learning\nAfter executing an action, the environment returns a new observation s(cid:48), a scalar reward r, and\npossibly a signal indicating the end of a trial. The network computes a SARSA TD error [10, 2]:\n\n(6)\nwhere qK(cid:48) is the predicted value of the winning action for the new observation, and \u03b3 \u2208 [0, 1] is the\ntemporal discount parameter [2]. AuGMEnT learns by minimizing the squared prediction error E:\n\n\u03b4 = r + \u03b3qK(cid:48) \u2212 qK ,\n\nE =\n\n1\n2\n\n(\u03b4)2 =\n\n1\n2\n\n(r + \u03b3qK(cid:48) \u2212 qK)2 ,\n\n(7)\n\nThe synaptic updates have two factors. The \ufb01rst is a synaptic tag (Fig. 2, pentagons; equivalent\nto an eligibility trace in RL [2]) that arises from an interaction between feedforward and feedback\nactivations. The second is a global neuromodulatory signal \u03b4 which interacts with these tags to yield\nsynaptic plasticity. The updates can be derived by the chain rule for derivatives [16].\nThe update for synapses wR\n\njk is:\njk = \u2212\u03b2\nT agR\njk = (\u03bb\u03b3 \u2212 1)T agR\n\n\u2202E\n\u2202qK\n\n\u2206wR\n\n\u2206T agR\n\njk ,\n\njk = \u03b2\u03b4(t)T agR\njk + \u2202qK\n\u2202wR\njk\n\n= (\u03bb\u03b3 \u2212 1)T agR\n\njk + yR\n\nj zk ,\n\n(8)\n\n(9)\n\njk are the synaptic tags on synapses between regular hidden units\nwhere \u03b2 is a learning rate, T agR\n= \u2212\u03b2 \u2202E\nand the motor layer, and \u03bb is a decay parameter [2]. Note that \u2206wR\n,\n\u2202wR\njk\nholding with equality if \u03bb\u03b3 = 0. If \u03bb\u03b3 > 0, tags decay exponentially so that synapses that were\nresponsible for previous actions are also assigned credit for the currently observed error.\nEquivalently, updates for synapses between memory units and motor units are:\n\njk \u221d \u2212\u03b2 \u2202E\n\n\u2202qK\n\u2202wR\njk\n\n\u2202qK\n\n\u2206wM\n\u2206T agM\n\nmk = \u03b2\u03b4(t)T agM\nmk ,\nmk = (\u03bb\u03b3 \u2212 1)T agM\n\nmk + yM\n\nm zk .\n\n(10)\n(11)\n\nThe updates for synapses between instantaneous sensory units and regular association units are:\n\n\u2206vR\n\nij = \u2212\u03b2\n\n\u2202E\n\u2202qK\n\nT agR\n\nij = \u03b2\u03b4T agR\nij ,\n\n\u2206T agR\n\nij = (\u03bb\u03b3 \u2212 1)T agR\n= (\u03bb\u03b3 \u2212 1)T agR\n\nij + \u2202qK\n\u2202yR\nj\n(cid:48)R\nKjyR\n\nij + w\n\n\u2202yR\n\u2202aR\nj\nj\n\u2202aR\n\u2202vR\nj\nij\nj (1 \u2212 yR\n\n,\n\nj )xi ,\n\n(12)\n\n(13)\n\n(14)\n\n(cid:48)R\nwhere w\nKj are feedback weights from the motor layer back to the association layer. The intuition\nfor the last equation is that the winning output unit K provides feedback to the units in the asso-\nciation layer that were responsible for its activation. Association units with a strong feedforward\nconnection also have a strong feedback connection. As a result, synapses onto association units that\n\n3\n\n\fprovided strong input to the winning unit will have the strongest plasticity. This \u2018attentional feed-\nback\u2019 mechanism was introduced in [13]. For convenience, we have assumed that feedforward and\nfeedback weights are symmetrical, but they can also be trained as in [13].\nFor the updates for the synapses between +/- sensory units and memory units we \ufb01rst approximate\nthe activation aM\n\nm (see eqn. (3)) as:\n\nlmx(cid:48)\nvM\n\nl \u2248 vM\n\nlm\n\nx(cid:48)\nl(t(cid:48)) ,\n\n(15)\n\nm (t \u2212 1) +(cid:88)\n\nl\n\nm = aM\naM\n\nt(cid:88)\n\nt(cid:48)=0\n\nwhich is a good approximation if the synapses vM\n\nlm change slowly. We can then write the updates as:\n\n\u2206vM\n\nlm = \u2212\u03b2\n\u2202E\n\u2202qK\nlm = \u2212T agM\n\n\u2206T agM\n\nT agM\n\nlm = \u03b2\u03b4T agM\n\nlm ,\n\nlm + \u2202qK\n\u2202yM\nm\n\n,\n\n\u2202aM\nm\n\u2202vM\nlm\n\n\u2202yM\nm\n\u2202aM\nm\nm (t)(1 \u2212 yM\n\nm (t))\n\n= \u2212T agM\n\nlm + w\n\n(cid:48)M\nKj yM\n\n(cid:35)\nl(t(cid:48))\nx(cid:48)\n\n.\n\n(cid:34) t(cid:88)\n\nt(cid:48)=0\n\n(16)\n\n(17)\n\n(18)\n\nl on the activity in the motor layer does not decay either (\u03bb\u03b3 = 0).\n\nNote that one can interpret a memory unit as a regular one that receives all sensory input in a trial\nsimultaneously. For synapses onto memory units, we set \u03bb = 0 to arrive at the last equation. The\nintuition behind the last equation is that because the activity of a memory unit does not decay, the\nin\ufb02uence of its inputs x(cid:48)\nA special condition occurs when the environment returns the end-trial signal.\nIn this case, the\nestimate qK in eqn. (6) is set to 0 (see [2]) and after the synaptic updates we reset the memory units\nand synaptic tags, so that there is no confounding between different trials.\nAuGMEnT is biologically plausible because the information required for the synaptic updates is\nlocally available by the interaction of feedforward and feedback signals and a globally released\nneuromodulator coding TD errors. As we will show, this mechanism is powerful enough to learn\nnon-linear transformations and to create relevant working memories.\n\n3 Experiments\n\nWe tested AuGMEnT on a set of memory tasks that have been used to investigate the effects of\ntraining on neuronal activity in area LIP. Across all of our simulations, we \ufb01xed the con\ufb01guration\nof the association layer (three regular units, four memory units) and Q-layer (three output units,\nfor directing gaze to the left, center or right of a virtual screen). The input layer was tailored to\nthe speci\ufb01c task (see below). In all tasks, we trained the network by trial and error to \ufb01xate on a\n\ufb01xation mark and to respond to task-related cues. As is usual in training animals for complex tasks,\nwe used a small shaping reward rf ix (arbitrary units) to facilitate learning to \ufb01xate [17]. At the end\nof trials the model had to make an eye-movement to the left or right. The full task reward rf in was\ngiven if this saccade was accurate, while we aborted trials and gave no reward if the model made\nthe wrong eye-movement or broke \ufb01xation before the go signal. We used a single set of parameters\nfor the network; \u03b2 = 0.15; \u03bb = 0.20; \u03b3 = 0.90; \u0001 = 0.025 and \u03b8 = 2.5, which shifts the sigmoidal\nactivation function for association units so that that units with little input have almost zero output.\nInitial synaptic weights were drawn from a uniform distribution U \u223c [\u22120.25, 0.25]. For all tasks we\nused rf ix = 0.2 and rf in = 1.5.\n\n3.1 Saccade/Antisaccade\n\nThe memory saccade/anti-saccade task (Fig. 2A) is based on [3]. This task requires a non-linear\ntransformation and cannot be solved by a direct mapping from sensory units to Q-value units. Trials\nstarted with an empty screen, shown for one time step. Then either a black or white \ufb01xation mark\nwas shown indicating a pro-saccade or anti-saccade trial, respectively. The model had to \ufb01xate on\nthe \ufb01xation mark within ten time-steps, or the trial was terminated. After \ufb01xating for two time-\nsteps, a cue was presented on the left or right and a small shaping reward rf ix was given. The\n\n4\n\n\fFigure 2: A Memory saccade/antisaccade task. B Model network. In the association layer, a regular\nunit and two memory units are color coded gray, green and orange, respectively. Output units L,F ,R\nare colored green, blue and red, respectively. C Unit activation traces for a sample trained network.\nSymbols in bottom graph indicate highest valued action. F, \ufb01xation onset; C, cue onset; D, delay; G,\n\ufb01xation offset (\u2018Go\u2019 signal). Thick blue: \ufb01xate, dashed green: left, red: right. D Selectivity indices\nof memory units in saccade/antisaccade task (black) and in pro-saccade only task (red).\n\ncue was shown for one time-step, and then only the \ufb01xation mark was visible for two time-steps\nbefore turning off. In the pro-saccade condition, the offset of the \ufb01xation mark indicated that the\nmodel should make an eye-movement towards the cue location to collect rf in. In the anti-saccade\ncondition, the model had to make an eye-movement away from the cue location. The model had to\nmake the correct eye-movement within eight time steps. The input to the model (Fig. 2B) consisted\nof four binary variables representing the information on the virtual screen; two for the \ufb01xation marks\nand two for the cue location. Due to the +/\u2212 cells, the input layer thus had 12 binary units.\nWe trained the models for at most 25, 000 trials, or until convergence. We measured convergence as\nthe proportion of correct trials for the last 50 examples of all trial-types (N = 4). When this propor-\ntion reached 0.9 or higher for all trial-types, learning in the network was stopped and we evaluated\naccuracy on all trial types without stochastic exploration of actions. We considered learning suc-\ncessful if the model performed all trial-types accurately.\nWe trained 10, 000 randomly initialized networks with and without a shaping reward (rf ix = 0).\nOf the networks that received \ufb01xation rewards, 9, 945 learned the task versus 7, 641 that did not\nreceive \ufb01xation rewards; \u03c72(1, N = 10, 000) = 2, 498, P < 10\u22126. The 10, 000 models trained\nwith shaping learned the complete task in a median of 4, 117 trials. This is at least an order of\nmagnitude faster than monkeys that typically learn such a task after months of training with more\nthan 1, 000 trials per day, e. g. [6].\nThe activity of a trained network is illustrated in Fig. 2C. The Q-unit for \ufb01xating at the center had\nstrongest activity at \ufb01xation onset and throughout the \ufb01xation and memory delays, whereas the Q-\nunit for the appropriate eye movement became more active after the go-signal. Interestingly, the\nactivity of the Q-cells also depended on cue-location during the memory delay, as is observed, for\nexample, in the frontal eye \ufb01elds [18]. This activity derives from memory units in the association\nlayer that maintain a trace of the cue as persistent elevation of their activity and are also tuned to\nthe difference between pro- and antisaccade trials. To illustrate this, we de\ufb01ned selectivity indices\n(SIs) to characterize the tuning of memory units to the difference between pro- or antisaccade trials\nand to the difference in cue location. The sensitivity of units to differences in trial types, SItype\nwas |0.5((RP L + RP R) \u2212 (RAL + RAR))|, with R representing a units\u2019 activation level (at \u2018Go\u2019\ntime) in pro (P) and anti-saccade trials (A) with a left (L) or right (R) cue. A unit has an SI of\n0 if it does not distinguish between pro- and antisaccade trials, and an SI of 1 if it is fully active\nfor one trial type and inactive for the other. The sensitivity to cue location, SIcue, was de\ufb01ned\n|0.5((RP L + RAL) \u2212 (RP R + RAR))|. We trained 100 networks and found that units tuned to\ncue-location also tended to be selective for trial-type (black data points in Fig. 2D; SI correlation\n0.79, (N = 400, P < 10\u22126)). To show that the association layer only learns to represent relevant\nfeatures, we trained the same 100 networks using the same stimuli, but now only required pro-\n\n5\n\nProAntiFixationCueDelayGoRLRLRRLF00.5Assoc.00.5Assoc.00.5Assoc.FCG01QFCGFCGFCGRRRRLLLLFFFFFFFFFFFFFFFFFFFFDDDDLeft CueRight CueLeft CueRight CuePro-SaccadeAnti-Saccade00.6500.65SI Cue LocationSI Trial TypeABCD\fFigure 3: A Probabilistic classi\ufb01cation task (redrawn from [6]). B Model network C Population\naverages, conditional on LogLR-quintile (inset) for LIP neurons (redrawn from [6]) (top) and model\nmemory units over 100, 000 trials after learning had converged (bottom). D Subjective weights\ninferred for a trained monkey (redrawn from [6]) (left) and average synaptic weights to an example\nmemory unit (right) versus true symbol weights (A, right). E Histogram of weight correlations for\n400 memory units from 100 trained networks.\n\nsaccades, rendering the color of the \ufb01xation point irrelevant. Memory units in the 97 converged\nnetworks now became tuned to cue-location but not to \ufb01xation point color (Fig. 2D, red data points.\nSI Correlation 0.04, (N = 388, P > 0.48)), indicating that the association layer indeed only learns\nto represent relevant features.\n\n3.2 Probabilistic Classi\ufb01cation\n\nNeurons in area LIP also play a role in perceptual decision making [5]. We hypothesized that\nmemory units could learn to integrate probabilistic evidence for a decision. Yang and Shadlen [6]\ninvestigated how monkeys learn to combine information about four brie\ufb02y presented symbols, which\nprovided probabilistic cues whether a red or green eye movement target was baited with reward\n(Fig. 3A). A previous model with only one layer of modi\ufb01able synapses could learn a simpli\ufb01ed,\nlinear version of this task [19]. We tested if AuGMEnT could train the network to adapt to the full\ncomplexity of the task that demands a non-linear combination of information about the four symbols\nwith the position of the red and green eye-movement targets. Trials followed the same structure as\ndescribed in section 3.1, but now four cues were subsequently added to the display. Cues were\ndrawn with replacement from a set of ten (Fig. 3A, right), each with a different associated weight.\nThe sum of these weights, W , determined the probability that rf in was assigned to the red target\n(R) as follows: P (R|W ) = 10W /(1 + 10W ). For the green target G, P (G|W ) = 1 \u2212 P (R|W ).\nAt \ufb01xation mark offset, the model had to make a saccade to the target with the highest reward\nprobability. The sensory layer of the model (Fig. 3B) had four retinotopic \ufb01elds with binary units\nfor all possible symbols, a binary unit for the \ufb01xation mark and four binary units coding the locations\nof the colored targets on the virtual screen. Due to the +/- units, this made 45 \u00d7 3 units in total.\nAs in [6], we increased the dif\ufb01culty of the task gradually (i. e. we used a shaping strategy) by\nincreasing the set of input symbols (2, 4, . . . , 10) and sequence length (1\u22124) in eight steps. Training\nstarted with the \u2018trump\u2019 shapes which guarantee reward for the correct decision (Fig. 3A, right; see\n[6]) and then added the symbols with the next absolute highest weights. We determined that the\ntask had been learned when the proportion of trials on which the correct decision was taken over\nthe last n trials reached 0.85, where n was increased with the dif\ufb01culty level l of the task. For the\n\ufb01rst 5 levels, n(l) = 500 + 500l and for l = 6, 7, 8 n was 10, 000; 10, 000 and 20, 000, respectively.\nNetworks were trained for at most 500, 000 trials.\nThe behavior of a trained network is shown in \ufb01gure 3C (bottom). Memory units integrated informa-\ntion for one of the choices over the symbol sequence and maintained information about the value of\nthis choice as persistent activity during the memory delay. Their activation was correlated to the log\nlikelihood that the targets were baited, just like LIP neurons [6] (Fig. 3C). The graphs show average\nactivations of populations of real and model neurons in the four cue presentation epochs. Each pos-\n\n6\n\nS1DelayGoFixationS4CResponse (sp/s)0.604030200.600.600.60Time (s)0.1Activation0.50.3S1S2S3S4ModelDataSymbols presentedA+\u221e0.90.70.50.3\u20130.3\u20130.5\u20130.7\u20130.9\u2013\u221eAssigned weightsFavouring greenFavouring redShapesLFRGRGRBsubjective WOE+\u221e-\u221e01-11-10Average WeightsTrue symbol weights0+\u221e-\u221e1-13-30+\u2013LogLRDProp. of Units-1-0.510.5Spearman Correlation0.40.2EDataModel\fFigure 4: Association layer scaling behavior for A default learning parameters and, B optimized\nlearning parameters. Error bars are 95% con\ufb01dence intervals. Parameters used are indicated by\nshading (see inset)\n\nsible sub-sequence of cues was assigned to a log-likelihood ratio (logLR) quintile, which correlates\nwith the probability that the neurons\u2019 preferred eye-movement is rewarded. Note that sub-sequences\nfrom the same trial might be assigned to different quintiles. We computed LogLR quintiles by\nenumerating all combinations of four symbols and then computing the probabilities of reward for\nsaccades to red and green targets. Given these probabilities, we computed reward probability for\nall sub-sequences by marginalizing over the unknown symbols, i. e. to compute the probability that\nthe red target was baited given only a \ufb01rst symbol si, P (R|si), we summed the probabilities for full\nsequences starting with si and divided by the number of such full sequences. We then computed the\nlogLR for the sub-sequences and divided those into quintiles. For model units we rearranged the\nquintiles so that they were aligned in the last epoch to compute the population average.\nSynaptic weights from input neurons to memory cells became strongly correlated to the true weights\nof the symbols (Fig. 3D, right; Spearman correlation, \u03c1 = 1, P < 10\u22126). Thus, the training of\nsynaptic weights to memory neurons in parietal cortex can explain how the monkeys valuate the\nsymbols [19]. We trained 100 networks on the same task and computed Spearman correlations for\nthe memory unit weights with the true weights and found that in general they learn to represent\nthe symbols (Fig. 3E). The learning scheme thus offers a biologically realistic explanation of how\nneurons in LIP learn to integrate relevant information in a probabilistic classi\ufb01cation task.\n\n3.3 Scaling behavior\n\nTo show that the learning scheme scales well, we ran a series of simulations with increasing numbers\nof association units. We scaled the number of association units by powers of two, from 21 = 2\n(yielding 6 regular units and 8 memory units) to 27 = 128 (yielding 384 regular and 512 memory\nunits). For each scale, we trained 100 networks on the saccade/antisaccade task, as described in\nsection 3.1. We \ufb01rst evaluated these scaled networks with the standard set of learning parameters\nand found that these yielded stable results within a wide range but that performance deteriorated\nfor the largest networks (from 26 = 64; 192 regular units and 256 memory units) (Fig. 4A). In a\nsecond experiment (Fig. 4B), we also varied the learning rate (\u03b2) and trace decay (\u03bb) parameters.\nWe jointly scaled these parameters by 1\n8 and selected the parameter combination which\nresulted in the highest convergence rate and the fastest median convergence speed. It can be seen\nthat the performance of the larger networks was at least as good as that of the default network,\nprovided the learning parameters were scaled. Furthermore, we ran extensive grid-searches over the\n\u03bb, \u03b2 parameter space using default networks (not shown) and found that the model robustly learns\nboth tasks with a wide range of parameters.\n\n4 and 1\n\n2 , 1\n\n4 Discussion\n\nWe have shown that AuGMEnT can train networks to solve working memory tasks that require non-\nlinear stimulus-response mappings and the integration of sensory evidence in a biologically plausible\nway. All the information required for the synaptic updates is available locally, at the synapses. The\nnetwork is trained by a form of SARSA(\u03bb) [10, 2], and synaptic updates minimize TD errors by\nstochastic gradient descent. Although there is an ongoing debate whether SARSA or Q-learning\n\n7\n\nConvergence Rate01AB25024Median Convergence SpeedConvergence Rate0125024Median Convergence Speed6+8384+512192+25696+12848+6424+3212+163+4Association layer units (reg. + mem.)6+8384+512192+25696+12848+6424+3212+163+4Association layer units (reg. + mem.)6+8384+512192+25696+12848+6424+3212+163+4Association layer units (reg. + mem.)6+8384+512192+25696+12848+6424+3212+163+4Association layer units (reg. + mem.)\f[20] like algorithms are used by the brain [21, 22], we used SARSA because this has stronger con-\nvergence guarantees than Q-learning when used to train neural networks [23]. Although stability\nis considered a problem for neural networks implementing reinforcement learning methods [24],\nAuGMEnT robustly trained networks on our tasks for a wide range of model parameters.\nTechnically, working memory tasks are Partially Observable Markov Decision Processes\n(POMDPs), because current observations do not contain the information to make optimal decisions\n[25]. Although AuGMEnT is not a solution for all POMDPs, as these are in general intractable [25],\nits simple learning mechanism is well able to learn challenging working memory tasks.\nThe problem of learning new working memory representations by reinforcement learning is not\nwell-studied. Some early work used the biologically implausible backpropagation-through-time\nalgorithm to learn memory representations [26, 27]. Most other work pre-wires some aspects of\nworking memory and only has a single layer of plastic weights (e. g. [19]), so that the learning\nmechanism is not general. To our knowledge, the model by O\u2019Reilly and Frank [7] is most closely\nrelated to AuGMEnT. This model is able to learn a variety of working memory tasks, but it requires\na teaching signal that provides the correct actions on each time-step and the architecture and learning\nrules are elaborate. AuGMEnT only requires scalar rewards and the learning rules are simple and\nwell-grounded in RL theory [2].\nAuGMEnT explains how neurons become tuned to relevant sensory stimuli in sequential decision\ntasks that animals learn by trial and error. The scheme uses units with properties that resemble cor-\ntical and subcortical neurons: transient and sustained neurons in sensory cortices [28], action-value\ncoding neurons in frontal cortex and basal ganglia [29, 30] and neurons which integrate input and\ntherefore carry traces of previously presented stimuli in association cortex. The persistent activity of\nthese memory cells could derive from intracellular processes, local circuit reverberations or recurrent\nactivity in larger networks spanning cortex, thalamus and basal ganglia [31]. The learning scheme\nadopts previously proposed ideas that globally released neuromodulatory signals code deviations\nfrom reward expectancy and gate synaptic plasticity [8, 9, 14]. In addition to this neuromodula-\ntory signal, plasticity in AuGMEnT is gated by an attentional feedback signal that tags synapses\nresponsible for the chosen action. Such a feedback signal exists in the brain because neurons at\nthe motor stage that code a selected action enhance the activity of upstream neurons that provided\ninput for this action [32], a signal that explains a corresponding shift of visual attention [33]. AuG-\nMEnT trains networks to direct feedback (i.e. selective attention) to features that are critical for the\nstimulus-response mapping and are associated with reward. Although the hypothesis that attentional\nfeedback controls the formation of tags is new, there is ample evidence for the existence of synaptic\ntags [34, 12]. Recent studies have started to elucidate the identity of the tags [35, 36] and future\nwork could investigate how they are in\ufb02uenced by attention. Interestingly, neuromodulatory signals\nin\ufb02uence synaptic plasticity even if released seconds or minutes later than the plasticity-inducing\nevent [12, 35], which supports that they interact with a trace of the stimulus, i.e. some form of tag.\nHere we have shown how interactions between synaptic tags and neuromodulatory signals explain\nhow neurons in association areas acquire working memory representations for apparently disparate\ntasks that rely on working memory or decision making. These tasks now \ufb01t in a single, uni\ufb01ed\nreinforcement learning framework.\n\nReferences\n[1] Gnadt, J. and Andersen, R. A. Memory Related motor planning activity in posterior parietal cortex of\n\nmacaque. Experimental brain research., 70(1):216\u2013220, 1988.\n\n[2] Sutton, R. S. and Barto, A. G. Reinforcement learning. MIT Press, Cambridge, MA, 1998.\n[3] Gottlieb, J. and Goldberg, M. E. Activity of neurons in the lateral intraparietal area of the monkey during\n\nan antisaccade task. Nature neuroscience, 2(10):906\u201312, 1999.\n\n[4] Bisley, J. W. and Goldberg, M. E. Attention, intention, and priority in the parietal lobe. Annual review of\n\nneuroscience, 33:1\u201321, 2010.\n\n[5] Gold, J. I. and Shadlen, M. N. The neural basis of decision making. Annual review of neuroscience,\n\n30:535\u201374, 2007.\n\n[6] Yang, T. and Shadlen, M. N. Probabilistic reasoning by neurons. Nature, 447(7148):1075\u201380, 2007.\n[7] O\u2019Reilly, R. C. and Frank, M. J. Making working memory work: a computational model of learning in\n\nthe prefrontal cortex and basal ganglia. Neural computation, 18(2):283\u2013328, 2006.\n\n8\n\n\f[8] Izhikevich, E. M. Solving the distal reward problem through linkage of STDP and dopamine signaling.\n\nCerebral cortex, 17(10):2443\u201352, 2007.\n\n[9] Montague, P. R., Hyman, S. E., et al. Computational roles for dopamine in behavioural control. Nature,\n\n431(7010):760\u20137, 2004.\n\n[10] Rummery, G. A. and Niranjan, M. Online Q-learning using connectionist systems. Technical report,\n\nCambridge University Engineering Department, 1994.\n\n[11] Funahashi, S., Bruce, C. J., et al. Mnemonic Coding of Visual Space in the Monkey\u2019s Dorsolateral\n\nPrefrontal Cortex. Journal of Neurophysiology, 6(2):331\u2013349, 1989.\n\n[12] Cassenaer, S. and Laurent, G. Conditional modulation of spike-timing- dependent plasticity for olfactory\n\nlearning. Nature, 482(7383):47\u201352, 2012.\n\n[13] Roelfsema, P. R. and van Ooyen, A. Attention-Gated Reinforcement Learning of Internal Representations\n\nfor Classi\ufb01cation. Neural Computation, 2214(17):2176\u20132214, 2005.\n\n[14] Schultz, W. Multiple dopamine functions at different time courses. Annual review of neuroscience,\n\n30:259\u201388, 2007.\n\n[15] Wiering, M. and Schmidhuber, J. HQ-Learning. Adaptive Behavior, 6(2):219\u2013246, 1997.\n[16] Rumelhart, D. E., Hinton, G. E., et al. Learning representations by back-propagating errors. Nature,\n\n323(6088):533\u2013536, 1986.\n\n[17] Krueger, K. A. and Dayan, P. Flexible shaping: how learning in small steps helps. Cognition, 110(3):380\u2013\n\n94, 2009.\n\n[18] Sommer, M. A. and Wurtz, R. H. Frontal Eye Field Sends Delay Activity Related to Movement, Memory,\n\nand Vision to the Superior Colliculus. Journal of Neurophysiology, 85(4):1673\u20131685, 2001.\n\n[19] Soltani, A. and Wang, X.-J. Synaptic computation underlying probabilistic inference. Nature Neuro-\n\nscience, 13(1):112\u2013119, 2009.\n\n[20] Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 292:279\u2013292, 1992.\n[21] Morris, G., Nevet, A., et al. Midbrain dopamine neurons encode decisions for future action. Nature\n\nneuroscience, 9(8):1057\u201363, 2006.\n\n[22] Roesch, M. R., Calu, D. J., et al. Dopamine neurons encode the better option in rats deciding between\n\ndifferently delayed or sized rewards. Nature neuroscience, 10(12):1615\u201324, 2007.\n\n[23] van Seijen, H., van Hasselt, H., et al. A theoretical and empirical analysis of Expected Sarsa. 2009 IEEE\n\nSymposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177\u2013184, 2009.\n\n[24] Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of\n\nthe 26th International Conference on Machine Learning (ICML), pages 30\u201337, 1995.\n\n[25] Todd, M. T., Niv, Y., et al. Learning to use working memory in partially observable environments through\n\ndopaminic reinforcement. In NIPS, volume 21, pages 1689\u20131696, 2009.\n\n[26] Zipser, D. Recurrent network model of the neural mechanism of short-term active memory. Neural\n\nComputation, 3(2):179\u2013193, 1991.\n\n[27] Moody, S. L., Wise, S. P., et al. A model that accounts for activity in primate frontal cortex during a\n\ndelayed matching-to-sample task. The journal of Neuroscience, 18(1):399\u2013410, 1998.\n\n[28] Nassi, J. J. and Callaway, E. M. Parallel processing strategies of the primate visual system. Nature\n\nreviews. Neuroscience, 10(5):360\u201372, 2009.\n\n[29] Hikosaka, O., Nakamura, K., et al. Basal ganglia orient eyes to reward. Journal of neurophysiology,\n\n95(2):567\u201384, 2006.\n\n[30] Samejima, K., Ueda, Y., et al. Representation of action-speci\ufb01c reward values in the striatum. Science,\n\n310(5752):1337\u201340, 2005.\n\n[31] Wang, X.-J. Synaptic reverberation underlying mnemonic persistent activity. Trends in neurosciences,\n\n24(8):455\u201363, 2001.\n\n[32] Roelfsema, P. R., van Ooyen, A., et al. Perceptual learning rules based on reinforcers and attention. Trends\n\nin cognitive sciences, 14(2):64\u201371, 2010.\n\n[33] Deubel, H. and Schneider, W. Saccade target selection and object recognition: Evidence for a common\n\nattentional mechanism. Vision Research, 36(12):1827\u20131837, 1996.\n\n[34] Frey, U. and Morris, R. Synaptic tagging and long-term potentiation. Nature, 385(6616):533\u2013536, 1997.\n[35] Moncada, D., Ballarini, F., et al. Identi\ufb01cation of transmitter systems and learning tag molecules involved\n\nin behavioral tagging during memory formation. PNAS, 108(31):12931\u20136, 2011.\n\n[36] Sajikumar, S. and Korte, M. Metaplasticity governs compartmentalization of synaptic tagging and cap-\nture through brain-derived neurotrophic factor (BDNF) and protein kinase Mzeta (PKMzeta). PNAS,\n108(6):2551\u20136, 2011.\n\n9\n\n\f", "award": [], "sourceid": 934, "authors": [{"given_name": "Jaldert", "family_name": "Rombouts", "institution": null}, {"given_name": "Pieter", "family_name": "Roelfsema", "institution": null}, {"given_name": "Sander", "family_name": "Bohte", "institution": null}]}