{"title": "Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation", "book": "Advances in Neural Information Processing Systems", "page_first": 385, "page_last": 392, "abstract": "Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through the cortical and basal ganglia. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming separate time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms.", "full_text": "Temporal Difference Based Actor Critic Learning -\n\nConvergence and Neural Implementation\n\nDotan Di Castro, Dmitry Volkinshtein and Ron Meir\n\nDepartment of Electrical Engineering\n\nTechnion, Haifa 32000, Israel\n\n{dot@tx},{dmitryv@tx},{rmeir@ee}.technion.ac.il\n\nAbstract\n\nActor-critic algorithms for reinforcement learning are achieving renewed popular-\nity due to their good convergence properties in situations where other approaches\noften fail (e.g., when function approximation is involved). Interestingly, there is\ngrowing evidence that actor-critic approaches based on phasic dopamine signals\nplay a key role in biological learning through cortical and basal ganglia loops.\nWe derive a temporal difference based actor critic learning algorithm, for which\nconvergence can be proved without assuming widely separated time scales for the\nactor and the critic. The approach is demonstrated by applying it to networks\nof spiking neurons. The established relation between phasic dopamine and the\ntemporal difference signal lends support to the biological relevance of such algo-\nrithms.\n\n1 Introduction\n\nActor-critic (AC) algorithms [22] were probably among the \ufb01rst algorithmic approaches to reinforce-\nment learning (RL). In recent years much work focused on state, or state-action, value functions as a\nbasis for learning. These methods, while possessing desirable convergence attributes in the context\nof table lookup representation, led to convergence problems when function approximation was in-\nvolved. A more recent line of research is based on directly (and usually parametrically) representing\nthe policy, and performing stochastic gradient ascent on the expected reward, estimated through try-\ning out various actions and sampling trajectories [3, 15, 23]. However, such direct policy methods\noften lead to very slow convergence due to large estimation variance. One approach suggested in\nrecent years to remedy this problem is the utilization of AC approaches, where the value function\nis estimated by a critic, and passed to an actor which selects an appropriate action, based on the\napproximated value function. The \ufb01rst convergence result for a policy gradient AC algorithm based\non function approximation was established in [13], and extended recently in [5, 6]. At this stage\nit seems that AC based algorithms provide a solid foundation for provably effective approaches to\nRL based on function approximation. Whether these methods will yield useful solutions to practical\nproblems remains to be seen.\nRL has also been playing an increasingly important role in neuroscience, and experimentalists have\ndirectly recorded the activities of neurons while animals perform learning tasks [20], and used imag-\ning techniques to characterize human brain activities [17, 24] during learning. It was suggested long\nago that the basal ganglia, a set of ancient sub-cortical brain nuclei, are implicated in RL. Moreover,\nthese nuclei are naturally divided into two components, based on the separation of the striatum (the\nmain input channel to the basal ganglia) into the ventral and dorsal components. Several imaging\nstudies [17, 24] have suggested that the ventral stream is associated with value estimation by a so\ncalled critic, while the dorsal stream has been implicated in motor output, action selection, and\nlearning by a so called actor. Two further experimental \ufb01ndings support the view taken in this work.\n\n\fFirst, it has been observed [20] that the short latency phasic response of the dopamine neurons in\nthe midbrain strongly resembles the temporal difference (TD) signal introduced in theory of TD-\nlearning [22], which can be used by AC algorithms for both the actor and the critic. Since mid-brain\ndopaminergic neurons project diffusively to both the ventral and dorsal components of the striatum,\nthese results are consistent with a TD-based AC learning interpretation of the basal ganglia. Second,\nrecent results suggest that synaptic plasticity occurring at the cortico-striatal synapses is strongly\nmodulated by dopamine [18]. Based on these observations it has been suggested that the basal gan-\nglia take part in TD based RL, with the (global) phasic dopamine signal serving as the TD signal\n[16] modulating synaptic plasticity.\nSome recent work has been devoted to implementing RL in networks of spiking neurons (e.g., [1,\n9, 12]). Such an approach may lead to speci\ufb01c and experimentally veri\ufb01able hypotheses regarding\nthe interaction of known synaptic plasticity rules and RL. In fact, one tantalizing possibility is to\ntest these derived rules in the context of ex-vivo cultured neural networks (e.g., [19]), which are\nconnected to the environment through input (sensory) and output (motor) channels. We then envision\ndopamine serving as a biological substrate for implementing the TD signal in such a system. The\nwork cited above is mostly based on direct policy gradient algorithms, (e.g., [3]), leading to non-\nAC approaches. Moreover, these algorithms were based directly on the reward, rather than on the\nbiologically better motivated TD signal, which provides more information than the reward itself,\nand is expected to lead to improved convergence.\n\n2 A Temporal Difference Based Actor-Critic Algorithm\n\nThe TD-based AC algorithm developed in this section is related to the one presented in [5, 6]. While\nthe derivation of the present algorithm differs from the latter work (which also stressed the issue of\nthe natural gradient) , the essential novel theoretical feature here is the establishment of convergence1\nwithout the restriction to two time scales which was used in [5, 6, 13]. This result is also important\nin a biological context, where, as far as we are aware, there is no evidence for such a time scale\nseparation.\n\n2.1 Problem Formulation\nWe consider a \ufb01nite Markov Decision Process (MDP) in discrete time with a \ufb01nite state set X of\nsize |X| and a \ufb01nite action set U. The MDP models the environment in which the agent acts. Each\nselected action u \u2208 U determines a stochastic matrix P (u) = [P (y|x, u)]x,y\u2208X where P (y|x, u) is\nthe transition probability from a state x \u2208 X to a state y \u2208 X given the control u. A parameterized\npolicy is described by a conditional probability function, denoted by \u00b5(u|x, \u03b8), which maps obser-\nvation x \u2208 X into a control u \u2208 U given a parameter \u03b8 \u2208 RK. For each state x \u2208 X the agent\nreceives a corresponding reward r(x). The agent\u2019s goal is to adjust the parameter \u03b8 in order to attain\nmaximum average reward over time.\nFor each \u03b8 \u2208 RK, we have a Markov Chain (MC) induced by P (y|x, u) and \u00b5(u|x, \u03b8). The state\n(cid:82)\ntransitions of the MC are obtained by \ufb01rst generating an action u according to \u00b5(u|x, \u03b8), and then\ngenerating the next state according to P (y|x, u)]x,y\u2208X . Thus, the MC has a transition matrix P (\u03b8) =\n[P (y|x, \u03b8)]x,y\u2208X which is given by P (y|x, \u03b8) =\nU P (y|x, u)d\u00b5(u|x, \u03b8). We denote the set of these\ntransition probabilities by P = {P (\u03b8)|\u03b8 \u2208 RK}, and its closure by \u00afP. We denote by P (x, u, y)\nthe stationary probability to be in state x, choose action u and go to state y. Several technical\nassumptions are required in the proofs below.\nAssumption 2.1. (i) Each MC P (\u03b8), P (\u03b8) \u2208 \u00afP, is aperiodic, recurrent, and contains a single\nequivalence class. (ii) The function \u00b5(u|x, \u03b8) is twice differentiable. Moreover, there exist positive\nconstants Br and B\u00b5, such that for all x \u2208 X , u \u2208 U, \u03b8 \u2208 RK and 1 \u2264 k1, k2 \u2264 K, we have\n|r(x)| \u2264 Br, |\u2202\u00b5(u|x, \u03b8)/\u2202\u03b8k| \u2264 B\u00b5, |\u22022\u00b5(u|x, \u03b8)/\u2202\u03b8k1\u03b8k2| \u2264 B\u00b5.\nAs a result of assumption 2.1(i), we have the following lemma regarding the stationary distribution\n(Theorem 3.1 in [8]).\n\n1Throughout this paper convergence refers to convergence to a small ball around a stationary point; see\n\nTheorem 2.6 for a precise de\ufb01nition.\n\n\fLemma 2.1. Under Assumption 2.1(i), each MC, P (\u03b8) \u2208 \u00afP, has a unique stationary distribution,\ndenoted by \u03c0(\u03b8), satisfying \u03c0(\u03b8)(cid:48)P (\u03b8) = \u03c0(\u03b8)(cid:48), where x(cid:48) is the transpose of vector x.\n\nNext, we de\ufb01ne a measure for performance of an agent in an environment. The average reward per\nstage of a MC starting from an initial state x0 \u2208 X is de\ufb01ned by\n\n(cid:34)\n\nT(cid:88)\n\nn=0\n\n1\nT\n\n(cid:35)\n(cid:175)(cid:175)(cid:175)x0 = x\n\n,\n\nr(xn)\n\nJ(x|\u03b8) (cid:44) lim\n\nT\u2192\u221e E\u03b8\n\nwhere E\u03b8[\u00b7] denotes the expectation under the probability measure P (\u03b8), and xn is the state at time\nn. The agent\u2019s goal is to \ufb01nd \u03b8 \u2208 RK which maximizes J(x|\u03b8). The following lemma shows\nthat under Assumption 2.1, the average reward per stage does not depend on the initial states (see\nTheorem 4.7 in [10]).\nLemma 2.2. Under Assumption 2.1 and Lemma 2.1, the average reward per stage, J(x|\u03b8), is inde-\npendent of the starting state, is denoted by \u03b7(\u03b8), and satis\ufb01es \u03b7(\u03b8) = \u03c0(\u03b8)(cid:48)r.\n\nBased on Lemma 2.2, the agent\u2019s goal is to \ufb01nd a parameter vector \u03b8, which maximizes the average\nreward per stage \u03b7(\u03b8). Performing the maximization directly on \u03b7(\u03b8) is hard. In the sequel we\nshow how this maximization can be performed by optimizing \u03b7(\u03b8), using \u2207\u03b7(\u03b8). A consequence of\nAssumption 2.1 and the de\ufb01nition of \u03b7(\u03b8) is the following lemma (see Lemma 1 in [15]).\nLemma 2.3. For each x, y \u2208 X and for each \u03b8 \u2208 RK, the functions P (y|x, \u03b8), \u03c0(x|\u03b8), and \u03b7(\u03b8),\nare bounded, twice differentiable, and have bounded \ufb01rst and second derivatives.\nNext, we de\ufb01ne the differential value function of state x \u2208 X which represents the average reward\nthe agent receives upon starting from a state x0 and reaching a recurrent state x\u2217 for the \ufb01rst time.\nMathematically,\n\nh(x|\u03b8) (cid:44) E\u03b8\n\n(r(xn) \u2212 \u03b7(\u03b8))\n\n(1)\nwhere T (cid:44) min{k > 0|xk = x\u2217}. We de\ufb01ne h(\u03b8) (cid:44) (h(x1|\u03b8), . . . , h(x|X||\u03b8)) \u2208 R|X|. For each\n\u03b8 \u2208 RK and x \u2208 X , h(x|\u03b8), r(x), and \u03b7(\u03b8) satisfy Poisson\u2019s equation (see Theorem 7.4.1 in [4]),\n(2)\n\nh(x|\u03b8) = r(x) \u2212 \u03b7(\u03b8) +\n\nP (y|x, \u03b8)h(y|\u03b8).\n\nn=0\n\n(cid:35)\n(cid:175)(cid:175)(cid:175)x0 = x\n\n,\n\n(cid:34)\n\nT(cid:88)\n\n(cid:88)\n\ny\u2208X\n\nBased on the differential value de\ufb01nition we de\ufb01ne the temporal difference (TD) between the states\nx \u2208 X and y \u2208 X . Formally,\n\nd(x, y) (cid:44) r(x) \u2212 \u03b7(\u03b8) + h(y|\u03b8) \u2212 h(x|\u03b8).\n\n(3)\nThe TD measures the difference between the differential value estimate following the receipt of\nreward r(x) and a move to a new state y, and the estimate of the current differential state value at\nstate x.\n\n2.2 Algorithmic details and single time scale convergence\nWe start with a de\ufb01nition of the likelihood ratio derivative, \u03c8(x, u|\u03b8) (cid:44) \u2207\u00b5(u|x, \u03b8)/\u00b5(u|x, \u03b8),\nwhich we assume to be bounded.\nAssumption 2.2. For all x \u2208 X , u \u2208 U, and \u03b8 \u2208 RK, there exists a positive constant, B\u03c8, such\nthat |\u03c8(x, u|\u03b8)| \u2264 B\u03c8 < \u221e.\nIn order to improve the agent\u2019s performance, we need to follow the gradient direction. The following\ntheorem shows how the gradient of the average reward per stage can be calculated by the TD signal.\nSimilar variants of the theorem were proved using the Q-value [23] or state value [15] instead of the\nTD-signal.\nTheorem 2.4. The gradient of the average reward per stage for \u03b8 \u2208 RK can be expressed by\n\n\u2207\u03b7(\u03b8) =\n\nP (x, u, y)\u03c8(x, u|\u03b8) (d(x, y) + f(x))\n\n(f(x) arbitrary).\n\n(4)\n\n(cid:88)\n\nx,y\u2208X ,u\u2208U\n\n\fThe theorem was proved using an advantage function argument in [6]. We provide a direct proof\nin section A of the supplementary material. The \ufb02exibility resulting from the function f(x) allows\nus to encode the TD signal using biologically realistic positive values only, without in\ufb02uencing the\nconvergence proof. In this paper, for simplicity, we use f(x) = 0.\nBased on Theorem 2.4, we suggest an TD-based AC algorithm. This algorithm is motivated by [15]\nwhere an actor only algorithm was proposed. In [15] the differential value function was re-estimated\nafresh for each regenerative cycle leading to a large estimation variance. Using the continuity of the\nactor\u2019s policy function in \u03b8, the difference between the estimates between regenerative cycles is\nsmall. Thus, the critic has a good initial estimate at the beginning of each cycle, which is used\nhere in order to reduce the variance. A related AC algorithm was proposed in [5, 6], where two\ntime scales were assumed in order to use Borkar\u2019s two time scales convergence theorem [7]. In our\nproposed algorithm, and associated convergence theorem, we do not assume different time scales\nfor the actor and for the critic.\nWe present batch mode update equations2 in Algorithm 1 for the actor and the critic. The algorithm\nis based on some recurrent state x\u2217; the visit times to this state are denoted by t0, t1, . . .. Updated\noccur only at these times (batch mode). We de\ufb01ne a cycle of the algorithm by the time indices which\nsatisfy tm \u2264 n < tm+1. The variables \u02dcd, \u02dch(x), and \u02dc\u03b7 are the critic\u2019s estimates for d, h(x|\u03b8), and\n\u03b7(\u03b8) respectively.\n\nAlgorithm 1 Temporal Difference Based Actor Critic Algorithm\n1: Given\n\n(cid:80)\u221e\n\n(cid:80)\u221e\n\u2022 An MDP with \ufb01nite set X of states and a recurrent state x\u2217, satisfying 2.1(i).\n\u2022 Hitting times t0 < t1 < t2 < \u00b7\u00b7\u00b7 for the state x\u2217.\nm=1 \u03b3m = \u221e and\n\u2022 Step coef\ufb01cients \u03b3m such that\n\u2022 A parameterized policy \u00b5(u|x, \u03b8), \u03b8 \u2208 RK, which satis\ufb01es Assumption 2.1(ii).\n\u2022 A set H, constants B\u02dch and B\u03b8, and an operator \u03a0H according to Assumption B.1.\n\u2022 Step parameters \u0393\u03b7 and \u0393h satisfying Theorem 2.6.\n\u2022 \u02dc\u03b70 = 0 (the estimate of the average reward per stage)\n\u2022 \u02dch0(x) = 0,\n\n\u2200x \u2208 X (the estimate of the differential value function)\n\nm < \u221e.\n\nm=1 \u03b32\n\n2: Initiate the critic\u2019s variables:\n\n3: Initiate the actor: \u03b80 = 0 and choose f(x) (see (4))\n4: for each state xtm+1 visited do\n5:\n\nCritic: For all x \u2208 X , Nm(x) (cid:44) min{tm < k < tm+1|xk = x}, (min(\u2205) = \u221e)\n\n\u02dcd(xn, xn+1) = r(xn) \u2212 \u02dc\u03b7m + \u02dchm(xn+1) \u2212 \u02dchm(xn),\n\n\u02dchm+1(x) = \u02dchm(x) + \u03b3m\u0393h\n\n\u02dcd(xn, xn+1)\n\n\u2200x \u2208 X ,\n\n\uf8f6\uf8f8 ,\n\n\uf8eb\uf8ed tm+1\u22121(cid:88)\ntm+1\u22121(cid:88)\n\nn=Nm(x)\n\nn=tm\n\n(r(xn) \u2212 \u02dc\u03b7m).\n\n\u02dc\u03b7m+1 = \u02dc\u03b7m + \u03b3m\u0393\u03b7\n\n(cid:80)tm+1\u22121\n\n\u03c8(xn, un|\u03b8m)( \u02dcd(xn, xn+1) + f(xn))\nActor: \u03b8m+1 = \u03b8m + \u03b3m\nProject each component of \u02dchm+1 and \u03b8m+1 onto H (see Assumption B.1.).\n\nn=tm\n\n6:\n7:\n8: end for\n\nIn order to prove the convergence of Algorithm 1, we establish two basic results. The \ufb01rst shows that\nthe algorithm converges to the set of ordinary differential equations (5), and the second establishes\nconditions under which the differential equations converge locally.\n\n2In order to prove convergence certain boundedness conditions need to be imposed, which appear as step\n7 in the algorithm. For lack of space, the precise de\ufb01nition of the set H is given in Assumption B.1 of the\nsupplementary material.\n\n\f\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:34)\n\nT\u22121(cid:88)\n\n(cid:179)\n\nx \u2208 X\n(cid:34)\nT\u22121(cid:88)\n\nn=0\n\n(cid:180)\n\n,\n\n(5)\n\n(cid:35)\n(cid:175)(cid:175)(cid:175)x0 = x\n(cid:35)\n(cid:175)(cid:175)(cid:175)x0 = x\n\n\u2217\n\n\u2217\n\n,\n\n,\n\nTheorem 2.5. Under Assumptions 2.1 and B.1, Algorithm 1 converges to the following set of ODE\u2019s\n\n(cid:88)\n\nx\u2208X\n\n\u02d9\u03b8 = T (\u03b8)\u2207\u03b7(\u03b8) + C(\u03b8) (\u03b7(\u03b8) \u2212 \u02dc\u03b7) +\n\nD(x)(\u03b8)\n\nh(x|\u03b8) \u2212 \u02dch(x)\n\n(cid:179)\n\n(cid:180)\n\n\u02d9\u02dch(x) = \u0393h\n\nh(x|\u03b8) \u2212 \u02dch(x)\n\u02d9\u02dc\u03b7 = \u0393\u03b7T (\u03b8) (\u03b7(\u03b8) \u2212 \u02dc\u03b7) ,\n\n+ \u0393hT (\u03b8) (\u03b7(\u03b8) \u2212 \u02dc\u03b7) ,\n\nwith probability 1, where\n\nT = min{k > 0|x0 = x\n\u2217\n\n, xk = x\n\n\u2217}, T (\u03b8) = E\u03b8[T ], C(\u03b8) = E\u03b8\n\n\u03c8(xn, un|\u03b8)\n\n(cid:35)\n\n(cid:175)(cid:175)(cid:175)x0 = x\n\n\u2217\n\n(cid:34)\nT\u22121(cid:88)\n\nD(x)(\u03b8) = E\u03b8\n\n1{xn+1 = x} \u03c8(xn, un|\u03b8)\n\n+ E\u03b8\n\n(1{xn = x} \u03c8(xn, un|\u03b8)\n\nn=0\n\nn=0\n\nand where T (\u03b8), C(\u03b8), and D(x)(\u03b8) are continuous with respect to \u03b8.\n\nTheorem 2.5 is proved in section B of the supplementary material, based on the theory of stochas-\ntic approximation, and more speci\ufb01cally, on Theorem 5.2.1 in [14]. An advantage of the proof\ntechnique is that it does not need to assume two time scales.\nThe second theorem, proved in section C of the supplementary material, states the conditions for\nwhich \u03b7(\u03b8t) converges to a ball around the local optimum.\nTheorem 2.6. If we choose \u0393\u03b7 \u2265 B2\n\u02d9\u03b7/\u0001\u03b7 and \u0393h \u2265 B2\n/\u0001h, for some positive constants \u0001h and \u0001\u03b7,\nthen lim supt\u2192\u221e (cid:107)\u2207\u03b7(\u03b8(t))(cid:107) \u2264 \u0001, where \u0001 (cid:44) BC\u0001\u03b7 + |X|BD\u0001h. The constants B \u02d9\u03b7 and B \u02d9h are\nde\ufb01ned in Section C of the supplementary material.\n\n\u02d9h\n\n3 A Neural Algorithm for the Actor Using McCulloch-Pitts Neurons\n\nIn this section we apply the previously developed algorithm to the case of neural networks. We start\nwith the classic binary valued McCulloch-Pitts neuron, and then consider a more realistic spiking\nneuron model. While the algorithm presented in Section 2 was derived and proved to converge in\nbatch mode, we apply it here in an online fashion. The derivation of an online learning algorithm\nfrom the batch version is immediate (e.g., [15]), and a proof of convergence in this setting is currently\nunderway.\n\nA McCulloch-Pitts actor network\nThe dynamics of the binary valued neurons, given at time n by {ui(n)}N\nassumed to be based on stochastic discrete time parallel updates, given by\n\ni=1, ui(n) \u2208 {0, 1}, is\n\nPr(ui(n) = 1) = \u03c3(vi(n)) where\n\nvi(n) =\n\nwijuj(n \u2212 1)\n\n(i = 1, 2, . . . , N).\n\nj=1\n\nHere \u03c3(v) = 1/(1 + exp(\u2212v)), and the parameters \u03b8 in Algorithm 1 are given by {wij}, where\nwij(n) is the j (cid:55)\u2192 i synaptic weight at time n. Each neuron\u2019s stochastic output ui is viewed as an\naction.\nApplying the actor update from Algorithm 1 we obtain the following online learning rule\nwij(n + 1) = wij(n) + \u03b3d(x(n), x(n + 1)) (ui(n) \u2212 \u03c3(vi(n))) uj(n \u2212 1).\n\n(6)\n\nwhere d(x(n), x(n + 1)) is the TD signal.\nThe update (6) can be interpreted as an error-driven Hebbian-like learning rule modulated by the\nTD signal. It resembles the direct policy update rule presented in [2], except that in this rule the\nreward signal is replaced by the TD signal (computed by the critic). Moreover, the eligibility trace\nformalism in [2] differs from our formulation.\n\nN(cid:88)\n\n\fWe describe a simulation experiment conducted using a one layered feed-forward arti\ufb01cial neu-\nral network which functions as an actor, combined with a non biologically motivated critic. The\npurpose of the experiment is to examine a simple neuronal model, using different actor and critic\narchitectures. The actor network consists of a single layered feed-forward network of McCulloch-\nPitts neurons, and TD modulated synapses as described above, where the TD signal is calculated by\na critic. The environment is a maze with barriers consisting of 36 states, see Figure 1(b), where a\nreward of value 1 is provided at the top right corner, and is zero elsewhere. Every time the agent\nreceives a reward, it is transferred randomly to a different location in the maze. At each time step,\nthe agent is given an input vector which represents the state. The output layer consists of 4 output\nneurons where each neuron represents an action from the action set U = {up, down, left, right}. We\nused two different input representations for the actor, consisting either of 12 or 36 neurons (note that\nthe minimum number of input neurons to represent 36 states is 6, and the maximum number is 36).\nThe architecture with 36 input neurons represents each maze state with one exclusive neuron, thus,\nthere is no overlap between input vectors. The architecture with 12 input neurons uses a representa-\ntion where each state is represented by two neurons, leading to overlaps between the input vectors.\nWe tested two types of critic: a table based critic which performs iterates according to Algorithm 1,\nand an exact TD which provides the TD of the optimal policy. The results are shown in Figure 1(c),\naveraged over 25 runs, and demonstrate the importance of good input representations and precise\nvalue estimates.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) A illustration of the McCulloch-Pitts network. (b) A diagram of the maze where the agent needs\nto reach the reward at the upper right corner. (c) The average reward per stage in four different cases: an actor\nconsisting of 12 input neurons and a table based critic (blue crosses), an actor consisting of 36 input neurons\nand a table based critic (green stars), an actor consisting of 12 input neurons and exact critic (red circles), and\nan actor consisting of 36 input neurons and an exact TD (black crosses). The optimal average reward per stage\nis denoted by the dotted line, while a random agent achieves a reward of 0.005.\n\nA spiking neuron actor\n\nActual neurons function in continuous time producing action potentials. In extension of [1, 9], we\ndeveloped an update rule which is based on the Spike Response Model (SRM) [11]. For each neuron\nwe de\ufb01ne a state variable vi(t) which represents the membrane potential. The dynamics of vi(t) is\ngiven by\n\nN(cid:88)\n\n(cid:88)\n\nvi(t) = \u03d1i(t \u2212 \u02c6ti) +\n\nwij(t)\n\n\u0001ij(t \u2212 \u02c6ti, t \u2212 tf\nj ),\n\n(7)\n\nj=1\n\ntf\nj\n\nwhere wij(t) is the synaptic ef\ufb01cacy, \u02c6ti is the last spike time of neuron i prior o t, \u03d1i(t) is the\nj are the times of the presynaptic spikes emitted prior to time t, and \u0001ij(t \u2212\nrefractory response, tf\n\u02c6ti, t \u2212 tf\nj ) is the response induced by neuron j at neuron i. The second summation in (7) is over\nall spike times of neuron j emitted prior to time t. The neuron model is assumed to have a noisy\nthreshold, which we model by an escape noise model [11]. According to this model, the neuron \ufb01res\nin the time interval [t, t + \u03b4t) with probability ui(t)\u03b4t = \u03c1i(vi(t) \u2212 vth)\u03b4t, where vth is the \ufb01ring\nthreshold and \u03c1i(\u00b7) is a monotonically increasing function. When the neuron reaches the threshold\nit is assumed to \ufb01re and the membrane potential is reset to vr.\n\n051015x 10500.020.040.060.080.10.12Number of StepsAverage Rewardper Stage\fWe consider a network of continuous time neurons and synapses. Based on Algorithm 1, using a\nsmall time step \u03b4t, we \ufb01nd\n\n(8)\nWe de\ufb01ne the output of the neuron (interpreted as an action) at time t by ui(t). We note that the\nneuron\u2019s output is discrete and that at each time t, a neuron can \ufb01re, ui(t) = 1, or be quiescent,\nui(t) = 0. Using the de\ufb01nition of \u03c8 from Section 2.2, yields (similar to [9])\n\nwij(t + \u03b4t) = wij(t) + \u03b3d(t)\u03c8ij(t).\n\n\u03c8ij(t) =\n\ni(t)\n\u03c1i(t)\n\nHt\ni(t)\n1\u2212\u03b4t\u03c1i(t)\n\n\u2212 \u03b4t\u03c1(cid:48)\n\nj\n\n\u0001ij(t \u2212 \u02c6ti, t \u2212 tf\nj ),\n\n\u0001ij(t \u2212 \u02c6ti, t \u2212 tf\nj ),\n\nHt\n\nj\n\nif ui(t) = 1\nif ui(t) = 0\n\nTaking the limit \u03b4t \u2192 0, yields the following continuous time update rule\n\n(cid:80)\n\n\uf8f1\uf8f2\uf8f3 \u03c1(cid:48)\n(cid:122)\n(cid:195)\n\n(cid:80)\n\n(cid:88)\n\nHi\n\ndwij(t)\n\ndt\n\n= \u03b3d(t)\n\n(1/\u03c1i(t))\n\n\u03b4(t \u2212 tf\n\ni ) \u2212 1\n\n\u03c1(cid:48)\ni(t)\n\n\u0001ij(t \u2212 \u02c6ti, t \u2212 tf\n\nj ) .\n\n(9)\n\n(cid:125)(cid:124)\n\nFpost({tf\n\ni })\n\n(cid:33)\n\n(cid:125)(cid:124)\n\nFpre({tf\n\nj })\n\n(cid:123)\n\n(cid:123)\n\n(cid:122)\n(cid:88)\n\nHt\n\nj\n\nSimilarly to [1, 9] we interpret the update rule (9) as a TD modulated spike time dependent plasticity\nrule. A detailed discussion and interpretation of this update in a more biological context will be left\nto the full paper.\nWe applied the update rule (9) to an actor network consisting of spiking neurons based on (7). The\nnetwork\u2019s goal was to reach a circle at the center of a 2D plain =, where the agent can move, using\nNewtonian dynamics, in the four principle directions. The actor is composed of an input layer and\na single layer of modi\ufb01able weights. The input layer consists of \u2018sensory\u2019 neurons which \ufb01re ac-\ncording to the agent\u2019s location in the environment. The synaptic dynamics of the actor is determined\nby (9). The critic receives the same inputs as the actor, but uses a linear function approximation\narchitecture rather than the table lookup used in Algorithm 1. A standard parameter update rule\nappropriate for this architecture (e.g., ch. 8 in [22]) was used to update the critic\u2019s parameters3. The\noutput layer of the actor consists of four neuronal groups, representing the directions in which the\nagent can move, coded based on a \ufb01ring rate model using Gaussian tuning curves. The TD signal\nis calculated according to (3). Whenever it reaches the centered circle, it receives a reward, and is\ntransferred randomly to a new position in the environment.\nResults of such a simulation are presented in Figure 3. Figure 3-a displays the agent\u2019s typical random\nwalk like behavior prior to learning, . Figure 3-b depicts four typical trajectories representing the\nagent\u2019s actions after a learning phase. Finally, Figure 3-c demonstrates the increase of the average\nreward per stage, \u03b7, vs. time.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Typical agent tracks prior to learning. (b) Agent trajectories following learning. (c) Average\nreward per stage plotted against time.\n\n4 Discussion\n\nWe have presented a temporal difference based actor critic learning algorithm for reinforcement\nlearning. The algorithm was derived from \ufb01rst principles based on following a noisy gradient of the\n\n3Algorithm 1 relies on a table lookup critic, while in this example we used a function approximation based\n\ncritic, due to the large (continuous) state space.\n\n01020051015200102005101520020040060000.0050.010.0150.02time[sec]\u03b7\faverage reward, and a convergence proof was presented without relying on the widely used two time\nscale separation for the actor and the critic. The derived algorithm was applied to neural networks,\ndemonstrating their effective operation in maze problems. The motivation for the proposed algo-\nrithm was biological, providing a coherent computational explanation for several recently observed\nphenomena: actor critic architectures in the basal ganglia, the relation of phasic dopaminergic neu-\nromodulators to the TD signal, and the modulation of the spike time dependent plasticity rules by\ndopamine. While a great deal of further work needs to be done on both the theoretical and biologi-\ncal components of the framework, we hope that these results provide a tentative step in the (noisy!)\ndirection of explaining biological RL.\n\nReferences\n[1] D. Baras and R. Meir. Reinforcement learning, spike time dependent plasticity and the bcm rule. Neural\n\nComput., 19(8):22452279, 2007\n\n[2] J. Baxter and P.L. Bartlett. Hebbian synaptic modi\ufb01cations in spiking neurons that learn. (Technical rep.).\nCanberra: Research School of Information Sciences and Engineering, Australian National University,\n1999.\n\n[3] J. Baxter and P.L. Bartlett. In\ufb01nite-Horizon Policy-Gradient Estimation. J. of Arti\ufb01cial Intelligence Re-\n\nsearch, 15:319\u2013350, 2001.\n\n[4] D.P. Bertsekas. Dynamic Programming and Optimal Control, Vol I., 3rd Ed. Athena Scineti\ufb01c, 2006.\n[5] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-critic algorithms. In J.C.\nPlatt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 105\u2013112. MIT Press, Cambridge, MA, 2008.\n\n[6] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, To\n\nappear, 2008.\n\n[7] V.S. Borkar. Stochastic approximation with two time scales. Syst. Control Lett., 29(5):291294, 1997.\n[8] P. Bremaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer, 1999.\n[9] R.V. Florian. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity.\n\nNeural Computation, 19:14681502, 2007.\n\n[10] R.G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, 1995.\n[11] W. Gerstner and W.M. Kistler. Spinking Neuron Models. Cambridge University Press, Cambridge, 2002.\n[12] E.M. Izhikevich. Solving the Distal Reward Problem through Linkage of STDP and Dopamine Signaling.\n\nCerebral Cortex, 17(10):2443-52, 2007.\n\n[13] V.R. Konda and J. Tsitsiklis. On actor critic algorithms. SIAM J. Control Optim., 42(4):11431166, 2003.\n[14] H.J. Kushner and G.G. Yin. Stochastic Approximation Algorithms and Applications. Springer, 1997.\n[15] P. Marbach and J. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. IEEE. Trans.\n\nAuto. Cont., 46:191\u2013209, 1998.\n\n[16] P.R. Montague, P. Dayan, and T.J. Sejnowski. A framework for mesencephalic dopamine systems based\n\non predictive hebbian learning. Journal of Neuroscience, 16:19361947, 1996.\n\n[17] J. ODoherty, P. Dayan, J. Schultz, R. Deichmann, K. Friston, and R.J. Dolan. Dissociable roles of ventral\n\nand dorsal striatum in instrumental conditioning. Science, 304:452454, 2004.\n\n[18] J.N.J. Reynolds and J.R. Wickens. Dopamine-dependent plasticity of corticostriatal synapses. Neural Net-\n\nworks, 15(4-6):507521, 2002.\n\n[19] S. Marom and G. Shahaf. Development, learning and memory in large random networks of cortical neu-\n\nrons: lessons beyond anatomy. Quarterly Reviews of Biophysics, 35:6387, 2002.\n\n[20] W. Schultz. Multiple reward signals in the brain. Nature Reviews Neuroscience, 1:199207, Dec. 2000.\n[21] S. Singh and P. Dayan. Analytical mean squared error curves for temporal difference learning. Machine\n\nLearning, 32:540, 1998.\n\n[22] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1998.\n[23] R. Sutton, D. McAllester, S. Singh and Y. Mansour. Policy-Gradient Methods for Reinforcement Learning\nwith Function Approximation. Advances in Neural Information Processing Systems, 12:1057\u20131063, 2000.\n[24] E.M. Tricomi, M.R. Delgado, and J.A. Fiez. Modulation of caudate activity by action contingency. Neu-\n\nron, 41(2):281292, 2004.\n\n\f", "award": [], "sourceid": 437, "authors": [{"given_name": "Dotan", "family_name": "Castro", "institution": null}, {"given_name": "Dmitry", "family_name": "Volkinshtein", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}]}