{"title": "Gradient Descent for Spiking Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1433, "page_last": 1443, "abstract": "Most large-scale network models use neurons with static nonlinearities that produce analog output, despite the fact that information processing in the brain is predominantly carried out by dynamic neurons that produce discrete pulses called spikes. Research in spike-based computation has been impeded by the lack of efficient supervised learning algorithm for spiking neural networks. Here, we present a gradient descent method for optimizing spiking network models by introducing a differentiable formulation of spiking dynamics and deriving the exact gradient calculation. For demonstration, we trained recurrent spiking networks on two dynamic tasks: one that requires optimizing fast (~ millisecond) spike-based interactions for efficient encoding of information, and a delayed-memory task over extended duration (~ second). The results show that the gradient descent approach indeed optimizes networks dynamics on the time scale of individual spikes as well as on behavioral time scales. In conclusion, our method yields a general purpose supervised learning algorithm for spiking neural networks, which can facilitate further investigations on spike-based computations.", "full_text": "Gradient Descent for Spiking Neural Networks\n\nDongsung Huh\nSalk Institute\n\nLa Jolla, CA 92037\n\nhuh@salk.edu\n\nTerrence J. Sejnowski\n\nSalk Institute\n\nLa Jolla, CA 92037\nterry@salk.edu\n\nAbstract\n\nMost large-scale network models use neurons with static nonlinearities that pro-\nduce analog output, despite the fact that information processing in the brain is\npredominantly carried out by dynamic neurons that produce discrete pulses called\nspikes. Research in spike-based computation has been impeded by the lack of\nef\ufb01cient supervised learning algorithm for spiking neural networks. Here, we\npresent a gradient descent method for optimizing spiking network models by in-\ntroducing a differentiable formulation of spiking dynamics and deriving the exact\ngradient calculation. For demonstration, we trained recurrent spiking networks on\ntwo dynamic tasks: one that requires optimizing fast (\u2248 millisecond) spike-based\ninteractions for ef\ufb01cient encoding of information, and a delayed-memory task over\nextended duration (\u2248 second). The results show that the gradient descent approach\nindeed optimizes networks dynamics on the time scale of individual spikes as well\nas on behavioral time scales. In conclusion, our method yields a general purpose\nsupervised learning algorithm for spiking neural networks, which can facilitate\nfurther investigations on spike-based computations.\n\n1\n\nIntroduction\n\nThe brain operates in a highly decentralized event-driven manner, processing multiple asynchronous\nstreams of sensory-motor data in real-time. The main currency of neural computation is spikes:\ni.e. brief impulse signals transmitted between neurons. Experimental evidence shows that brain\u2019s\narchitecture utilizes not only the rate, but the precise timing of spikes to process information [1].\nDeep-learning models solve simpli\ufb01ed problems by assuming static units that produce analog output,\nwhich describes the time-averaged \ufb01ring-rate response of a neuron. These rate-based arti\ufb01cial neural\nnetworks (ANNs) are easily differentiated, and therefore can be ef\ufb01ciently trained using gradient\ndescent learning rules. The recent success of deep learning demonstrates the computational potential\nof trainable, hierarchical distributed architectures.\nThis brings up the natural question: What types of computation would be possible if we could train\nspiking neural networks (SNNs)? The set of implementable functions by SNNs subsumes that of\nANNs, since a spiking neuron reduces to a rate-based unit in the high \ufb01ring-rate limit. Moreover,\nin the low \ufb01ring-rate range in which the brain operates (1\u223c10 Hz), spike-times can be utilized as\nan additional dimension for computation. However, such computational potential has never been\nexplored due to the lack of general learning algorithms for SNNs.\n\n1.1 Prior work\n\nDynamical systems are most generally described by ordinary differential equations, but linear time-\ninvariant systems can also be characterized by impulse response kernels. Most SNN models are\nconstructed using the latter approach, by de\ufb01ning a neuron\u2019s membrane voltage vi(t) as a weighted\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flinear summation of kernels Kij(t \u2212 tk) that describe how the spike-event of neuron j at previous\ntime tk affects neuron i at time t. When the neuron\u2019s voltage approaches a suf\ufb01cient level, it generates\na spike in deterministic [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] or stochastic manner [13, 14, 15, 16, 17].\nThese kernel-based neuron models are known as spike response models (SRMs).\nThe appeal of SRMs is that they can simulate SNN dynamics without explicit integration steps.\nHowever, this representation takes individual spike-times as the state variables of SNNs, which causes\nproblems for learning algorithms when spikes are needed to be created or deleted during the learning\nprocess. For example, Spikeprop [2] and its variants [3, 4, 5, 6, 18] calculate the derivatives of\nspike-times to derive accurate gradient-based update rules, but they are only applicable to problems\nwhere each neuron is constrained to generating a prede\ufb01ned number of spikes.\nCurrently, learning algorithms compatible with variable spike counts have multiple shortcomings:\nMost gradient-based methods can only train \"visible neurons\" that directly receive desired target\noutput patterns [7, 8, 9, 11, 13, 17]. While extensions have been proposed to enable training of\nhidden neurons in multilayer [10, 14, 16, 19] and recurrent networks [15], they require neglecting\nthe derivative of the self-kernel terms, i.e. Kii(t), which is crucial for the gradient information to\npropagate through spike events. Moreover, the learning rules derived for speci\ufb01c neuron dynamics\nmodels cannot be easily generalized to other neuron models. Also, most methods require the training\ndata to be prepared in spike-time representations. For instance, they use loss functions that penalize\nthe difference between the desired and the actual output spike-time patterns. In practice, however,\nsuch spike-time data are rarely available.\nAlternative approaches take inspiration from biological spike-time dependent plasticity (STDP)\n[20, 21], and reward-modulated STDP process [22, 23, 24]. However, it is generally hard to guarantee\nconvergence of these bottom-up approaches, which do not consider the complex effects network\ndynamics nor the task information in designing of the learning rule.\nLastly, there are rate-based learning approaches, which convert trained ANN models into spiking\nmodels [25, 26, 27, 28, 29, 30], or apply rate-based learning rules to training SNNs [31]. However,\nthese approaches can at best replicate the solutions from rate-based ANN models, rather than\nexploring computational solutions that can utilize spike-times.\n\n1.2 New learning framework for spiking neural networks\n\nHere, we derive a novel learning approach for training SNNs represented by ordinary differential\nequations. The state vector is composed of dynamic variables, such as membrane voltage and synaptic\ncurrent, rather than spike-time history. This approach is compatible with the usual setup in optimal\ncontrol, which allows gradient calculation by using the existing tools in optimal control. Moreover,\nresulting process closely resembles the familiar backpropagation rule, which can fully utilize the\nexisting statistical optimization methods in deep learning framework.\nNote that, unlike the prior literature, our work here provides not just a single learning rule for a\nparticular model and task, but a general framework for calculating gradient for arbitrary network\narchitecture, neuron models, and loss functions. Moreover, the goal of this research is not necessarily\nto replicate a biological learning phenomenon, but to derive ef\ufb01cient learning methods that can\nexplore the computational solutions implementable by the networks of spiking neurons in biology.\nThe trained SNN model could then be analyzed to reveal the computational processes of the brain, or\nprovide algorithmic solutions that can be implemented with neuromorphic hardwares.\n\n2 Methods\n\n2.1 Differentiable synapse model\n\nIn spiking networks, transmission of neural activity is mediated by synaptic current. Most models\ndescribe the synaptic current dynamics as a linear \ufb01lter process which instantly activates when the\npresynaptic membrane voltage v crosses a threshold: e.g.,\n\n\u03c4 \u02d9s = \u2212s +\n\n\u03b4(t \u2212 tk).\n\n(1)\n\n(cid:88)\n\nk\n\n2\n\n\fFigure 1: Differentiability of synaptic current dynamics: The synaptic current traces from eq (2)\n(solid lines, upper panels) are shown with the corresponding membrane voltage traces (lower panels).\nHere, the gate function is g = 1/\u2206 within the active zone of width \u2206 (shaded area, lower panels);\ng = 0 otherwise. (A,B) The pre-synaptic membrane voltage depolarizes beyond the active zone.\nDespite the different rates of depolarization, both events incur the same amount of charge in the\n\nsynaptic activity: (cid:82) s dt = 1. (C,D,E) Graded synaptic activity due to insuf\ufb01cient depolarization\n\nlevels that do not exceed the active zone. The threshold-triggered synaptic dynamics in eq (1) is also\nshown for comparison (red dashed lines, upper panels). The effect of voltage reset is ignored for the\npurpose of illustration. \u03c4 = 10 ms.\n\nwhere \u03b4(\u00b7) is the Dirac-delta function, and tk denotes the time of kth threshold-crossing. Such\nthreshold-triggered dynamics generates discrete, all-or-none responses of synaptic current, which is\nnon-differentiable.\nHere, we replace the threshold with a gate function g(v): a non-negative (g \u2265 0), unit integral\n\n((cid:82) g dv = 1) function with narrow support1, which we call the active zone. This allows the synaptic\n\ncurrent to be activated in a gradual manner throughout the active zone. The corresponding synaptic\ncurrent dynamics is\n\n\u03c4 \u02d9s = \u2212s + g \u02d9v,\n\n(2)\nwhere \u02d9v is the time derivative of the pre-synaptic membrane voltage. The \u02d9v term is required for\nthe dimensional consistency between eq (1) and (2): The g \u02d9v term has the same [time]\u22121 dimension\nas the Dirac-delta impulses of eq (1), since the gate function has the dimension [voltage]\u22121 and \u02d9v\nhas the dimension [voltage][time]\u22121. Hence, the time integral of synaptic current, i.e. charge, is\na dimensionless quantity. Consequently, a depolarization event beyond the active zone induces a\nconstant amount of total charge regardless of the time scale of depolarization, since\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\ns dt =\n\ng \u02d9v dt =\n\ng dv = 1.\n\nTherefore, eq (2) generalizes the threshold-triggered synapse model while preserving the fundamental\nproperty of spiking neurons: i.e. all supra-threshold depolarizations induce the same amount of\nsynaptic responses regardless of the depolarization rate (Figure 1A,B). Depolarizations below the\nactive zone induce no synaptic responses (Figure 1E), and depolarizations within the active zone\ninduce graded responses (Figure 1C,D). This contrasts with the threshold-triggered synaptic dynamics,\nwhich causes abrupt, non-differentiable change of response at the threshold (Figure 1, dashed lines).\nNote that the g \u02d9v term reduces to the Dirac-delta impulses in the zero-width limit of the active zone,\nwhich reduces eq (2) back to the threshold-triggered synapse model eq (1).\nThe gate function, without the \u02d9v term, was previously used as a differentiable model of synaptic\nconnection [32]. In such a model, however, a spike event delivers varying amount of charge depending\non the depolarization rate: the slower the presynaptic depolarization, the greater the amount of charge\ndelivered to the post-synaptic targets.\n\n1Support of a function g : X \u2192 R is the subset of the domain X where g(x) is non-zero.\n\n3\n\nt (ms)t (ms)t (ms)t (ms)t (ms)ABCDE0204002040020400204002040Membrane voltage0204000.102040020400204002040(ms-1)Synaptic current\fFigure 2: The model receives time varying input, (cid:126)i(t), processes it through a network of spiking\nneurons, and produces time varying output, (cid:126)o(t). The internal state variables are the membrane\nvoltage (cid:126)v(t) and the synaptic current (cid:126)s(t).\n\n2.2 Network model\n\nTo complete the input-output dynamics of a spiking neuron, the synaptic current dynamics must\nbe coupled with the presynaptic neuron\u2019s internal state dynamics. For simplicity, we consider\ndifferentiable neural dynamics that depend only on the the membrane voltage and the input current:\n\n\u02d9v = f (v, I).\n\n(3)\n\nThe dynamics of an interconnected network of neurons can then be constructed by linking the\ndynamics of individual neurons and synapses eq (2,3) through the input current vector:\n\n(cid:126)I = W (cid:126)s + U(cid:126)i + (cid:126)Io,\n\n(4)\n\nwhere W is the recurrent connectivity weight matrix, U is the input weight matrix,(cid:126)i(t) is the input\nsignal for the network, and (cid:126)Io is the tonic current. Note that this formulation describes general, fully\nconnected networks; speci\ufb01c network structures can be imposed by constraining the connectivity: e.g.\ntriangular matrix structure W for multi-layer feedforward networks.\nLastly, we de\ufb01ne the output of the network as the linear readout of the synaptic current:\n\nwhere O is the readout matrix. The overall schematic of the model is shown in Figure 2.\n\nAll of the network parameters W , U, O, (cid:126)Io can be tuned to minimize the total cost, C \u2261(cid:82) l(t) dt,\n\nwhere l is the cost function that evaluates the performance of network output for given task.\n\n(cid:126)o(t) = O(cid:126)s(t),\n\n2.3 Gradient calculation\n\nThe above spiking neural network model can be optimized via gradient descent. In general, the exact\ngradient of a dynamical system can be calculated using either Pontryagin\u2019s minimum principle [33],\nalso known as backpropagation through time, or real-time recurrent learning, which yield identical\nresults. We present the former approach here, which scales better with network size, O(N 2) instead\nof O(N 3), but the latter approach can also be straightforwardly implemented.\nBackpropagation through time for the spiking dynamics eq (2,3) utilizes the following backpropagat-\ning dynamics of adjoint state variables (pv, ps. See Supplementary Materials):\n\n(5)\n(6)\nwhere pv, ps are the modi\ufb01ed adjoints of v and s, \u2202vf \u2261 \u2202f /\u2202v, and \u03be is called the error current.\nFor the recurrently connected network eq (4), the error current vector has the following expression\n\n\u2212 \u02d9pv = \u2202vf pv \u2212 g \u02d9ps\n\u2212\u03c4 \u02d9ps = \u2212ps + \u03be,\n\n(cid:124)\n\n(cid:126)\u03be = W\n\n(7)\nwhich links the backpropagating dynamics eq (5,6) of individual neurons. Here, \u2202I f \u2261 \u2202f /\u2202I,\n(\u2202I f pv)k \u2261 (\u2202f /\u2202I)kpvk, and (\u2202sl)k \u2261 \u2202l/\u2202sk.\nInterestingly, the coupling term of the backpropagating dynamics, g \u02d9ps, has the same form as the\ncoupling term g \u02d9v of the forward-propagating dynamics. Thus, the same gating mechanism that\n\n(\u2202I f pv) + (cid:126)\u2202sl,\n\n(cid:126)\n\n4\n\nInput Output ~WUO~i(t)~o(t)v(t)(t)s~~\fmediates the spiked-based communication of signals also controls the propagation of error in the\nsame sparse, compressed manner.\nGiven the adjoint state vectors that satisfy eq (5,6,7), the gradient of the total cost with respect to the\nnetwork parameters can be calculated as\n\n(cid:90)\n(cid:90)\n(cid:90)\n(cid:90)\n\n\u2207W C =\n\u2207U C =\n\u2207IoC =\n\u2207OC =\n\n(cid:124)\n(\u2202I f pv) (cid:126)s\n\n(cid:126)\n\ndt\n\n(cid:124)\n(\u2202I f pv)(cid:126)i\n\n(cid:126)\n\ndt\n\n(cid:126)\n\n(\u2202I f pv) dt\n\n(cid:124)\n(cid:126)\u2202ol (cid:126)s\n\ndt\n\nwhere (\u2202ol)k \u2261 \u2202l/\u2202ok. Note that the gradient calculation procedure involves multiplication between\nthe presynaptic input source and the postsynaptic adjoint state pv, which is driven by the g \u02d9ps term:\ni.e. the product of postsynaptic spike activity and temporal difference of error. This is analogous to\nreward-modulated spike-time dependent plasticity (STDP) [24].\n\n3 Results\n\nWe demonstrate our method by training spiking networks on dynamic tasks that require information\nprocessing over time. Tasks are de\ufb01ned by the relationship between time-varying input-output signals,\nwhich are used as training examples. We draw mini-batches of \u2248 50 training examples from the\nsignal distribution, calculate the gradient of the average total cost, and use stochastic gradient descent\n[34] for optimization.\nHere, we use a cost function l that penalizes the readout error and the overall synaptic activity:\n\nwhere (cid:126)od(t) is the desired output, and \u03bb is a regularization parameter.\n\n(cid:107)(cid:126)o \u2212 (cid:126)od(cid:107)2 + \u03bb(cid:107)(cid:126)s(cid:107)2\n\n,\n\nl =\n\n2\n\n3.1 Predictive Coding Task\n\nWe \ufb01rst consider predictive coding tasks [35, 36], which optimize spike-based representations to\naccurately reproduce the input-ouput behavior of a linear dynamical system of full-rank input and\noutput matrices. Analytical solutions for this class of problems can be obtained in the form of\nnon-leaky integrate and \ufb01re (NIF) neural networks, although insigni\ufb01cant amount of leak current is\noften added [36]. The solutions also require the networks to be equipped with a set of instantaneous\nsynapses for fast time-scale interactions between neurons, as well slower synapses for readout.\nDespite its simplicity, the predictive coding framework reproduces important features of biological\nneural networks, such as the balance of excitatory and inhibitory inputs and ef\ufb01cient coding [35].\nAlso, its analytical solutions provide a great benchmark for assessing the effectiveness of our learning\nmethod.\nThe membrane voltage dynamics of a NIF neuron is given by\n\nf (v, I) = I.\n\nHere, we impose two thresholds at v\u03b8+ = 1 and v\u03b8\u2212 = \u22121, and the reset voltage at vreset = 0, where\nthe v\u03b8\u2212 threshold would trigger negative synaptic responses. This bi-threshold NIF model naturally\n\ufb01ts with the inherent sign symmetry of the task, and also provides an easy solution to ensure that the\nmembrane voltage stays within a \ufb01nite range. However, the training also works with the usual single\nthreshold model. We also introduce two different synaptic time constants, as proposed in [35, 36]: a\nfast constant \u03c4 = 1 ms for synapses for the recurrent connections, and a slow constant \u03c4s = 10 ms\nfor readout.\nIn the predictive-coding task, the desired output signal is the low-pass \ufb01ltered version of the input\nsignal:\n\n\u03c4s \u02d9(cid:126)od = \u2212(cid:126)od +(cid:126)i,\n\n5\n\n\fFigure 3: Balanced dynamics of a spiking network trained for auto-encoding task. (A) Readout\nsignals: actual (solid), and desired (dashed). (B) Input current components into a single neuron:\nexternal input current (U(cid:126)i(t), blue), and fast reccurent synaptic current (Wf(cid:126)sf (t), red). (C) Total\ninput current into a single neuron (U(cid:126)i(t) + Wf(cid:126)sf (t)). (D) Single neuron membrane voltage traces:\nthe actual voltage trace driven by both external input and fast reccurent synaptic current (solid, 6\nspikes), and a virtual trace driven by external input only (dashed, 29 spikes). (E) Fast recurrent\nweight: trained (Wf , above) and predicted (\u2212U O, below). Diagonal elements are set to zero to avoid\nself-excitation/inhibition. (F) Readout weight O vs input weight U.\n\nwhere \u03c4s is the slow synaptic time constant [35, 36]. The goal is to accurately represent the analog\nsignals using least number of spikes. We used a network of 30 NIF neurons, 2 input and 2 output\nchannels. Randomly generated sum-of-sinusoid signals with period 1200 ms were used as the input.\nThe output of the trained network accurately tracks the desired output (Figure 3A). Analysis of the\nsimulation reveals that the network operates in a tightly balanced regime: The fast recurrent synaptic\ninput, W (cid:126)s(t), provides opposing current that mostly cancels the input current from the external signal,\nU(cid:126)i(t), such that the neuron generates a greatly reduced number of spike outputs (Figure 3B,C,D).\nThe network structure also shows close agreement to the prediction. The optimal input weight matrix\nis equal to the transpose of the readout matrix (up to a scale factor), U \u221d O\n(cid:124), and the optimal fast\nrecurrent weight is approximately the product of the input and readout weights, W \u2248 \u2212U O , which\nare in close agreement with [35, 36, 37]. Such network structures have been shown to maintain\ntight input balance and remove redundant spikes to encode the signals in most ef\ufb01cient manner:\nThe representation error scales as 1/K, where K is the number of involved spikes, compared to the\n1/\n\nK error of encoding with independent Poisson spikes.\n\n\u221a\n\n6\n\n-202-20204080120time (ms)-101-0.500.5ReadoutInput current (components)Input current (total)Membrane voltageABCD(ms-1)(ms-1)-0.8-0.400.40.8EF102030102030-0.400.4-2-1012(ms)(ms-1)-102030-0.8-0.400.40.8\fFigure 4: Delayed-memory XOR task: Each panel shows the single-trial input, go-cue, output traces,\nand spike raster of an optimized QIF neural network. The y-axis of the raster plot is the neuron\nID. Note the similarity of the initial portion of spike patterns for trials of the same \ufb01rst input pulses\n(A,B,C vs D,E,F). In contrast, the spike patterns after the go-cue signal are similar for trials of the\nsame desired output pulses: (A,D: negative output), (B,E: positive output), and (C,F: null output).\n\n3.2 Delayed-memory XOR task\n\nA major challenge for spike-based computation is in bridging the wide divergence between the time-\nscales of spikes and behavior: How do millisecond spikes perform behaviorally relevant computations\non the order of seconds?\nHere, we consider a delayed-memory XOR task, which performs the exclusive-or (XOR) operation\non the input history stored over extended duration. Speci\ufb01cally, the network receives binary pulse\nsignals, + or \u2212, through an input channel and a go-cue through another channel. If the network\nreceives two input pulses since the last go-cue signal, it should generate the XOR output pulse on the\nnext go-cue: i.e. a positive output pulse if the input pulses are of opposite signs (+\u2212 or \u2212+), and\na negative output pulse if the input pulses are of equal signs (++ or \u2212\u2212). Additionally, it should\ngenerate a null output if only one input pulse is received since the last go-cue signal. Variable time\ndelays are introduced between the input pulses and the go-cues.\nA simpler version of the task was proposed in [26], whose solution involved \ufb01rst training an analog,\nrate-based ANN model and converting the trained ANN dynamics with a larger network of spik-\ning neurons (\u2248 3000), using the results from predictive coding [35]. It also required a dendritic\nnonlinearity function to match the transfer function of rate neurons.\nWe trained a network of 80 quadratic integrate and \ufb01re (QIF) neurons2, whose dynamics is\n\nf (v, I) = (1 + cos(2\u03c0v))/\u03c4v + (1 \u2212 cos(2\u03c0v))I,\n\n2NIF networks fail to learn the delayed-memory XOR task: the memory requirement for past input history\n\ndrives the training toward strong recurrent connections and runaway excitation.\n\n7\n\ntime (ms)0100200300400500600time (ms)time (ms)time (ms)time (ms)time (ms)SpikesSpikesGo-cueOutputInput01002003004005006000100200300400500600010020030040050060001002003004005006000100200300400500600InputGo-cueOutputABCDEF\falso known as Theta neuron model [38], with the threshold and the reset voltage at v\u03b8 = 1, vreset = 0.\nTime constants of \u03c4v = 25, \u03c4f = 5, and \u03c4 = 20 ms were used, whereas the time-scale of the task\nwas \u2248 500 ms, much longer than the time constants. The intrinsic nonlinearity of the QIF spiking\ndynamics proves to be suf\ufb01cient for solving this task without requiring extra dendritic nonlinearity.\nThe trained network successfully solves the delayed-memory XOR task (Figure 4): The spike patterns\nexhibit time-varying, but sustained activities that maintain the input history, generate the correct\noutputs when triggered by the go-cue signal, and then return to the background activity. More analysis\nis needed to understand the exact underlying computational mechanism.\nThis result shows that out algorithm can indeed optimize spiking networks to perform nonlinear\ncomputations over extended time.\n\n4 Discussion\n\nWe have presented a novel, differentiable formulation of spiking neural networks and derived the\ngradient calculation for supervised learning. Unlike previous learning methods, our method optimizes\nthe spiking network dynamics for general supervised tasks on the time scale of individual spikes as\nwell as the behavioral time scales.\nExact gradient-based learning methods, such as ours, may depart from the known biological learning\nmechanisms. Nonetheless, these methods provide a solid theoretical foundation for understanding the\nprinciples underlying biological learning rules. For example, our result shows that the gradient update\noccurs in a sparsely compressed manner near spike times, similar to reward-modulated STDP, which\ndepends only on a narrow 20 ms window around the postsynaptic spike. Further analysis may reveal\nthat certain aspects of the gradient calculation can be approximated in a biologically plausible manner\nwithout signi\ufb01cantly compromising the ef\ufb01ciency of optimization. For example, it was recently\nshown that the biologically implausible aspects of backpropagation method can be resolved through\nfeedback alignment in rate-based multilayer feedforward networks [39]. Such approximations could\nalso apply to spiking neural networks.\nHere, we coupled the synaptic current model with differentiable single-state spiking neuron models.\nWe want to emphasize that the synapse model can be coupled to any neuron model, including\nbiologically realistic multi-state neuron models with action potential dynamics 3, including the\nHodgkin-Huxley model, the Morris-Lecar model and the FitzHugh-Nagumo model; and an even\nwider range of neuron models with internal adaptation variables and neuron models having non-\ndifferentiable reset dynamics, such as the leaky integrate and \ufb01re model, the exponential integrate\nand \ufb01re model, and the Izhikevich model. This will be examined in the future work.\n\nReferences\n[1] Ru\ufb01n VanRullen, Rudy Guyonneau, and Simon J Thorpe. Spike times make sense. Trends in neurosciences,\n\n28(1):1\u20134, 2005.\n\n[2] Sander M Bohte, Joost N Kok, and Han La Poutre. Error-backpropagation in temporally encoded networks\n\nof spiking neurons. Neurocomputing, 48(1):17\u201337, 2002.\n\n[3] Jianguo Xin and Mark J Embrechts. Supervised learning with spiking neural networks. In Neural Networks,\n2001. Proceedings. IJCNN\u201901. International Joint Conference on, volume 3, pages 1772\u20131777. IEEE,\n2001.\n\n[4] Benjamin Schrauwen and Jan Van Campenhout. Extending spikeprop.\n\nIn Neural Networks, 2004.\n\nProceedings. 2004 IEEE International Joint Conference on, volume 1, pages 471\u2013475. IEEE, 2004.\n\n[5] Olaf Booij and Hieu tat Nguyen. A gradient descent rule for spiking neurons emitting multiple spikes.\n\nInformation Processing Letters, 95(6):552\u2013558, 2005.\n\n[6] Peter Ti\u02c7no and Ashely JS Mills. Learning beyond \ufb01nite memory in recurrent networks of spiking neurons.\n\nNeural computation, 18(3):591\u2013613, 2006.\n\n[7] Robert G\u00fctig and Haim Sompolinsky. The tempotron: a neuron that learns spike timing\u2013based decisions.\n\nNature neuroscience, 9(3):420\u2013428, 2006.\n\n3Simple modi\ufb01cation of the gate function would be required to prevent activation during the falling phase of\n\naction potential.\n\n8\n\n\f[8] Robert Urbanczik and Walter Senn. A gradient learning rule for the tempotron. Neural computation,\n\n21(2):340\u2013352, 2009.\n\n[9] Raoul-Martin Memmesheimer, Ran Rubin, Bence P \u00d6lveczky, and Haim Sompolinsky. Learning precisely\n\ntimed spikes. Neuron, 82(4):925\u2013938, 2014.\n\n[10] Jun Haeng Lee, Tobi Delbruck, and Michael Pfeiffer. Training deep spiking neural networks using\n\nbackpropagation. Frontiers in neuroscience, 10, 2016.\n\n[11] Robert G\u00fctig. Spiking neurons can discover predictive features by aggregate-label learning. Science,\n\n351(6277):aab4113, 2016.\n\n[12] R\u02d8azvan V Florian. The chronotron: a neuron that learns to \ufb01re temporally precise spike patterns. PloS one,\n\n7(8):e40233, 2012.\n\n[13] Jean-Pascal P\ufb01ster, Taro Toyoizumi, David Barber, and Wulfram Gerstner. Optimal spike-timing-dependent\nplasticity for precise action potential \ufb01ring in supervised learning. Neural computation, 18(6):1318\u20131348,\n2006.\n\n[14] Danilo J Rezende, Daan Wierstra, and Wulfram Gerstner. Variational learning for recurrent spiking\n\nnetworks. In Advances in Neural Information Processing Systems, pages 136\u2013144, 2011.\n\n[15] Johanni Brea, Walter Senn, and Jean-Pascal P\ufb01ster. Matching recall and storage in sequence learning with\n\nspiking neural networks. Journal of neuroscience, 33(23):9565\u20139575, 2013.\n\n[16] Brian Gardner, Ioana Sporea, and Andr\u00e9 Gr\u00fcning. Learning spatiotemporally encoded pattern transforma-\n\ntions in structured spiking neural networks. Neural computation, 27(12):2548\u20132586, 2015.\n\n[17] Brian Gardner and Andr\u00e9 Gr\u00fcning. Supervised learning in spiking neural networks for precise temporal\n\nencoding. PloS one, 11(8):e0161335, 2016.\n\n[18] Sam McKennoch, Thomas Voegtlin, and Linda Bushnell. Spike-timing error backpropagation in theta\n\nneuron networks. Neural computation, 21(1):9\u201345, 2009.\n\n[19] Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multi-layer spiking neural\n\nnetworks. arXiv preprint arXiv:1705.11146, 2017.\n\n[20] Filip Ponulak and Andrzej Kasi\u00b4nski. Supervised learning in spiking neural networks with resume: sequence\n\nlearning, classi\ufb01cation, and spike shifting. Neural Computation, 22(2):467\u2013510, 2010.\n\n[21] Ioana Sporea and Andr\u00e9 Gr\u00fcning. Supervised learning in multilayer spiking neural networks. Neural\n\ncomputation, 25(2):473\u2013509, 2013.\n\n[22] Eugene M Izhikevich. Solving the distal reward problem through linkage of stdp and dopamine signaling.\n\nCerebral cortex, 17(10):2443\u20132452, 2007.\n\n[23] Robert Legenstein, Dejan Pecevski, and Wolfgang Maass. A learning theory for reward-modulated\nspike-timing-dependent plasticity with application to biofeedback. PLoS Comput Biol, 4(10):e1000180,\n2008.\n\n[24] Nicolas Fr\u00e9maux and Wulfram Gerstner. Neuromodulated spike-timing-dependent plasticity, and theory of\n\nthree-factor learning rules. Frontiers in neural circuits, 9, 2015.\n\n[25] Eric Hunsberger and Chris Eliasmith. Spiking deep networks with lif neurons.\n\narXiv:1510.08829, 2015.\n\narXiv preprint\n\n[26] LF Abbott, Brian DePasquale, and Raoul-Martin Memmesheimer. Building functional networks of spiking\n\nmodel neurons. Nature neuroscience, 19(3):350\u2013355, 2016.\n\n[27] Peter O\u2019Connor, Daniel Neil, Shih-Chii Liu, Tobi Delbruck, and Michael Pfeiffer. Real-time classi\ufb01cation\n\nand sensor fusion with a spiking deep belief network. Frontiers in neuroscience, 7:178, 2013.\n\n[28] Peter U Diehl, Daniel Neil, Jonathan Binas, Matthew Cook, Shih-Chii Liu, and Michael Pfeiffer. Fast-\nclassifying, high-accuracy spiking deep networks through weight and threshold balancing. In Neural\nNetworks (IJCNN), 2015 International Joint Conference on, pages 1\u20138. IEEE, 2015.\n\n[29] Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, and Michael Pfeiffer. Theory and tools for the\nconversion of analog to spiking convolutional neural networks. arXiv preprint arXiv:1612.04052, 2016.\n\n9\n\n\f[30] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. Going deeper in spiking neural\n\nnetworks: Vgg and residual architectures. arXiv preprint arXiv:1802.02627, 2018.\n\n[31] Peter O\u2019Connor and Max Welling. Deep spiking networks. arXiv preprint arXiv:1602.08323, 2016.\n\n[32] Guillaume Lajoie, Kevin K Lin, and Eric Shea-Brown. Chaos and reliability in balanced spiking networks\n\nwith temporal drive. Physical Review E, 87(5):052901, 2013.\n\n[33] Lev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathematical\n\ntheory of optimal processes. 1962.\n\n[34] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[35] Sophie Den\u00e8ve and Christian K Machens. Ef\ufb01cient codes and balanced networks. Nature neuroscience,\n\n19(3):375\u2013382, 2016.\n\n[36] Martin Boerlin, Christian K Machens, and Sophie Den\u00e8ve. Predictive coding of dynamical variables in\n\nbalanced spiking networks. PLoS Comput Biol, 9(11):e1003258, 2013.\n\n[37] Wieland Brendel, Ralph Bourdoukan, Pietro Vertechi, Christian K Machens, and Sophie Den\u00e9ve. Learning\n\nto represent signals spike by spike. arXiv preprint arXiv:1703.03777, 2017.\n\n[38] Bard Ermentrout. Ermentrout-kopell canonical model. Scholarpedia, 3(3):1398, 2008.\n\n[39] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random synaptic feedback\n\nweights support error backpropagation for deep learning. Nature Communications, 7, 2016.\n\n10\n\n\fSupplementary Materials: Gradient calculation for the spiking neural\nnetwork\n\nPontryagin\u2019s minimum principle According to [33], the Hamiltonian for the dynamics eq (2,3,4) is\n\n(cid:88)\n(cid:88)\n\u00afpvi \u02d9vi + \u00afpsi \u02d9si + l((cid:126)s)\n(\u00afpvi + gi \u00afpsi /\u03c4 )fi \u2212 \u00afpsi si/\u03c4 + l((cid:126)s),\n\ni\n\nH =\n\n=\n\ni\n\nwhere \u00afpvi and \u00afpsi are the adjoint state variables for the membrane voltage vi and the synaptic current si of\nneuron i, respectively, and l((cid:126)s) is the cost function. The back-propagating dynamics of the adjoint state variables\nare:\n\n\u2212 \u02d9\u00afpvi =\n\u2212 \u02d9\u00afpsi =\n\n\u2202H\n\u2202vi\n\u2202H\n\u2202si\n\n= (\u00afpvi + gi \u00afpsi /\u03c4 )\u2202vfi + fig\n\n(cid:88)\n(\u00afpvj + gj \u00afpsj /\u03c4 ) \u00b7 \u2202I fj Wji \u2212 \u00afpsi /\u03c4 + lsi\n\n(cid:48)\ni \u00afpsi /\u03c4\n\n=\n\nj\n\nwhere fi \u2261 f (vi, Ii), gi \u2261 g(vi), \u2202vfi \u2261 \u2202f /\u2202vi, \u2202I fi \u2261 \u2202f /\u2202Ii, g(cid:48)\nThis formulation can be simpli\ufb01ed via change of variables, pvi \u2261 \u00afpvi + g \u00afpsi /\u03c4, psi \u2261 \u00afpsi /\u03c4, which yields\n\ni \u2261 dg/dvi, and lsi \u2261 \u2202l/\u2202si.\n\nH = (cid:126)pv \u00b7 (cid:126)f \u2212 (cid:126)ps \u00b7 (cid:126)s + l\n(cid:88)\n\u2212 \u02d9pvi = \u2202vfi pvi \u2212 gi \u02d9psi\n\u2212\u03c4 \u02d9psi = \u2212psi + lsi +\n\nj\n\nWji\u2202I fj pvj ,\n\nwhere we used \u02d9pvi = \u02d9\u00afpvi + fi g(cid:48)\nThe gradient of the total cost can be obtained by integrating the partial derivative of the Hamiltonian with respect\nto the parameter (e.g. \u2202H/\u2202Wij, \u2202H/\u2202Uij, \u2202H/\u2202Ioi, \u2202H/\u2202Oij).\n\ni \u00afpsi /\u03c4 + gi \u02d9\u00afpsi /\u03c4.\n\n11\n\n\f", "award": [], "sourceid": 746, "authors": [{"given_name": "Dongsung", "family_name": "Huh", "institution": "MIT-IBM AI Center"}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": "Salk Institute"}]}