{"title": "Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 3882, "page_last": 3890, "abstract": "Recurrent Neural Networks (RNNs) have become the state-of-the-art choice for extracting patterns from temporal sequences. Current RNN models are ill suited to process irregularly sampled data triggered by events generated in continuous time by sensors or other neurons. Such data can occur, for example, when the input comes from novel event-driven artificial sensors which generate sparse, asynchronous streams of events or from multiple conventional sensors with different update intervals. In this work, we introduce the Phased LSTM model, which extends the LSTM unit by adding a new time gate. This gate is controlled by a parametrized oscillation with a frequency range which require updates of the memory cell only during a small percentage of the cycle. Even with the sparse updates imposed by the oscillation, the Phased LSTM network achieves faster convergence than regular LSTMs on tasks which require learning of long sequences.   The model naturally integrates inputs from sensors of arbitrary sampling rates, thereby opening new areas of investigation for processing asynchronous sensory events that carry timing information.  It also greatly improves the performance of LSTMs in standard RNN applications, and does so with an order-of-magnitude fewer computes.", "full_text": "Phased LSTM: Accelerating Recurrent Network\n\nTraining for Long or Event-based Sequences\n\nDaniel Neil, Michael Pfeiffer, and Shih-Chii Liu\n\nInstitute of Neuroinformatics\n\nUniversity of Zurich and ETH Zurich\n\nZurich, Switzerland 8057\n\n{dneil, pfeiffer, shih}@ini.uzh.ch\n\nAbstract\n\nRecurrent Neural Networks (RNNs) have become the state-of-the-art choice for\nextracting patterns from temporal sequences. However, current RNN models are\nill-suited to process irregularly sampled data triggered by events generated in\ncontinuous time by sensors or other neurons. Such data can occur, for example,\nwhen the input comes from novel event-driven arti\ufb01cial sensors that generate\nsparse, asynchronous streams of events or from multiple conventional sensors with\ndifferent update intervals. In this work, we introduce the Phased LSTM model,\nwhich extends the LSTM unit by adding a new time gate. This gate is controlled\nby a parametrized oscillation with a frequency range that produces updates of the\nmemory cell only during a small percentage of the cycle. Even with the sparse\nupdates imposed by the oscillation, the Phased LSTM network achieves faster\nconvergence than regular LSTMs on tasks which require learning of long sequences.\nThe model naturally integrates inputs from sensors of arbitrary sampling rates,\nthereby opening new areas of investigation for processing asynchronous sensory\nevents that carry timing information. It also greatly improves the performance of\nLSTMs in standard RNN applications, and does so with an order-of-magnitude\nfewer computes at runtime.\n\n1\n\nIntroduction\n\nInterest in recurrent neural networks (RNNs) has greatly increased in recent years, since larger\ntraining databases, more powerful computing resources, and better training algorithms have enabled\nbreakthroughs in both processing and modeling of temporal sequences. Applications include speech\nrecognition [13], natural language processing [1, 20], and attention-based models for structured\nprediction [5, 29]. RNNs are attractive because they equip neural networks with memories, and\nthe introduction of gating units such as LSTM and GRU [16, 6] has greatly helped in making the\nlearning of these networks manageable. RNNs are typically modeled as discrete-time dynamical\nsystems, thereby implicitly assuming a constant sampling rate of input signals, which also becomes\nthe update frequency of recurrent and feed-forward units. Although early work such as [25, 10, 4]\nhas realized the resulting limitations and suggested continuous-time dynamical systems approaches\ntowards RNNs, the great majority of modern RNN implementations uses \ufb01xed time steps.\nAlthough \ufb01xed time steps are perfectly suitable for many RNN applications, there are several\nimportant scenarios in which constant update rates impose constraints that affect the precision and\nef\ufb01ciency of RNNs. Many real-world tasks for autonomous vehicles or robots need to integrate input\nfrom a variety of sensors, e.g. for vision, audition, distance measurements, or gyroscopes. Each sensor\nmay have its own data sampling rate, and short time steps are necessary to deal with sensors with\nhigh sampling frequencies. However, this leads to an unnecessarily higher computational load and\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a)\n\n(b)\n\nFigure 1: Model architecture. (a) Standard LSTM model. (b) Phased LSTM model, with time gate kt\ncontrolled by timestamp t. In the Phased LSTM formulation, the cell value ct and the hidden output\nht can only be updated during an \u201copen\u201d phase; otherwise, the previous values are maintained.\n\npower consumption so that all units in the network can be updated with one time step. An interesting\nnew application area is processing of event-based sensors, which are data-driven, and record stimulus\nchanges in the world with short latencies and accurate timing. Processing the asynchronous outputs of\nsuch sensors with time-stepped models would require high update frequencies, thereby counteracting\nthe potential power savings of event-based sensors. And \ufb01nally there is an interest coming from\ncomputational neuroscience, since brains can be viewed loosely as very large RNNs. However,\nbiological neurons communicate with spikes, and therefore perform asynchronous, event-triggered\nupdates in continuous time. This work presents a novel RNN model which can process inputs sampled\nat asynchronous times and is described further in the following sections.\n\n2 Model Description\n\nLong short-term memory (LSTM) units [16] (Fig. 1(a)) are an important ingredient for modern deep\nRNN architectures. We \ufb01rst de\ufb01ne their update equations in the commonly-used version from [12]:\n\nit = \u03c3i(xtWxi + ht\u22121Whi + wci (cid:12) ct\u22121 + bi)\nft = \u03c3f (xtWxf + ht\u22121Whf + wcf (cid:12) ct\u22121 + bf )\nct = ft (cid:12) ct\u22121 + it (cid:12) \u03c3c(xtWxc + ht\u22121Whc + bc)\not = \u03c3o(xtWxo + ht\u22121Who + wco (cid:12) ct + bo)\nht = ot (cid:12) \u03c3h(ct)\n\n(1)\n(2)\n(3)\n(4)\n(5)\n\nThe main difference to classical RNNs is the use of the gating functions it, ft, ot, which represent\nthe input, forget, and output gate at time t respectively. ct is the cell activation vector, whereas xt\nand ht represent the input feature vector and the hidden output vector respectively. The gates use the\ntypical sigmoidal nonlinearities \u03c3i, \u03c3f , \u03c3o and tanh nonlinearities \u03c3c, and \u03c3h with weight parameters\nWhi, Whf , Who, Wxi, Wxf , and Wxo, which connect the different inputs and gates with the memory\ncells and outputs, as well as biases bi, bf , and bo. The cell state ct itself is updated with a fraction of\nthe previous cell state that is controlled by ft, and a new input state created from the element-wise\n(Hadamard) product, denoted by (cid:12), of it and the output of the cell state nonlinearity \u03c3c. Optional\npeephole [11] connection weights wci, wcf , wco further in\ufb02uence the operation of the input, forget,\nand output gates.\nThe Phased LSTM model extends the LSTM model by adding a new time gate, kt (Fig. 1(b)). The\nopening and closing of this gate is controlled by an independent rhythmic oscillation speci\ufb01ed by\nthree parameters; updates to the cell state ct and ht are permitted only when the gate is open. The\n\ufb01rst parameter, \u03c4, controls the real-time period of the oscillation. The second, ron, controls the ratio\nof the duration of the \u201copen\u201d phase to the full period. The third, s, controls the phase shift of the\noscillation to each Phased LSTM cell. All parameters can be learned during the training process.\nThough other variants are possible, we propose here a particularly successful linearized formulation\n\n2\n\nInputGateitotftxtxtxtxtcthtForget GateOutput GatextInputGatectitotftxtxtxthtForget GateOutputGatect~tkttkt\f(a)\n\n(b)\n\nFigure 2: Diagram of Phased LSTM behaviour. (a) Top: The rhythmic oscillations to the time gates of\n3 different neurons; the period \u03c4 and the phase shift s is shown for the lowest neuron. The parameter\nron is the ratio of the open period to the total period \u03c4. Bottom: Note that in a multilayer scenario,\nthe timestamp is distributed to all layers which are updated at the same time point. (b) Illustration of\nPhased LSTM operation. A simple linearly increasing function is used as an input. The time gate\nkt of each neuron has a different \u03c4, identical phase shift s, and an open ratio ron of 0.05. Note that\nthe input (top panel) \ufb02ows through the time gate kt (middle panel) to be held as the new cell state ct\n(bottom panel) only when kt is open.\n\nof the time gate, with analogy to the recti\ufb01ed linear unit that propagates gradients well:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(t \u2212 s) mod \u03c4\n\n\u03c4\n\n\u03c6t =\n\n,\n\nkt =\n\n,\n\n2\u03c6t\nron\n2 \u2212 2\u03c6t\nron\n\n,\n\n\u03b1\u03c6t,\n\n1\n2\n\nif \u03c6t <\nif 1\n2\notherwise\n\nron\n\nron < \u03c6t < ron\n\n(6)\n\n\u03c6t is an auxiliary variable, which represents the phase inside the rhythmic cycle. The gate kt has three\nphases (see Fig. 2a): in the \ufb01rst two phases, the \"openness\" of the gate rises from 0 to 1 (\ufb01rst phase)\nand drops from 1 to 0 (second phase). During the third phase, the gate is closed and the previous cell\nstate is maintained. The leak with rate \u03b1 is active in the closed phase, and plays a similar role as the\nleak in a parametric \u201cleaky\u201d recti\ufb01ed linear unit [15] by propagating important gradient information\neven when the gate is closed. Note that the linear slopes of kt during the open phases of the time gate\nallow effective transmission of error gradients.\nIn contrast to traditional RNNs, and even sparser variants of RNNs [19], updates in Phased LSTM\ncan optionally be performed at irregularly sampled time points tj. This allows the RNNs to work with\nevent-driven, asynchronously sampled input data. We use the shorthand notation cj = ctj for cell\nstates at time tj (analogously for other gates and units), and let cj\u22121 denote the state at the previous\nupdate time tj\u22121. We can then rewrite the regular LSTM cell update equations for cj and hj (from\n\nEq. 3 and Eq. 5), using proposed cell updates (cid:101)cj and (cid:101)hj mediated by the time gate kj:\n\n(cid:101)cj = fj (cid:12) cj\u22121 + ij (cid:12) \u03c3c(xjWxc + hj\u22121Whc + bc)\ncj = kj (cid:12)(cid:101)cj + (1 \u2212 kj) (cid:12) cj\u22121\n(cid:101)hj = oj (cid:12) \u03c3h((cid:101)cj)\nhj = kj (cid:12) (cid:101)hj + (1 \u2212 kj) (cid:12) hj\u22121\n\n(7)\n(8)\n(9)\n(10)\n\nA schematic of Phased LSTM with its parameters can be found in Fig. 2a, accompanied by an\nillustration of the relationship between the time, the input, the time gate kt, and the state ct in Fig. 2b.\nOne key advantage of this Phased LSTM formulation lies in the rate of memory decay. For the simple\ntask of keeping an initial memory state c0 as long as possible without receiving additional inputs (i.e.\nij = 0 at all time steps tj), a standard LSTM with a nearly fully-opened forget gate (i.e. fj = 1 \u2212 \u0001)\nafter n update steps would contain\n\ncn = fn (cid:12) cn\u22121 = (1 \u2212 \u0001) (cid:12) (fn\u22121 (cid:12) cn\u22122) = . . . = (1 \u2212 \u0001)n (cid:12) c0\n\n.\n\n(11)\n\n3\n\ntInputtj-2Layer 1Layer 2j-2Inputtj-1Layer 1Layer 2j-1InputtjLayer 1Layer 2jOutputOutputOutput.........closedopenInputkt Openness1234Timect State\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Frequency discrimination task. The network is trained to discriminate waves of different\nfrequency sets (shown in blue and gray); every circle is an input point. (a) Standard condition: the\ndata is regularly sampled every 1 ms. (b) High resolution sampling condition: new input points\nare gathered every 0.1ms. (c) Asynchronous sampling condition: new input points are presented at\nintervals of 0.02 ms to 10 ms. (d) The accuracy of Phased LSTM under the three sampling conditions\nis maintained, but the accuracy of the BN-LSTM and standard LSTM drops signi\ufb01cantly in the\nsampling conditions (b) and (c). Error bars indicate standard deviation over 5 runs.\n\nThis means the memory for \u0001 < 1 decays exponentially with every time step. Conversely, the Phased\nLSTM state only decays during the open periods of the time gate, but maintains a perfect memory\nduring its closed phase, i.e. cj = cj\u2212\u2206 if kt = 0 for tj\u2212\u2206 \u2264 t \u2264 tj. Thus, during a single oscillation\nperiod of length \u03c4, the units only update during a duration of ron \u00b7 \u03c4, which will result in substantially\nfewer than n update steps. Because of this cyclic memory, Phased LSTM can have much longer and\nadjustable memory length via the parameter \u03c4.\nThe oscillations impose sparse updates of the units, therefore substantially decreasing the total number\nof updates during network operation. During training, this sparseness ensures that the gradient is\nrequired to backpropagate through fewer updating timesteps, allowing an undecayed gradient to be\nbackpropagated through time and allowing faster learning convergence. Similar to the shielding of\nthe cell state ct (and its gradient) by the input gates and forget gates of the LSTM, the time gate\nprevents external inputs and time steps from dispersing and mixing the gradient of the cell state.\n\n3 Results\n\nIn the following sections, we investigate the advantages of the Phased LSTM model in a variety\nof scenarios that require either precise timing of updates or learning from a long sequence. For all\nthe results presented here, the networks were trained with Adam [18] set to default learning rate\nparameters, using Theano [2] with Lasagne [9]. Unless otherwise speci\ufb01ed, the leak rate was set to\n\u03b1 = 0.001 during training and \u03b1 = 0 during test. The phase shift, s, for each neuron was uniformly\nchosen from the interval [0, \u03c4 ]. The parameters \u03c4 and s were learned during training, while the open\nratio ron was \ufb01xed at 0.05 and not adjusted during training, except in the \ufb01rst task to demonstrate\nthat the model can train successfully while learning all parameters.\n\n3.1 Frequency Discrimination Task\n\nIn this \ufb01rst experiment, the network is trained to distinguish two classes of sine waves from different\nfrequency sets: those with a period in a target range T \u223c U(5, 6), and those outside the range, i.e.\nT \u223c {U(1, 5) \u222a U(6, 100)}, using U(a, b) for the uniform distribution on the interval (a, b). This\ntask illustrates the advantages of Phased LSTM, since it involves a periodic stimulus and requires\n\ufb01ne timing discrimination. The inputs are presented as pairs (cid:104)y, t(cid:105), where y is the amplitude and t\nthe timestamp of the sample from the input sine wave.\nFigure 3 illustrates the task: the blue curves must be separated from the lighter curves based on\nthe samples shown as circles. We evaluate three conditions for sampling the input signals: In the\nstandard condition (Fig. 3a), the sine waves are regularly sampled every 1 ms; in the oversampled\n\n4\n\n1618202224262830Time [ms]\u22121.0\u22120.50.00.51.01618202224262830Time [ms]\u22121.0\u22120.50.00.51.01618202224262830Time [ms]\u22121.0\u22120.50.00.51.0StandardsamplingHighresolutionsamplingAsync.sampling5060708090100Accuracy at 70 Epochs [%]Phased LSTMBN LSTMLSTM\f(a)\n\n(b)\n\nFigure 4: (a) Accuracy during training for the superimposed frequencies task. The Phased LSTM\noutperforms both LSTM and BN-LSTM while exhibiting lower variance. Shading shows maximum\nand minimum over 5 runs, while dark lines indicate the mean. (b) Mean-squared error over training\non the addition task, with an input length of 500. Note that longer periods accelerate learning\nconvergence.\n\ncondition (Fig. 3b), the sine waves are regularly sampled every 0.1 ms, resulting in ten times\nas many data points. Finally, in the asynchronously sampled condition (Fig. 3c), samples are\ncollected at asynchronous times over the duration of the input. Additionally, the sine waves have\na uniformly drawn random phase shift from all possible shifts, random numbers of samples drawn\nfrom U(15, 125), a random duration drawn from U(15, 125), and a start time drawn from U(0, 125\u2212\nduration). The number of samples in the asynchronous and standard sampling condition is equal.\nThe classes were approximately balanced, yielding a 50% chance success rate.\nSingle-layer RNNs are trained on this data, each repeated with \ufb01ve random initial seeds. We compare\nour Phased LSTM con\ufb01guration to regular LSTM, and batch-normalized (BN) LSTM which has\nfound success in certain applications [14]. For the regular LSTM and the BN-LSTM, the timestamp\nis used as an additional input feature dimension; for the Phased LSTM, the time input controls\nthe time gates kt. The architecture consists of 2-110-2 neurons for the LSTM and BN-LSTM, and\n1-110-2 for the Phased LSTM. The oscillation periods of the Phased LSTMs are drawn uniformly in\nthe exponential space to give a wide variety of applicable frequencies, i.e., \u03c4 \u223c exp(U(0, 3)). All\nother parameters match between models where applicable. The default LSTM parameters are given\nin the Lasagne Theano implementation, and were kept for LSTM, BN-LSTM, and Phased LSTM.\nAppropriate gate biasing was investigated but did not resolve the discrepancies between the models.\nAll three networks excel under standard sampling conditions as expected, as seen in Fig. 3d (left).\nHowever, for the same number of epochs, increasing the data sampling by a factor of ten has\ndevastating effects for both LSTM and BN-LSTM, dropping their accuracy down to near chance\n(Fig. 3d, middle). Presumably, if given enough training iterations, their accuracies would return to\nthe normal baseline. However, for the oversampled condition, Phased LSTM actually increases in\naccuracy, as it receives more information about the underlying waveform. Finally, if the updates are\nnot evenly spaced and are instead sampled at asynchronous times, even when controlled to have the\nsame number of points as the standard sampling condition, it appears to make the problem rather\nchallenging for traditional state-of-the-art models (Fig. 3d, right). However, the Phased LSTM has\nno dif\ufb01culty with the asynchronously sampled data, because the time gates kt do not need regular\nupdates and can be correctly sampled at any continuous time within the period.\nWe extend the previous task by training the same RNN architectures on signals composed of two\nsine waves. The goal is to distinguish signals composed of sine waves with periods T1 \u223c U(5, 6)\nand T2 \u223c U(13, 15), each with independent phase, from signals composed of sine waves with\nperiods T1 \u223c {U(1, 5) \u222a U(6, 100)} and T2 \u223c {U(1, 13) \u222a U(15, 100)}, again with independent\nphase. Despite being signi\ufb01cantly more challenging, Fig. 4a demonstrates how quickly the Phased\nLSTM converges to the correct solution compared to the standard approaches, using exactly the same\nparameters. Additionally, the Phased LSTM appears to exhibit very low variance during training.\n\n5\n\n050100150200250300Epoch45505560657075808590Accuracy [%]Phased LSTMBN LSTMLSTM020406080100Epoch10-510-410-310-210-1100MSELSTMPLSTM (\u00bf\u00bbeU(0;2))PLSTM (\u00bf\u00bbeU(2;4))PLSTM (\u00bf\u00bbeU(4;6))PLSTM (\u00bf\u00bbeU(6;8))\f(a)\n\n(b)\n\n(c)\n\nFigure 5: N-MNIST experiment. (a) Sketch of digit movement seen by the image sensor. (b)\nFrame-based representation of an \u20188\u2019 digit from the N-MNIST dataset [24] obtained by integrating all\ninput spikes for each pixel. (c) Spatio-temporal representation of the digit, presented in three saccades\nas in (a). Note that this representation shows the digit more clearly than the blurred frame-based one.\n\n3.2 Adding Task\n\nTo investigate how introducing time gates helps learning when long memory is required, we revisit\nan original LSTM task called the adding task [16]. In this task, a sequence of random numbers\nis presented along with an indicator input stream. When there is a 0 in the indicator input stream,\nthe presented value should be ignored; a 1 indicates that the value should be added. At the end of\npresentation the network produces a sum of all indicated values. Unlike the previous tasks, there is no\ninherent periodicity in the input, and it is one of the original tasks that LSTM was designed to solve\nwell. This would seem to work against the advantages of Phased LSTM, but using a longer period for\nthe time gate kt could allow more effective training as a unit opens only a for a few timesteps during\ntraining.\nIn this task, a sequence of numbers (of length 490 to 510) was drawn from U(\u22120.5, 0.5). Two\nnumbers in this stream of numbers are marked for addition: one from the \ufb01rst 10% of numbers\n(drawn with uniform probability) and one in the last half (drawn with uniform probability), producing\na model of a long and noisy stream of data with only few signi\ufb01cant points. Importantly, this should\nchallenge the Phased LSTM model because there is no inherent periodicity and every timestep could\ncontain the important marked points.\nThe same network architecture is used as before. The period \u03c4 was drawn uniformly in the expo-\nnential domain, comparing four sampling intervals exp(U(0, 2)), exp(U(2, 4)), exp(U(4, 6)), and\nexp(U(6, 8)). Note that despite different \u03c4 values, the total number of LSTM updates remains ap-\nproximately the same, since the overall sparseness is set by ron. However, a longer period \u03c4 provides\na longer jump through the past timesteps for the gradient during backpropagation-through-time.\nMoreover, we investigate whether the model can learn longer sequences more effectively when longer\nperiods are used. By varying the period \u03c4, the results in Fig. 4b show longer \u03c4 accelerates training of\nthe network to learn much longer sequences faster.\n\n3.3 N-MNIST Event-Based Visual Recognition\n\nTo test performance on real-world asynchronously sampled data, we make use of the publicly-\navailable N-MNIST [24] dataset for neuromorphic vision. The recordings come from an event-based\nvision sensor that is sensitive to local temporal contrast changes [26]. An event is generated from\na pixel when its local contrast change exceeds a threshold. Every event is encoded as a 4-tuple\n(cid:104)x, y, p, t(cid:105) with position x, y of the pixel, a polarity bit p (indicating a contrast increase or decrease),\nand a timestamp t indicating the time when the event is generated. The recordings consist of events\ngenerated by the vision sensor while the sensor undergoes three saccadic movements facing a static\ndigit from the MNIST dataset (Fig. 5a). An example of the event responses can be seen in Fig. 5c).\nIn previous work using event-based input data [21, 23], the timing information was sometimes\nremoved and instead a frame-based representation was generated by computing the pixel-wise\nevent-rate over some time period (as shown in Fig. 5(b)). Note that the spatio-temporal surface of\n\n6\n\n12301\u00d7105Time [us]23\fAccuracy at Epoch 1\nTrain/test \u03c1 = 0.75\nTest with \u03c1 = 0.4\nTest with \u03c1 = 1.0\nLSTM Updates\n\n73.81% \u00b1 3.5\n95.02% \u00b1 0.3\n90.67% \u00b1 0.3\n94.99% \u00b1 0.3\n\n\u2013\n\nBN-LSTM\n40.87% \u00b1 13.3\n96.93% \u00b1 0.12\n94.79% \u00b1 0.03\n96.55% \u00b1 0.63\n3153 per neuron\n\nTable 1: Accuracy on N-MNIST\nCNN\n\nPhased LSTM (\u03c4 = 100ms)\n\n90.32% \u00b1 2.3\n97.28% \u00b1 0.1\n95.11% \u00b1 0.2\n97.27% \u00b1 0.1\n\n159 \u00b1 2.8 per neuron\n\nevents in Fig. 5(c) reveals details of the digit much more clearly than in the blurred frame-based\nrepresentation.The Phased LSTM allows us to operate directly on such spatio-temporal event streams.\nTable 1 summarizes classi\ufb01cation results for three different network types: a CNN trained on frame-\nbased representations of N-MNIST digits and two RNNs, a BN-LSTM and a Phased LSTM, trained\ndirectly on the event streams. Regular LSTM is not shown, as it was found to perform worse. The\nCNN was comprised of three alternating layers of 8 kernels of 5x5 convolution with a leaky ReLU\nnonlinearity and 2x2 max-pooling, which were then fully-connected to 256 neurons, and \ufb01nally fully-\nconnected to the 10 output classes. The event pixel address was used to produce a 40-dimensional\nembedding via a learned embedding matrix [9], and combined with the polarity to produce the input.\nTherefore, the network architecture was 41-110-10 for the Phased LSTM and 42-110-10 for the\nBN-LSTM, with the time given as an extra input dimension to the BN-LSTM.\nTable 1 shows that Phased LSTM trains faster than alternative models and achieves much higher\naccuracy with a lower variance even within the \ufb01rst epoch of training. We further de\ufb01ne a factor, \u03c1,\nwhich represents the probability that an event is included, i.e. \u03c1 = 1.0 means all events are included.\nThe RNN models are trained with \u03c1 = 0.75, and again the Phased LSTM achieves slightly higher\nperformance than the BN-LSTM model. When testing with \u03c1 = 0.4 (fewer events) and \u03c1 = 1.0 (more\nevents) without retraining, both RNN models perform well and greatly outperform the CNN. This is\nbecause the accumulated statistics of the frame-based input to the CNN change drastically when the\noverall spike rates are altered. The Phased LSTM RNNs seem to have learned a stable spatio-temporal\nsurface on the input and are only slightly altered by sampling it more or less frequently.\nFinally, as each neuron of the Phased LSTM only updates about 5% of the time, on average, 159\nupdates are needed in comparison to the 3153 updates needed per neuron of the BN-LSTM, leading\nto an approximate twenty-fold reduction in run time compute cost. It is also worth noting that these\nresults form a new state-of-the-art accuracy for this dataset [24, 7].\n\n3.4 Visual-Auditory Sensor Fusion for Lip Reading\n\nFinally, we demonstrate the use of Phased LSTM on a task involving sensors with different sampling\nrates. Few RNN models ever attempt to merge sensors of different input frequencies, although the\nsampling rates can vary substantially. For this task, we use the GRID dataset [8]. This corpus contains\nvideo and audio of 30 speakers each uttering 1000 sentences composed of a \ufb01xed grammar and a\nconstrained vocabulary of 51 words. The data was randomly divided into a 90%/10% train-test set.\nAn OpenCV [17] implementation of a face detector was used on the video stream to extract the face\nwhich was then resized to grayscale 48x48 pixels. The goal here is to obtain a model that can use\naudio alone, video alone, or both inputs to robustly classify the sentence. However, since the audio\nalone is suf\ufb01cient to achieve greater than 99% accuracy, sensor modalities were randomly masked to\nzero during training to encourage robustness towards sensory noise and loss.\nThe network architecture \ufb01rst separately processes video and audio data before merging them in\ntwo RNN layers that receive both modalities. The video stream uses three alternating layers of 16\nkernels of 5x5 convolution and 2x2 subsampling to reduce the input of 1x48x48 to 16x2x2, which is\nthen used as the input to 110 recurrent units. The audio stream connects the 39-dimensional MFCCs\n(13 MFCCs with \ufb01rst and second derivatives) to 150 recurrent units. Both streams converge into\nthe Merged-1 layer with 250 recurrent units, and is connected to a second hidden layer with 250\nrecurrent units named Merged-2. The output of the Merged-2 layer is fully-connected to 51 output\nnodes, which represent the vocabulary of GRID. For the Phased LSTM network, all recurrent units\nare Phased LSTM units.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 6: Lip reading experiment. (a) Inputs and openness of time gates for the lip reading experiment.\nNote that the 25fps video frame rate is a multiple of the audio input frequency (100 Hz). Phased\nLSTM timing parameters are con\ufb01gured to align to the sampling time of their inputs. (b) Example\ninput of video (top) and audio (bottom). (c) Test loss using the video stream alone. Video frame rate\nis 40ms. Top: low resolution condition, MFCCs computed every 40ms with a network update every\n40 ms; Bottom: high resolution condition, MFCCs every 10 ms with a network update every 10 ms.\n\nIn the audio and video Phased LSTM layers, we manually align the open periods of the time gates\nto the sampling times of the inputs and disable learning of the \u03c4 and s parameters (see Fig. 6a).\nThis prevents presenting zeros or arti\ufb01cial interpolations to the network when data is not present.\nIn the merged layers, however, the parameters of the time gate are learned, with the period \u03c4 of the\n\ufb01rst merged layer drawn from U(10, 1000) and the second from U(500, 3000). Fig. 6b shows a\nvisualization of one frame of video and the complete duration of an audio sample.\nDuring evaluation, all networks achieve greater than 98% accuracy on audio-only and combined\naudio-video inputs. However, video-only evaluation with an audio-video capable network proved\nthe most challenging, so the results in Fig. 6c focus on these results (though result rankings are\nrepresentative of all conditions). Two differently-sampled versions of the data were used: In the \ufb01rst\n\u201clow resolution\u201d version (Fig. 6c, top), the sampling rate of the MFCCs was matched to the sampling\nrate of the 25 fps video. In the second \u201chigh-resolution\u201d condition, the sampling rate was set to the\nmore common value of 100 Hz sampling frequency (Fig. 6c, bottom and shown in Fig. 6a). The\nhigher audio sampling rate did not increase accuracy, but allows for a faster latency (10ms instead of\n40ms). The Phased LSTM again converges substantially faster than both LSTM and batch-normalized\nLSTM. The peak accuracy of 81.15% compares favorably against lipreading-focused state-of-the-art\napproaches [28] while avoiding manually-crafted features.\n\n4 Discussion\n\nThe Phased LSTM has many surprising advantages. With its rhythmic periodicity, it acts like a\nlearnable, gated Fourier transform on its input, permitting very \ufb01ne timing discrimination. Alterna-\ntively, the rhythmic periodicity can be viewed as a kind of persistent dropout that preserves state [27],\nenhancing model diversity. The rhythmic inactivation can even be viewed as a shortcut to the past\nfor gradient backpropagation, accelerating training. The presented results support these interpreta-\ntions, demonstrating the ability to discriminate rhythmic signals and to learn long memory traces.\nImportantly, in all experiments, Phased LSTM converges more quickly and theoretically requires\nonly 5% of the computes at runtime, while often improving in accuracy compared to standard LSTM.\nThe presented methods can also easily be extended to GRUs [6], and it is likely that even simpler\nmodels, such as ones that use a square-wave-like oscillation, will perform well, thereby making even\nmore ef\ufb01cient and encouraging alternative Phased LSTM formulations. An inspiration for using\noscillations in recurrent networks comes from computational neuroscience [3], where rhythms have\nbeen shown to play important roles for synchronization and plasticity [22]. Phased LSTMs were\nnot designed as biologically plausible models, but may help explain some of the advantages and\nrobustness of learning in large spiking recurrent networks.\n\n8\n\nTimeMFCCsInputsVideoFrames220260300340Time [ms]Merged-2PLSTMMerged-1PLSTMVideoPLSTMAudioPLSTMkj Openness50015002500Time [ms]05101520253035MFCC10-210-1Low Res.Loss01020304050Epoch10-210-1High Res.LossPhased LSTMBN LSTMLSTM\fReferences\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\narXiv preprint arXiv:1409.0473, 2014.\n\n[2] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley,\nand Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for\nscienti\ufb01c computing conference (SciPy), volume 4, page 3, 2010.\n\n[3] G. Buzsaki. Rhythms of the Brain. Oxford University Press, 2006.\n[4] G. Cauwenberghs. An analog VLSI recurrent neural network learning a continuous-time trajectory. IEEE\n\nTransactions on Neural Networks, 7(2):346\u2013361, 1996.\n\n[5] K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-based encoder-decoder\n\nnetworks. IEEE Transactions on Multimedia, 17(11):1875\u20131886, 2015.\n\n[6] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase repre-\nsentations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,\n2014.\n\n[7] G. K. Cohen, G. Orchard, S. H. Ieng, J. Tapson, R. B. Benosman, and A. van Schaik. Skimming digits:\n\nNeuromorphic classi\ufb01cation of spike-encoded images. Frontiers in Neuroscience, 10(184), 2016.\n\n[8] M. Cooke, J. Barker, S. Cunningham, and X. Shao. An audio-visual corpus for speech perception and\nautomatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421\u20132424, 2006.\n\n[9] S. Dieleman et al. Lasagne: First release., Aug. 2015.\n[10] K.-I. Funahashi and Y. Nakamura. Approximation of dynamical systems by continuous time recurrent\n\nneural networks. Neural Networks, 6(6):801\u2013806, 1993.\n\n[11] F. A. Gers and J. Schmidhuber. Recurrent nets that time and count. In Neural Networks, 2000. IJCNN\n2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 3, pages 189\u2013194.\nIEEE, 2000.\n\n[12] A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n[13] A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks.\nIn 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages\n6645\u20136649, 2013.\n\n[14] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta,\nA. Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567,\n2014.\n\n[15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level performance on\n\nimagenet classi\ufb01cation. In The IEEE International Conference on Computer Vision (ICCV), 2015.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n[17] Itseez. Open source computer vision library. https://github.com/itseez/opencv, 2015.\n[18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.\n[19] J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber. A clockwork rnn. arXiv preprint arXiv:1402.3511,\n\n2014.\n\n[20] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. Cernock`y, and S. Khudanpur. Recurrent neural network based\n\nlanguage model. Interspeech, 2:3, 2010.\n\n[21] D. Neil and S.-C. Liu. Effective sensor fusion with event-based sensors and deep network architectures. In\n\nIEEE Int. Symposium on Circuits and Systems (ISCAS), 2016.\n\n[22] B. Nessler, M. Pfeiffer, L. Buesing, and W. Maass. Bayesian computation emerges in generic cortical\n\nmicrocircuits through spike-timing-dependent plasticity. PLoS Comput Biol, 9(4):e1003037, 2013.\n\n[23] P. O\u2019Connor, D. Neil, S.-C. Liu, T. Delbruck, and M. Pfeiffer. Real-time classi\ufb01cation and sensor fusion\n\nwith a spiking Deep Belief Network. Frontiers in Neuroscience, 7, 2013.\n\n[24] G. Orchard, A. Jayawant, G. Cohen, and N. Thakor. Converting static image datasets to spiking neuromor-\n\nphic datasets using saccades. arXiv: 1507.07629, 2015.\n\n[25] B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks. Neural Computation,\n\n1(2):263\u2013269, 1989.\n\n[26] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck. Retinomorphic event-based\nvision sensors: bioinspired cameras with spiking outputs. Proceedings of the IEEE, 102(10):1470\u20131484,\n2014.\n\n[27] S. Semeniuta, A. Severyn, and E. Barth. Recurrent dropout without memory loss. arXiv, arXiv:1603.05118,\n\n2016.\n\n[28] M. Wand, J. Koutn\u00edk, and J. Schmidhuber. Lipreading with long short-term memory. arXiv preprint\n\narXiv:1601.08188, 2016.\n\n[29] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend\nand tell: Neural image caption generation with visual attention. In International Conference on Machine\nLearning, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1928, "authors": [{"given_name": "Daniel", "family_name": "Neil", "institution": "Institute of Neuroinformatics"}, {"given_name": "Michael", "family_name": "Pfeiffer", "institution": "Institute of Neuroinformatics"}, {"given_name": "Shih-Chii", "family_name": "Liu", "institution": "Institute for Neuroinformatics, University of Zurich and ETH Zurich"}]}