{"title": "LSTM can Solve Hard Long Time Lag Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 473, "page_last": 479, "abstract": "", "full_text": "LSTM CAN SOLVE HARD\n\nLO G TIME LAG PROBLEMS\n\nSepp Hochreiter\n\nFakultat fur Informatik\n\nJiirgen Schmidhuber\n\nIDSIA\n\nTechnische Universitat Munchen\n\n80290 Miinchen, Germany\n\nCorso Elvezia 36\n\n6900 Lugano, Switzerland\n\nAbstract\n\nStandard recurrent nets cannot deal with long minimal time lags\nbetween relevant signals. Several recent NIPS papers propose alter(cid:173)\nnative methods. We first show: problems used to promote various\nprevious algorithms can be solved more quickly by random weight\nguessing than by the proposed algorithms. We then use LSTM,\nour own recent algorithm, to solve a hard problem that can neither\nbe quickly solved by random search nor by any other recurrent net\nalgorithm we are aware of.\n\n1 TRIVIAL PREVIOUS LONG TIME LAG PROBLEMS\n\nTraditional recurrent nets fail in case 'of long minimal time lags between input sig(cid:173)\nnals and corresponding error signals [7, 3]. Many recent papers propose alternative\nmethods, e.g., [16, 12, 1,5,9]. For instance, Bengio et ale investigate methods such\nas simulated annealing, multi-grid random search, time-weighted pseudo-Newton\noptimization, and discrete error propagation [3]. They also propose. an EM ap(cid:173)\nproach [1]. Quite a few papers use variants of the \"2-sequence problem\" (and ('latch\nproblem\" ) to show the proposed algorithm's superiority, e.g. [3, 1, 5, 9]. Some pa(cid:173)\npers also use the \"parity problem\", e.g., [3, 1]. Some of Tomita's [18] grammars are\nalso often used as benchmark problems for recurrent nets [2, 19, 14, 11].\n\nTrivial versus non-trivial tasks. By our definition, a \"trivial\" task is one that\ncan be solved quickly by random search (RS) in weight space. RS works as follows:\nREPEAT randomly initialize the weights and test the resulting net on a training set\nUNTIL solution found.\n\n\f_\n\n474\n\ns. Hochreiter and J. Schmidhuber\n\nRandom search (RS) details. In all our RS experiments, we randomly initialize\nweights in [-100.0,100.0]. Binary inputs are -1.0 (for 0) and 1.0 (for 1). Targets are\neither 1.0 or 0.0. All activation functions are logistic sigmoid in [0.0,1.0]. We use two\narchitectures (AI, A2) suitable for many widely used \"benchmark\" problems: Al is\na fully connected net with 1 input, 1 output, and n biased hidden units. A2 is like\nAl with n = 10, but less densely connected: each hidden unit sees the input unit,\nthe output unit, and itself; the output unit sees all other units; all units are biased.\nAll activations are set to 0 at each sequence begin. We will indicate where we\nalso use different architectures of other authors. All sequence lengths are randomly\nchosen between 500 and 600 (most other authors facilitate their problems by using\nmuch shorter training/test sequences). The \"benchmark\" problems always require\nto classify two types of sequences. Our training set consists of 100 sequences, 50\nfrom class 1 (target 0) and 50 from class 2 (target 1). Correct sequence classification\nis defined as \"absolute error at sequence end below 0.1\". We stop the search once\na random weight matrix correctly classifies all training sequences. Then we test on\nthe test set (100 sequences). All results below are averages of 10 trials. In all our\nsimulations below, RS finally classified all test set sequences correctly;\naverage final absolute test set errors were always below 0.001 -\nin most\ncases below 0.0001.\n\"2-sequence p~oblem\" (and \"latch problem\") [3, 1, 9]. The task is to observe and\nclassify input sequences. There are two classes. There is only one input unit or input\nline. Only the first N real-valued sequence elements convey relevant information\nabout the class. Sequence elements at positions t > N(we use N = 1) are generated\nby a Gaussian with mean z~ro and variance 0.2. The first sequence element is 1.0\n(-1.0) for class 1 (2). Tar#t at sequence end is 1.0 (0.0) for class 1 (2) (the latch\nproblem is a simple version of the 2-sequence problem that allows for input tuning\ninstead of weight tuning).\n\nBengio et ale 's results. For the 2-sequence problem, the best method among the\nsix tested by Bengio et all\n[3] was multigrid random search (sequence lengths 50\n100; N and stopping criterion undefined), which solved the problem after 6,400\n-\nsequence presentations, with final classification error 0.06.\nIn more recent work,\nBengio andFrasconi reported that an EM-approach [1] solves the problem within\n2,900 trials.\nRS results. RS with architecture A2 (AI, n = 1) solves the problem within only\n718 (1247) trials on average. Using an architecture with only 3 parameters (as\nin Bengio et at's architecture for the latch problem [3]), the problem was solved\nwithin only 22 trials on average, due to tiny parameter space. According to our\ndefinition above, the problem is trivial. RS outperforms Bengio et al.'s methods in\nevery respect: (1) many fewer trials required, (2) much less computation time per\ntrial. Also, in most cases (3) the solution quality is better (less error).\n\nIt should be mentioned, however, that different input representations and differ(cid:173)\nent types of noise may lead to worse RS performance (Yoshua Bengio, personal\ncommunication, 1996).\n\n\"Parity problem\". The parity task [3, 1] requires to classify sequences with\nseveral 100 elements (only l's or -1's) according to whether the number of l's is\neven or odd. The target at sequence end is 1.0 for odd and 0.0 for even.\n\n\fLSTM can Solve\n\nTime\n\nProblems\n\n475\n\nBengio et al.\"s results. For sequences with only 25-50 steps, among the six\nmethods tested in [3] only simulated annealing was reported to achieve final classi(cid:173)\nfication error of 0.000 (within about 810,000 triaIs-- the authors did not mention\nthe precise stopping criterion). A method called \"discrete error BP\" took about\n54,000 trials to achieve final classification error 0.05. In more recent work [1], for\nsequences with 250-500 steps, their EM-approach took about 3,400 trials to achieve\n- final classification error 0.12.\nRS results. RS with Al (n = 1) solves the problem within only 2906 trials on\naverage. RS with A2 solves it within 2797 trials. We also ran another experiment\nwith architecture A2, but without self-connections for hidden units. RS solved the\nproblem within 250 trials on average.\n\nAgain it should be mentioned that different input representations and noise types\nmay lead to worse RS performance (Yoshua Bengio, personal communication, 1996).\nTomita grammars. Many authors also use Tomita's grammars [18] to test their\nalgorithms. See, e.g., [2, 19, -14, 11, 10]. Since we already tested parity problems\nabove, we now focus on a few \"parity-free\" Tomita grammars (nr.s #1, #2, #4).\nPrevious work facilitated the problems by restricting sequence length. E.g., in\n[11], maximal test (training) sequence length is 15 (10). Reference [11] reports the\nnumber of sequences required for convergence (for various first and second order\nnets with 3 to 9 units): Tomita #1: 23,000 - 46,000; Tomita #2: 77,000 - 200,000;\nTomita #4: 46,000 - 210,000. RS, however, clearly outperforms the methods in\n[11]. The average results are: Tomita #1: 182 (AI, n = 1) and 288 (A2), Tomita\n#2: 1,511 (AI, n = 3) and 17,953 (A2), Tomita #4: 13,833 (AI, n = 2) and 35,610\n(A2).\n\nNon-trivial tasks / Outline of remainder. Solutions of non-trivial tasks are\nsparse in weight space. They require either many free parameters (e.g., input\nweights) or high weight precision, such that RS becomes infeasible. To solve such\ntasks we need a novel method called \"Long Short-Term Memory\", or LSTM for\nshort [8]. Section 2 will briefly review LSTM. Section 3 will show results on a task\nthat cannot be solved at all by any other recurrent net learning algorithm we are\naware of. The task involves distributed, high-precision, continuous-valued represen(cid:173)\ntations and long minimal time lags -\nthere are no short time lag training exemplars\nfacilitating learning.\n\n2 LONG SHORT-TERM MEMORY\n\nMemory cells and gate units: basic ideas. LSTM's basic unit is called a\nmemory cell. Within each memory cell, there is a linear unit with a fixed-weight\nself-connection (compare Mozer's time constants [12]). This enforces constant, non(cid:173)\nexploding, non-vanishing error flow within the memory cell. A multiplicative input\ngate unit learns to protect the constant error flow within the memory cell from\nperturbation by irrelevant inputs. Likewise, a multiplicative output gate unit learns\nto protect other units from perturbation by currently irrelevant memory contents\n. stored in the memory cell. The gates learn-to open and close access to constant error\nflow. Why is constant error flow important\u00a5 For instance, with conventional \"back(cid:173)\nprop through time\" (BPTT, e.g., [20]) or RTRL (e.g., [15]), error signals ''flowing\nbackwards in time\" tend to vanish: the temporal evolution of the backpropagated\n\n\f476\n\nS. Hochreiter and 1. Schmidhuber\n\nerror exponentially depends on the size of the weights. For the first theoretical error\nflow analysis see [7]. See [3] for a more recent, independent, essentially identical\nanalysis.\n\nLSTM details. In what follows, Wuv denotes the weight on the connection from\nunit v to unit u. netu(t), yU(t) are net input and activation ofunit u (with activation\nfunction fu) at time t. For all non-input units that aren't memory cells (e.g. output\nunits), we have yU(t) = fu(netu(t)), where netu(t) = Lv wuvyV(t - 1). The j(cid:173)\nth memory cell is denoted Cj. Each memory cell is built around a central linear\nunit with a fixed self-connection (weight 1.0) and identity function as activation\nfunction (see definition of SCj below). In addition to netCj(t) = Lu WCjuyU(t - 1),\nc; also gets input from a special unit out; (the \"output gate\"), and from another\nspecial unit in; (the \"input gate\"). in; 's activation at time t is denoted by ifnj(t).\nout; 's activation at time t is denoted by youtj (t). in;, outj are viewed as ordinary\nhidden units. We have youtj(t) = !outj(net outj (t)), yin;(t) = finj(netinj(t)), where\nnetoutj(t) = Lu WoutjuyU(t - 1), netinj(t) = Lu WinjuyU(t - 1). The summation\nindices u may stand for input units, gate units, memory cells, or even conventional\nhidden units if there are any (see also paragraph on \"network topology\" below).\nAll these different types of units may convey useful information about the current\nstate of the net. For instance, an input gate (output gate) may use inputs from\nother memory cells to decide whether to store (access) certain information in its\nmemory cell. There even may be recurrent self-connections like WCjCj . It is up to\nthe user to define the network topology. At time t, Cj'S output yCj(t) is computed\nin a sigma-pi-likefashion: yCj(t) = youtj(t)h(scj(t)), where\n\nSCj(O) =0, SCj(t) = SCj(t -1) + ifnj(t)g (netCj(t)) for t > o.\n\nThe differentiable function 9 scales netCj . The differentiable function h scales mem(cid:173)\nory cell outputs computed from the internal state SCj.\n\nWhy gate units? in; controls the error flow to memory cell Cj 's input connections\nW Cju \u2022 out; controls the error flow from unit j's output connections.. Error signals\ntrapped within a memory cell cannot change - but different error signals flowing into\nthe ceil (at different times) via its output gate may get superimposed. The output\ngate will have to learn which errors to trap in its memory cell, by appropriately\nscaling them. Likewise, the input gate will have to learn when to release errors.\nGates open and close access to constant error flow.\n\nNetwork topology. There is one input, one hidden, 'and one output layer.. The\nfully self-connected hidden layer contains memory cells and corresponding gate units\n(for convenience, we refer to both memory cells and gate units as hidden units\nlocated in the hidden layer). The hidden layer may also contain \"conventional\"\nhidden units providing inputs to gate uni~s and memory cells. All units (except for\ngate units) in a~llayers have directed connections (serve as inputs) to all units in\nhigher layers.\n\nMemory .cell blocks. S memory cells sharing one input gate a:nd one output gate\nform a \"memory cell block of size S\".. They can facilitate information storage.\n\nLearning with excellent computational complexity -\nsee details in appendix\nof [8]. We use a variant of RTRL which properly takes into account the altered\n(sigma-pi-like) dynamics caused by input and output gates. Ho~ever, to ensure\nconstant error backprop, like with truncated BP.TT [20], errors arriving at \"memory\n\n\fLSTM can Solve\n\nTime Lag Problems\n\n477\n\nJ\n\nJ\n\n.\n\nJ\n\ncell net inputs\" (for cell Cj, this includes nete ., netin ., netout .) do not get propagated\nback further in time (although they do serve to change the incoming weights). Only\nwithin memory cells, errors are propagated back through previous internal states SCj.\nThis enforces constant error flow within memory cells. Thus only the derivatives\nd~ nee\n8s c -\nto be stored and updated. Hence, the algorithm is very efficient,\nand LSTM's update complexity per time step is excellent in comparison to other\napproaches such as RTRL: given n units and a fixed number ofoutput units, LSTM's\nupdate complexity per time step is at most O(n2 ), just like BPTI's.\n\n3 EXPERIMENT: ADDING PROBLEM\n\nOur previous experimental comparisons (on widely used benchmark problems)\nwith.RTRL (e.g., .[15]; results compared to the ones in [17]), Recurrent Cascade(cid:173)\nCorrelation [6], Elman nets (results compared to the ones in [4]), and Neural Se(cid:173)\nquence Chunking [16], demonstrated that LSTM leads to many more successful\nruns than its competitors, and learns much faster [8]. The following task, though,\nis more difficult than the above benchmark problems: it cannot be solved at all in\nreasonable time by RS (we tried various architectures) nor any other recurrent net\nlearning algorithm we are aware of (see [13] for \u00b7an overview). The experiment will\nshow that LSTM can solve non-trivial, complex long time lag problems involving\ndistributed, high-precision, continuous-valued representations.\n\nTask. Each element of each input sequence is a pair consisting of two components.\nThe first component is a real value randomly chosen from the interval [-1,1]. The\nsecond component is ~ither 1.0, 0.0, or -1.0, and is used as a marker: at the end\nof each sequence, the task is to output the sum of the first components of those\npairs that are marked by second components .equal to 1.0. The value T is used to\ndetermine average sequence length, which is a randomly chosen int~ger between T\nand T + 'fa. With a given sequence, exactly two pairs are marked as follows: we first\nrandomly select and mark one of the first ten pairs (whose first component is called\n,Xl). Then we randomly select and mark one of the first ~ -1 still unmarked pairs\n(whose first component is called X2). The second components of the remaining\npairs are zero except for the first and final pair, whose second components are -1\n(Xl is set to zero in the rare case where the first pair of the sequence got marke~.\nthe target is 0.5 + X)4~O 2\nAn error signal is generated only at the sequence end:\n(the sum Xl + X2 scaled to the interval [0,1]). A sequence was processed correctly\nif the absolute error at the sequence end is below 0.04.\n\nArchitecture. We use a 3-layer net with 2 input units, 1 output unit, and 2\nmemory cell blocks of size 2 (a cell block size of 1 works well, too). The output\nlayer receives connections only from memory cells. Memory cells/ gate units receive\ninputs from memory cells/gate units (fully connected hidden layer). Gate units\n(finj,fout;) and output units are sigmoid in [0,1]. h is sigmoid in [-1,1], and 9 is\nsigmoid in [-2,2].\n\nState drift versus initial bias. Note that the task requires to store the precise\nvalues of real numbers for long durations -\nthe system must learn to protect mem(cid:173)\nory cell contents against even minor \"internal state drifts\". Our simple but highly\neffective way of solving drift problems at the. beginning of learning is to initially bias\nthe input gate in; towards zero. There is no need for fine tuning initial bias: with\n\n\f478\n\nS. Hochreiter and J. Schmidhuber\n\nsigmoid l~gistic activation functions, the precise initial bias hardly matters because\nvastly different initial bias values produce almost the same near-zero activations. In\nfact, the system itself learns to generate the most appropriate input gate bias. To\nstudy the significance of the drift problem, we bias all non-input units, thus artifi(cid:173)\ncially inducing internal state drifts. Weights (including bias weights) are randomly\ninitialized in the range [-0.1,0.1]. The first (second) input gate bias is initialized\nwith -3.0 (-6.0) (recall that the precise- initialization values hardly matters, as\nconfirmed by additional experiments)..\n\nTraining / Testing. The learning rate is 0.5. Training examples are generated\non-line. Training is stopped if the average training error is below 0.01, and the 2000\nmost recent sequences were processed correctly (see definition above).\n\nResults. With a test set consisting of 2560 randomly chosen. sequences, the average\ntest set error was always below 0.01, and there were never more than 3 incorrectly\nprocessed sequences. The following results are means of 10 trials: For T = 100\n(T = 500, T = 1000), training was stopped after 74,000 (209,000; 853,000) training\nsequences, and then only 1 (0, 1) of the test sequences was not processed correctly.\nFor T = 1000, the number of required training examples varied between 370,000\nand 2,020,000, exceeding 700,000 in only 3 cases.\nThe experiment demonstrates even for very long minimal time lags: (1) LSTM is\nable to work well with distributed representations.\n(2) LSTM is able to perform\ncalculations involving high-precision, continuous values. Such tasks are impossible\nto solve within reasonable time by other algorithms: the main problem of gradient(cid:173)\nbased approaches (including TDNN, pseudo Newton) is their inability to deal with\nvery long minimal time lags (vanishing gradient). A main problem of \"global\" and\n\"discrete\" approaches (RS, Bengio's and Frasconi's EM-approach, discrete error\npropagation) is their inability to deal with high-precision, continuous values.\n\nOther experiments. In [8] LSTM is used to solve numerous additional tasks that\ncannot be solved by other recurrent net learning algorithm we are aware of. For\ninstance, LSTM can extract information conveyed by the temporal order of widely\nseparated inputs. LSTM also can learn real-valued, conditional expectations of\nstrongly delayed, noisy targets, given the inputs.\n\nConclusion. For non-trivial tasks (where RS is infeasible), we recommend LSTM.\n\n4 ACKNOWLEDGMENTS\n\nThis work was supported by DFG grant SCRM 942/3-1 from \"Deutsche Forschungs(cid:173)\ngemeinschaft\" .\n\nReferences\n\n[1] Y. Bengio and P. Frasconi. Credit assignment through time: Alternatives to backprop(cid:173)\nagation. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural\nInformation Processing Systems 6, pages 75-82. San Mateo, CA: Morgan Kaufmann,\n1994.\n\n[2] Y. Bengio and P. Frasconi. An input output HMM architecture.\n\nIn G. Tesauro,\nD. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing\nSystems 7, pages 427-434. MIT Press, Cambridge MA, 1995.\n\n\fLSTM can Solve 4 ............ oIIL.J'IV\"'I.)IIi; Time Lag Problems\n\n479\n\n[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient\n\ndescent is difficult. IEEE Transactions on Neural Networks, 5(2):157-166, 1994.\n\n[4] A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland. Finite-state automata\n\nand simple recurrent networks. Neural Computation, 1:372-381, 1989.\n\n[5] S. EI Hihi and Y. Bengio. Hierarchical recurrent neural networks for long-term depen(cid:173)\ndencies. In Advances in Neural Information\u00b7Processing Systems 8, 1995. to appear.\n\n[6] S. E. Fahlman. The recurrent cascade-correlation learning algorithm. In R. P. Lipp(cid:173)\nmann, J .. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information\nProcessing Systems 9, pages 190-196. San Mateo, CA: Morgan Kaufmann, 1991.\n\n[7] J. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis,\nInstitut fiir Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen,\n1991. See www7.informatik.tu-muenchen.de/-hochreit.\n\n[8] S. Hochreiter and J. Schmidhuber. Long short-term memory. Technical Report FKI(cid:173)\n207-95, Fakultii.t fur Informatik, Technische Universitat Munchen, 1995. Revised 1996\n(see www.idsia.ch/-juergen, www7.informatik.tu-muenchen.dej-hochreit).\n\n[9] T. Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning long-term dependencies is\nnot as difficult with NARX recurrent neural networks. Technical Report UMIACS(cid:173)\nTR-95-78 and CS-TR-3500, Institute for Advanced Computer Studies, University of\nMaryland, College Pa.rk, MD 20742, 1995.\n\n[10] P. Manolios and R. Fanelli. First-order recurrent neural networks and deterministic\n\nfinite state automata. Neural Computation, 6:1155-1173, 1994.\n\n[11] C. B. Miller and C. L. Giles. Experimental comparison of the effect of order in\nrecurrent neural networks. International Journal of Pattern Recognition and Artificial\nIntelligence, 7(4):849-872, 1993.\n\n[12] M. C. Mozer.\n\nIn J. E. Moody, S. J.\nHanson, and R. P. Lippman, editors, Advances in Neural Information Processing\nSystems 4, pages 275-282. San Mateo, CA: Morgan Kaufmann, 1992.\n\nInduction of multiscale temporal structure.\n\n[13] B. A. Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A\n\nsurvey. IEEE Transactions on Neural Networks, 6(5):1212-1228, 1995.\n\n[14] J. B. Pollack. The induction of dynamical recognizers. Machine Learning, 7:227-252,\n\n1991.\n\n[15] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation net(cid:173)\nwork. Technical Report CUEDjF-INFENGjTR.1, Cambridge University Engineering\nDepartment, 1987.\n\n[16] J. H. Schmidhuber. Learning complex, extended sequences using the principle of\n\nhistory compression. Neural Computation, 4(2):234-242, 1992.\n\n[17] A. W. Smith and D. Zipser. Learning sequential structures with the real-time re(cid:173)\nInternational Journal of Neural Systems, 1(2):125-131,\n\ncurrent learning algorithm.\n1989.\n\n[18] M. Tomita. Dynamic construction of finite automata from examples using hill(cid:173)\nIn Proceedings of the Fourth Annual Cognitive Science Conference, pages\n\nclimbing.\n105-108. Ann Arbor, MI, 1982.\n\n[19] R. L. Watrous and G. M. Kuhn.\n\nInduction of finite-state automata using second(cid:173)\norder recurrent networks. In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors,\nAdvances in Neural Information Processing Systems 4, pages 309-316. San Mateo,\nCA: Morgan Kaufmann, 1992.\n\n[20] R. J. Williams and J. Peng. An efficient gradient-based algorithm for on-line training\n\nof recurrent network trajectories. Neural Computation, 4:491-501, 1990.\n\n\f\f", "award": [], "sourceid": 1215, "authors": [{"given_name": "Sepp", "family_name": "Hochreiter", "institution": null}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}