{"title": "Recurrent Networks and NARMA Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 301, "page_last": 308, "abstract": null, "full_text": "Recurrent Networks and N ARMA Modeling \n\nJerome Connor \n\nLes E. Atlas \n\nDouglas R. Martin \n\nFT-lO \n\nInteractive Systems Design Laboratory \n\nDept. of Electrical Engineering \n\nUniversity of Washington \nSeattle, Washington 98195 \n\nB-317 \n\nDept. of Statistics \n\nUniversity of Washington \nSeattle, Washington 98195 \n\nAbstract \n\nThere exist large classes of time series, such as those with nonlinear moving \naverage components, that are not well modeled by feedforward networks \nor linear models, but can be modeled by recurrent networks. We show that \nrecurrent neural networks are a type of nonlinear autoregressive-moving \naverage (N ARMA) model. Practical ability will be shown in the results of \na competition sponsored by the Puget Sound Power and Light Company, \nwhere the recurrent networks gave the best performance on electric load \nforecasting. \n\n1 \n\nIntroduction \n\nThis paper will concentrate on identifying types of time series for which a recurrent \nnetwork provides a significantly better model, and corresponding prediction, than \na feedforward network. Our main interest is in discrete time series that are par(cid:173)\nsimoniously modeled by a simple recurrent network, but for which, a feedforward \nneural network is highly non-parsimonious by virtue of requiring an infinite amount \nof past observations as input to achieve the same accuracy in prediction. \nOur approach is to consider predictive neural networks as stochastic models. Section \n2 will be devoted to a brief summary of time series theory that will be used to \nillustrate the the differences between feedforward and recurrent networks. Section 3 \nwill investigate some of the problems associated with nonlinear moving average and \nstate space models of time series. In particular, neural networks will be analyzed as \n301 \n\n\f302 \n\nConnor, Atlas, and Martin \n\nnonlinear extensions oftraditionallinear models. From the preceding sections, it will \nbecome apparent that the recurrent network will have advantages over feedforward \nneural networks in much the same way that ARMA models have over autoregressive \nmodels for some types of time series. \nFinally in section 4, the results of a competition in electric load forecasting spon(cid:173)\nsored by the Puget Sound Power and Light Company will discussed. In this com(cid:173)\npetition, a recurrent network model gave superior results to feed forward networks \nand various types of linear models. The advantages of a state space model for \nmultivariate time series will be shown on the Puget Power time series. \n\n2 Traditional Approaches to Time Series Analysis \n\nThe statistical approach to forecasting involves the construction of stochastic mod(cid:173)\nels to predict the value of an observation Xt using previous observations. This is \noften accomplished using linear stochastic difference equation models, with random \ninputs. \nA very general class of linear models used for forecasting purposes is the class of \nARMA(p,q) models \n\np \n\nq \n\nXt = L <PXt-1 + L \n\n(Jet-i + et \n\n1=1 \n\ni=l \n\nwhere et denotes random noise, independent of past X\"~ The conditional mean \n(minimum mean square error) predictor Xt of Xt can be expressed in the recurrent \nform \n\nXt = L<pXt-, + L(Jet-i\u00b7 \n\np \n\n1=1 \n\nq \n\ni=l \n\nwhere ek is approximated by \n\nfk = Xk - Xk, \n\nIe = t - 1, ... , t - q \n\nThe key properties of interest for an ARMA(p,q) model are stationarity and invert(cid:173)\nibility. If the process Xt is stationary, its statistical properties are independent of \ntime. Any stationary ARMA(p,q) process can be written as a moving average \n\n00 \n\nXt = L hket-k + et\u00b7 \n\nk=l \n\nAn invertible process can be equivalently expressed in terms of previous observations \nor residuals. For a process to be invertible, all the poles of the z-transform must \nlie inside the unit circle of the z plane. An invertible ARMA(p,q) process can be \nwritten as an infinite autoregression \n\n00 \n\nXt = L <PkXt-k + et\u00b7 \n\nk=l \n\nAs an example of how the inverse process occurs, let et be solved for in terms of Xt \nand then substitute previous et's into the original process. This can be illustrated \n\n\fwith an MA(I) process \n\nXt = et + (}et-1 \n\nRecurrent Networks and NARMA Modeling \n\n303 \n\net-i = Xt-i -\n\n(}et-i-1 \n\nXt = et + (}(Xt-1 -\n(}et-2) \nXt = et + ~(-I)i-1(}iXt_i \n\nLooking at this example, it can be seen that an MA(I) processes with I(}I ~ 1 will \ndepend significantly on observations in the distant past. However, if I(}I < 1, then \nthe effect of the distant past is negligible. \n\ni \n\nIn the nonlinear case, it will be shown that it is not always possible to go back \nand forth between descriptions in terms of observables (e.g. Xi) and descriptions \nin terms of unobservables (e.g. ei) even when St = O. For a review of time series \nprediction in greater depth see the works of Box [1] or Harvey [2]. \n\n3 Nonlinear ARMA Models \n\nMany types of nonlinear models have been proposed in the literature. Here we focus \non feed forward and recurrent neural networks and how they relate to nonlinear \nARMA models. \n\n3.1 Nonlinear Autoregressive Models \n\nThe simplest generalization to the nonlinear case would be the nonlinear autore(cid:173)\ngressive (NAR) model \n\nXt = h(xt-1! Xt-2, ... , Xt-p) + et, \n\nwhere hO is an unknown smooth function with the assumption the best (i.e., mini(cid:173)\nmum mean square error) prediction of Xt given Xt-1I ... , Xt-p is its conditional mean \n\nZt = E(xtl x t-1I ... , Xt_p) = h(xt-1I ... , Xt-p). \n\nFeedforward networks were first proposed as an N AR model for time series predic(cid:173)\ntion by Lapedes and Farber [3]. A feedforward network is a nonlinear approximation \nto h given by \n\nI \n\np \n\nZt = h(Xt-1I ... , Xt-p) = ~ Wd(~ WijXt-j). \n\ni=l \n\n;=1 \n\nThe weight matrix W is lower diagonal and will allow no feedback. Thus the feed(cid:173)\nforward network is a nonlinear mapping from previous observation onto predictions \nof future observations. The function /(x) is a smooth bounded monotonic function, \ntypically a sigmoid. \nThe parameters Wi and Wij are estimates from a training sample x~, ... , x').\" thereby \nobtaining an estimate of h of h. Estimates are obtained by minimizing the sum \nof the square residuals E~l (Xt - Zt)2 by gradient descent procedure known as \n\"backpropagation\" [4]. \n\n\f304 \n\nConnor, Atlas, and Martin \n\n3.2 NARMA or NMA \n\nA simple nonlinear generalization of ARMA models is \n\nIt is natural to predict \n\nZt = h(Xt-b Xt-2, ... , Xt-p, et-b ... , et-q). \n\nIf the model h(Xt-b Xt-2, ... , Xt-p, et-l, ... , et-q) is chosen, then a recurrent network \ncan approximate it as \n\nZt = h(Xt-1' ... , Xt-p) = L Wd(L WijXt-j + L wij(Xt-j - Zt-j\u00bb. \n\n1 \n\np \n\ni=1 \n\nj=1 \n\nq \n\n;=1 \n\nThis model is a special case of the fully interconnected recurrent network \n\n1 \n\nZt = L Wd(L wijXt-j) \n\nn \n\nwhere wij are coefficients of a full matrix. \n\ni=1 \n\nj=1 \n\nNonlinear autoregressive models and nonlinear moving average models are not al(cid:173)\nways equivalent for nondeterministic processes as in the linear case. If the prob(cid:173)\nability of the next observation depends on the previous state of the process, a \nrepresentation built on et may not be complete unless some information on the pre(cid:173)\nvious state is added[8]. The problem is that if et, ... , et-m are known, there is still \nnot enough information to determine which state the series is in at t - m. Given \nthe lack of knowledge of the initial state, it is impossible to predict future states \nand without the state information, the best predictions cannot be made. \n\nIf the moving average representation cannot be made with et alone, it still may be \npossible to express a model in terms of past et and state information. \n\nIt has been shown that for a large class of nondeterministic Markov processes, a \nmodel of this form can be constructed[8]. This link is important, because a recurrent \nnetwork is this type of model. For further details on using recurrent networks to \nNARMA modeling see Connor et al[9]. \n\n4 Competition on Load Forecasting Data \n\nA fully interconnected recurrent network trained with the Williams and Zipser algo(cid:173)\nrithm [10] was part of a competition to predict the loads of the Puget Sound Power \nand Light Company from November 11, 1990 to March 31, 1991. The object was \nto predict the demand for the electric power, known as the load, profile of each day \non the previous working day. Because the forecast is made on Friday morning, the \nMonday prediction is the most difficult. Actual loads and temperatures of the past \nare available as well as forecasted temperatures for the day of the prediction. \n\n\fRecurrent Networks and NARMA Modeling \n\n305 \n\nNeural networks are not parsimonious and many parameters need to be determined. \nSeasonality limits the amount of useful data for the load forecasting problem. For \nexample, the load profile in August is not useful for predicting the load profile in \nJanuary. This limited amount of data severely constrains the number of parameters \na model can accurately determine. We avoided seasonality, while increasing the size \nof the training set by including data form the last four winters. In total 26976 \nvectors were available when data from August 1 to March 31 for 1986 to 1990 were \nincluded. The larger training set enables neural network models be trained with \nless danger of overfitting the data. If the network can accurately model load growth \nover the years, then the network will have the added advantage of being exposed \nto a larger temperature spectrum on which to base future predictions. The larger \ntemperature spectrum is hypothetically useful for predicting phenomenon such as \ncold snaps which can result in larger loads than normal. It should be noted that \nneural networks have been applied to this model in the past[6]. \n\nIt(k) = et(k - 7) + I(lt(k - 7), et(k - 7), it(k), T8(k - 1), t, d, y), \n\nInitially five recurrent models were constructed, one for each day of the week, with \nWednesday, Thursday and Friday in a single network. Each network has tempera(cid:173)\nture and load values from a week previous at that hour, the forecasted temperature \nof the hour to be predicted, the hour year and the week of the forecast. The week of \nthe forecast was included to allow the network to model the seasonality of the data. \nSome models have added load and temperature from earlier in the week, depending \non the availability of the data. The networks themselves consisted of three to four \nneurons in the hidden layer. This predictor is of the form \nwhere 10 is a nonlinear function, It(k) is the load at time t and day k, et is the \nnoise, T is the temperature, T is the forecasted temperature, d is the day of the \nweek, and y is the year of the data. \nAfter comparing its performance to the winner of the competition, the linear model \nin Fig. 1, the poor performance could be attributed to the choice of model, rather \nthan a problem with recurrent networks. It should be mentioned that the linear \nmodel took as one of its inputs, the square of the last available load. This is a \nparsimonious way of modeling nonlinearities. A second recurrent predictor was \nthen built with the same input and output configuration as the linear model, save \nthe square of the previous load term which the nets nonlinearities can handle. This \nnet, denoted as the Recurrent Network, had a different recurrent model for each \nhour of the day. Each hour of the day had a different model, this yielded the best \npredictions. This predictor is of the form \n\nIt(k) = et(k) + It.(lt(k - 1), et(k - 1), it(k), Ts(k - 1), d, y). \n\nAll of the models in the figure use the last available load, forecasted temperature \nat the hour to be predicted, maximum forecasted temperature of the day to be \npredicted, the previous midnight temperatures, and the hour and year of the pre(cid:173)\ndiction. A second recurrent network was also trained with the last available load \nat that hour, this enabled et-l to be modeled. The availability of et-l turned out \nto be the difference between making superior and average predictions. It should be \nnoted that the use of et-l did not improve the results of linear models. \nThe three most important error measures are the weekly morning, afternoon, and \ntotal loads and are listed in the table below. The A.M. peak is the mean average \n\n\f306 \n\nConnor, Atlas, and Martin \n\nRecurrent \n\n.0275 \n\n.0355 \n\n.0218 \n\n.0311 \n\nTable 1: Mean Square Error \n\npercent error (MAPE) of the summed predictions of 7 A.M. to 9 A.M., the P.M. \npeak is the MAPE of the summed predictions of 5 P.M. to 7 P.M, and the total \nis the MAPE of the summed predictions over the entire day. Results, of the total \npower for the day prediction, of the recurrent network and other predictors are \nshown in Fig. 1. The performance on the A.M. and P.M. peaks were similar[9]. \nThe failure of the daily recurrent network to accurately predict is a product of trying \nto model to complex a problem. When the complexity of the problem was reduced \nto that of predicting a single hour of the day, results improved significantly[7]. \nThe superior performance of the recurrent network over the feedforward network \nis time series dependent. A feedforward and a recurrent network with the same \ninput representation was trained to predict the 5 P.M. load on the previous work \nday. The feedforward network succeeded in modeling the training set with a mean \nsquare error of .0153 compared to the recurrent networks .0179. However, when \nthe tested on several winter outside the training set the results, listed in the table \nbelow, varied. For the 1990-91 winter, the recurrent network did better with a \nmean square error of .0311 compared to the feedforward networks .0331. For the \nother winter of the years before the training set, the results were quite different, \nthe feedforward network won in all cases. The differences in prediction performance \ncan be explained by the inability of the feedforward network to model load growth \nin the future. The loads experience in the 1990-91 winter were outside the range of \nthe entire training set. The earlier winters range of loads were not as far form the \ntraining set and the feedforward network modeled them well. \n\nThe effect of the nonlinear nature of neural networks was apparent in the error \nresiduals of the training and test sets. Figs. 2 and 3 are plots of the residuals \nagainst the predicted load for the training and test sets respectively. In Fig. 2, \nthe mean and variance of the residuals is roughly constant as a function of the \npredicted load, this is indicative of a good fit to the data. However, in Fig. 3, \nthe errors tend to be positive for larger loads and negative for lesser loads. This \nis a product of the squashing effect of the sigmoidal nonlinearities. The squashing \neffect becomes acute during the prediction of the peak loads of the winter. These \npeak loads are caused when a cold spell occurs and the power demand reaches record \nlevels. This is the only measure on which the performance of the recurrent networks \nis surpassed, human experts outperformed the recurrent network for predictions \nduring cold spells. The recurrent network did outperform all other statistical models \non this measure. \n\n\fRecurrent Networks and NARMA Modeling \n\n307 \n\n.. \nE w \n~ \u2022 ;: \nu .. \ni \n\n8~1 ----------------------------------~I \nI \n:t \ni \n~. \nI ,I-\nJ \n\n:1 n n \n\nI I Recunenl Feed Forward Besl Linearl \n\n~elWOr\" I .'4elWOu! \n\n.'.fodel \n\nI \nfj\\!fIIJrl' I Recunenl I \n\nOri. \n\n: ~r ~elWOrlt : \n\n\u00b0O~----------~:------3------4----~~--~6\u00b7 \n\nFigure 1; Competition Performance on Total Power \n\nerror \n400 \n\n: .. \n\n\u2022 : \u2022 \u2022 \n\n. .. \" \n\n\" .. \n\n'.. \n\u2022 \u2022 #'. \n. . \n\n200 \n___ ..\\--I' l!o.... \" ..................... ..-.i-'\" r,.;'.,..c\",. ~~i:..;o:!..\u00b7I..:.\u00b7~' ...... ____ -'-___ p r e ~ c t e \n\n. ' \n\u2022 \n.. \u00b7,&\u00b7\"~\u00b7i\u00b7-f/: \",,' \n. . .'...... -. .. .' \n..... \u2022 .:2' ~J).~,\u00b7l-~~5\u00b70... 3750 \n\nd' \nLoad \n\n\u2022 \", I, \". \n\u2022 \u2022 &. \n\n\u2022 \u2022 \n..-, \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \n\n\u2022 \n\nd \n\n-20a' \n\n-400 \n\n\u2022 \n\n\u2022 \u2022 \u2022 \n\nI . :_ \u2022 \n\n:.e ,':-:' .' ... 1: \n., ... \n. . . \n'. ...' \n~ . \n\nFigure 2: Prediction vs. Residual on Training Set \n\nerror \n400 \n\n200 \n\n-200 \n\n-400 \n\n' . . . \n\n. . .' . \n\n.' \u2022 '.1 \n\n\u2022 '. \n\n\u2022 \n\n. . \n2750 .. \u00b7 .. 3\"25'0 \n. \n\n. \n. \n.' . \n: . . \n. ..- . \n\n___ +-___ ~~~ ....... _ .. ..,.'~.:~:,~. ~.~.-..-. _ ................ _predi c ted \n\n3750 \n\nLoad \n\nFigure 3: Prediction vs. Residual on Testing Set \n\n\f308 \n\nConnor, Atlas, and Martin \n\n5 Conclusion \n\nRecurrent networks are the nonlinear neural network analog of linear ARMA mod(cid:173)\nels. As such, they are well-suited for time series that possess moving average com(cid:173)\nponents, are state dependent, or have trends. Recurrent neural networks can give \nsuperior results for load forecasting, but as with linear models, the choice of model \nis critical to good prediction performance. \n\n6 Acknowledgements \n\nWe would like to than Milan Casey Brace of the Puget Power Corporation, Dr. Seho \nOh, Dr. Mohammed EI-Sharkawi, Dr. Robert Marks, and Dr. Mark Damborg for \nhelpful discussions. We would also like to thank the National Science Foundation \nfor partially supporting this work. \n\nReferences \n\n[1] G. Box, Time series analysis: forecasting and control, Holden-Day, 1976. \n[2] A. C. Harvey, The econometric analysis 0/ time series, MIT Press, 1990. \n[3] A. Lapedes and R. Farber, \"Nonlinear Signal Processing Using Neural Net(cid:173)\nworks: Prediction and System Modeling\", Technical Report, LA-UR87-2662, \nLos Alamos National Laboratory, Los Alamos, New Mexico, 1987. \n\n[4] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, \"Learning internal represen(cid:173)\n\ntations by error propagation\", in Parallel Distributed Processing, vol. 1, D.E. \nRumelhart, and J.L. NcCelland,eds. Cambridge:M.I.T. Press,1986, pp. 318-362. \n[5] M.C. Brace, A Comparison 0/ the Forecasting Accuracy of Neural Networks \nwith Other Established Techniques, Proc. of the 1st Int. Forum on Applications \nof Neural Networks to Power Systems, Seattle, July 23-26, 1991. \n\n[6] L. Atlas, J. Connor, et al., \"Performance Comparisons Between Backpropaga(cid:173)\ntion Networks and Classification Trees on Three Real-World Applications\", \nAdvances in Neural In/ormation Processing Systems 2, pp. 622-629, ed. D. \nTouretzky, 1989. \n\n[7] S. Oh et al., Electric Load Forecasting Using an Adaptively Trained Layered \nPerceptron, Proc. of the 1st Int. Forum on Applications of Neural Networks to \nPower Systems, Seattle, July 23-26, 1991. \n\n[8] M. Rosenblatt, Markov Processes. Structure and Asymptotic Behavior, \n\nSpringer-Verlag, 1971, 160-182. \n\n[9] J. Connor, L. E. Atlas, and R. D. Martin,\"Recurrent Neural Networks and \nTime Series Prediction\", to be submitted to IEEE Trans. on Neural Networks, \n1992. \n\n[10] R. Williams and D. Zipser. A Learning Algorithm for Continually Running \n\nFully Recurrent Neural Networks, Neural Computation, 1, 1989, 270-280. \n\n\f", "award": [], "sourceid": 475, "authors": [{"given_name": "Jerome", "family_name": "Connor", "institution": null}, {"given_name": "Les", "family_name": "Atlas", "institution": null}, {"given_name": "Douglas", "family_name": "Martin", "institution": null}]}