{"title": "Generalization by Weight-Elimination with Application to Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 875, "page_last": 882, "abstract": null, "full_text": "Generalization by Weight-Elimination \n\nwith Application to Forecasting \n\nAndreas S. Weigend \nPhysics Department \nStanford University \nStanford, CA 94305 \n\nDavid E. Rumelhart \nPsychology Department \n\nStanford University \nStanford, CA 94305 \n\nBernardo A. Huberman \nDynamics of Computation \nXeroxPARC \nPalo Alto, CA 94304 \n\nAbstract \n\nInspired by the information theoretic idea of minimum description length, we add \na term  to the back propagation cost function that penalizes network complexity. \nWe  give  the  details  of the  procedure,  called  weight-elimination,  describe  its \ndynamics, and clarify the meaning of the parameters involved. From a Bayesian \nperspective,  the complexity term  can  be usefully interpreted as  an  assumption \nabout prior distribution of the weights.  We  use  this  procedure  to  predict  the \nsunspot time series and the notoriously noisy series of currency exchange rates. \n\n1 \n\nINTRODUCTION \n\nLearning procedures for connectionist networks are essentially statistical devices for per(cid:173)\nforming inductive inference.  There is a trade-off between two goals:  on the one hand, we \nwant such devices  to be as  general as possible so that they are able to learn a broad range \nof problems.  This recommends  large and flexible networks.  On the other hand, the true \nmeasure of an  inductive device is  not how well  it performs  on the examples  it has  been \nshown, but how it performs on cases it has not yet seen, i.e., its out-of-sample performance. \n\nToo many weights of high precision make it easy for a net to fit the idiosyncrasies or \"noise\" \nof the training data and thus fail to generalize well to  new cases.  This overfitting problem \nis  familiar in  inductive inference,  such as  polynomial curve fitting.  There are a  number \nof potential solutions  to  this  problem.  We  focus  here on  the so-called minimal network \nstrategy.  The underlying hypothesis is:  if several nets fit the data equally well, the simplest \none will on average provide the best generalization.  Evaluating this hypothesis requires (i) \nsome way of measuring simplicity and (ii) a search procedure for finding the desired net. \n\nThe complexity of an  algorithm can be measured by the length of its minimal description \n\n875 \n\n\f876  Weigend, Rumelhart, and Huberman \n\nin some language.  Rissanen [Ris89] and Cheeseman [Che90] formalized the old but vague \nintuition of Occam's razor as the information theoretic minimum description length (MDL) \ncriterion: Given some data, the most probable model is the model that minimizes \n\ndescription length  =  description length(datalmodel)  +  description length(model)  . \n\n\", \n\n, \n\n.J \n\n, \n\n.f \n\n, \n\nY \n\ncost \n\nV \n\nerror \n\nY \n\ncomplexity \n\nThis sum represents the trade-off between residual error and model complexity. The goal is \nto find a net that has the lowest complexity while fitting the data adequately. The complexity \nis dominated by the number of bits needed to encode the weights. It is roughly proportional \nto  the  number  of weights  times  the  number of bits  per weight.  We  focus  here  on  the \nprocedure of weight-elimination that tries to find a net with the smallest number of weights. \nWe compare it with a second approach that tries to minimize the number of bits per weight, \nthereby creating a net that is not too dependent on the precise values of its weights. \n\n2  WEIGHT-ELIMINATION \n\nIn  1987, Rumelhart proposed a method for finding  minimal nets within the framework of \nback propagation learning.  In this section we explain and interpret the procedure and, for \nthe first time, give the details of its implementation.  1 \n\n2.1  METHOD \n\nThe idea is indeed simple in conception:  add to the usual cost function a term which counts \nthe number of parameters,  and minimize the sum of performance error and the number of \nweights by back propagation, \n\n(1) \n\nThe first  term  measures  the performance of the net.  In  the simplest case,  it is  the sum \nsquared error over the set of training examples T.  The second term measures  the size of \nthe net.  Its sum extends over all connections C.  A represents the relative importance of the \ncomplexity term with respect to the performance term. \n\nThe learning rule is  then to change the weights according to the gradient of the entire cost \nfunction, continuously doing justice to the trade-off between error and complexity.  This \ndiffers from methods that consider a set of fixed  models, estimate the parameters for each \nof them, and then compare between the models by considering the number of parameters. \nThe complexity cost as function of wdwo is shown in Figure 1(b). The extreme regions of \nvery large and very small weights are easily interpreted. For IWi I ~ wo, the cost of a weight \napproaches  unity (times  A).  This justifies the interpretation of the complexity term  as  a \ncounter of significantly sized weights. For IWi I ~ wo, the cost is close to zero.  ''Large'' and \n\"small\" are defined with respect to the scale wo, a free parameter of the weight-elimination \nprocedure that has to be chosen. \n\nIThe  original formulation  benefited from  conversations with  Paul Smolensky.  Variations,  and \nalternatives  have  been developed by  Hinton,  Hanson  and  Pratt,  Mozer  and  Smolensky,  Ie  Cun, \nDenker and SoHa,  Ii, Snapp and Psaltis and others. They are discussed in Weigend [Wei91]. \n\n\fGeneralization by Weight-Elimination with Application to Forecasting \n\n877 \n\n0.8 \n\n0.5 \n\n. \n\nprior \n\nprobability  I \n\n'\\ \nI \nI \n\nI \\ \n\nA=4 \nA=2 \nA=l \n\n(c) \n\n\\ \n\nJ \n\n0.2  ~  (a) \n=-~=-=-~-~_:/.. \n\n---- A=O.S \n.~~=-~-=-=. \n\nI \n0 \n\n1 \n\nI \n2 \n\nI \n3 \n\nI \n4 \n\n-4 \n\nI \n-3 \n\n-2 \n\n-1 \n\n0.9 \ncost \nA \n0.8 \n\n0.7 \n\n0.6 \n\n1.3 \n\ncost/).. \n\n(b)  0:=  1.2 A  0.5 \n\n7.  1 \n\nweight/wo \n\nex \nI \n0.0 \n\nI \n0.5 \n\n0.4 \n\nI \n1.0 \n\n0.5 \n\n0.2 \n\n-4 \n\n-3 \n\n-2 \n\n-1 \n\n0 \n\n1 \n\n2 \n\n3 \n\n4 \n\nFigure 1:  (a) Prior probability distribution for a weight.  (b) Corresponding cost. \n(c) Cost for different values of S/wo as  function of 0:'  =  WI /  S, where S  =  WI  + W2. \n\nTo clarify the meaning of Wo,  let us consider a unit which is connected-redundantly-by \ntwo weights (WI and wz) to the same signal source. Is it cheaper to have two smaller weights \nor just one large weight?  Interestingly, as shown in Figure l(c), the answer depends on the \nratio S/wo, where S  =  WI  + Wz  is the relevant sum for the receiving unit.  For values of \nS/wo up to about 1.1, there is only one minimum at 0:'  := wt/S =  0.5,  i.e., both weights \nare present and equal.  When S/Wo  increases,  this symmetry gets broken; it is cheaper to \nset one weight ~ S  and eliminate the other one. \n\nWeight-decay, proposed by Hinton and by Ie Cun in  1987, is contained in our method of \nweight-elimination as  the special case of large woo  In the statistics community, this limit \n(cost ex  w;) is known as ridge regression.  The scale parameter Wo  thus allows us to express \na preference for fewer large weights (wo  small) or many small weights (wo  large).  In our \nexperience, choosing Wo  of order unity is good for activations of order unity. \n\n2.2 \n\nINTERPRETATION AS PRIOR PROBABILITY \n\nFurther insight can  be gained  by  viewing  the  cost  as  the  negative  log  likelihood of the \nnetwork, given the data.  In this framework2, the error term is the negative logarithm of the \nprobability of the data given the net, and the complexity term is  the negative logarithm of \nthe prior probability of the weights. \n\nThe cost function corresponds approximately to the assumption that the weights come from \na mixture of two distributions. Relevant weights are drawn from a uniform distribution (to \n\n2This perspective is expanded in  a forthcoming paper by Rumelhart et ai. [RDGC92]. \n\n\f878  Weigend, Rumelhart, and Huberman \n\nallow for normalization of the probability, up to a certain maximum size).  Weights that are \nmerely the result of \"noise\" are drawn from a Gaussian-like distribution centered on zero; \nthey are expected to be small.  We show the prior probability for our complexity term for \nseveral values of ,X  in Figure l(a).  If we wish to approximate the bump around zero by a \nGaussian, its variance is given by (1'2  ==  w5;,X.  Its width scales with Woo \nPerhaps surprisingly the innocent weighting factor ,x now influences the width:  the variance \nof the \"noise\" is inversely proportional to,X.  The larger ,x  is, the closer to zero  a weight \nmust be to  have a  reasonable probability of being a  member of the \"noise\" distribution. \nAlso, the larger ,x  is, the more \"pressure\" small weights feel to become even smaller. \nThe following technical section describes how ,x  is dynamically adjusted in training. From \nthe perspective taken in Section 2.1, the usual increase of ,x  during training corresponds to \nattaching more importance to the complexity term.  From the perspective developed in this \nsection, it corresponds to sharpening the peak of the weight distribution around zero. \n\n2.3  DETAILS \n\nAlthough the basic form of the weight-elimination procedure is simple, it is sensitive to the \nchoice of ,X.3  If ,x  is too small, it will have no effect. If ,x  is too large, all of the weights will \nbe driven to zero.  Worse, a value of ,x  which is useful for a problem that is easily learned \nmay be too large for a hard problem, and a problem which is difficult in one region (at the \nstart, for example) may require a larger value of ,x  later on.  We have developed some rules \nthat make the performance relatively insensitive to the exact values of the parameters. \nWe start with  A =  0 so that the network can initially use all of its resources.  A is changed after each \nepoch. It is usually gently incremented, sometimes decremented, and, in emergencies, cut down. The \nchoice among these three actions depends on the value of the error on the training set en. \nThe subscript n  denotes the number of the epoch that has just finished.  (Note  that en  is  only the \nfirst term of the cost function (Equation 1).  Since gradient descent minimizes the sum of both terms, \nen by itself can decrease or increase.)  en  is compared to three quantities, the first two derived from \nprevious values of that error itself, the last one given externally: \n\n\u2022  en-l Previous error. \n\u2022  An  Average error (exponentially weighted over the past). \n\nIt is defined as An = \"YAn - 1 + (1  -\n\n\"Y)en (with \"Y  relatively close to  1). \n\n\u2022  1)  Desired error, the externally provided performance criterion. \n\nThe strategy for  choosing 1) depends on the specific problem.  For example, \"solutions\" with \nan error larger than 1) might not be acceptable.  Dr, we may have observed (by monitoring the \nout-of-sample performance during training) that overfitting starts when a certain in-sample error \nis reached.  Dr,  we may have some other estimate of the amount of noise in the training data. \nFor toy  problems, derived from  approximating  analytically defined functions  (where perfect \nperformance on the training data can be expected), a good choice is 1) =  O.  For hard problems, \nsuch as the prediction of currency exchange rates, 1) is set just below the error that corresponds \nto chance performance, since overfitting would occur if the error was reduced further. \nAfter each epoch in training, we evaluate whether en is above or below each of these quantities. This \ngives eight possibilities. Three actions are possible: \n\n\u2022  A ~ A +dA \n\nIn six cases, we increment A slightly.  These are the situations in which things are going well: \nthe error is  already  below than  the  criterion  (en  <  1))  and/or is  still  falling  (en  < en-d. \n\n3The  reason  that  A appears  at  all  is  because weight-elimination  only  deals  with  a  part of the \ncomplete  network complexity,  and this  only  approximately.  In  a  theory  rigidly  derived from  the \nminimum description length principle, no such parameter would appear. \n\n\fGeneralization by Weight-Elimination with Application to Forecasting \n\n879 \n\nIncrementing  ~ means  attaching  more  importance  to  the  complexity  term  and  making  the \nGaussian a little sharper.  Note that the primary parameter is actually .1~. Its size is fairly small, \nof order 10-6 \u2022 \nIn the remaining two cases, the error is worse than the criterion and it has grown compared to \njust before (En  ~ En - 1 ). The action depends on its relation to its long term average An . \n\n\u2022  ~ - ~ - .1~ \n\n[if En  ~ En - 1  A En  < An A En  ~ 1)] \nIn the less severe of those two cases, the performance is still improving with respect to the long \nterm average (En  < A). Since the error can have grown only slightly, we reduce ~ slightly . \n\n\u2022  ~ - 0.9 ~ \n\n[if En  ~ En - 1  A  En  ~ An A  En  ~ 1)] \nIn  this  last case, the error has increased and exceeds its  long term  average.  This can happen \nfor  two reasons.  The error might have grown a lot in the last iteration.  Or,  it might not have \nimproved by much in the  whole period covered by  the  long  term  average,  i.e.,  the  network \nmight be trapped somewhere before reaching the performance criterion.  The value of ~ is cut, \nhopefully prevent weight-elimination from devouring the whole net. \n\nWe have found that this  set of heuristics  for finding a  minimal network while achieving \na desired level of performance on the training data works rather well on a wide range of \ntasks.  We give two examples of applications of weight-elimination. In the second example \nwe show how A changes during training. \n\n3  APPLICATION TO TIME SERIES PREDICTION \n\nA  central  problem  in  science  is  predicting the  future of temporal  sequences;  examples \nrange from  forecasting  the weather to  anticipating currency  exchange rates.  The  desire \nto know the future is  often the driving force behind the search  for laws  in  science.  The \nability to forecast the behavior of a system hinges on two types of knowledge. The first and \nmost powerful one is  the knowledge of the laws  underlying a given phenomenon.  When \nexpressed in the form of equations, the future outcome of an experiment can be predicted. \nThe second, albeit less  powerful, type of knowledge relies On  the discovery of empirical \nregularities without resorting to knowledge of the underlying mechanism.  In this case, the \nkey problem is to determine which aspects of the data are merely idiosyncrasies and which \naspects are truly indicators of the intrinsic behavior.  This issue is particularly serious for \nreal world data, which are limited in precision and sample size.  We have applied nets with \nweight-elimination to time series of sunspots and currency exchange rates. \n\n3.1  SUNSPOT SERIES 4 \n\nWhen applied to predict the famous  yearly sunspot averages,  weight-elimination reduces \nthe number of hidden units to three.  Just having a small net, however, is not the ultimate \ngoal:  predictive power is  what counts.  The net has  one half the out-of-sample error (on \niterated single step predictions) of the benchmark model by Tong [Ton90]. \n\nWhat happens when we enlarge the input size from twelve, the optimal size for the bench(cid:173)\nmark model,  to four times  that size?  As  shown in  [WRH90],  the performance does  not \ndeteriorate (as  might have been  expected from  a less dense distribution of data points in \nhigher dimensional spaces).  Instead, the net manages to ignore irrelevant information. \n\n4We here only briefly summarize our results on sunspots. Details have been published in [WHR90) \n\nand [WRH90). \n\n\f880  Weigend, Rumelhart, and Huberman \n\n3.2  CURRENCY EXCHANGE RATES 5 \n\nWe use daily exchange rates  (or prices with respect to  the US  Dollar) for five currencies \n(German Mark (DM), Japanese Yen, Swiss Franc, Pound Sterling and Canadian Dollar) to \npredict the returns at day t, defined as \n\n. _  I  ~ -\nn \n\n1't.- n \n\n-\n\nPt-l \n\nI  (1 + Pt  - Pt -1)  f\"V \n\n,....,  -\n\nPt-l \n\n,--P_t _-_P_t_-_l \n\nPt-l \n\n(2) \n\nFor small changes, the return is the difference to the previous day normalized by the price \nPt -1. Since different currencies and different days of the week may have different dynamics, \nwe pick for one day  (Monday) and one currency (OM).  We define the task to be to learn \nMonday DM dynamics:  given exchange rate information through a  Monday, predict the \nDM - US$  rate for the following day. \n\nThe net has 45 inputs for past daily DM returns, 5 inputs for the present Monday's returns \nof all available currencies, and 11 inputs for additional information (trends and volatilities), \nsolely derived from the original exchange rates.  The k day trend at day t is the mean of the \nreturns of the k last days, t 2:!-ktl 1't  \u2022  Similarly, the k day volatility is defined to be the \nstandard deviation of the returns 0  the k last days. \n\nThe inputs  are fully connected to the 5  sigmoidal hidden units with range (-1, 1).  The \nhidden  units are fully connected to  two  output units.  The first  one is  to predict the next \nday return, 1't+l.  This is a linear unit, trained with quadratic error.  The second output unit \nfocuses  on the sign of the change.  Its target value is one when the price goes up and zero \notherwise.  Since we want the unit to predict the probability that the return is positive, we \nchoose a sigmoidal unit with range (0,1) and minimize cross entropy error. \n\nThe central question is  whether the net is able to extract any signal from  the training set \nthat generalizes to the test sets.  The performance is given as  function of training time in \nepochs in Figure 2.  6 \n\nThe result is that the out-of-sample prediction is significantly better than chance.  Weight(cid:173)\nelimination reliably extracts a signal that accounts for between 2.5 and 4.0 per cent of the \nvariance, corresponding to a correlation coefficient of 0 .21 \u00b1 0.03 for both test sets.  In con(cid:173)\ntrast, nets without precautions against overfitting show hopeless out -of-sample performance \nalmost before the training has started.  Also, none of the control experiments (randomized \nseries and time-reversed series) reaches any significant predictability. \n\nThe dynamics of weight-elimination, discussed in Section 2.3, is also shown in Figure 2. \nA first  grows  very  slowly.  Then,  around  epoch  230,  the error  reaches  the performance \n\nSWe thank Blake LeBaron for sending us the data. \n6The error of the unit predicting the return is expressed as the gyerage r.elative y\"ariance \n\narv S  = \n\n2:kES (targetk - predictionk)2 \n\n2:kES (targetk  - means)2 \n\n=  -\n\n1 \n1  ' \"  ( \ncri- Ns  kES \n\nL..t  rk  -\n\n-\n\n......  )2 \n\nrk \n\n(3) \n\nThe averaging (division by N s, the number of observations in set S) makes the measure independent \nof the size of the set.  The normalization (division by ~, the estimated variance of the data in S), \nremoves the dependence on the dynamic range of the data.  Since the mean of the returns is  close to \nzero, the random walk hypothesis corresponds to  arv = 1.0. \n\n\fGeneralization by Weight-Elimination with Application to Forecasting \n\n881 \n\ntraining with weight-elimination \n\ntraining with added noise \n\n~~~ ... ~ ... ~--~=---------------~--~1.00F_~~_~_~_-~----------------~--~----~ \n\n:-=--:>~,-~.~~i:~:::. ::: \n\n'-><',-----=:!>E;;~:~\u00b7 \n\naN (r-unit) \n\n0.92 \n\n~-+-.~~~+-__ ~~+-~~~~o.oo~-+-+~~~+-__ -+ __ ~~~~~ \n~-~-~-~==~~~---:----~----~.~~--------~~--~~--~----~ \n\n-....... -... - ... -- .... , .... \n\n.  . \n\n. ,.  .. \n\n~- ~. ~~  '  ........ --\nIt\" \nr.m.s.  error (s-ul1  .I \n., .. \n\n\" \n\"\\  .. , \n--~ ... \n. \\  . :..  . ... ; .... .. ... ~ \n\n'.  - - - - - - \u2022\u2022 -,--... \n\n.494 \n:  492 \n\n:400 \n\n... , ... ; .... \\ ...... { ..... _ .........  :  .488 \n\n. \n\n\u00b7'\u00b7il  .486I1...----1..--L--L...J.......L.....L..JI....L-__  ' - -......... ~ ......... -L..JL.....L..J ......... \n\n50  70  epochs  200 \n\n400 \n\n700 \n\ntraining  (5/75 ... 12/84)  (501) \nearly test  (9/73 ... 4/75)  (87) \nlate test  (12/84 ... 5/87)  (128) \n\nFigure 2:  Learning curves of currency exchange rates for training with weight-elimination \n(left) and training with added noise (right). In-sample predictions are shown as solid lines, \nout-of sample predictions in grey and dashed.  Top:  average relative variance of the unit \npredicting the return (r-unit). Center:  root-mean-square error of the unit predicting the sign \n(s-unit).  Bottom:  Weighting of the complexity term. \n\ncriterion. 7  The network starts to focus on the elimination of weights (indicated by growing \nA)  without further reducing  its  in-sample errors  (solid lines),  since that  would probably \ncorrespond to overfitting. \n\nWe  also  compare  training  with weight-elimination with  a  method intended to make  the \nparameters  more robust.  We  add  noise to the inputs,  independently  to  each  input unit, \ndifferent at each presentation of each pattern.8  This can be viewed as artificially enlarging \nthe training set by smearing the data points around their centers.  Smoother boundaries of \nthe \"basins of attraction\" are the result.  Viewed from the description length angle, it means \nsaving bits by specifying the (input) weights with less preciSion, as  opposed to eliminating \nsome of them.  The corresponding learning  curves  are  shown  on  the right hand side of \nFigure 2.  This simple method also successfully avoids overfitting. \n\n7Guided by  cross-validation,  we  set the criterion  (for  the  sum of the squared errors  from  both \noutputs) to  650.  With  this  value, the choice of the other parameters is  not critical,  as  long  as  they \nare  fairly  small.  We  used a learning  rate  of 2.5  x  10-4, no  momentum,  and an  increment dA  of \n2.5  x  10-6 \u2022  If the criterion  was  set to  zero,  the  balance between error  and complexity would be \nfragile in such a hard problem. \n\n8We add Gaussian noise with a rather large standard deviation of 1.5 times  the signal.  The exact \n\nvalue is not crucial:  similar performance is obtained for noise levels between 0.7 and 2.0. \n\n\f882  Weigend, Rumelhart, and Huberman \n\nFinally, we analyze the weight-eliminated network solution. The weights from the hidden \nunits  to  the outputs are in  a region where the complexity term  acts  as  a counter.  In  fact \nonly one or two  hidden  units remain.  The  weights  from  the inputs  to  the dead  hidden \nunits are also eliminated. For time series prediction, weight-elimination acts as hidden-unit \nelimination. \n\nThe weights between inputs and remaining hidden units are fairly small. Weight-elimination \nis in its  quadratic region  and prevents them  from  growing too  large.  Consequently, the \nactivation of the hidden units lies in ( -0.4,0.4). This prompted us to try a linear net where \nour procedure also works surprisingly well, yielding comparable performance to sigmoids. \n\nSince all inputs are  scaled  to  zero  mean  and unit standard  deviation,  we  can  gauge the \nimportance of different inputs directly by the size of the weights.  With weight-elimination, \nit becomes fairly clear which quantities are important, since connections that do not manage \nto reduce the error are not worth their price.  A detailed deSCription  will be published in \n[WHR91].  Weight-elimination enhances the interpretability of the solution. \n\nTo  summarize,  we have a working procedure that finds  small nets  and can  help prevent \noverfitting. With our rules for the dynamics of A, weight-elimination is fairly stable.  values \nof most parameters.  In the examples we analyzed, the network manages  to pick out some \nsignificant part of the dynamics underlying the time series. \n\nReferences \n\n[Che90] \n\nPeter C. Cheeseman.  On finding the most probable model. In J. Shrager and \nP.  Langley  (eds.)  Computational Models of Scientific Discovery  and Theory \nFormation, p. 73. Morgan Kaufmann, 1990. \n\n[RDGC92]  David  E.  Rumelhart,  Richard  Durbin,  Richard  Golden,  and  Yves  Chauvin. \n\nBackpropagation:  theoretical foundations. In Y. Chauvin and D. E. Rumel(cid:173)\nhart  (eds.)  Backpropagation  and Connectionist Theory.  Lawrence  Erlbaum, \n1992. \nJorma Rissanen.  Stochastic Complexity in Statistical Inquiry. World Scien(cid:173)\ntific,  1989. \nHowell  Tong.  Non-linear Time Series:  a  Dynamical System  Approach. \nOxford University Press, 1990. \nAndreas S. Weigend.  Connectionist Architectures for Time Series Predic(cid:173)\ntion. PhD thesis, Stanford University, 1991. (in preparation) \n\n[Ris89] \n\n[Ton90] \n\n[Wei91] \n\n[WHR90]  Andreas S. Weigend, Bernardo A.  Huberman, and David E.  Rumelhart.  Pre(cid:173)\n\ndicting the future:  a connectionist approach. International Journal of Neural \nSystems,  1:193, 1990. \n\n[WHR91]  Andreas S.  Weigend, Bernardo A.  Huberman, and David E.  Rumelhart.  Pre(cid:173)\n\ndicting sunspots and currency  rates with connectionist networks.  In  M. \nCasdagli and S.  Eubank (eds.)  Proceedings of the  1990 NATO  Workshop  on \nNonlinear Modeling and Forecasting (Santa Fe).  Addison-Wesley, 1991. \n\n[WRH90]  Andreas S. Weigend, David E. Rumelhart, and Bernardo A. Huberman.  Back(cid:173)\npropagation, weight-elimination and time series prediction. In D. S.1buret(cid:173)\nzky, J. L. Elman, T.  J. Sejnowski, and G. E.  Hinton (eds.) Proceedings of the \n1990 Connectionist Models Summer School, p  105. Morgan Kaufmann,  1990. \n\n\f", "award": [], "sourceid": 323, "authors": [{"given_name": "Andreas", "family_name": "Weigend", "institution": null}, {"given_name": "David", "family_name": "Rumelhart", "institution": null}, {"given_name": "Bernardo", "family_name": "Huberman", "institution": null}]}