{"title": "Robust Learning of Chaotic Attractors", "book": "Advances in Neural Information Processing Systems", "page_first": 879, "page_last": 885, "abstract": null, "full_text": "Robust Learning of Chaotic Attractors \n\nRembrandt Bakker* \n\nJaap C. Schouten \n\nMarc-Olivier Coppens \n\nChemical Reactor Engineering \n\nDelft Univ. of Technology \nr.bakker@stm.tudelft\u00b7nl \n\nChemical Reactor Engineering \nEindhoven Univ. of Technology \n\nJ.C.Schouten@tue.nl \n\nChemical Reactor Engineering \n\nDelft Univ. of Technology \ncoppens@stm.tudelft\u00b7nl \n\nFloris Takens \nDept. Mathematics \n\nUniversity of Groningen \nF. Takens@math.rug.nl \n\nC. Lee Giles \n\nCor M. van den Bleek \n\nNEC Research Institute \n\nChemical Reactor Engineering \n\nPrinceton Nl \n\ngiles@research.nj.nec.com \n\nDelft Univ. of Technology \nvdbleek@stm.tudelft\u00b7nl \n\nAbstract \n\nA fundamental problem with the modeling of chaotic time series data is that \nminimizing short-term prediction errors does not guarantee a match \nbetween the reconstructed attractors of model and experiments. We \nintroduce a modeling paradigm that simultaneously learns to short-tenn \npredict and to locate the outlines of the attractor by a new way of nonlinear \nprincipal component analysis. Closed-loop predictions are constrained to \nstay within these outlines, to prevent divergence from the attractor. Learning \nis exceptionally fast: parameter estimation for the 1000 sample laser data \nfrom the 1991 Santa Fe time series competition took less than a minute on \na 166 MHz Pentium PC. \n\n1 Introduction \n\nWe focus on the following objective: given a set of experimental data and the assumption that \nit was produced by a deterministic chaotic system, find a set of model equations that will \nproduce a time-series with identical chaotic characteristics, having the same chaotic attractor. \nThe common approach consists oftwo steps: (1) identify a model that makes accurate short(cid:173)\ntenn predictions; and (2) generate a long time-series with the model and compare the \nnonlinear-dynamic characteristics of this time-series with the original, measured time-series. \n\nPrincipe et al. [1] found that in many cases the model can make good short-tenn predictions \nbut does not learn the chaotic attractor. The method would be greatly improved if we could \nminimize directly the difference between the reconstructed attractors of the model-generated \nand measured data, instead of minimizing prediction errors. However, we cannot reconstruct \nthe attractor without first having a prediction model. Until now research has focused on how \nto optimize both step 1 and step 2. For example, it is important to optimize the prediction \nhorizon of the model [2] and to reduce complexity as much as possible. This way it was \npossible to learn the attractor of the benchmark laser time series data from the 1991 Santa Fe \n\n*DelftChemTech, Chemical Reactor Engineering Lab, lulianalaan 136, 2628 BL, Delft, The \nNetherlands; http://www.cpt.stm.tudelft.nllcptlcre!researchlbakker/. \n\n\f880 \n\nR. Bakker. J. C. Schouten. M.-Q. Coppens. F. Takens. C. L. Giles and C. M. v. d. Bleek \n\ntime series competition. While training a neural network for this problem, we noticed [3] that \nthe attractor of the model fluctuated from a good match to a complete mismatch from one \niteration to another. We were able to circumvent this problem by selecting exactly that model \nthat matches the attractor. However, after carrying out more simulations we found that what \nwe neglected as an unfortunate phenomenon [3] is really a fundamental limitation of current \napproaches. \n\nAn important development is the work of Principe et al. [4] who use Kohonen Self Organizing \nMaps (SOMs) to create a discrete representation of the state space of the system. This creates \na partitioning of the input space that becomes an infrastructure for local (linear) model \nconstruction. This partitioning enables to verify if the model input is near the original data (i. e. , \ndetect if the model is not extrapolating) without keeping the training data set with the model. \nWe propose a different partitioning of the input space that can be used to (i) learn the outlines \nof the chaotic attractor by means of a new way of nonlinear Principal Component Analysis \n(PCA), and (ii) enforce the model never to predict outside these outlines. The nonlinear PCA \nalgorithm is inspired by the work of Kambhatla and Leen [5] on local PCA: they partition the \ninput space and perform local PCA in each region. Unfortunately, this introduces \ndiscontinuities between neighboring regions. We resolve them by introducing a hierarchical \npartitioning algorithm that uses fuzzy boundaries between the regions . This partitioning closely \nresembles the hierarchical mixtures of experts of Jordan and Jacobs [6]. \n\nIn Sec. 2 we put forward the fundamental problem that arises when trying to learn a chaotic \nattractor by creating a short-term prediction model. In Sec. 3 we describe the proposed \npartitioning algorithm. In Sec. 4 it is outlined how this partitioning can be used to learn the \noutline of the attractor by defining a potential that measures the distance to the attractor. In Sec. \n5 we show modeling results on a toy example, the logistic map, and on a more serious \nproblem, the laser data from the 1991 Santa Fe time series competition. Section 6 concludes. \n\n2 The attractor learning dilemma \n\nImagine an experimental system with a chaotic attractor, and a time-series of noise-free \nmeasurements taken from this system. The data is used to fit the parameters of the model \n;.1 =FwC; ' ;_I\"\" ,Zt-m) whereF is a nonlinear function, w contains its adjustable parameters \nand m is a positive constant. What happens if we fit the parameters w by nonlinear least \nsquares regression? Will the model be stable, i.e. , will the closed-loop long term prediction \nconverge to the same attractor as the one represented by the measurements? \n\nFigure 1 shows the result of a test by Diks et al. [7] that compares the difference between the \nmodel and measured attractor. The figure shows that while the neural network is trained to \npredict chaotic data, the model quickly \nconverges to \nthe measured attractor \n(S=O), but once in a while, from one \niteration to another, the match between \nthe attractors is lost. \n\n20 t-----;r-----r-.....,.1t_:_----.----.-----i-,----i-,----, \n\n.. . -~ ..... ----. -. ---' ... -.. .... . - - -\n\n15 .. ...... .. .... ~ ..... .. \n.... \n;;;'10 ............ .. ~ .... . .. \n\n' \n: \n\nI \n\n, \n, \n, \n, \n, \n, \n, \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n\n\u2022\n\n\u2022 __ -' _____ \n\n\u2022\n\n\u2022\n\n\u2022 _ _ L __ \u2022\n\n\u2022\n\n\u2022 _ \u2022 \u2022 \u2022\n\n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\n, \n, \n\n.- -_ .. _-- -..... ..\n\n, \n, \n, \n, \n, \n, \n, \n, \n, \n. \n, \n. .... _-\n. \n, \n, \n\nLo.. : \n\nTo understand what causes \nthis \ninstability, imagine that we try to fit the \nparameters of a model ;.1 = ii + B Zt \nwhile the real system has a point \nattractor, Z = a, where Z is the state of \nthe system and a its attracting value. \nClearly, measurements taken from this \nsystem contain no \nto \n\ninformation \n\n5 _ .. ---- -.-----: -_. - .-\n~.L \u00bb......... ~I \n0'-\u00b7 \u00b7\u00b7 .. \u00b7 \n'. \no \n\ntraining progress leg iterations] \n\n8000 \n\nFigure 1: Diks test monitoring curve for a neural \nnetwork model \nfrom an \nexperimental chaotic pendulum [3]. \n\ntrained on data \n\n\fRobust Learning of Chaotic Attractors \n\n881 \n\nestimate both ii and B. If we fit the model parameters with non-robust linear least squares, B \nmay be assigned any value and if its largest eigenvalue happens to be greater than zero, the \nmodel will be unstable! \n\nFor the linear model this problem has been solved a long time ago with the introduction of \nsingular value decomposition. There still is a need for a nonlinear counterpart of this technique, \nin particular since we have to work with very flexible models that are designed to fit a wide \nvariety of nonlinear shapes, see for example the early work of Lapedes and Farber [8]. It is \nalready common practice to control the complexity of nonlinear models by pruning or \nregularization. Unfortunately, these methods do not always solve the attractor learning \nproblem, since there is a good chance that a nonlinear term explains a lot of variance in one \npart of the state space, while it causes instability of the attractor (without affecting the one-step(cid:173)\nahead prediction accuracy) elsewhere. In Secs. 3 and 4 we will introduce a new method for \nnonlinear principal component analysis that will detect and prevent unstable behavior. \n\n3. The split and fit algorithm \n\nThe nonlinear regression procedure of this section will form the basis of the nonlinear principal \ncomponent algorithm in Sec. 4. It consists of (i) a partitioning of the input space, (ii) a local \nlinear model for each region, and (iii) fuzzy boundaries between regions to ensure global \nsmoothness. The partitioning scheme is outlined in Procedure 1: \n\nProcedure 1: Partitioning the input space \n\n1) Start with the entire set Z of input data \n2) Determine the direction of largest variance of Z: perform a singular value \ndecomposition of Z into the product ULVT and take the eigenvector (column \nof V) with the largest singular value (on the diagonal of EJ. \n3) Split the data in two subsets (to be called: clusters) by creating a plane \nperpendicular to the direction of largest variance, through the center of \ngravity of Z. \n4) Next, select the cluster with the largest sum squared error to be split next, \nand recursively apply 2-4 until a stopping criteria is met. \n\nFigures 2 and 3 show examples of the partitioning. The disadvantage of dividing regression \nproblems into localized subproblems was pointed out by Jordan and Jacobs [6]: the spread of \nthe data in each region will be much smaller than the spread of the data as a whole, and this \nwill increase the variance of the model parameters. Since we always split perpendicular to the \ndirection of maximum variance, this problem is minimized. \n\nThe partitioning can be written as a binary tree, with each non-terminal node being a split and \neach terminal node a cluster. Procedure 2 creates fuzzy boundaries between the clusters. \n\nProcedure 2. Creating fuzzy boundaries \n\n1) An input i enters the tree at the top of the partitioning tree. \n2) The Euclidean distance to the splitting hyperplane is divided by the \nbandwidth f3 of the split, and passed through a sigmoidal function with range \n[0,1]. This results in i's share 0 in the subset on z's side of the splitting \nplane. The share in the other subset is I-a. \n3) The previous step is carried out for all non-terminal nodes of the tree. \n\n\f882 \n\nR. Bakker. J. C. Schouten, M.-Q. Coppens, F. Takens, C. L. Giles and C. M. v. d. Bleek \n\n4) The membership Pc of z to subset (terminal node) c is computed by \ntaking the product of all previously computed shares 0 along the path from \nthe terminal node to the top of the tree. \n\nIf we would make all parameters adjustable, that is (i) the orientation of the splitting \nhyperplanes, (ii) the bandwidths f3, and (iii) the local linear model parameters, the above model \nstructure would be identical to the hierarchical mixtures of experts of Jordan and Jacobs [6]. \nHowever, we already fixed the hyperplanes and use Procedure 3 to compute the bandwidths: \n\nProcedure 3. Computing the Bandwidths \n\n1) The bandwidths of the terminal nodes are taken to be a constant (we use 1.65, \nthe 90% confidence limit of a normal distribution) times the variance of the \nsubset before it was last split, in the direction of the eigenvector of that last split. \n2) The other bandwidths do depend on the input z. They are computed by \nclimbing upward in the tree. The bandwidth of node n is computed as a \nweighted sum between the fJs of its right and left child, by the implicit formula \nPn=OL PL uR PR' in which uL and OR depend on Pn\u00b7 Starting from initial guess \nPn=PL if oL>O\u00b75, or else Pn=PR' the formula is solved in a few iterations. \n\nThis procedure is designed to create large overlap between neighboring regions and almost no \noverlap between non-neighboring regions. What remains to be fitted is the set of the local \nlinear models. The j-th output of the split&fit model for a given input zp is computed: \nYj,p = L fl; {ii;zp +b/}. where iicand bC contain the linear model parameters of subset c, \n\nc \n\nc=J \n\nand C is the number of clusters. We can determine the parameters of all local linear models in \none global fit that is linear in the parameters. However, we prefer to locally optimize the \nparameters for two reasons: (i) it makes it possible to locally control the stability of the \nattractor and do the principal component analysis of Sec. 4; and (ii) the computing time for a \nlinear regression problem with r regressors scales -O(~). If we would adopt global fitting, r \nwould scale linearly with C and, while growing the model, the regression problem would \nquickly become intractable. We use the following iterative local fitting procedure instead. \n\nProcedure 4. Iterative Local Fitting \n\nc \n\np \n\n1) Initialize a J by N matrix of residuals R to zero, J being the number of \noutputs and N the number of data. \n2) For cluster c, if an estimate for its linear model parameters already exists, \nfor each input vector z add flcv. to the matrix of residuals, otherwise add \nflpYj,p to R, Yj.P being the j-th element of the deSIred output vector for sample \np. \n3) Least squares fit the linear model parameters of cluster c to predict the \ncurrent residuals R, and subtract the (new) estimate, fl;Y;,p' from R. \n4) Do 2-4 for each cluster and repeat the fitting several times (default: 3). \n\nJY l,p \n\n. \n\nFrom simulations we found that the above fast optimization method converges to the global \nminimum if it is repeated many times. Just as with neural network training, it is often better to \nuse early stopping when the prediction error on an independent test set starts to increase. \n\n\fRobust Learning o/Chaotic Attractors \n\n883 \n\n4. Nonlinear Principal Component Analysis \n\nTo learn a chaotic attractor from a single experimental time-series we use the method of delays: \nthe state l consists of m delays taken from the time series. The embedding dimension m must \nbe chosen large enough to ensure that it contains sufficient infonnation for faithful \nreconstruction of the chaotic attractor, see Takens [9]. Typically, this results in an m(cid:173)\ndimensional state space with all the measurents covering only a much lower dimensional, but \nnon-linearly shaped, subspace. This creates the danger pointed out in Sec. 2: the stability of the \nmodel in directions perpendicular to this low dimensional subspace cannot be guaranteed. \n\nWith the split & fit algorithm from Sec. 3 we can learn the non-linear shape of the low \ndimensional subspace, and, if the state of the system escapes from this subspace, we use the \nalgorithm to redirect the state to the nearest point on the subspace. See Malthouse [10] for \nlimitations of existing nonlinear peA approaches. To obtain the low dimensional subspace, we \nproceed according to Procedure 5. \n\nProcedure 5. Learning the Low-dimensional Subspace \n\n1) Augment the output of the model with the m-dimensional statel: the \nmodel will learn to predict its own input. \n2) In each cluster c, perfonn a singular value decomposition to create a set of \nm principal directions, sorted in order of decreasing explained variance. The \nresult of this decomposition is also used in step 3 of Procedure 4. \n3) Allow the local linear model of each cluster to use no more than mred of \nthese principal directions. \n4) Define a potential P to be the squared Euclidian distance between the \nstate l and its prediction by the model. \n\nThe potential P implicitly defines the \nlower dimensional subspace: if a state \nl is on the subspace, P will be zero. \nP will increase with the distance of l \nfrom the subspace. The model has \nlearned to predict its own input with \nsmall error, meaning that it has tried \nto reduce P as much as possible at \nexactly those points in state space \nwhere the training data was sampled. \nIn other words, P will be low if the \ninput l is close to one of the original \npoints in the training data set. From \nthe split&fit algorithm we can \nanalytically compute the gradient \ndPldl. Since the evaluation of the \nsplit&fit model involves a backward \nthe bandwidths) and \n(computing \nforward pass \n(computing \nmemberships), the gradient algorithm \ninvolves a forward and backward \npass through the tree. The gradient is \nused to project states that are off the \nnonlinear subspace onto the subspace \n\n-2 \n\n-1 \n\n2 \n\nFigure 2. Projecting two-dimensional data on a one(cid:173)\ndimensional \nsubspace. The \ncolorscale represents the potential P, white indicates \nP>0.04 .. \n\nself-intersecting \n\n\f884 \n\nR. Bakker, J C. Schouten, M.-O. Coppens, F. Takens, C. L. Giles and C. M. v. d. Bleek \n\nin one or a few Newton-Rhapson iterations. \nFigure 2 illustrates the algorithm for the \nproblem of creating a one-dimensional \nrepresentation of the number \n'8'. The \ntraining set consists of 136 clean samples, \nand Fig. 2 shows how a set of 272 noisy \ninputs is projected by a 48 subset split&fit \nmodel onto the one-dimensional subspace. \nNote that the center of the '8' cannot be well \nrepresented by a one-dimensional space. We \nleave development of an algorithm that \nautomatically detects the optimum local \nsubspace dimension for future research. \n\n5. Application Examples \n\nXl \n\n-1 \n\no X \n\n1 \n\nFigure 3. Learning the attractor of the two(cid:173)\ninput logistic map. The order of creation of the \nsplits is indicated. The colorscale represents the \npotential P, white indicates P>O.05. \n\nresult for a \n\nFirst we show the nonlinear principal \ncomponent analysis \ntoy \nexample, the logistic map Zt+l =4zt(1-Zt). If we use a model Zt+l = Fw(zt) , where the \nprediction only depends on one previous output, there is no lower dimensional space to which \nthe attractor is confined. However, if we allow the output to depend on more than a single \ndelay, we create a possibility for unstable behavior. Figure 3 shows how well the split&fit \nalgorithm learns the one-dimensional shape of the attractor after creating only five regions. The \nparabola is slightly deformed (seen from the white lines perpendicular to the attractor), but this \nmay be solved by increasing the number of splits. \n\nNext we look at the laser data. The complex behavior of chaotic systems is caused by an \ninterplay of destabilizing and stabilizing forces: the destabilizing forces make nearby points in \nstate space diverge, while the stabilizing forces keep the state of the system bounded. This \nprocess, known as 'stretching and folding', results in the attractor of the system: the set of \npoints that the state of the system will visit after all transients have died out. In the case of the \nlaser data this behavior is clear cut: destabilizing forces make the signal grow exponentially \nuntil the increasing amplitude triggers a collapse that reinitiates the sequence. We have seen \nin neural network based models [3] and in this study that it is very hard for the models to cope \nwith the sudden collapses. Without the nonlinear subspace correction of Sec. 4, most of the \n\ntrain data \n\n(a) \n\n0.4 , - - - - - - - - -+ - - - - - - - - - - - - - - - - - - - - , (b) \n\n1000 \n\ntime \n\nFigure 4. Laser data from the Santa Fe time series competition. The 1000 sample train \ndata set is followed by iterated prediction of the model (a). After every prediction a \ncorrection is made to keep P (see Sec. 4) small. Plot (b) shows P before this correction. \n\n\fRobust Learning of Chaotic Attractors \n\n885 \n\nmodels we tested grow without bounds after one or more rise and collapse sequences. That is \nnot very surprising - the training data set contains only three examples of a collapse. Figure 4 \nshows how this is solved with the subspace correction: every time the model is about to grow \nto infinity, a high potential P is detected (depicted in Fig. 3b) and the state of the system is \ndirected to the nearest point on the subspace as learned from the nonlinear principal component \nanalysis. After some trial and error, we selected an embedding dimension m of 12 and a \nreduced dimension mred of 4. The split&fit model starts with a single dataset, and was grown \nuntil 48 subsets. At that point, the error on the 1000 sample train set was still decreasing \nrapidly but the error on an independent 1000 sample test set increased. We compared the \nreconstructed attractors of the model and measurements, using 9000 samples of closed-loop \ngenerated and 9000 samples of measured data. No significant difference between the two \ncould be detected by the Diles test [7]. \n\n6. Conclusions \n\nWe present an algorithm that robustly models chaotic attractors. It simultaneously learns (1) \nto make accurate short term predictions; and (2) the outlines of the attractor. In closed-loop \nprediction mode, the state of the system is corrected after every prediction, to stay within these \noutlines. The algorithm is very fast, since the main computation is to least squares fit a set of \nlocal linear models. In our implementation the largest matrix to be stored is N by C, N being \nthe number of data and C the number of clusters. We see many applications other than attractor \nlearning: the split&fit algorithm can be used as a fast learning alternative to neural networks \nand the new form of nonlinear peA will be useful for data reduction and object recognition. \nWe envisage to apply the technique to a wide range of applications, from the control and \nmodeling of chaos in fluid dynamics to problems in finance and biology to fluid dynamics. \n\nAcknowledgements \n\nThis work is supported by the Netherlands Foundation for Chemical Research (SON) with financial \naid from the Netherlands Organization for Scientific Research (NWO). \n\nReferences \n\n[1] 1.e. Principe, A. Rathie, and 1.M. Kuo. \"Prediction of Chaotic Time Series with Neural Networks \nand the Issue of Dynamic Modeling\". Int. J. Bifurcation and Chaos. 2, 1992. P 989. \n[2] 1.M. Kuo. and 1.C. Principe. \"Reconstructed Dynamics and Chaotic Signal Modeling\". In Proc. \nIEEE Int'l Conf. Neural Networks, 5, 1994, p 3l31. \n[3] R Bakker, J.C. Schouten, e.L. Giles. F. Takens, e.M. van den Bleek, \"Learning Chaotic \nAttractors by Neural Networks\", submitted. \n[4] 1.e. Principe, L. Wang, MA Motter, \"Local Dynamic Modeling with Self-Organizing Maps and \nApplications to Nonlinear System Identification and Control\" .Proc. IEEE. 86(11). 1998. \n[5] N. Kambhatla, T.K. Leen. \"Dimension Reduction by Local PCA\", Neural Computation. 9,1997. \np. 1493 \n[6] M.I. Jordan, RA. Jacobs. \"Hierarchical Mixtures of Experts and the EM Algorithm\". Neural \nCompution. 6. 1994. p. 181. \n[7] e. Diks, W.R. van Zwet. F. Takens. and 1. de Goede, \"Detecting differences between delay vector \ndistributions\", PhYSical Review E. 53, 1996. p. 2169. \n[8] A. Lapedes. R Farber. \"Nonlinear Signal Processing Using Neural Networks: Prediction and \nSystem Modelling\". Los Alamos Technical Report LA-UR-87-2662. \n[9] F. Takens, \"Detecting strange attractors in turbulence\", Lecture notes in Mathematics, \n898, 1981, p. 365. \n[10] E.C. Malthouse. \"Limitations of Nonlinear PCA as performed with Generic Neural Networks. \nIEEE Trans. Neural Networks. 9(1). 1998. p. 165. \n\n\f", "award": [], "sourceid": 1642, "authors": [{"given_name": "Rembrandt", "family_name": "Bakker", "institution": null}, {"given_name": "Jaap", "family_name": "Schouten", "institution": null}, {"given_name": "Marc-Olivier", "family_name": "Coppens", "institution": null}, {"given_name": "Floris", "family_name": "Takens", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}, {"given_name": "Cor", "family_name": "van den Bleek", "institution": null}]}