{"title": "Adaptive On-line Learning in Changing Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 599, "page_last": 605, "abstract": null, "full_text": "Adaptive On-line Learning in Changing \n\nEnvironments \n\nNoboru Murata, Klaus-Robert Miiller, Andreas Ziehe \n\nGMD-First, Rudower Chaussee 5, 12489 Berlin, Germany \n\n{mura.klaus.ziehe}~first.gmd.de \n\nShun-ichi Amari \n\nLaboratory for Information Representation, RIKEN \n\nHirosawa 2-1, Wako-shi, Saitama 351-01, Japan \n\namari~zoo.riken.go.jp \n\nAbstract \n\nAn adaptive on-line algorithm extending the learning of learning \nidea is proposed and theoretically motivated. Relying only on gra(cid:173)\ndient flow information it can be applied to learning continuous \nfunctions or distributions, even when no explicit loss function is gi(cid:173)\nven and the Hessian is not available. Its efficiency is demonstrated \nfor a non-stationary blind separation task of acoustic signals. \n\n1 \n\nIntroduction \n\nNeural networks provide powerful tools to capture the structure in data by learning. \nOften the batch learning paradigm is assumed, where the learner is given all trai(cid:173)\nning examples simultaneously and allowed to use them as often as desired. In large \npractical applications batch learning is often experienced to be rather infeasible and \ninstead on-line learning is employed. \nIn the on-line learning scenario only one example is given at a time and then discar(cid:173)\nded after learning. So it is less memory consuming and at the same time it fits well \ninto more natural learning, where the learner receives new information and should \nadapt to it, without having a large memory for storing old data. On-line learning \nhas been analyzed extensively within the framework of statistics (Robbins & Monro \n[1951]' Amari [1967] and others) and statistical mechanics (see ego Saad & Solla \n[1995]). It was shown that on-line learning is asymptotically as effective as batch \n\n\f600 \n\nN. Murata, K. MUller, A. Ziehe and S. Amari \n\nlearning (cf. Robbins & Monro [1951]). However this only holds, if the appropriate \nlearning rate 1] is chosen. A too large 1] spoils the convergence of learning. In earlier \nwork on dichotomies Sompolinsky et al. [1995] showed the effect on the rate of \nconvergence of the generalization error of a constant, annealed and adaptive lear(cid:173)\nning rate. In particular, the annealed learning rate provides an optimal convergence \nrate, however it cannot follow changes in the environment. Since on-line learning \naims to follow the change of the rule which generated the data, Sompolinsky et al. \n[1995], Darken & Moody [1991] and Sutton [1992] proposed adaptive learning rates, \nwhich learn how to learn. Recently Cichoki et al. [1996] proposed an adaptive on(cid:173)\nline learning algorithm for blind separation based on low pass filtering to stabilize \nlearning. \nWe will extend the reasoning of Sompolinsky et al. in several points: (1) we give \nan adaptive learning rule for learning continuous functions (section 3) and (2) we \nconsider the case, where no explicit loss function is given and the Hessian cannot be \naccessed (section 4). This will help us to apply our idea to the problem of on-line \nblind separation in a changing environment (section 5). \n\n2 On-line Learning \n\nLet us consider an infinite sequence of independent examples (Zl, Yl), (Z2' Y2), .... \nThe purpose of learning is to obtain a network with parameter W which can simulate \nthe rule inherent to this data. To this end, the neural network modifies its parameter \nWt at time t into Wt+1 by using only the next example (Zt+l' Yt+l) given by the \nrule. We introduce a loss function l(z, Y; w) to evaluate the performance of the \nnetwork with parameter w. Let R(w) = (l(z, Y; w)) be the expected loss or the \ngeneralization error of the network having parameter w, where ( \n) denotes the \naverage over the distribution of examples (z, y). The parameter w* of the best \nmachine is given by w* = argminR(w). We use the following stochastic gradient \ndescent algorithm (see Amari [1967] and Rumelhart et al. [1986]): \n\nWt+l = Wt -1]tC(Wt) 8~ I(Zt+l' Yt+l; Wt), \n\n(1) \n\nwhere 1]t is the learning rate which may depend on t and C(Wt) is a positive-definite \nmatrix which rr,ay depend on Wt. The matrix C plays the role of the Riemannian \nmetric tensor of the underlying parameter space {w}. \nWhen 1]t is fixed to be equal to a small constant 1], E[wt] converges to w* and \nVar[wt] converges to a non-zero matrix which is order 0(1]). It means that Wt \nfluctuates around w* (see Amari [1967], Heskes & Kappen [1991]). If 1]t = cit \n(annealed learning rate) Wt converges to w* locally (Sompolinsky et al. [1995]). \nHowever when the rule changes over time, an annealed learning rate cannot follow \nthe changes fast enough since 1]t = cit is too small. \n\n3 Adapti ve Learning Rate \n\nThe idea of an adaptively changing 1]t was called learning of the learning rule (Som(cid:173)\npolinsky et al. [1995]). In this section we investigate an extension of this idea to \ndifferentiable loss functions. Following their algorithm, we consider \n\nWt - 1]t K - 1(wt) 8~ l(zt+1' Yt+1; Wt), \n\n(2) \n\n\fAdaptive On-line Learning in Changing Environments \n\n601 \n\n(3) \n\nwhere c\u00a5 and f3 are constants, K( Wt) is a Hessian matrix of the expected loss func(cid:173)\ntion 8 2 R(Wt)/8w8w and R is an estimator of R(woO) . Intuitively speaking, the \ncoefficient 'I in Eq.(3) is controlled by the remaining error. When the error is large, \n'I takes a relatively large value. When the error is small, it means that the estimated \n\nparameter is close to the optimal parameter; 'I approaches to \u00b0 automatically. Ho(cid:173)\n\nwever, for the above algorithm all quantities (K, I, R) have to be accessible which \nthey are certainly not in general. Furthermore I(Zt+l' Yt+1; Wt) - R could take \nnegative values. Nevertheless in order to still get an intuition of the learning beha(cid:173)\nviour, we use the continuous versions of (2) and (3), averaged with respect to the \ncurrent input-output pair (Zt, Yt) and we omit correlations and variances between \nthe quantities ('It, Wt, I) for the sake of simplicity \n\nNoting that (81(z, Y; woO)/8w) = 0, we have the asymptotic evaluations \n\n(8~ I(z, Y; W t\u00bb) \n\nKoO(Wt - woO), \n\n(/(z, Y; Wt) - R) \n\nR(woO) - R+ ~(Wt - woO? KoO(wt - woO), \n\nwith KoO = 82R(w*)/8w8w. AssumingR(woO)-Ris small and K(wt) ':::: KoO yields \n\n:t Wt = -TJt(Wt - woO), \n\n!TJt = C\u00a5TJt (~(Wt - woO? K*(wt - woO) - 'It) . \n\n(4) \n\nIntroducing the squared error et = HWt - woOf KoO(Wt - woO), gives rise to \n\nThe behavior of the above equation system is interesting: The origin (0,0) is its \nattractor and the basin of attraction has a fractal boundary. Starting from an \nadequate initial value, it has the solution of the form \n\n(5) \n\n1 1 \n'It = - . -. \n2 \nt \n\n(6) \n\nIt is important to note that this l/t-convergence rate of the generalization error et \nis the optimal order of any estimator Wt converging to woO. So we find that Eq.( 4) \ngives us an on-line learning algorithm which converges with a fast rate. This holds \nalso if the target rule is slowly fluctuating or suddenly changing. The technique to \nprove convergence was to use the scalar distance in weight space et. Note also that \nEq.(6) holds only within an appropriate parameter range; for small 'I and Wt - woO \ncorrelations and variances between ('It, Wt, I) can no longer be neglected . \n\n4 Modification \n\nFrom the practical point of view (1) the Hessian KoO of the expected loss or (2) \nthe minimum value of the expected loss R are in general not known or (3) in some \n\n\f602 \n\nN Murata, K. Muller, A. Ziehe and S. Amari \n\napplications we cannot access the explicit loss function (e.g. blind separation). Let \nus therefore consider a generalized learning algorithm: \n\nwhere I is a flow which determines the modification when an example (Zt+1' Yt+l) \nis given. Here we do not assume the existence of a loss function and we only assume \nthat the averaged flow vanishes at the optimal parameter, i.e. (f(z, Y; w*\u00bb) = o. \nWith a loss function, the flow corresponds to the gradient of the loss. We consider \nthe averaged continuous equation and expand it around the optimal parameter: \n\n(7) \n\nwhere K* = (81(z, Y; w*)/8w). Suppose that we have an eigenvector of the Hes(cid:173)\nsian K* vector v satisfying v T K* = ).vT and let us define \n\n(8) \n\nthen the dynamics of e can be approximately represented as \n\nBy using e, we define a discrete and continuous modification of the rule for 1]: \n\n(9) \n\n(10) \n\nand \n\n1]t+l = 1]t + a1]t (Plet I - 1]d \n\n(11) \nIntuitively e corresponds to a I-dimensional pseudo distance, where the average \nflow I is projected down to a single direction v. The idea is to choose a clever \ndirection such that it is sufficient to observe all dynamics of the flow only along this \nprojection. In this sense the scalar e is the simplest obtainable value to observe \nlearning. Noting that e is always positive or negative depending on its initial value \nand 1] can be positive, these two equations (10) and (11) are equivalent to the \nequation system (5). Therefore their asymptotic solutions are \n\nand \n\n1 1 \n1]t = - . -. \n). \nt \n\n(12) \n\nAgain similar to the last section we have shown that the algorithm converges pro(cid:173)\nperly, however this time without using loss or Hessian. In this algorithm, an import(cid:173)\nant problem is how to get a good projection v. Here we assume the following facts \nand approximate the previous algorithm: (1) the minimum eigenvalue of matrix \nf{* is sufficiently smaller than the second minimum eigenvalue and (2) therefore \nafter a large number of iterations, the parameter vector Wt will approach from the \ndirection of the minimum eigenvector of K*. Since under these conditions the evo(cid:173)\nlution of the estimated parameter can be thought of as a one-dimensional process, \nany vector can be used as v except for the vectors which are orthogonal to the \nminimum eigenvector. The most efficient vector will be the minimum eigenvector \nitself which can be approximated (for a large number of iterations) by \n\nv = (/)/11(/)11, \n\n\fAdaptive On-line Learning in Changing Environments \n\n603 \n\nwhere II \" denotes the L2 norm. Hence we can adopt e = II(f)II. Substituting the \ninstantaneous average of the flow by a leaky average, we arrive at \n\nWt - 1Jd(~t+l' Yt+l; Wt), \n(1- 8)rt + 8f(~t+l,Yt+1;Wt), (0 < 8 < 1) \n1Jt + a1Jt (,811rt+l\" - 1Jt) , \n\n(13) \n(14) \n(15) \n\nwhere 8 controls the leakiness of the average and r is used as auxiliary variable \nto calculate the leaky average of the flow f. This set of rules is easy to compute. \nHowever 1J will now approach a small value because of fluctuations in the estimation \nof r which depend on the choice of a,,8, 'Y. In practice, to assure the stability of the \nalgorithm, the learning rate in Eq.(13) should be limited to a maximum value 1Jrnax \nand a cut-off 1Jrnin should be imposed. \n\n5 Numerical Experiment: an application to blind \n\nseparation \n\nIn the following we will describe the blind separation experiment that we conducted \n(see ego Bell & Sejnowski [1995], Jutten & Herault [1991]' Molgedey & Schuster \n[1994] for more details on blind separation). As an example we use the two sun \naudio files (sampling rate 8kHz): \"rooster\" (sD and \"space music\" (s;) (see Fig. \n1). Both sources are mixed on the computer via J; = (lL. + A)s~ where Os < \nt < 1.25s and 3.75s ~ t ~ 5s and it = (lL. + B)s~ for 1.25s ~ t < 3.75s, using \nA = (0 0.9; 0.7 0) and B = (0 0.8; 0.6 0) as mixing matrices. So the rule switches \ntwice in the given data. The goal is to obtain the sources s-; by estimating A and \nB, given only the measured mixed signals it. A change of the mixing is a scenario \noften encountered in real blind separation tasks, e.g. a speaker turns his head or \nmoves during his utterances. Our on-line algorithm is especially suited to this non(cid:173)\nstationary separation task, since adaptation is not limited by the above-discussed \ngeneric drawbacks of a constant learning rate as in Bell & Sejnowski [1995], Jutten \n& Herault [1991]' Molgedey & Schuster [1994]. Let u~ be the unmixed signals \n\n(16) \n\nwhere T is the estimated mixing matrix. Along the lines of Molgedey & Schuster \n[1994] we use as modification rule for Tt \n\n~T:j ex 1Jtt ((I!u{), (u~u{), (I!UL1), (U~UL1)) ex 1Jt(I!u{)(u~u{)+(I! UL1)(U~uLl)' \n\n(i,j = 1, 2,i # j), where we substitute instantaneous averages with leaky averages \n\n(I! U{)leaky = (1 - t)(ILl UL1)leaky + d! u{. \n\nNote that the necessary ingredients for the flow t in Eq.(13)-(14) are in this case \nsimply the correlations at equal or different times; 1Jt is computed according to \nEq.(15). In Fig.2 we observe the results of the simulation (for parameter details, \nsee figure caption). After a short time (t=O.4s) of large 1J and strong fluctuations in \n1J the mixing matrix is estimated correctly. Until t=1.25s the learning rate adapts \ncooling down approximately similar to lit (cf. Fig. 2c), which was predicted in \nEq.(12) in the previous section, i.e. it finds the optimal rate for annealing. At the \n\n\f604 \n\nN. Murata, K. Muller; A. Ziehe and S. Amari \n\npoint of the switch where simple annealed learning would have failed to adapt to \nthe sudden change, our adaptive rule increases TJ drastically and is able to follow \nthe switch within another O.4s rsp. O.ls. Then again, the learning rate is cooled \ndown automatically as intended. Comparing the mixed, original and unmixed si(cid:173)\ngnals in Fig.1 confirms the accurate and fast estimate that we already observed in \nthe mixing matrix elements. The same also holds for an acoustic cross check : for \na small part of a second both signals are audible, then as time proceeds only one \nsignal, and again after the switches both signals are audible but only for a very \nshort moment. The fading away of the signal is so fast to the listener that it seems \nthat one signal is simply \"switched off\" by the separation algorithm. \nAltogether we found an excellent adaptation behavior of the proposed on-line algo(cid:173)\nrithm, which was also reproduced in other simulation examples omitted here. \n\n6 Conclusion \n\nWe gave a theoretically motivated adaptive on-line algorithm extending the work of \nSompolinskyet al. [1995]. Our algorithm applies to general feed-forward networks \nand can be used to accelerate learning by the learning about learning strategy in the \ndifficult setting where (a) continuous functions or distributions are to be learned, \n(b) the Hessian K is not available and (c) no explicit loss function is given. Note, \nthat if an explicit loss function or K is given, this additional information can be \nincorporated easily, e.g. we can make use of the real gradient otherwise we only \nrely on the flow. Non-stationary blind separation is a typical implementation of the \nsetting (a)-(c) and we use it as an application of the adaptive on-line algorithm in \na changing environment. Note that we can apply the learning rate adaptation to \nmost existing blind separation algorithms and thus make them feasible for a non(cid:173)\nstationary environment. However, we would like to emphasize that blind separation \nis just an example for the general adaptive on-line strategy proposed and applica(cid:173)\ntions of our algorithm are by no means limited to this scenario. Future work will \nalso consider applications where the rules change more gradually (e.g. drift) . \n\nReferences \n\nAmari, S. (1967) IEEE Trans. EC 16(3):299-307. \nBell, T ., Sejnowski, T . (1995) Neural Compo 7:1129-1159. \nCichocki A., Amari S., Adachi M., Kasprzak W. (1996) Self-Adaptive Neural Net(cid:173)\nworks for Blind Separation of Sources, ISCAS'96 (IEEE), Vol. 2, 157-160. \nDarken, C., Moody, J. (1991) in NIPS 3, Morgan Kaufmann, Palo Alto. \nHeskes, T.M., Kappen, B. (1991) Phys. Rev. A 440:2718-2726. \nJutten, C., Herault, J . (1991) Signal Processing 24:1-10. \nMolgedey, L., Schuster, H.G. (1994) Phys. Rev. Lett. 72(23):3634-3637. \nRobbins, H., Monro, S. (1951) Ann. Math. Statist., 22:400-407. \nRumelhart, D., McClelland, J.L and the PDP Research Group (eds.) (1986) , PDP \nVol. 1, pp. 318-362, Cambridge, MA: MIT Press. \nSaad D., and SoIl a S. (1995), Workshop at NIPS '95, see World-Wide-Web page: \nhttp://neural-server.aston.ac.uk/nips95/workshop.html and references therein. \nSompolinsky, H., Barkai, N., Seung, H.S. (1995) in Neural Networks: The Statistical \nMechanics Perspective, pp. 105-130. Singapore: World Scientific. \nSutton, R.S. (1992) in Proc. 10th nat. conf. on AI, 171-176, MIT Press. \n\n\fAdaptive On-line Learning in Changing Environments \n\n605 \n\no \n\n0.5 \n\n1 \n\n2.5 \n\n3 \n\n3.5 \n\n4 \n\n4.5 \n\n5 \n\n1.5 \n\n2 \n\nt:~I.~ I ~.~ \u2022\u2022 ~ ~.I~.~I ~ I ~~ ... ~ ~ ~~.I~ \nt~~ \nr~\u00b7,,: ' .. :, \n\n5 \nm 1~--~--~----~--~--~----~--~--~--------~ \nc: \nOl \n'iii i 0 \n'E \u00a7_1L-__ ~ __ ~ ____ L -__ -L __ ~ ____ ~ __ -L __ ~ ____ ~ __ ~ \n5 \n\n, \n\n1.5 \n\n2 \n\n2.5 \n\n3 \n\n0.5 \n\n1 \n\n4 \n\n4.5 \n\no \n\n0.5 \n\n1 \n\no \n\n1.5 \n\n3.5 \n\ntil - 0 \n\n0.5 \n\n1 \n\n1.5 \n\n2 \n\n2.5 \n\ntime sec \n\n3 \n\n3.5 \n\n4 \n\n4.5 \n\n5 \n\nFigure 1: s\u00a5 \"space music\", the mixture signal Ii, the unmixed signal u\u00a5 and the \nseparation error u\u00a5 - s\u00a5 as functions of time in seconds. \n\n\u00a5l 1 \nc: \nIII \n~0.8 \nIII \n8 \n0l0.6 \nc: \n'E 0.4 \n0 \n15 \n\n.~ \n\n.l!l \nIII \n\n10 \n\n5 \n\n10 \n\n~ 5 \n\n~ \n\n0.5 \n\n1.5 \n\n2 \n\n2.5 \n\n3 \n\n3.5 \n\n4 \n\n4.5 \n\n5 \n\n0.5 \n\n1.5 \n\n2 \n\n2.5 \n\n3 \n\n3.5 \n\n4 \n\n4.5 \n\n5 \n\n00 \n\n0.5 \n\n1.5 \n\n2 \n\n2.5 \n\ntime sec \n\n3 \n\n3.5 \n\n4 \n\n4.5 \n\n5 \n\nFigure 2: Estimated mixing matrix Tt , evolution ofthe learning rate TJt and inverse \nlearning rate l/TJt over time. Rule switches (t=1.25s, 3.75s) are clearly observed as \ndrastic changes in TJt . Asymptotic 1ft scaling in TJ amounts to a straight line in l/TJt. \nSimulation parameters are a = 0.002,,B = 20/ maxll(r)II, f = 8 = 0.01. maxll(r)11 \ndenotes the maximal value of the past observations. \n\n\f", "award": [], "sourceid": 1183, "authors": [{"given_name": "Noboru", "family_name": "Murata", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Andreas", "family_name": "Ziehe", "institution": null}, {"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}]}