{"title": "Dynamics of Supervised Learning with Restricted Training Sets and Noisy Teachers", "book": "Advances in Neural Information Processing Systems", "page_first": 237, "page_last": 243, "abstract": null, "full_text": "Dynamics of Supervised Learning with \n\nRestricted Training Sets and Noisy Teachers \n\nA.C.C. Coolen \n\nDept of Mathematics \nKing's College London \n\nThe Strand, London WC2R 2LS, UK \n\ntcoolen@mth.kc1.ac.uk \n\nC.W.H.Mace \n\nDept of Mathematics \nKing's College London \n\nThe Strand, London WC2R 2LS, UK \n\ncmace@mth.kc1.ac.uk \n\nAbstract \n\nWe generalize a recent formalism to describe the dynamics of supervised \nlearning in layered neural networks, in the regime where data recycling \nis  inevitable, to the case of noisy teachers.  Our theory generates reliable \npredictions for the  evolution in  time of training- and generalization er(cid:173)\nrors, and extends the class of mathematically solvable learning processes \nin large neural networks to those situations where overfitting can occur. \n\n1  Introduction \n\nTools from statistical mechanics have been used successfully over the last decade to study \nthe dynamics of learning in layered neural networks (for reviews see e.g.  [1]  or [2]).  The \nsimplest  theories  result  upon  assuming  the  data  set to  be  much  larger  than  the  number \nof weight updates  made,  which  rules  out recycling  and  ensures  that  any  distribution  of \nrelevance will  be Gaussian.  Unfortunately, both  in  terms of applications and in  terms of \nmathematical  interest,  this regime is  not the  most relevant one.  Most complications and \npeculiarities in the dynamics of learning arise precisely due to data recycling, which creates \nfor the system the possibility to improve performance by memorizing answers rather than \nby  learning an underlying rule.  The dynamics of learning with restricted training sets was \nfirst studied analytically in [3]  (linear learning rules) and [4] (systems with binary weights). \nThe latter studies were ahead of their time, and did not get the attention they deserved just \nbecause at that stage even  the simpler learning dynamics without data recycling had  not \nyet been  studied.  More recently  attention has  moved  back to  the dynamics  of learning \nin  the  recycling  regime.  Some  studies  aimed  at developing  a  general  theory  [5,  6,  7], \nsome at finding exact solutions for special cases [8].  All general theories published so far \nhave in  common that they  as  yet considered realizable scenario's:  the  rule to  be learned \nwas  implementable by  the student, and overfitting could not yet occur.  The next hurdle is \nthat where restricted training sets are combined with unrealizable rules.  Again some have \nturned to non-typical but solvable cases, involving Hebbian rules and noisy [9]  or 'reverse \nwedge'  teachers  [10].  More recently  the cavity  method has  been used  to  build  a general \ntheory [11]  (as yet for batch learning only).  In this paper we generalize the general theory \nlaunched in [6,5,7], which applies to arbitrary learning rules, to the case of noisy teachers. \nWe  will mirror closely the presentation in [6]  (dealing with the simpler case of noise-free \nteachers), and we refer to [5, 7] for background reading on the ideas behind the formalism. \n\n\f238 \n\n2  Definitions \n\nA.  C.  C.  Coolen and C.  W.  H.  Mace \n\nAs in  [6,  5]  we restrict ourselves for simplicity to perceptrons.  A student perceptron oper(cid:173)\nates a linear separation, parametrised by a weight vector J  E  iRN : \n\nS:{-I,I}N -t{-I,I} \n\nS(e) =  sgn[J\u00b7e] \n\nIt aims to emulate a teacher o~erating a similar rule, which, however, is characterized by  a \nvariable weight vector BE iR \n\n,drawn at random from a distribution P(B) such as \n\noutput noise: \nGaussian weight noise: \n\nP(B) =  >'6[B+B*] + (1->')6[B-B*] \nP(B) =  [~~/NrN e- tN(B-B')2/E2 \n\n(1) \n\nvalues of the teacher outputs. We choose the teacher noise to be consistent, i.e.  the answer \n\n(2) \nThe parameters>. and  ~ control the amount of teacher noise,  with the  noise-free teacher \nB  =  B* recovered in the limits>. -t 0 and  ~ -t O.  The student modifies J  iteratively, using \nexamples of input vectors e which are drawn at random from a fixed (randomly composed) \ntraining set containing p =  aN vectors e E {-I, I}N with  a> 0,  and the corresponding \ngiven by  the teacher to a question e will remain the same when that particular question \nre-appears during the learning process. Thus T(e\u00b7) =  sgn[BJL . e], with p teacher weight \nvectors BJL, drawn randomly and independently from P(B), and we generalize the training \nset accordingly to jj = He l , B l ), . .. , (e, BP)}.  Consistency of teacher noise is  natural \nin terms of applications, and a prerequisite for overfitting phenomena.  Averages over the \ntraining set will be denoted as ( ... ) b; averages over all possible input vectors e E {-I, I}N \nas  ( ... ) e.  We analyze two classes of learning rules, of the form J (\u00a3 + 1)  =  J (\u00a3) + f).J (\u00a3): \n\non-line: \nbatch : \n\nf).J(\u00a3) = 11 {e(\u00a3) 9 [J(\u00a3)\u00b7e(\u00a3), B(\u00a3)\u00b7e(\u00a3)] - ,J(\u00a3) } \nf).J(\u00a3)  =  11 { (e  9 [J(\u00a3)\u00b7e, B\u00b7eDl> - ,J(m) } \n\n(3) \n\nIn  on-line learning  one  draws  at each  step \u00a3 a  question/answer pair  (e (\u00a3), B (\u00a3))  at ran(cid:173)\ndom from the training set.  In  batch learning one iterates a deterministic map which is  an \naverage over all  data in  the training set.  Our performance measures are the training- and \ngeneralization errors, defined as follows (with the step function O[x > 0] = 1, O[x < 0] = 0): \n(4) \n\nEt(J) = (O[-(J \u00b7e)(B \u00b7em b \n\nEg(J) = (O[-(J \u00b7e)(B* \u00b7e)])e \n\nWe introduce macroscopic observables, taylored to the present problem, generalizing [5, 6]: \nQ[J]=J2,  R[J]=J\u00b7B*,  P[x,y,z;J]=(6[x-J\u00b7e]6[y-B*\u00b7e]6[z-B\u00b7eDl> \n(5) \nAs in [5, 6] we eliminate technical subtleties by assuming the number of arguments (x, y, z) \nfor which P[x, y, z; J] is evaluated to go to infinity after the limit N -t 00 has been taken. \n\n3  Derivation of Macroscopic Laws \n\nUpon generalizing the calculations in [6, 5], one finds for on-line learning: \n\n! Q =  2'f} !dXdydZ P[x, y, z]  xg[x, z]  - 2'f},Q + 'f}2 !dXdYdZ P[x, y, z]  g2[x, z] \n! R  =  'f} !dXdydZ P[x, y, z] y9[x, z]- 'f},R \n:t P[x, y, z]  =  ~ ! dx' P[x', y, z]  {6[x-x' -'f}G[x', z]] -6[x-x']} \n\n-'f}! /  dx'dy'dz' /  dx'dy'dz'9[x', z]A[x, y, z; x',y', z'] + 'f}, :x {xP[x , y, z]} \n\n1  ! \n\n+'i'f}2  dx'dy'dz' P[x', y', z']92[x', z'] 8x2 P[x, y, z] \n\nEP \n\n(6) \n\n(7) \n\n(8) \n\n\fSupervised Learning with Restricted Training Sets \n\n239 \n\nThe complexity of the problem is concentrated in a Green's function: \nA[x, y, Zj x', y', z']  = \n\nlim \nN-+oo \n\n(( ([1-6ee, ]6[x-J\u00b7e]6[y-B*\u00b7e]6[z-B\u00b7e] (e\u00b7e')6[x' -J\u00b7e']6[y' - B*\u00b7e']6[y' - B\u00b7e'])i\u00bbi> )QW;t \nIt involves a conditional average of the form (K[J])QW;t = J dJ Pt(JIQ,R,P)K[J], with \nPt(J) 6[Q-Q[J]]6[R- R[J]] nXYZ 6[P[x, y, z] -P[x, y, Zj J]] \nPt(JIQ,R,P) = J dJ Pt(J) 6[Q - Q[J]]6[R- R[J]] nXYZ 6[P[x, y, z] - P[x, y, z; J]] \nin  which Pt (J) is  the weight probability density at time t.  The solution of (6,7,8) can be \nused to generate the N -+ 00 performance measures (4) at any time: \n\nEt = /  dxdydz P[x, y, z]O[-xz] \n\nEg  = 11\"-1  arccos[RIVQ] \n\n(9) \n\nExpansion of these equations in powers of\"\"  and retaining only the terms linear in \"\"  gives \nthe corresponding equations describing batch learning. So far this analysis is exact. \n\n4  Closure of Macroscopic Laws \n\nAs  in  [6,  5]  we  close our macroscopic laws  (6,7,8)  by  making the  two  key  assumptions \nunderlying dynamical replica theory: \n\n(i)  For N  -+ 00 our macroscopic observables obey closed dynamic equations. \n(ii)  These equations are self-averaging with respect to the specific realization of D. \n\n(i) implies that probability variations within {Q, R, P} subshells are either absent or irrel(cid:173)\nevant to the macroscopic laws.  We may thus make the simplest choice for Pt (J I Q, R, P): \n(10) \n\nPt(JIQ,R,P)  -+  6[Q-Q[J]] 6[R-R[J]] II 6[P[x,y,z]-P[x,y,ZjJ]] \n\nxyz \n\nThe procedure (10) leads to  exact laws if our observables {Q, R, P} indeed obey closed \nequations  for  N  -+  00.  It  is  a  maximum  entropy  approximation  if not.  (ii)  allows  us \nto average the  macroscopic laws over all  training sets;  it is  observed in  simulations,  and \nproven using the formalism of [4].  Our assumptions  (10) result in the closure of (6,7,8), \nsince now the Green's function can be written in terms of {Q, R, Pl.  The final  ingredient \nof dynamical replica theory is doing the average of fractions with the replica identity \n\n/ J dJ W[JID]GIJID])  = lim  /dJ I  \u2022\u2022\u2022 dJn (G[J 1 ID] IT W[JO<ID])sets \n\\ \n\nJ dJ W[JID] \n\nn-+O \n\nsets \n\na=1 \n\nOur problem  has  been  reduced  to  calculating  (non-trivial)  integrals  and  averages.  One \nfinds that P[x, y, z] = P[x, zly]P[y] with Ply] = (211\")-!exp[-!y21 With the short-hands \nDy = P[y]dy and  (f(x, y, z)) = J Dydxdz P[x, zly]f(x, y, z)  we can write the resulting \nmacroscopic laws, for the case of output noise (1), in the following compact way: \n\nd \ndt Q =  2\",(V - ,Q) + rJ2 Z \n\nd \ndtR =  \",(W - ,R) \n\n1/ \n\n[) \n[)tP[x,zly] =  a  dx'P[x',zly]  6[x-x'-\",G[x',z]]-6[x-x']  +2\",2Z [)x2P[x,zIY] \n-\",:x {P[x,zly] [U(x-RY)+Wy-,x+[V-RW-(Q-R2)U]~[x,y,z])}  (12) \n\n} 1  [)2 \n\n{ \n\n(11) \n\nwith \n\nU = (~[x, y, z]9[x, z]),  v = (x9[x, z]),  W = (y9[x, z]),  Z = (92[x, z]) \n\nThe solution of (12) is at any time of the following form: \n\nP[x,zly] = (1-,x)6[y-z]P+[xly] + ,x6[y+z]P-[xly] \n\n(13) \n\n\f240 \n\nA.  C.  C.  Coolen and C.  W. H.  Mace \n\nFinding the function <I> [x, y, z] (in replica symmetric ansatz) requires solving a saddle-point \nproblem for a scalar observable q and two functions M\u00b1[xly]. Upon introducing \n\nB  =  ....:..V...,...q.,-Q ___ R,-2 \nQ(I-q) \n\n(f[x, y])\u00b1 =  J dx  M\u00b1[xly]eBxs J[x, y] \n\n* \n\nJ dx  M\u00b1[xly]eBxs \n\n(with J dx M\u00b1[xly] = 1 for all y) the saddle-point equations acquire the fonn \n\nfor all X, y : \n\np\u00b1[Xly] = ! Ds (O[X -xl); \n\n(14) \n\n<I> [X, y, z] =! \n\n((x-Ry)2) + (qQ-R2)[I-!:.] = qQ+Q-2R2 !DYDS S[(I-A)(X); + A(X);]  (15) \n\na \n\n..jqQ_R2 \n\nThe equations (14) which detennine M\u00b1[xly] have the same structure as the corresponding \n(single) equation in  [5,  6], so the proofs in  [5,  6]  again apply,  and the solutions M\u00b1[xly], \ngiven a q  in  the physical range q E [R2/Q, 1],  are unique.  The function  <I> [x, y, z]  is then \ngiven by \n\nDs s \n\n{(I-A)O[Z-y](o[X -x)); + AO[Z+Y](o[X -xl);} \n\n..jqQ_R2 P[X, zly] \n\n(16) \nWorking  out predictions from  these equations is  generally CPU-intensive,  mainly due to \nthe functional saddle-point equation (14) to be solved at each time step.  However, as in  [7] \none can construct useful approximations of the theory, with increasing complexity: \n\n(i)  Large a  approximation (giving the simplest theory, without saddle-point equations) \n(ii)  Conditionally Gaussian approximation for M[xly]  (with y-dependent moments) \n(iii)  Annealed approximation of the functional saddle-point equation \n\n5  Benchmark Tests:  The Limits a --+ 00 and ,\\ --+ 0 \n\nWe first show that in the limit a  --+  00 our theory reduces to the simple (Q, R)  formalism \nof infinite training sets, as worked out for noisy teachers in [12].  Upon making the ansatz \n\np\u00b1[xly] =  P[xly]  =  [27r(Q-R2)]-t e- t [x- Rv]2/(Q-R2) \n\n(17) \n\none finds \n\nM\u00b1[xly] =  P[xly], \n\n<I>[x,y,Z]  =  (x-Ry)/(Q-R2) \n\nInsertion of our ansatz into (12),  followed by rearranging of terms and usage of the above \nexpression for <I> [x, y, z], shows that (12) is satisfied.  The remaining equations (11) involve \nonly averages over the Gaussian distribution (17), and indeed reduce to those of [12]: \n~! Q =  (I-A) { 2(x9[x, y))  + 1}{92[x, y)) } + A {2(x9[x,-y)) + 1}(92[x,-y)) } - 2,Q \n\n1  d \n--d R = (I-A)(y9[x,y)) + A(y9[x,-yl) -,R \nt \n1} \n\nNext we turn to the limit A --+ 0 (restricted training sets &  noise-free teachers) and show that \nhere our theory reproduces the fonnalism of [6,5]. Now we make the following ansatz: \n\nP+[xly]  =  P[xly], \n\n(18) \nInsertion shows that for  A = 0  solutions of this  fonn indeed solve our equations,  giving \n<p[x, y, z]--+  <I> [x, y]  and  M+[xly]  = M[xly),  and  leaving us exactly with the fonnalism \nof [6,  5]  describing the case of noise-free teachers and restricted training sets (apart from \nsome new tenns due to the presence of weight decay, which was absent in  [6,  5]). \n\nP[x, zly]  = o[z-y]P[xIY] \n\n\fSupervised Learning with Restricted Training Sets \n\n241 \n\n0. ,  r------~--__, \n\n0..4 \n\n0..3 \n\n0..2 \n\n0.  , , \n\n0..0. \n\n0. \n\n11>=0.' \n0;=1 \n\n0;=2 \n\na=4 \n\na=4 \n\n0;=2 \n\n0;=1 \n\n11>=0, \n\n--\n\n-- - ----\n\n- - ------\n------- ---- ----- -\n\n, 0. \n\n\" \n\n0..3 \n\n0.2 \n\n0.,  , \n\nno.  I \n0. \n\n0..4  ~-------_____I  11>=0.' \n~-------~ 0;=1 \n:::---- - -----1  0;=2 \n\n':::::========:::j  a=4 \n= =-=--=-=--=-=--=-=-=--=-=- -=-=-_oed \n\na=4 \n\n0;=2 \n\n_ __ ___ _____ _ \n\na= 1 \n\n, 0. \n\n\" \n\nFigure  1:  On-line Hebbian  learning:  conditionally Gaussian  approximation versus exact \nsolution in [9]  (.,.,  =  1, ,X  =  0.2).  Left:  \"I  =  0.1, right:  \"I =  0.5. Solid lines:  approximated \ntheory,  dashed  lines:  exact result.  Upper curves:  Eg  as  functions  of time  (here the two \ntheories agree), lower curves:  E t  as functions of time. \n\n6  Benchmark Tests:  Hebbian Learning \n\nThe special case of Hebbian learning, i.e.  Q[x, z] =  sgn(z), can be solved exactly at any \ntime, for arbitrary {a, ,x, \"I} [9], providing yet another excellent benchmark for our theory. \nFor batch execution of Hebbian learning the macroscopic laws are obtained upon expanding \n(11,12) and retaining only those terms which are linear in.,.,.  All  integrations can now be \ndone and all equations solved explicitly, resulting in U =0, Z = 1, W =  (I-2,X)J2/7r, and \nQ = Qo  e-2rryt  +  2Ro(I-2'x) e-17\"Yt[I_e-rrrt]  f{ +  [~(I-2,X)2+.!.]  [I-e-17\"YtF \n\nV:; \n\n7r \n\na \n\n\"12 \n\n\"I \n\nR  =  Ro e-17\"Yt +(I-2'x)J2/7r[I-e-17\"Yt]/\"I \n\np\u00b1[xIY] =  [27r(Q-R2)] -t e-tlz-RH sgn(y)[1-e-\"..,t]/a\"Y]2/(Q-R2) \n\nq =  [aR2+(I_e- 17\"Yt)2 i'l]/aQ \n(19) \nFrom these results, in tum, follow the performance measures Eg = 7r- 1 arccos[ R/ JQ) and \nE =  ! - !(1-,X)!D  erf[IYIR+[I-e-77\"Yt ]/a\"l] + !,X!D  erf[IYIR-[I-e-17\"Yt]/a\"l] \n\nt \n\n2  2 \n\nY \n\nJ2(Q-R2) \n\n2 \n\ny \n\nJ2(Q-R2) \n\nComparison with  the exact solution, calculated along the lines of [9]  or, equivalently, ob(cid:173)\ntained upon putting t \u00ab .,.,-2  in [9], shows that the above expressions are all exact. \nFor on-line execution we cannot (yet) solve the functional saddle-point equation in general. \nHowever, some analytical predictions can still be extracted from (11,12,13): \nQ =  Qo  e-217\"Yt +  2Ro(I-2,X) e-77\"Yt[I_e-17\"Yt]  f{ +  [~(I-2,X)2+.!.]  [I_e- 17\"Yt]2 \n\nV:; \n\n7r \n\na \n\n\"12 \n\n\"I \n\nR  =  Ro e-17\"Y t  +  (I-2,X)J2/7r[I-e- 17\"Yt]/\"I \n\nJ dx xP\u00b1[xIY] =  Ry \u00b1  sgn(y)[I-e-17\"Yt]/a\"l \n\n+ !L[I_e-217\"Yt] \n\n2\"1 \n\nwith U =0, W =  (I-2,X)J2/7r, V = W R+[I-e-17\"Yt]/a\"l, and Z = 1.  Comparison with the \nresults in [9]  shows that the above expressions, and thus also that of E g ,  are all fully exact, \nat any time.  Observables involving P[x, y, z]  (including the training error) are not as easily \nsolved  from  our equations.  Instead  we  used  the  conditionally  Gaussian  approximation \n(found  to  be adequate  for  the  noiseless  Hebbian case [5,  6,  7]).  The result  is  shown  in \nfigure  1.  The agreement is  reasonable,  but significantly  less  than  that in  [6];  apparently \nteacher noise adds to the deformation of the field distribution away from a Gaussian shape. \n\n\f242 \n\nA.  C.  C.  Coolen and C.  W  H.  Mac \n\n000000 \n\n0.4 \n\nE \n\n0.0 \n\n0 \n\n2 \n\n4 \n\n6 \n\n10 \n\n0.4 \n\nE \n\n0.2 \n\n0.6 ~ \n\n0.4 \n\n0.2 ~ \nI \ni \n\n0.0 \n\n-3 \n\n0.6  f \n\n0.4  [ \n\n0.2 \n\n-2 \n\n-I \n\n0 \nX \n\n0.0  L-o!i6iIII.\"\"\"\"\"',-\"--~_~~ __ --' \n3 \n\n-3 \n\n2 \n\n-2 \n\n-I \n\n0 \nX \n\nFigure  2:  Large a  approximation versus  numerical  simulations  (with  N  =  10,000), for \n\n,= 0  and  A = 0.2.  Top  row:  Perceptron rule,  with.,., = ~.  Bottom row:  Adatron rule, \n\nwith.,., =  ~.  Left:  training errors E t  and generalisation errors Eg  as  functions of time, for \naE {~, 1, 2}. Lines:  approximated theory, markers:  simulations (circles:  E t , squares:  Eg) . \nRight:  joint distributions for student field  and teacher noise p\u00b1[x] = J dy P[x, y, z = \u00b1y] \n(upper:  P+[x], lower:  P-[x]). Histograms: simulations, lines:  approximated theory. \n\n7  Non-Linear Learning Rules:  Theory versus Simulations \n\nIn the case of non-linear learning rules no exact solution is known against which to test our \nformalism, leaving numerical simulations as the yardstick.  We have evaluated numerically \nthe large a  approximation of our theory for Perceptron learning,  9[x, z]  =  sgn(z)O[-xz], \nand  for  Adatron  learning,  9[x, z]  =  sgn(z)lzIO[-xz].  This  approximation  leads  to  the \nfollowing fully explicit equation for the field distributions: \nd \n-p\u00b1[xly] =  -\na \ndt \n\ndx' p\u00b1[x'ly]{o[x-x'-.,.,.1'[x', \u00b1y]] -o[x-x]} + _.,.,2 Z!:I  2 p\u00b1[xly] \n\n'  1  ~ \n\n1/ \n\n2 \n\n. \n\nWith \n\n_  ~ {P[  I ] [W  _ \n.,., 8 \nX \n\nx  y \n\ny \n\n,X + \n\nuX \nU[X\u00b1(y)-RY]+(V-RW)[X-X\u00b1(y)]]} \n\nQ _ R2 \n\nU = J Dydx {(I-A)P+[xly][x-P(y)]9[x,Y]+AP-[xly][x-x-(y)]9[x,-y]) \n\nV  = ! Dydx x {(I-A)P+[xly]9[x, Y]+AP-[xly]9[x,-y]) \nW = 1 Dydx y {(1-A)P+[xly]9[x, Y]+AP-[xly]9[x,-y]) \nZ = 1 Dydx {(I-A)P+[xly]92[x, Y]+AP-[xly]92[x,-yJ) \n\n\fSupervised Learning with Restricted Training Sets \n\n243 \n\nand with the short-hands X\u00b1(y)  =  J dx xP\u00b1[xly). The result of our comparison is shown \nin  figure  2.  Note:  Et  increases monotonically  with  a,  and  Eg  decreases monotonically \nwith a, at any t.  As in  the noise-free formalism [7],  the large a  approximation appears to \ncapture the dominant terms both for a  -7  00 and for a  -7 O.  The predicting power of our \ntheory is  mainly limited by numerical constraints.  For instance, the Adatron learning rule \ngenerates singularities at x =  0 in the distributions P\u00b1[xly) (especially for small \"I) which, \nalthough predicted by our theory, are almost impossible to capture in numerical solutions. \n\n8  Discussion \n\nWe  have shown how a recent theory to describe the dynamics of supervised learning with \nrestricted training sets (designed to apply in the data recycling regime, and for arbitrary on(cid:173)\nline and batch learning rules) [5,  6, 7]  in large layered neural networks can be generalized \nsuccessfully in order to deal also with noisy teachers. In our generalized approach the joint \ndistribution P[x, y, z)  for the fields of student,  'clean' teacher, and noisy teacher is taken to \nbe a dynamical order parameter, in addition to the conventional observables Q and R. From \nthe order parameter set {Q, R, P} we derive the generalization error Eg  and the training \nerror Et .  Following  the  prescriptions of dynamical  replica theory  one finds  a  diffusion \nequation for P[x, y, z], which we have evaluated by making the replica-symmetric ansatz. \nWe have carried out several orthogonal benchmark tests of our theory:  (i)  for a -7 00 (no \ndata recycling) our theory is  exact,  (ii)  for A -7  0  (no teacher noise) our theory reduces \nto  that of [5,  6, 7],  and  (iii)  for  batch Hebbian learning our theory  is  exact.  For on-line \nHebbian learning our theory is  exact with  regard to the predictions for Q,  R, Eg  and  the \ny-dependent conditional averages J dx xP\u00b1[xly), at any  time, and a crude approximation \nof our equations already gives reasonable agreement with the exact results [9]  for Et .  For \nnon-linear learning rules (Perceptron and Adatron) we have compared numerical solution \nof a  simple  large  a  aproximation  of our equations to  numerical  simulations,  and  found \nsatisfactory agreement.  This paper is  a preliminary presentation of results obtained in the \nsecond stage of a research programme aimed at extending our theoretical tools in the arena \nof learning dynamics, building on [5,  6, 7].  Ongoing work is aimed at systematic applica(cid:173)\ntion of our theory and its approximations to various types of non-linear learning rules, and \nat generalization of the theory to multi-layer networks. \n\nReferences \n[1]  Mace C.W.H. and Coolen AC.C (1998), Statistics and Computing 8, 55 \n[2]  Saad D.  (ed.) (1998), On-Line Learning in Neural Networks (Cambridge: CUP) \n[3]  Hertz J.A., Krogh A  and Thorgersson G.I. (1989), J.  Phys. A 22, 2133 \n[4]  HomerH. (1992a), Z.  Phys.  B 86, 291  and  Homer H.  (1992b), Z.  Phys.  B 87,371 \n[5]  Coolen A.C.C.  and Saad D.  (1998), in On-Line Learning in Neural Networks,  Saad \n\nD. (ed.), (Cambridge:  CUP) \n\n[6]  Coolen AC.C. and  Saad D.  (1999),  in  Advances in  Neural Information Processing \n\nSystems 11, Kearns D., Solla S.A., Cohn D.A (eds.), (MIT press) \n\n[7]  Coolen A.C.C. and Saad D.  (1999), preprints KCL-MTH-99-32 &  KCL-MTH-99-33 \n[8]  Rae H.C.,  Sollich P.  and Coolen AC.C.  (1999), in Advances in  Neural Information \n\nProcessing Systems 11, Kearns D., Solla S.A., Cohn D.A. (eds.), (MIT press) \n\n[9]  Rae H.C., Sollich P. and Coolen AC.C. (1999),J.  Phys. A 32, 3321 \n[10]  Inoue J.I. (1999) private communication \n[11]  Wong K.YM., Li S. and Tong YW. (1999),preprint cond-mat19909004 \n[12]  Biehl M., Riegler P.  and Stechert M.  (1995), Phys.  Rev.  E 52, 4624 \n\n\f", "award": [], "sourceid": 1693, "authors": [{"given_name": "Anthony", "family_name": "Coolen", "institution": null}, {"given_name": "C.", "family_name": "Mace", "institution": null}]}