{"title": "Some results on convergent unlearning algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 358, "page_last": 364, "abstract": null, "full_text": "Some results on convergent unlearning \n\nalgorithm \n\nSerguei A. Semenov &: Irina B. Shuvalova \n\nInstitute of Physics and Technology \n\nPrechistenka St. 13/7 \nMoscow 119034, Russia \n\nAbstract \n\nIn this paper we consider probabilities of different asymptotics of \nconvergent unlearning algorithm for the Hopfield-type neural net(cid:173)\nwork (Plakhov & Semenov, 1994) treating the case of unbiased \nrandom patterns. We show also that failed unlearning results in \ntotal memory breakdown. \n\n1 \n\nINTRODUCTION \n\nIn the past years the unsupervised learning schemes arose strong interest among \nresearchers but for the time being a little is known about underlying learning mech(cid:173)\nanisms, as well as still less rigorous results like convergence theorems were obtained \nin this field. One of promising concepts along this line is so called \"unlearning\" \nfor the Hopfield-type neural networks (Hopfield et ai, 1983, van Hemmen & Klem(cid:173)\nmer, 1992, Wimbauer et ai, 1994). Elaborating that elegant ideas the convergent \nunlearning algorithm has recently been proposed (Plakhov & Semenov, 1994), ex(cid:173)\necuting without patterns presentation. It is aimed at to correct initial Hebbian \nconnectivity in order to provide extensive storage of arbitrary correlated data. \nThis algorithm is stated as follows. Pick up at iteration step m, m = 0,1,2, ... a \nrandom network state s(m) = (S~m), .. . , S~m), with the values sfm) = \u00b11 having \nequal probability 1/2, calculate local fields generated by s(m) \n\nh~m) = ~ J~~)S~m) \n, \n' \n\n'J \n\nJ \n\nt = , ... , , \n. \nN \n\n1 \n\nN \nL..J \ni=l \n\nand then update the synaptic weights by \n\nJ ~~+1) = J~~) - cN-lh~m)h~m) \n' J ' \n'J \n\n'J \n\n.. 1 \nt, J = , ... , \n\nN \n\n. \n\n(1) \n\n\fSome Results on Convergent Unlearning Algorithm \n\n359 \n\nHere C > 0 stands for the unlearning strength parameter. We stress that self(cid:173)\ninteractions, Jii , are necessarily involved in the iteration process. The initial con-\ndition for (1) is given by the Hebb matrix, J~O) = J!f: \n\n(2) \n\nwith arbitrary (\u00b11)-patterns eJJ , J.l = 1, ... ,p. \nFor C < Ce, the (rescaled) synaptic matrix has been proven to converge with proba(cid:173)\nbility one to the projection one on the linear subspace spanned by maximal subset \nof linearly independent patterns (Plakhov & Semenov, 1994). As the sufficient \ncondition for that convergence to occur, the value of unlearning strength C should \nbe less than Ce = '\\;~~x where Amax denotes the largest eigenvalue of the Hebb \nmatrix. Very often in real-world situations there are no means to know Ce in ad(cid:173)\nvance, and therefore it is of interest to explore asymptotic behaviour of iterated \nsynaptic matrix for arbitrary values of c. As it is seen, there are only three possi(cid:173)\nble limiting behaviours of the normalized synaptic matrix (Plakhov 1995, Plakhov \n& Semenov, 1995). The corresponding convergence theorems relate corresponding \nspectrum dynamics to limiting behaviour of normalized synaptic matrix j = J IIIJII \n( IPII = (L:~=1 Ji~)1/2 ) which can be described in terms of A~n;2 the smallest \neigenvalues of J(m): \n\nI. if A~2 = 0 for every m = 0,1,2, ... , with multiplicity of zero eigenvalue being \nfixed, then \n\n(A) \n\nlim j~r:n) = S-1/2 PiJ\u00b7 \n\nm-oo \n\nIJ \n\nwhere P marks the projection matrix on the linear subspace CeRN spanned by \nthe nominated patterns set eJJ , J.l = 1, . .. , p, s = dim C ~ p; \nII. if A~n;2 = 0, m = 0,1,2, ... , besides at some (at least one) steps mUltiplicity of \nzero eigenvalue increases, then \n\n(B) \n\nI\u00b7 \n1m \nm-oo \n\nJ-(m) -\n-\n\n.. \nIJ \n\n'-1/2p' \n.. \ns \nIJ \n\nwhere P' is the projector on some subspace C' C C, s' = dimC' < s; \nIII. if A~n;2 < 0 starting from some value of m, then \n\n(C) \n\n(3) \n\nwith some (not a \u00b11) unity random vector e = (6, \u00b7 .. ,eN). \nThese three cases exhaust all possible asymptotic behaviours of ji~m), that is their \ntotal probability is unity: PA + PB + Pc = 1. The patterns set is supposed to be \nfixed. \n\nThe convergence theorems say nothing about relative probabilities to have specific \nasymptotics depending on model parameters. In this paper we present some general \nresults elucidating this question and verify them by numerical simulation. \nWe show further that the limiting synaptic matrix for the case (C) which is the \nprojector on -e E C cannot maintain any associative memory. Brief discussion on \nthe retrieval properties of the intermediate case (B) is also given . \n\n\f360 \n\nS. A. SEMENOV, 1. B. SHUVALOVA \n\n2 PROBABILITIES OF POSSIBLE LIMITING \n\nBEHAVIOURS OF j(m) \n\nThe unlearning procedure under consideration is stochastic in nature. Which result \nof iteration process, (A), (B) or (C), will realize depends upon the value of \u20ac, size \nand statistical properties of the patterns set {~~, J.l = 1, ... , p}, and realization of \nunlearning sequence {sCm), m=0,1,2, . .. }. \n\nUnder fixed patterns set probabilities of appearance of each limiting behaviour of \nsynaptic matrix is determined by the value of unlearning strength E only. In this \nsection we consider these probabilities as a function of E. \n\n0, otherwise Pc(E) -\n\nGenerally speaking, considered probabilities exhibit strong dependence on patterns \nset, making impossible to calculate them explicitly. It is possible however to obtain \nsome general knowledge concerning that probabilities, namely: P A (E) -\n0+, and hence, PB,c(E) -\n0, \nbecause of P A + PB + Pc = 1. This means that the risk to have failed unlearning \nrises when E increases. Specifically, we are able to prove the following: \nProposition. There exist positive \u20acl and \u20ac2 \nand Pc(\u20ac) = 1, \u00a32 < \u20ac. \nBefore passing to the proof we bring forward an alternative formulation of the above \nstated classification. After multiplying both sides of(l) by sIm)sjm) and summing \nup over all i and j, we obtain in the matrix notation \n\nsuch that P A (\u00a3) = 1, \n\n00, and PA,B(\u00a3) -\n\n0 < \u20ac < \u20acl, \n\n1 as E -\n\n1 as E -\n\ns(m)T J(m+l)s(m) = D.ms(m)T J(m)s(m) \n\n(4) \nwhere the contraction factor D. m = 1 - EN-ls(m)T J(m)s(m) controls the asymp(cid:173)\ntotics of j(m), as it is suggested by detailed analysis (Plakhov & Semenov, 1995). \n(Here and below superscript T designates the transpose.) The hypothesis of conver-\ngence theorems can be thus restated in terms of D.m , instead of .A~'7~, respectively: \nI. D.m > 0 'tim; \nIII. D.m < 0 at some step \nm. \n\nII. D.m = 0 for I steps ml, ... , ml; \n\nIt is obvious that D.m 2 1 - \u20ac.Af:a1 where .A~\"!1 marks the largest eigen(cid:173)\nProof \nvalue of J(m) . From (4), it follows that the sequence p~\"!t, m = 0,1,2, ... } is \nnon increasing, and consequently D.m 2 1 - \u00a3.A~~x with \n\n.A~~x = s~ xT JHx = s~ N- l t (L~rxi)2 \n=:; sup N- 1 L L)~n2 L xl = p. \n\nIxl-l \n\nIxl-l \n\nP N \n\ni \nN \n\n~=1 \n\nIxl=l \n\n~=l i=l \n\ni=l \n\nFrom this, it is straightforward to see that, if \u00a3 < p-l , then D. m > 0 for any m. By \nconvergence theorem (Plakhov & Semenov, 1995) iteration process (1) thus leads \nto the limiting relation (A). \nLet by definition \"I = mins N-lsr JHS where minimum is taken over such (\u00b11)(cid:173)\nvectors S for which JH S =1= 0 (-y > 0, in view of positive semidefiniteness of JH), \nand put \u20ac > \"1- 1 . Let us further denote by n the iteration step such that JH sCm) = \n0, m = 0,1, ... , n - 1 and JH sen) =1= O. Needless to say that this condition may be \nsatisfied even for the initial step n = 0: JH S(O) =1= O. At step n one has \n\nD. n = 1 - EN- 1 s(n)T JH sen) =:; 1 - q < O. \n\n\fSome Results on Convergent Unlearning Algorithm \n\n361 \n\nThe latter implies loss of positive semidefiniteness of J(m), what results in asymp(cid:173)\ntotics (C) (Plakhov, 1995, Plakhov & Semenov, 1995). By choosing Cl = p-l and \nC2 = 1'-1 we come to the statement of Proposition. \nComparison of numerical estimates of considered probabilities with analytical ap(cid:173)\nproximations can be done on simple patterns statistics. In what follows the patterns \nare assumed to be random and unbiased. \nThe dependence P(c) has been found in computer simulation with unbiased random \npatterns. It is worth noting, by passing, that calculation Llm using current simu(cid:173)\nlation data supplies a good control of unlearning process owing to an alternative \nformulation of convergence theorems. In simulation we calculate pf (c) averaged \nover the sets of unbiased random patterns, as well as over the realizations of un(cid:173)\nlearning sequence. As N increases, with 0: = piN remaining fixed, the curves slope \nsteeply down approaching step function PA'(c) = O(c - 0:- 1 ) (Plakhov & Semenov, \n1995). Without presenting of derivation or proof we will advance the reasoning \nsuggestive of it. First it can be checked that Ll m is a selfaveraging quantity with \nmean 1 - cN- 1TrJ(m) and variance vanishing as N goes to infinity. Initially one \nhas N- 1TrJ H = 0:, and obviously the sequence {TrJ(m), m = 0,1,2, ... } is nonin(cid:173)\ncreasing. Therefore Llo = 1 - cO:, and all others Llm are not less than Llo. If one \nchooses c < 0:- 1 , then all Llm will be positive, and the case (A) will realize. On the \nother hand, when c > 0:- 1 , we have Llo < 0, and the case (C) will take place. \nWhat is probability for asymptotics (B) to appear? We will adduce an argument \n(detailed analysis (Plakhov & Semenov, 1995) is rather cumbersome and omitted \nhere) indicating that this probability is quite small. First note that given patterns \nset it is nonzero for isolated values of c only. Under the assumption that the patterns \nare random and unbiased, we have calculated probability of I-fold appearance Llm = \no summed up over that isolated values of c. Using Gaussian approximation at \nlarge N, we have found that probability scales with N as N'/2+2-21+m+l. The \ntotal probability can then be obtained through summing up over integer values \nI: 0 < I < s and all the iteration steps m = 0,1,2, .... As a result, the main \ncontribution to the total probability comes from m = 0 term which is of the order \nN- 3 / 2 . \n\n3 LIMITING RETRIEVAL PROPERTIES \n\nHow does reduction of dimension of \"memory space\" in the case (B), 5 ~ 5' = 5-1, \naffect retrieval properties of the system? They may vary considerably depending on \nI. In the most probable case I = 1 it is expected that there will be a slight decrease \nin storage capacity but the size of attraction basins will change negligibly. This is \ncorroborated by calculating the stability parameter for each pattern J.I. \n\nI-' _ cl-' ' \" ' pi cl-' \n\"'i - <'i ~ ij<'j\u00b7 \n\njti \n\n(5) \n\nLet SemI) be the state vector with normalized projection on C given by V = \nps(mI) IIPs(mI)1 such that \n\nIPs(ml)1 = Jo:N, ~ '\" N-l/2, L ~~r '\" 1. \n\nN \n\ni=1 \n\nThen the stability parameter (5) is estimated by \n\n\",r = ~r L (Pij - ~Vj)~j = (1- Pii)- (~~r t Vj~j - Vi 2 ) ~ 1-Pii+O(N- 1/ 2 ). \n\nj~i \n\nj=1 \n\n\f362 \n\nS. A. SEMENOV, I. B. SHUV ALOVA \n\nSince Pij has mean a and variance vanishing as N --t 00, we thus conclude that the \nstability parameter only slightly differs from that calculated for the projector rule \n(s = s') (Kanter & Sompolinsky, 1987). \nOn the other hand, in the situation 0 < s' /s ~ 1 (the possible case i = 0 is trivial) \nthe system will be capable retrieving only a few nominated patterns which ones we \ncannot specify beforehand. As mentioned above, this case realizes with very small \nbut finite probability. \nThe main effect of self-interactions Jji lies in substantial decrease in storage capacity \n(Kanter & Sompolinsky, 1987). This is relevant when considering the cases (A) \nand (B). In the case (C) the system possesses an interesting dynamics exhibiting \npermanent walk over the state space. There are no fixed points at all. To show this, \nwe write down the fixed point condition for arbitrary state S: Si I:f:l JjjSj > \n0, i = 1, ... , N. By using the explicit expression for limiting matrix ~j (3) and \nsumming up over i's, we get as a result (I:j Sj\u20acj)2 < 0, what is impossible. \nIf self-interactions are excluded from local fields at the stage of network dynamics, \nit is then driven by the energy function of the form H = -(2N)-1 I:itj JjjSjSj. \n(Zero-temperature sequential dynamics either random or regular one is assumed.) \nIn the rest of this section we examine dynamics of the network equiped with limit(cid:173)\ning synaptic matrix (C) (3). We will show that in this limit the system lacks any \nassociative memory. There are a single global maximum of H given by Sj = sgn(\u20acd \nand exponentially many shallow minima concentrated close to the hyperplane or(cid:173)\nthogonal to \u20ac. Moreover it is turned out that all the metastable states are unstable \nagainst single spin flip only, whatever the realization of limiting vector \u20ac. Therefore \nafter a spin flips the system can relax into a new nearby energy minimum. Through \na sequence of steps each consisting of a single spin flip followed by relaxation one \ncan, in principle, pass from one metastable state to the other one. \n\nWe will prove in what follows that any given metastable state S' one can pass to \nany other one S through a sequence of steps each consisting of a single spin flip \nand subsequent relaxation to a some new metastable state. Note that this general \nstatement gives no indications concerning the order of spin flips when moving along \na particular trajectory in the state space. \n\nNow on we turn to the proof. Let us enumerate the spins in increasing order in \nabsolute values of vector components 0 ~ 161 ~ ... ~ I\u20acNI. The proof is carried out \nby induction on j = 1, ... , N where j is the maximal index for which SJ 1= Sj. \nFor j = 1 the statement is evident. Assuming that it holds for 1, ... , j - 1 (2 ~ \nj ~ N), let us prove it for j. One has j = max { i: Sf 1= Sd. With flipping spin \nj in the state Sl, we next allow relaxation by flipping spins 1, .. . ,j - 1 only. The \nsystem finally reaches the state S2 realizing conditional energy minimum under \nfixed Sj, ... , S N \u2022 \n\nShow that S2 is true energy minimum. There are two possibilities: \n(i) For some i, 1 ~ i ~ j - 1, one has sgn (\u20acj Sn = sgn (\u20acT S2) . The fixed point \ncondition for S2 can be then written as \n\nI \u20acT S2 I~ min {I\u20acd: 1 ~ i ~ j - 1, sgn(\u20acjS;) = sgn(\u20acT S2)} . \n\nl.From this, in view of increasing order of I\u20aci I 's, one gets immediately \nI \u20acTS2 I~ min {I\u20acd: 1 ~ i ~ N, sgn(\u20acjS;) = sgn(\u20acT S2)} , \n\nwhat implies S2 is true energy minimum. \n\n\fSome Results on Convergent Unlearning Algorithm \n\n363 \n\nIf ~T S2 = 0, the fixed point condition for S2 is automatically satisfied. Otherwise, \nfor 1 $ i $ j - 1 one has \n\nand \n\n~TS2 = -sgn(~T S2) 2: 1~s:1 + 2:~iSs:, \n\nj-l \n\nN \n\ni;;;1 \n\ni=j \n\n(6) \n\nFor the sake of definiteness, we set ~T S > O. (The opposite case is treated analo(cid:173)\ngously.) In this case ~T S2 > 0, since otherwise, according to (6), it should be \n\nj-I \n\no ~ ~T S2 = I: l~s:I + 2: ~iSi ~ ~T S, \n\nN \n\ni=1 \n\ni=j \n\nwhat contradicts our setting. \n\nOne thus obtains \n\nj-l \n\n~TS2 = - I: I~d + L~iSi $ ~TS, \n\nN \n\ni=1 \n\ni=j \n\n(7) \n\nand using the fixed point condition for S one gets \n\n~T S $ min {1~s:I: ~iSi > O} $ min {1~s:I: j $ i $ N, ~iSi > O} \n= min{l~s:I: ~iSf > O}. \n\n(8) \nIn the latter inequality of(8) one uses that ~iSf < 0, 1 $ i ~ j-l and Sf = Ss:, j ~ \ni ~ N. Taking into account (7) and (8), as a result we come to the condition for \nS2 to be true energy minimum \n\no < ~T S2 ~ min {I~il : ~iSf > O} . \n\nAccording to inductive hypothesis, since S; = Si, j ~ i ~ N, from the state S2 one \ncan pass to S, and therefore from S' through S2 to S. This proves the statement. \n\nIn general, metastable states may be grouped in clusters surrounded by high energy \nbarriers. The meaning of proven statement resides in excluding the possibility of \neven such type a memory. Conversely, allowing a sequence of single spin flips (for \ninstance, this can be done at finite temperatures) it is possible to walk through the \nwhole set of metastable states. \n\n4 CONCLUSION \n\nIn this paper we have begun studying on probabilities of different asymptotics of \nconvergent unlearning algorithm considering the case of unbiased random patterns. \nWe have shown also that failed unlearning results in total memory breakdown. \n\nReferences \n\nHopfield, J.J., Feinstein, D.I. & Palmer, R.G. (1983) \"Unlearning\" has a stabilizing \neffect in collective memories. Nature 304:158-159 . \nvan Hemmen, J.L. & Klemmer, N. (1992) Unlearning and its relevance to REM \nsleep: Decorrelating correlated data. In J. G. Taylor et al (eds.) , Neural Network \nDynamics, pp. 30-43. London: Springer. \n\n\f364 \n\nS. A. SEMENOV, I. B. SHUV ALOV A \n\nWimbauer, U., Klemmer, N. & van Hemmen, J .L. (1994) Universality of unlearning. \nNeural Networks 7:261-270. \nPlakhov, A.Yu. & Semenov, S.A. (1994) Neural networks: iterative unlearning \nalgorithm converging to the projector rule matrix. J. Phys.I France 4:253-260. \nPlakhov, A.Yu. (1995) private communication \nPlakhov, A.Yu . & Semenov, S.A. (1995) preprint IPT. \nKanter, I. & Sompolinsky, H. (1987) Associative recall of memory without errors. \nPhys. Rev. A 35:380-392. \n\n\f", "award": [], "sourceid": 1130, "authors": [{"given_name": "Serguei", "family_name": "Semenov", "institution": null}, {"given_name": "Irina", "family_name": "Shuvalova", "institution": null}]}