{"title": "Convergence of the Wake-Sleep Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 239, "page_last": 245, "abstract": null, "full_text": "Convergence of The Wake-Sleep Algorithm \n\nShiro Ikeda \nPRESTO,JST \n\nWako, Saitama, 351-0198, Japan \n\nshiro@brain.riken.go.jp \n\nShun-ichi Amari \n\nRIKEN Brain Science Institute \nWako, Saitama, 351-0198,Japan \n\namari@brain.riken.go.jp \n\nHiroyuki Nakahara \n\nRIKEN Brain Science Institute \n\nhiro@brain.riken.go.jp \n\nAbstract \n\nThe W-S (Wake-Sleep) algorithm is a simple learning rule for the models \nwith hidden variables. It is shown that this algorithm can be applied to \na factor analysis model which is a linear version of the Helmholtz ma(cid:173)\nchine. But even for a factor analysis model, the general convergence is \nnot proved theoretically. In this article, we describe the geometrical un(cid:173)\nderstanding of the W-S algorithm in contrast with the EM (Expectation(cid:173)\nMaximization) algorithm and the em algorithm. As the result, we prove \nthe convergence of the W-S algorithm for the factor analysis model. We \nalso show the condition for the convergence in general models. \n\n1 INTRODUCTION \n\nThe W-S algorithm[5] is a simple Hebbian learning algorithm. Neal and Dayan applied the \nW-S algorithm to a factor analysis mode1[7]. This model can be seen as a linear version of \nthe Helmholtz machine[3]. As it is mentioned in[7], the convergence of the W-S algorithm \nhas not been proved theoretically even for this simple model. \n\nFrom the similarity of the W-S and the EM algorithms and also from empirical results, the \nW-S algorithm seems to work for a factor analysis model. But there is an essential differ(cid:173)\nence between the W-S and the EM algorithms. In this article, we show the em algorithm[2], \nwhich is the information geometrical version of the EM algorithm, and describe the essen(cid:173)\ntial difference. From the result, we show that we cannot rely on the similarity for the reason \nof the W-S algorithm to work. However, even with this difference, the W-S algorithm works \non the factor analysis model and we can prove it theoretically. We show the proof and also \nshow the condition of the W-S algorithm to work in general models. \n\n\f240 \n\nS. Ikeda, S. Amari and H. Nakahara \n\n2 FACTOR ANALYSIS MODEL AND THE W-S ALGORITHM \n\nA factor analysis model with a single factor is defined as the following generative model, \n\nGenerative model \n\nx = J.t + yg + \u20ac, \n\n,xn)T is a n dimensional real-valued visible inputs, y ,......, \nwhere x = (Xl,'\" \nN(O, 1) is the single invisible factor, g is a vector of \"factor loadings\", J.t is the \n,......, N(O, E) is the \noverall means vector which is set to be zero in this article, and \u20ac \nnoise with a diagonal covariance matrix, E = diag( a;). In a Helmholtz machine, \nthis generative model is accompanied by a recognition model which is defined as, \n\nRecognition model \n\ny = rT x + 15, \n\nwhere r is the vector of recognition weights and 15 ,......, N (0, 8 2 ) is the noise. \n\nWhen data Xl, ... ,XN is given, we want to estimate the MLE(Maximum Likelihood Es(cid:173)\ntimator) of g and E. The W-S algorithm can be applied[7] for learning of this model. \n\nWake-phase: From the training set {x s} choose a number of x randomly and for each \ndata, generate y according to the recognition model y = rT x + 15,15 ,......, N(O, 8F). \nUpdate g and E as follows using these x's and y's, where a is a small positive \nnumber and (3 is slightly less than 1. \n\ngt+l \nal,t+l = \n\ngt + a(x - gtY)Y \n(3a;,t + (1 - (3) (Xi - 9i,ty)2, \n\n(1) \n\n(2) \n\nwhere denotes the averaging over the chosen data. \n\nSleep-phase: According to the updated generative model x = \n\n,......, \n,......, N(O,diag(a[+1))' generate a number of x and y. And update r \n\nygt+l + \u20ac, y \n\nN(O, 1) , \u20ac \nand 8 2 as, \n\nrt + a(y - rTx)x \n---=--\n(38; + (1 - (3)(y - rT x)2. \n\n(3) \n(4) \n\nBy iterating these phases, they try to find the MLE as the converged point. \n\nFor the following discussion, let us define two probability densities p and q, where p is the \ndensity of the generative model, and q is that of the recognition model. \nLet 0 = (g, E), and the generative model gives the density function of x and y as, \n\np(y,x; 0) = exp (-~(y xT)A ( ~ ) -1/J(O)) \n1( \n\n( l+gTE-lgl_gTE-l) \n\n-E 19 \n\nA= \n\nE 1 \n\n,1/J(O) = 2 Llog a i+(n+l)log211\" , \n\n2 \n\n) \n\n(5) \n\nwhile the recognition model gives the distribution of y conditional to x as the following, \n\nwhere, \"1 = (r , 8 2 ). From the data xl,'\" ,XN, we define, \n\nq(ylx; \"1) ,......, N(rT x, 8 2), \n\n1 N \n\nC = N L XsXs T, q(x),......, N(O, C). \n\nWith this q( x), we define q(y, x; \"1) as, \n\ns=l \n\nq(y, x; \"1) = q(x)q(ylx; \"1) = exp ( - ~(y xT)B ( ~ ) -1/J(\"1)) \n\n(6) \n\nB = -2 \n\n1 ( \n8 \n\n1 I _ rT ) \n\n+ rr \n\n-r 8 \n\n2C 1 1 ' , 1/J(\"1) = -2 (log 8 2 + log ICI + (n + 1) log 211\") . \n\n1 \n\n\fConvergence of the Wake-Sleep Algorithm \n\n241 \n\n3 THE EM AND THE em ALGORITHMS FOR A FACTOR \n\nANALYSIS MODEL \n\nIt is mentioned that the W-S algorithm is similar to the EM algorithm[ 4]([5][7]). But there \nis an essential difference between them. In this section, first, we show the EM algorithm. \nWe also describe the em algorithm[2] which gives us the information geometrical under(cid:173)\nstanding of the EM algorithm. With these results, we will show the difference between \nW-S and the EM algorithms in the next section. \n\nThe EM algorithm consists of the following two steps. \n\nE-step: Define Q(O, Ot) as, \n\nM-step: Update 0 as, \n\nQ(O,Ot) = N 2: Ep(Yiz. ;8.) [logp(y, xs; O)J \n\n1 N \n\ns=1 \n\nOt+l = argmaxQ(O, Ot), \n\n8 \n\ngt+l = T t\"'-le t\"'-1 \n\ngt L.Jt \n\nL.Jt gt + 1 + gt L.Jt gt \n\nT t\"'-l \n\n' \n\nEt+l = diag C - gt+l \n\n( \n\ngT E - 1e \n-1 \n\n) \n1 + gt E t gt \n\nt / \n\n. \n\n(7) \nEp [.J denotes taking the average with the probability distribution p. The iteration of these \ntwo steps converges to give the MLE. \n\nThe EM algorithm only uses the generative model, but the em algorithm[2] also uses the \nrecognition model. The em algorithm consists of the e and m steps which are defined as the \ne and m projections[l] between the two manifolds M and D. The manifolds are defined \nas follows. \n\nModel manifold M: M ~ {p(y, x; 0)10 = (g, diag(aD), 9 ERn , 0 < (Ii < oo}. \nDatamanifoldD: D ~ {q(y,x;1J)I1J = (r,s2),r E Rn,O < S < oo},q(x) include the \n\nmatrix C which is defined by the data, and this is called the \"data manifold\". \n\nD \n\nM \n\nFigure 1: Information geometrical understanding of the em algorithm \n\nFigure 1 schematically shows the em algorithm. It consists of two steps, e and m steps. On \neach step, parameters of recognition and generative models are updated respectively. \n\n\f242 \n\nS. Ikeda. S. Amari and H. Nakahara \n\ne-step: Update 7J as the e projection of p(y, x; 8d on D. \n\n7Jt+1 = argminKL(q(7J)'p(8t )) \n\n'1 \n\nrt+l = \n\nhi 1gt \n-1 \n\nT \n\n1 + gt Et gt \n\n2 \nSt+l = \n\n' \n\n1 \nT \n\n1 + gt h t gt \n\n-1 \n\n. \n\nwhere K L(q(7J),p(8)) is the Kullback-Leiblerdivergence defined as, \n\nKL(q(7J),p(8)) = E q (y ,:ll;'1) [log q~y,x:~~] \n\np y,x, \n\nm-step: Update 8 as the m projection of q(y, x; 7Jd on M. \n\n8t+1 = argminKL(q(7Jt+1),P(8)) \n\n9 \n\ngt+1 = 2 \n\nCrt+l \n\nT C \n\nSt+1 + r t+1 rt+1 \n\nEt+1 = diag (C - gt+1rt+1C), \n\nT \n\n' \n\n(8) \n\n(9) \n\n(10) \n\n(11) \n\nBy substituting (9) for rt+1 and s;+1 in (11), it is easily proved that (1) is equivalent to \n(7), and the em and EM algorithms are equivalent. \n\n4 THE DIFFERENCE BETWEEN THE W-S AND THE EM \n\nALGORITHMS \n\nThe wake-phase corresponds to a gradient flow of the M-step[7] in the stochastic sense. \nBut the sleep-phase is not a gradient flow of the E-step. In order to see these clear, we show \nthe detail of the W-S phases in this section. \n\nFirst, we show the averages of (1), (2), (3) and (4), \n\ngt+1 = gt - 0.(8; + rTCrd (gt -\nEt+1 = ht - (1- [3) (ht - diag (C - 2(Crt)gT + (s; + rTCrdgtgT)) \n\n2 cr;c ) \nSt + rt \nrt \n\nrt+1 = rt - o.(Et+1 + gt+1gt+1) \n\n1 + gt+1 ht+1gt+1 \nS;+1 = s; - (1-[3) (s; - ((1-g~1rd2 +rTht+1rt)). \n\nhi=i.\\9t+1) \nT \n\n- 1 \n\nrt -\n\nT \n\n( \n\nAs the K-L divergence is rewritten as K L(q(7J) ,p(8)), \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\nK L(q(7J),p(8)) = \"2 tr (B - 1 A) - -2- + 'ljJ (8) - 'ljJ (7J), \n\nn+l \n\n1 \n\nthe derivatives of this K -L divergence with respect to 8 = (g , h) are, \n\n:gKL(q(7J)'P(8)) \n\n2 ((S2 +rTCr)h-l) (g - S2 +C;Tcr) \n\n(16) \n\n8 \n8EK L(q(7J),p(8)) \n\nE-2 (E - diag (C - 2CrgT + (S2 + rTCr)ggT)) ~17) \n\nWith these results, we can rewrite the wake-phase as, \n\na \n\n8 \n\ngt+1 = gt - 2,Et Bgt KL(q(7Jt) , p(8t )) \n\nEt+1 = ht - (1 - [3 )ht BEt K L(q(7Jd ,p(8r)) \n\n2 8 \n\n(18) \n\n(19) \n\n\fConvergence of the Wake-Sleep Algorithm \n\n243 \n\nSince E is a positive definite matrix, the wake-phase is a gradient flow of m-step which is \ndefined as (0). \nOn the other hand, K L(p( 0), q( 1])) is, \n\nKL(p(O),q(1])) = 2tr(A - B) -\"2 +1/1(1]) -1/1(0). \n\n1 \n\n1 \n\nn \n\nThe derivatives of this K-L divergence respect to rand 8 2 are, \n\n8 \n8r K L(p(O) , q(1])) \n\n8 \n\n8(S2) K L(p(O) , q(1])) \n\nTherefore, the sleep-phase can be rewritten as, \n\n0: 2 8 \n\nrt+l = rt - \"2 8t 8rt K L(p(Ot+t} , q(1]t}) \n8;+1 = 8; - (1- {3)(SF)2 8(~F)KL(P(Ot+1),q(1]d). \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\nThese are also a gradient flow, but because of the asymmetricity of K-L divergence, (22), \n(23) are different from the on-line version of the m-step. This is the essential difference \nbetween the EM and W-S algorithms. Therefore, we cannot prove the convergence of the \nW-S algorithm based on the similarity of these two algorithms[7]. \n\n11 \n\nI \n\nD \n\nKL(p(a),q (11\u00bb \n\nKL(q(l1l.P (a\u00bb \n\nFigure 2: The Wake-Sleep algorithm \n\n5 CONVERGENCE PROPERTY \n\nWe want to prove the convergence property of the W-S algorithm. If we can find a Lyapnov \nfunction for the W-S algorithm, the convergence is guaranteed[7]. But we couldn't find it. \nInstead of finding a Lyapnov function, we take the continuous time, and see the behavior \nofthe parameters and K-L divergence, K L(q(1]t),p(Ot)). \nKL(q(1]),p(O)) is a function of g, r, E and 8 2 \u2022 The derivatives with respect to 9 and E \nare given in (16) and (17). The derivatives with respect to rand 8 2 are, \n\n8 \n8r K L(q(1]),p(O)) \n\n8 \n\n8(82) K L(q(1]),p(O)) \n\n(24) \n\n(25) \n\n\f244 \n\nS. Ikeda, S. Amari and H. Nakahara \n\nOn the other hand, we set the flows of g, T, E and S2 to follow the updating due to the W-S \nalgorithm, that is, \n\nd \ndtg \n\nd \n-T \ndt \n~E \ndt \n\n~(S2) \ndt \n\nWith theses results, dK L( q( 7]t), p( Ot)) / dt is, \n\ndKL(q(7]t),p(Ot)) \n\ndt \n\n8KLdg \n\n8KLdT 8KLdE 8KL d(S2) \n= 7i9 dt + fiT dt + 8E dt + 8(S2)--;{t\u00b7 \n\n(26) \n\n(27) \n\n(28) \n\n(29) \n\n(30) \n\nFirst 3 terms in the right side of (30) are apparently non-positive. Only the 4th one is not \nclear. \n\n8K L d(S2) \nI (2 \n8(S2) --;{t = -(3 St -\n\n(1 - gt Tt) + Tt EtTt \n( \n\n2 \n\nT \n\nT \n\n)) ( \n\n1 ) \n1 + gt Et gt - sF \n\nT -1 \n\n= -\n\n1+g[E;lgt(2 \nSt -\n\n) \n1 + gt Et gt \nThe K L(q(7]t), p(Ot)) does not decrease when s; stays between ((1 - g[ Tt)2 + T[ EtTt) \nand 1/(1 + g[ E;lgd. but if the following equation holds, these two are equivalent, \n\n( \n(1 - gt Tt) + Tt EtTt \n\n1 \nT \n\n2 \nSt \n\nT \n\n-1 \n\n))(2 \nSt -\n\nT 2 \n\n. \n\nTt = \n\nE;lgt \n1 \n\nT \n\n(31) \nFrom the above results, the flows of g, T and E decrease K L(q(7]d, p( Ot)) at any time. s; \nconverge to ((1- g[ Tt)2 +T[ EtTt) but it does not always decrease K L(q( 7]t), p( Od). But \nsince T does converge to satisfy (31) independently of s;, finally s; converges to 1/ (1 + \ng[ E;lgt). \n\n1 + gt E; gt \n\n. \n\n6 DISCUSSION \n\nThis factor analysis model has a special property that p(ylx ; 0) and q(ylx; 7]) are equivalent \nwhen following conditions are satisfied[7], \n\nT = \n\nE - 1g \n\n1 + gT E - 1g ' \n\n2 \nS = ---,=-----:--\n1 + gT E-1 g \n\n1 \n\n(32) \n\nFrom this property, minimizing K L(p( 0) , q( 7])) and K L( q( 7]) , p( 0)) with respect to 7] \nleads to the same point. \n\nKL(p(O) , q(7])) =Ep(:Jl ;9) \n\np(x;O)] \n\nlog q(x) + Ep (y ,:Jl ;9) \n[ \n\n[ \nP(Y1x ;O)] \nlog q(ylx;7]) \n\nK L(q(7]) ,p(O)) =Eq(:Jl) \n\nq(X)] \n\nlog p(x ; 0) + Eq(y ,:Jl ;'1) \n[ \n\n[q(Y1X; 7])] \nlog p(ylx; 0) , \n\n(33) \n\n(34) \n\nboth of (33) and (34) include 7] only in the second term of the right side. If (32) holds, \nthose two terms are O. Therefore K L(p( 0) , q( 7])) and K L(q(7]) , p(O)) are minimized at \nthe same point. \n\n\fConvergence o/the Wake-Sleep Algorithm \n\n245 \n\nWe can use this result to modify the W-S algorithm. If the factor analysis model does not \ntry wake- and sleep- phase alternately but \"sleeps we11\" untill convergence, it will find the \nTJ which is equivalent to the e-step in the em algorithm. Since the wake-phase is a gradient \nflow of the m-step, this procedure will converge to the MLE. This algorithm is equivalent \nto what is called the GEM(Generalized EM) algorithm[6]. \n\nThe reason ofthe GEM and the W-S algorithms work is thatp(ylx; 6) is realizable with the \nrecognition model q(ylx; TJ). If the recognition model is not realizable, the W-S algorithm \nwon't converge to the MLE. We are going to show an example and conclude this article. \n\nSuppose the case that the average of y in the recognition model is not a linear function of \nr and x but comes through a nonlinear function f (.) as, \nRecognition model \nwhere f(\u00b7) is a function of single input and output and 6 ,...., N(O,8 2 ) is the noise. In \nthis case, the generative model is not realizable by the recognition model in general. \nAnd minimizing (33) with respect to TJ leads to a different point from minimizing (34). \nK L(p( 6), q( TJ)) is minimized when rand 8 2 satisfies, \n\ny = f (r T x) + <5, \n\nEp(~;9) [J(rT x)f'(rT x)x] = Ep(y ,~;9) [Y1'(rT x)x] \n\nwhile KL(q(TJ),p(6)) is minimized when rand 8 2 satisfies, \n\n(1 + gT E-1g)Eq(~;'I1) [f(r T x)1'(rT x)x] = Eq(~;'I1) [1'(r T x)xxT] E- 1g \n\n8 2 = 1 - Ep(y,~;9) [-2yf(r T x) + f2(r T x)], \n\n(35) \n\n(36) \n\n(37) \n\n(38) \nHere, l' (.) is the derivative of f (.). If f (.) is a linear function, l' (.) is a constant value and \n(35), (36) and (37), (38) give the same TJ as (32), but these are different in general. \n\n1 + gT E-1 g \n\n2 \n8 = \n\n1 \n\nWe studied a factor analysis model, and showed that the W-S algorithm works on this \nmodel. From further analysis, we could show that the reason why the algorithm works \non the model is that the generative model is realizable by the recognition model. We also \nshowed that the W-S algorithm doesn't converge to the MLE if the generative model is not \nrealizable with a simple example. \nAcknowledgment \nWe thank Dr. Noboru Murata for very useful discussions on this work. \n\nReferences \n\n[1] Shun-ichi Amari. Differential-Geometrical Methods in Statistics, volume 28 of Lecture \n\nNotes in Statistics. Springer-Verlag, Berlin, 1985. \n\n[2] Shun-ichi Amari. Information geometry of the EM and em algorithms for neural net(cid:173)\n\nworks. Neural Networks, 8(9):1379-1408, 1995. \n\n[3] Peter Dayan, Geoffrey E. Hinton, and Radford M. Neal. The Helmholtz machine. \n\nNeural Computation, 7(5):889-904,1995. \n\n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete \n\ndata via the EM algorithm. J. R. Statistical Society, Series B, 39:1-38, 1977. \n\n[5] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The \"wake-sleep\" algorithm for \n\nunsupervised neural networks. Science, 268:1158-1160,1995. \n\n[6] Geoffrey J. McLachlan and Thriyambakam Krishnan. The EM Algorithm and Exten(cid:173)\n\nsions. Wiley series in probability and statistics. John Wiley & Sons, Inc., 1997. \n\n[7] Radford M. Neal and Peter Dayan. Factor analysis using delta-rule wake-sleep learn(cid:173)\n\ning. Neural Computation, 9(8):1781-1803,1997. \n\n\f", "award": [], "sourceid": 1632, "authors": [{"given_name": "Shiro", "family_name": "Ikeda", "institution": null}, {"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}, {"given_name": "Hiroyuki", "family_name": "Nakahara", "institution": null}]}