{"title": "Ensemble Learning and Linear Response Theory for ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 542, "page_last": 548, "abstract": null, "full_text": "Ensemble Learning and Linear Response Theory \n\nfor leA \n\nPedro A.d.F.R. Hfljen-Sflrensenl ,  Ole Winther2 , Lars Kai Hansen l \n\n1 Department of Mathematical Modelling, Technical University of Denmark B321 \n\nDK-2800 Lyngby, Denmark, ph s , l k h a n s en @imrn. d tu. dk \n\n2Theoretical Physics, Lund University, SOlvegatan 14 A \n\nS-223 62 Lund, Sweden, winther@ nimi s .thep.lu. se \n\nAbstract \n\nWe  propose a general Bayesian framework for performing independent \ncomponent analysis  (leA) which  relies  on  ensemble  learning  and  lin(cid:173)\near response theory  known from  statistical physics.  We  apply it to both \ndiscrete and continuous sources. For the continuous source the underde(cid:173)\ntermined (overcomplete) case is  studied.  The naive mean-field approach \nfails in this case whereas linear response theory-which gives an improved \nestimate  of covariances-is  very  efficient.  The  examples  given  are  for \nsources without temporal correlations. However, this derivation can eas(cid:173)\nily  be  extended  to  treat  temporal  correlations.  Finally,  the  framework \noffers a simple way  of generating new  leA algorithms without needing \nto define the prior distribution of the sources explicitly. \n\n1  Introduction \n\nReconstruction of statistically independent source signals from linear mixtures is an active \nresearch  field.  For historical  background and early  references  see  e.g.  [I].  The  source \nseparation problem has  a Bayesian formulation,  see e.g.,  [2,  3]  for which  there has  been \nsome recent progress based on ensemble learning [4]. \n\nIn  the Bayesian framework, the covariances of the sources are needed in order to estimate \nthe  mixing matrix and  the noise level.  Unfortunately, ensemble learning using factorized \ntrial  distributions  only  treats  self-interactions  correctly  and  trivially  predicts:  (SiSi)(cid:173)\n(Si}(Sj)  =  0 for i  -I  j.  This  naive mean-field  (NMF) approximation first  introduced in \nthe neural computing context by  Ref. [5]  for Boltzmann machine learning may completely \nfail in some cases [6].  Recently, Kappen and Rodriguez [6]  introduced an efficient learning \nalgorithm for Boltzmann Machines based on linear response (LR) theory.  LR theory gives \na  recipe  for  computing an  improved  approximation to  the  covariances  directly  from  the \nsolution to the NMF equations [7]. \n\nEnsemble learning has been applied in many contexts within neural computation, e.g. for \nsigmoid  belief networks  [8],  where  advanced  mean  field  methods  such  as  LR  theory  or \nTAP  [9]  may  also  be  applicable.  In  this  paper,  we  show  how  LR  theory  can  be  applied \nto independent component analysis (leA). The performance of this  approach is compared \nto  the  NMF  approach.  We  observe  that  NMF may  fail  for high  noise  levels  and  binary \n\n\fsources  and for  the  underdetermined continuous case.  In  these cases  the  NMF approach \nignores one of the sources and consequently overestimates the noise.  The LR approach on \nthe other hand succeeds in all cases studied. \n\nThe derivation of the  mean-field equations are kept completely general  and  are thus  valid \nfor  a  general  source  prior (without temporal  correlations).  The  final  eqs.  show  that  the \nmean-field framework may be used to propose ICA algorithms for which the source prior \nis  only defined implicitly. \n\n2  Probabilistic leA \n\nFollowing Ref. [10], we  consider a collection of N  temporal measurements, X  =  {Xdt}, \nwhere Xdt denotes the measurement at the dth sensor at time t.  Similarly, let S = {Smd \ndenote a collection of M  mutually independent sources where Sm.  is the mth source which \nin general may have temporal correlations.  The measured signals X  are assumed to be an \ninstantaneous linear mixing of the  sources corrupted with additive Gaussian noise r , that \nis, \n\n(1) \nwhere A is the mixing matrix. Furthermore, to simplify this exposition the noise is assumed \nto be iid Gaussian with variance a 2 .  The likelihood of the parameters is then given by, \n\nX=As+r , \n\nP(XIA, ( 2 )  = ! dSP(XIA, a 2 , S) P(S) , \n\n(2) \n\nwhere  P(S)  is  the  prior on  the  sources  which  might include  temporal  correlations.  We \nwill,  however, throughout this  paper assume that the sources are temporally uncorrelated. \nWe  choose to  estimate the  mixing matrix A  and  noise level a 2  by  Maximum Likelihood \n(ML-II). The saddlepoint of P(XIA, ( 2 )  is attained at, \n\n810gP~~IA,(2) =  0 \n\nA  =  X(S)T(SST)-l \n\n810gP~~IA,(2) = 0 \n\na 2  = D~(Tr(X - ASf(X - AS)) , \nwhere (.) denotes an average over the posterior and D  is the number of sensors. \n\n(3) \n\n(4) \n\n3  Mean field theory \n\nFirst,  we  derive  mean  field  equations  using  ensemble  learning.  Secondly,  using  linear \nresponse theory,  we obtain improved estimates  of the off-diagonal terms of (SST)  which \nare needed for estimating A  and a 2 .  The following derivation is performed for an arbitrary \nsource prior. \n\n3.1  Ensemble learning \n\nWe adopt a standard ensemble learning approach and approximate \n\nP(S IX  A \n\n2)  =  P(XIA, a 2 , S)P(S) \n\n(5) \nin a family of product distributions Q(S) = TImt Q(Smt) . It has been shown in Ref. [11] \nthat for a Gaussian P(XIA, a 2 , S), the optimal choice of Q(Smt) is  given by  a Gaussian \ntimes the prior: \n\nP(XIA,a 2 ) \n\n\"a  \n\n(6) \n\n\fIn  the  following,  it is  convenient to  use  standard  physics  notation  to  keep  everything  as \ngeneral as possible. We therefore parameterize the Gaussian as, \n\nP(XIA, a 2, S)  =  P(XIJ, h, S)  =  Ce~ Tr(ST JS)+Tr(hTS)  , \n\n(7) \n\nwhere J  =  _AT AI a 2 is  the M  x  M  interaction matrix and h  =  A TXI a 2 has the same \ndimensions as the source matrix S.  Note that h  acts as an external field from which we can \nobtain all moments of the sources.  This is a property that we  will make use of in  the next \nsection when  we derive the  linear response corrections.  The Kullback-Leibler divergence \nbetween the optimal product distribution Q (S) and the true source posterior is given by \n\nKL  =  ! dSQ(S)In P(SI~:~,a2) =  InP(XIA,a2) -lnP(XIA,a2) \nInP(XIA,a2)  =  2)Og! d8P(8)e~>'~tS2+Y~tS + ~ 2: (Jmm  - Amt)(8~t) \n\n(8) \n\nmt \n1 +2 Tr(ST)(J - diag(J)(S) + Tr(h - \"If (S) + In C  , \n\nmt \n\n(9) \n\nwhere P(XIA, a 2) is the naive mean field approximation to the Likelihood and diag(J) is \nthe diagonal matrix of J. The saddlepoints define the mean field equations; \n\n\"I  =  h + (J - diag(J))(S) \n\noKL \no(S)  =  0 \noKL \n0(8;'t)  = 0 \n\nThe remaining two equations depend explicitly on the source prior, P(S); \n\noK L  = 0 \n01mt \n\noKL = 0 \nOAmt \n\n: \n\n(8mt ) = _O_IOg! d8mtP(8mt)e~>'~ts;'t+Y~tS~t \n\n01mt \n\n==  fbmt, Amt) \n\\  = 2 -0 -10g !d8  P(8  )e~>'~tS;'t+')'~tS~t \n\nOAmt \n\nmt \n\nmt \n\n(82 \n\nmtl \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\n. \n\nIn section 4, we calculate fbmt' Amt) for some of the prior distributions found in the leA \nliterature. \n\n3.2  Linear response theory \n\nAs  mentioned already,  h  acts as  an external field.  This makes it possible to  calculate the \nmeans and covariances as derivatives of log P(XIJ, h), i.e. \n\\ = ologP(XIJ, h) \n\n(14) \n\n(15) \n\n(16) \n\n(8 \n\nmtl \n\nohmt \n\ntt' \n\n\\ \nXmm, =  mt  m't'l -\n\n(8  8 \n\n-\n\n(8 \n\n\\(8 \n\nmtl  m't'l -\n\n\\  _  0 2 log P(XIJ, h)  _  0(8mt ) \nU  m't' \n\n!lh \nU  m't' U  mt \n\n- ~h . \n\n!lh \n\nTo derive an equation for X~m\"  we use eqs. (10), (11) and (12) to get \n\nXmm,  = \ntt' \n\nOf(1mt, Amt)  01mt \nohm't' \n\n= \n\nOf(1mt, Amt)  (  \"J   tt \n\nL...J \n\nmm\"Xm\"m' + Umm'  Ott'\u00b7 \n\n01mt \n\nm\",m\"\u00a5:m \n\n~) ~ \n\n\f2 \n\nX'\"  a \n\n-1 \n\n-2 \n-2 \n\n2 \n\nX'\"  a \n\n-1 \n\n-2 \n-2 \n\n2 \n\n..: ..\u2022... \n.. \n.. &' .. \"''Ij.;~ . \n.. ~ . \n~r .. \n~, \n..  01#.  ~'  ~\" \u2022 \u2022 \u2022  \n,oJ \u2022 . . .  eA_ \n.. \n....... \n\na \nx, \n\n2 \n\nx'\"  a \n\n-1 \n\na \nX, \n\n2 \n\n-2 \n-2 \n\na \nx, \n\n2 \n\nFigure 1:  Binary source recovery for low  noise level  (M  =  2, D  =  2).  Shows from  left \nto  right:  +/- the column vectors  of;  the true A  (with the observations superimposed);  the \nestimated A  (NMF); estimated A  (LR). \n\n0.4 \n\n0.5 \n\nr!J.. \n\n8. \n\u2022\u2022 ,\n\n..... _ ... \n\n,, \n~ \u2022\u2022 \n\n\"\n\n\u2022\u2022 Ii> \n\n0.5 \n\n.~ \n\n-0.5 \n\n-0.5 \n\n-0.5 \n\na \nx, \n\n0.5 \n\n-0 .5 \n\na \nx, \n\n0.5 \n\n0'3~ \n\ni'l \n.~ 0.2 \n'\" \n> \n\n\\ \n\n, \n\n0.1 \n\n\\ \n\n,_,_,.l.~ \n\nO'------~--~ \n40 \n\n20 \n\na \n\niteration \n\nFigure 2:  Binary source recovery for low noise level (M =  2, D  =  2),  Shows the dynamics \nof the fix-point iterations.  From left to right;  +/- the column vectors of A  (NMF);  +/- the \ncolumn vectors of A  (LR); variance (72  (solid:NMF, dashed:LR, thick dash-dotted:  the true \nempirical noise variance). \n\nWe now see that the x-matrix factorizes in time X~ml =  ott' X~ml. This is a direct conse(cid:173)\nquence of the fact that the model has no temporal correlations. The above equation is linear \nand may straightforwardly be solved to yield \n\nX~ml = [(At  - J)-l]mm'  , \n\n(17) \n\nwhere we have defined the diagonal matrix \n\nAt =  diag (8fh~'Alt)  + J11 , ... ,  8fhM~,AM.)  + JMM)  . \n\n8\"(lt \n\n8\"(Mt \n\nAt this point is appropriate to explain why linear response theory is  more precise than us(cid:173)\ning  the  factorized  distribution  which predicts  X~ml = 0  for  non-diagonal terms.  Here, \nwe  give  an  argument  that  can  be  found  in  Parisi's  book  on  statistical  field  theory  [7] : \nLet  us  assume  that  the  approximate  and  exact  distribution  is  close  in  some  sense,  i,e. \nQ(S)  - P(SIX, A, (72)  =  c  then  (SmtSm1t)ex  =  (SmtSm1t)ap  + O(c).  Mean field  the(cid:173)\nory gives a lower bound on the log-Likelihood since K L , eq.  (8) is non-negaitive.  Conse(cid:173)\nquently, the linear term vanishes in the expansion of the log-Likelihood: log P(XIA, (72)  = \nlog P(XIA, (72)  + O(c2 ) .  It is  therefore more precise to  obtain moments of the variables \nthrough derivatives of the approximate log-Likelihood, i,e. by linear response. \n\nA final  remark to complete the picture:  if diag(J) in equation eq.  (10) is exchanged with \nAt  =  diag(Alt, ... , AMt) and likewise in the definition of At above we get TAP equations \n[9],  The TAP equation for A \n\nis xt  = 8fh=t ,A=,)  = [(At  - J)-l] \n\n. \nmm \n\nmt \n\nmm \n\n8\"(=t \n\n\f2 \n\n1 \n\n:0, ..... : .:.~: : .. \nX'\"  a  \u00b7:~t.~'~f \n\u2022 \u2022 \n.. ~-. .:' l.' :::: \n\n.......  ~,. \n\n..  .. \n\n..  \"C.. \n\n-1 \n\n2 \n\nX'\"  a  -\n\n-1 \n\n-2 \n-2 \n\na \nx, \n\n2 \n\n-2 \n-2 \n\na \nX, \n\n2 \n\nX'\"  a \n\n-1 \n\n-2 \n-2 \n\n2 \n\na \nx, \n\n2 \n\nFigure 3:  Binary source recovery for high noise level (M =  2,  D  =  2).  Shows from  left \nto  right:  +/- the column vectors  of;  the true A  (with the observations superimposed);  the \nestimated A (NMF); estimated A (LR). \n\n0 .5 \n\nxC\\!  a eo ................ ..... ~ x N \n\n)( \n\n)( \n\n0 .5 \n\n0 \n\n--<l .5 \n\n-0 .5 \n\n-0.5 \n\na \nx, \n\n0.5 \n\n)( \n\n* ~ ............... ~ \n\na \nx, \n\n-0.5 \n\n0.5 \n\n0 .7 \n\n0 .6 \n\ni'l0 .5 \niij \n.~ 0.4 \n\n1\\ - - - - - - -\n\n0.3 \n\n0.2 ' - - - - - - - - - (cid:173)\n\na \n\n200 \n\n400 \n\n600 \n\niteration \n\nFigure 4:  Binary source recovery for high noise level (M = 2,  D  = 2).  Same plot as in \nfigure 2. \n\n4  Examples \n\nIn  this  section  we  compare the  LR  approach  and  the  NMF  approach  on  the  noisy  leA \nmodel. The two approaches are demonstrated using binary and continous sources. \n\n4.1  Binary source \n\nIndependent component analysis  of binary sources  (e.g.  studied in  [12]) is  considered for \ndata transmission using binary modulation schemes such as MSK or biphase (Manchester) \ncodes. Here, we consider a binary source Smt  E { -1,1} with prior distribution P(Smt) = \n!  [8(Smt  - 1) + 8(Smt + 1)].  In this  case  we get the  well  known mean  field  equations \n(Smt)  = tanhbmt). Figures 1 and 2 show the results of the NMF approach as well as LR \napproach in a low-noise variance setting using two sources (M =  2) and two sensors (D = \n2).  Figures 3 and 4  show the same but in a high-noise setting.  The dynamical plots show \nthe trajectory of the fix-point iteration where 'x' marks the starting point and '0' the final \npoint.  Ideally,  the noise-less measurements would consist of the four combinations (with \nsigns) of the columns in the mixing matrix.  However, due to  the noise,  the measurement \nwill be scattered around these \"prototype\" observations. \n\nIn the low-noise level setting both approaches find good approximations to the true mixing \nmatrix and sources. However, the convergence rate of the LR approach is found to be faster. \nFor high-noise variance the NMF approach fails to recover the true statistics.  It is seen that \none of the directions in the mixing matrix vanishes which in tum results in overestimating \nthe noise variance. \n\n\f::= TIJTIJTIJ \n\n0 \nX, \n\n0 \nX, \n\n0 \nX, \n\n-5 \n\n-2 \n\n2 \n\n-2 \n\n5 \n\n0 \n\n2 \n\n-2 \n\n2 \n\n3 and  D  = 2.  Shows \nFigure  5:  Overcomplete continuous  source  recovery  with  M \nfrom left to right  the observations, +/- the column vectors of; the true A;  the estimated A \n(NMF); estimated A (LR). \n\n\" \n\n~ \" \n\"~\" \nx <!\"\" \n\n2 \n\nx N  0 \n\n-1 \n\n\" \n\n, f \n\n\u00b7fJ  \" \n\nx eI' \n\n0 \nX, \n\n2 \n\n-2 \n\n-2 \n\n0 \nX, \n\n2 \n\n2 \n\nxN  0 \n\n-1 \n\n-2 \n\n-2 \n\n2.5 \n\n2 \n\n\" u \n.. > \n\n, \n1  -, -,_._._._. -,_ .. \n\n.~  1.5  \"'-------\n\n0.5 \n0 \n\n1000 \n\n2000 \n\niteration \n\nFigure 6: Overcomplete continuous source recovery with M  =  3 and D  =  2.  Same plot as \nin figure 2.  Note that the initial iteration step for A is very large. \n\n4.2  Continuous Source \n\nTo  give  a  tractable  example  which  illustrates  the  improvement by  LR,  we  consider  the \nGaussian prior P(Smt) ex:  exp( -o.S~t!2) (not suitable for source separation). This leads \nto  fbmt, Amt)  = 'Ymt/(o.  - Amt).  Since  we  have  a  factorized  distribution,  ensemble \n(Smt}(Sm't')  =  8mm,8tt' (a.  - Amt)-l  =  8mm,8tt' (a.  -\nlearning predicts  (SmtSm't') -\nJmm)-l, where the second equality follows from eq. (11).  Linear response eq.  (17) gives \n(Smt}(Sm't')  =  8tt'  [(0.1  -J)-l]mm'  which  is  identical  with  the  exact \n(SmtSm't')  -\nresult obtained by direct integration. \n\nFor  the  popular  choice  of prior  P(Smt)  = \n[1],  it  is  not  possible  to  derive \nfbmt. Amt)  analytically.  However,  fbmt. Amt)  can  be  calculated  analytically  for  the \nvery similar Laplace distribution. Both these examples have positive kurtosis. \nMean  field  equations  for  negative kurtosis  can  be  obtained using  the  prior P(Smt)  ex:= \nexp( -(Smt - 1-\u00a3)2/2) + exp( -(Smt + 1-\u00a3)2/2)  [1]  leading to \n\n7r cos \n\ntnt \n\n~ s \n\nFigure 5 and  6 show  simulations using  this  source prior with  1-\u00a3  = 1 in  an  overcomplete \nsetting with  D  = 2 and  M  = 3.  Note that 1-\u00a3  = 1 yields a unimodal source distribution \nand hence qualitatively different from  the bimodal prior considered in  the binary case.  In \nthe overcomplete setting the NMF approach fails  to recover the true sources.  See [13]  for \nfurther discussion of the overcomplete case. \n\n\f5  Conclusion \n\nWe  have  presented a  general  ICA  mean  field  framework  based upon  ensemble learning \nand linear response theory.  The naive mean-field approach (pure ensemble learning) fails \nin  some  cases  and  we  speculate  that  it  is  incapable  of handling  the  overcomplete  case \n(more sources than sensors).  Linear response theory, on the other hand, succeeds in all  the \nexamples studied. \n\nThere are two directions in which we plan to extend this work:  (1) to sources with temporal \ncorrelations and (2) to source models defined not by a parametric source prior, but directly \nin  terms of the function  j, which defines  the mean field equations.  Starting directly from \nthe  j-function makes it possible to  test a whole range of implicitly defined source priors. \nA detailed analysis  of a large selection of constrained and unconstrained source priors as \nwell as comparisons of LR and the TAP approach can be found in  [14]. \n\nAcknowledgments \n\nPHS  wishes to  thank Mike Jordan for stimulating discussions on  the mean field  and  vari(cid:173)\national methods.  This research is  supported by the  Swedish Foundation for Strategic Re(cid:173)\nsearch as well as the Danish Research Councils through the Computational Neural Network \nCenter (CONNECT) and the THOR Center for Neuroinformatics. \n\nReferences \n\n[1]  T.-W. Lee:  Independent Component Analysis, Kluwer Academic Publishers, Boston (1998). \n[2]  A. Belouchrani and J.-F. Cardoso:  Maximum Likelihood Source Separation by the Expectation(cid:173)\n\nMaximization Technique:  Deterministic and Stochastic Implementation In Proc. NOLTA, 49-53 \n(1995). \n\n[3]  D.  MacKay:  Maximum  Likelihood  and  Covariant  Algorithms for  Independent  Components \n\nAnalysis. \"Draft 3.7\" (1996). \n\n[4]  H.  Lappalainen  and  J.W.  Miskin:  Ensemble  Learning,  Advances  in  Independent Component \n\nAnalysis, Ed. M. Girolami, In press (2000). \n\n[5]  C.  Peterson and J.  Anderson:  A Mean Field Theory Learning Algorithm for Neural Networks, \n\nComplex Systems 1, 995- 1019 (1987). \n\n[6]  H.  J.  Kappen  and F.  B.  Rodriguez:  Efficient Learning in  Boltzmann  Machines  Using  Linear \n\nResponse Theory, Neural Computation 10,1137-1156 (1998). \n\n[7]  G. Parisi:  Statistical Field Theory,  Addison Wesley,  Reading Massachusetts (1988). \n[8]  L.  K.  Saul,  T.  Jaakkola  and  M.  1.  Jordan:  Mean  Field  Theory  of Sigmoid  Belief Networks, \n\nJournal of Artificial Intelligence Research 4,  61- 76 (1996). \n\n[9]  M.  Opper and O.  Winther:  Tractable  Approximations for Probabilistic Models:  The Adaptive \n\nTAP Mean  Field Approach, Submitted to Phys. Rev. Lett. (2000). \n\n[10]  L. K. Hansen:  Blind Separation of Noisy Image Mixtures, Advances in Independent Component \n\nAnalysis, Ed. M. Girolami, In press (2000). \n\n[11]  L. Csat6, E. Fokoue, M. Opper, B. Schottky and O. Winther:  Efficient Approaches to Gaussian \nProcess  Classification,  in  Advances  in  Neural  Information  Processing Systems  12 (NIPS'99), \nEds. S. A. Solla, T.  K. Leen, and K.-R. Muller, MIT Press (2000). \n\n[12]  A.-J.  van  der  Veen:  Analytical Method for  Blind  Binary  Signal  Separation  IEEE  Trans.  on \n\nSignal Processing 45(4) 1078- 1082 (1997). \n\n[13]  M. S. Lewicki and T. J. Sejnowski:  Learning Overcomplete Representations, Neural Computa(cid:173)\n\ntion 12, 337-365  (2000). \n\n[14]  P.  A. d.  F.  R.  H0jen-S0rensen, O.  Winther and L. K. Hansen:  Mean Field Approaches to Inde(cid:173)\n\npendent Component Analysis, In  preparation. \n\n\f", "award": [], "sourceid": 1806, "authors": [{"given_name": "Pedro", "family_name": "H\u00f8jen-S\u00f8rensen", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}, {"given_name": "Lars", "family_name": "Hansen", "institution": null}]}