{"title": "A New Learning Algorithm for Blind Signal Separation", "book": "Advances in Neural Information Processing Systems", "page_first": 757, "page_last": 763, "abstract": null, "full_text": "A  New Learning Algorithm for  Blind \n\nSignal  Separation \n\ns.  Amari* \n\nUniversity of Tokyo \n\nBunkyo-ku,  Tokyo 113,  JAPAN \n\namari@sat.t. u-tokyo.ac.jp \n\nLab.  for  Artificial Brain Systems \n\nA.  Cichocki \n\nFRP, RIKEN \n\nWako-Shi,  Saitama, 351-01,  JAPAN \n\ncia@kamo.riken.go.jp \n\nLab.  for  Information Representation \n\nH.  H. Yang \n\nFRP, RIKEN \n\nWako-Shi,  Saitama, 351-01,  JAPAN \n\nhhy@koala.riken.go.jp \n\nAbstract \n\nA new on-line learning algorithm which minimizes a  statistical de(cid:173)\npendency among outputs is  derived for  blind separation  of mixed \nsignals.  The  dependency  is  measured  by  the  average  mutual  in(cid:173)\nformation  (MI)  of the outputs.  The source signals and the mixing \nmatrix  are  unknown  except  for  the  number  of  the  sources.  The \nGram-Charlier  expansion  instead  of the  Edgeworth  expansion  is \nused  in  evaluating  the  MI.  The  natural gradient  approach is  used \nto minimize the MI. A novel activation function is proposed for the \non-line learning algorithm  which  has  an equivariant property and \nis easily implemented on a  neural network like model.  The validity \nof the new learning algorithm are verified by computer simulations. \n\n1 \n\nINTRODUCTION \n\nThe problem  of blind signal separation arises in many areas such as speech recog(cid:173)\nnition, data communication, sensor signal processing, and medical science.  Several \nneural  network  algorithms  [3,  5,  7]  have  been  proposed  for  solving  this  problem. \nThe performance of these algorithms is  usually  affected  by the selection  of the ac(cid:173)\ntivation functions  for  the formal  neurons  in  the networks.  However,  all  activation \n\n\u00b7Lab.  for  Information Representation, FRP, RIKEN,  Wako-shi,  Saitama, JAPAN \n\n\f758 \n\nS.  AMARI, A.  CICHOCKI, H. H.  YANG \n\nfunctions  attempted are  monotonic  and  the  selections  of the  activation  functions \nare ad hoc.  How should the activation function be determined to minimize the MI? \nIs it necessary to use monotonic activation functions for  blind signal separation?  In \nthis paper,  we  shall answer  these  questions  and  give  an  on-line  learning algorithm \nwhich  uses  a  non-monotonic  activation function  selected  by the independent  com(cid:173)\nponent  analysis  (ICA)  [7].  Moreover,  we  shall  show  a  rigorous  way  to  derive  the \nlearning algorithm which has the equivariant  property, i.e.,  the performance of the \nalgorithm is  independent of the scaling parameters in the noiseless  case. \n\n2  PROBLEM \n\nLet  us  consider  unknown  source  signals  Si(t), i  =  1\"\", n  which  are  mutually in(cid:173)\ndependent.  It is  assumed  that  the sources  Si(t)  are stationary  processes  and each \nsource has moments of any order with a zero mean.  The model for the sensor output \nis \n\nx(t) =  As(t) \n\nis  an  unknown  non-singular  mixing  matrix,  set) \n\nwhere  A  E  R nxn \n[Sl(t),\u00b7 .. , sn(t)]T  and x(t) =  [Xl(t), .. \u00b7, xn(t)JT. \nWithout knowing the source signals and the mixing matrix, we want to recover the \noriginal signals from  the observations x(t) by the following  linear transform: \n\nyet)  =  Wx(t) \n\nwhere yet) =  [yl(t), ... , yn(t)]T and WE R nxn  is  a  de-mixing matrix. \nIt is impossible to obtain the original sources Si(t)  because they are not identifiable \nin the statistical sense.  However,  except for  a  permutation of indices,  it is  possible \nto  obtain  CiSi(t)  where  the  constants  Ci  are indefinite  nonzero  scalar factors.  The \nsource signals are identifiable in this sense.  So our goal is to find the matrix W  such \nthat  [yl, ... , yn]  coincides  with a  permutation of  [Sl, ... ,sn]  except  for  the  scalar \nfactors.  The solution W  is  the matrix which finds  all  independent  components in \nthe  outputs.  An  on-line  learning  algorithm for  W  is  needed  which  performs  the \nICA. It is possible to find such a learning algorithm which minimizes the dependency \namong the outputs.  The algorithm in [6]  is based on the Edgeworth expansion[8] for \nevaluating the marginal negentropy.  Both the  Gram-Charlier expansion[8]  and the \nEdgeworth expansion[8]  can be used  to approximate probability density functions. \nWe  shall use the Gram-Charlier expansion  instead of the Edgeworth expansion for \nevaluating the marginal entropy.  We  shall explain  the reason in section 3. \n\n3 \n\nINDEPENDENCE  OF  SIGNALS \n\nThe mathematical framework for the ICA is formulated in [6].  The basic idea of the \nICA is to minimize the dependency among the output components.  The dependency \nis  measured by the Kullback-Leibler divergence between  the joint and the  product \nof the marginal distributions of the outputs: \n\nD(W) = \n\np(y) \n\n( a) dy \n\nJ \np(y) log rr \n\na=lPa  y \n\n(1) \n\nwhere Pa(ya) is the marginal probability density function  (pdf).  Note the Kullback(cid:173)\nLeibler  divergence  has  some  invariant  properties  from  the differential-geometrical \npoint of view[l]. \n\n\fA New  Learning Algorithm for  Blind Signal  Separation \n\n759 \n\nIt  is  easy to relate the Kullback-Leibler divergence D(W) to the average MI of y: \n\nD(W) =  -H(y) + LH(ya) \n\nn \n\na=l \n\n(2) \n\nwhere \n\nH(y) =  - J p(y) logp(y)dy, \nH(ya)  =  - J Pa(ya)logPa(ya)dya  is  the marginal entropy. \n\nThe minimization of the Kullback-Leibler divergence leads to an ICA algorithm for \nestimating W  in  [6]  where the Edgeworth expansion is  used to evaluate the negen(cid:173)\ntropy.  We  use  the  truncated  Gram-Charlier  expansion  to  evaluate  the  Kullback(cid:173)\nLeibler divergence.  The Edgeworth expansion has some advantages over the Gram(cid:173)\nCharlier expansion  only for  some  special distributions.  In  the case  of the  Gamma \ndistribution or the distribution of a random variable which is the sum of iid random \nvariables, the coefficients of the Edgeworth expansion decrease uniformly.  However, \nthere is  no  such advantage for  the mixed output  ya  in general cases. \nTo  calculate  each  H(ya)  in  (2),  we  shall  apply  the  Gram-Charlier  expansion  to \napproximate the  pdf Pa(ya).  Since  E[y]  =  E[W As]  = 0,  we  have  E[ya]  = 0.  To \nsimplify the calculations for  the entropy H(ya)  to be carried out  later,  we  assume \nm2  =  1.  We  use  the following  truncated  Gram-Charlier expansion  to approximate \nthe pdf Pa(ya): \n\n(3) \nwhere  lI;a  = ma,  11;4  =  m4 - 3,  mk  =  E[(ya)k]  is  the  k-th  order  moment  of ya, \na(y)  =  ~e-lIi-, and  Hk(Y)  are  Chebyshev-Hermite  polynomials  defined  by  the \nidentity \n\n2 \n\nWe  prefer  the  Gram-Charlier  expansion  to  the  Edgeworth  expansion  because  the \nformer  clearly  shows  how  lI;a  and 11;4  affect  the approximation of the pdf.  The last \nterm  in  (3)  characterizes  non-Gaussian  distributions.  To  apply  (3)  to  calculate \nH(ya),  we  need  the following  integrals: \n\n- /  a(y)H2(y)loga(y)dy =  ~ \nJ a(y)(H2(y))2 H4(y)dy =  24. \n\n(4) \n\n(5) \n\nThese  integrals can be obtained  easily  from  the following  results for  the  moments \nof a  Gaussian random variable N(O,l): \n\n/  y2k+1a(y)dy =  0, \n\n/  y2ka(y)dy =  1\u00b73\u00b7\u00b7\u00b7 (2k  - 1). \n\n(6) \n\nBy using the expansion \n\nlog(l + y)  ~ y - 2  + O(y3) \n\ny2 \n\nand  taking account  of the orthogonality relations  of the  Chebyshev-Hermite  poly(cid:173)\nnomials and (4)-(5),  the entropy H(ya)  is  expanded as \n\nH(ya)  ~  -log(27re) __  3 ___  4_ + _(lI;a)2I1;a  + _(lI;a)3. \n\n(7) \n\n1 \n16 \n\n4 \n\n3 \n\n4 \n\n1 \n2 \n\n(lI;a)2 \n2 . 3! \n\n(lI;a)2 \n2 . 4! \n\n5 \n8 \n\n\f760 \n\nS.  AMARI, A.  CICHOCKI, H.  H. YANG \n\nIt is easy to calculate  -J a(y)loga(y)dy =  ~ log(27re). \n\nFrom y = Wx, we  have H(y) = H(x) + log Idet(W)I.  Applying (7)  and the above \nexpressions to (2),  we  have \n\n(Ka)2 \nn \nD(W) ~  -H(x) -log Idet(W)1 + -log(27re) - \"[_3_, + ~4' \n~ 2 \u00b73.  2\u00b7. \na=l \n\n(Ka)2 \n\nn \n2 \n\n4  A  NEW LEARNING ALGORITHM \n\nTo  obtain  the  gradient  descent  algorithm  to  update  W  recursively,  we  need  to \ncalculate  88.0..  where wk'  is the (a,k) element of W  in the a-th row and k-th column. \n\nWI. \n\nLet  cof(wk) be the cofactor of wk'  in  W.  It is  not  difficult  to derive the followings: \n\n(8) \n\n8log [det(W)[  _ \n-\n\n8wI: \n\n81t3  _ \n8w;:  -\n81t;  _ \n8wl:  -\n\ncof(wk')  =  (W-Tt \ndet(W) \nk \n3E[(ya)2x k] \n\n4E[(ya)3x k] \n\nwhere  (W-T)k'  denotes the (a,k)  element  of (WT)-l.  From (8),  we  obtain \n;!!a  ~  -(W-T)k' + f(K'3,  K~)E[(ya)2xk] + g(K'3, K~)E[(ya)3xk] \n\nk \n\n(9) \n\nwhere \n\nf(y, z)  =  -~y + l1yz,  g(y, z)  =  -~z + ~y2 + ~z2. \n\nFrom (9),  we  obtain the gradient descent algorithm to update W  recursively: \n\n\" \n\nd~1s  = \n\noD \n-TJ( t)--oWk' \nTJ(t){(W - T)k  - f(K'3,  K~)E[(ya)2xk]_ g(K'3, K~)E[(ya)3xk]}  (10) \n\nwhere TJ(t)  is  a  learning rate function.  Replacing the expectation values  in  (10)  by \ntheir instantaneous values,  we  have the stochastic gradient descent  algorithm: \n\nd~k =  TJ(t){(W-T)k'  - f(K'3, K~)(ya)2xk - g(K'3, K~)(ya)3xk}. \n\n(11) \n\nWe  need  to use  the following  adaptive algorithm to compute K'3  and  K~ in  (11): \n\ndK a \ndt  = -J.'(t)(K'3 - (ya)3) \ndKa \nd/  =  -J.'(t)(K~ - (ya)4  + 3) \n\n(12) \n\nwhere 1'( t)  is  another learning rate function. \nThe  performance  of  the  algorithm  (11)  relies  on  the  estimation  of the  third  and \nfourth  order cumulants  performed  by  the  algorithm  (12).  Replacing  the moments \n\n\fA New  Learning Algorithm for  Blind Signal  Separation \n\n761 \n\nofthe random variables in (11) by their instantaneous values, we obtain the following \nalgorithm which is  a  direct  but  coarse implementation of (11): \n\ndw a \ndt  =  1](t){(W-T)~ -\n\nf(ya)x k} \n\nwhere  the activation function  f(y)  is defined  by \n14  7 \n\n3  11 \n\nf() \n\n29  3 \nY  =  4Y  + 4 Y  -\"3Y  - 4 Y  + 4Y . \n\n25  9 \n\n47  5 \n\n(13) \n\n(14) \n\nNote  the  activation  function  f(y)  is  an  odd  function,  not  a  monotonic  function. \nThe equation (13)  can be written in a  matrix form: \n\nThis equation can  be further simplified as  following  by substituting xTWT =  yT: \n\n(15) \n\nwhere  f(y)  =  (f(yl), ... , f(yn))T.  The  above  equation  is  based  on  the  gradient \ndescent algorithm (10)  with the following  matrix form: \n\ndW \naD \ndt =  -1](t) aw' \n\n(17) \n\n(16) \n\nFrom  information  geometry  perspective[l],  since  the  mixing  matrix  A  is  non(cid:173)\nsingular we had better replace the above algorithm by the following natural gradient \ndescent algorithm: \n\nApplying  the  previous  approximation  of the  gradient  :& to  (18),  we  obtain  the \n\ndW \ndt =  -1](t)aw W  w. \n\naD \n\nT \n\n(18) \n\nfollowing algorithm: \n\nwhich has the same  \"equivariant\"  property as  the algorithms developed in  [4,  5]. \n\n(19) \n\nAlthough  the  on-line  learning  algorithms  (16)  and  (19)  look  similar  to  those  in \n[3,  7]  and  [5]  respectively,  the selection  of the activation function  in  this  paper is \nrational,  not  ad hoc.  The activation function  (14)  is  determined by the leA. It is \na  non-monotonic activation function different from  those used in  [3,  5,  7]. \nThere  is  a  simple  way  to  justify  the  stability  of  the  algorithm  (19).  Let  Vec(\u00b7) \ndenote an operator on a matrix which cascades the columns of the matrix from the \nleft  to  the right  and  forms  a  column  vector.  Note  this  operator has  the following \nproperty: \n\n(20) \nBoth the gradient descent algorithm and the natural gradient descent algorithm are \nspecial  cases of the following general gradient descent algorithm: \n\nVec(ABC) =  (CT 0  A)Vec(B). \n\ndVec(W)  =  _  (t)P \n\naD \n\ndt \n\n1] \n\naVec(W) \n\n(21) \n\nwhere P  is a symmetric and positive definite matrix.  It is trivial that (21)  becomes \n(17)  when  P  =  I.  When P  =  WTW 0  I, applying (20)  to (21),  we  obtain \n\ndVec(W) \n\ndt \n\n= -1]( t)(W  W  0 I) aVec(W)  = -1]( t)Vec( aw W  W) \n\nT \n\naD \n\naD \n\nT \n\n\f762 \n\ns. AMARI. A. CICHOCKI. H.  H. YANG \n\nand this equation implies  (18).  So  the natural gradient  descent  algorithm  updates \nWet) in the direction of decreasing the dependency D(W).  The information geom(cid:173)\netry theory[l]  explains  why  the natural gradient descent  algorithm should  be used \nto minimize the MI. \n\nAnother on-line learning algorithm for blind separation using recurrent network was \nproposed  in  [2].  For  this  algorithm,  the  activation  function  (14)  also  works  well. \nIn practice, other activation functions such as those proposed in  [2]-[6]  may also be \nused  in  (19).  However,  the performance of the algorithm for  such functions  usually \ndepends on the distributions of the sources.  The activation function  (14)  works for \nrelatively general cases in which the pdf of each source can be approximated by the \ntruncated Gram-Charlier expansion. \n\n5  SIMULATION \n\nIn order to check the validity and performance of the new on-line learning algorithm \n(19),  we  simulate it  on  the computer using  synthetic source  signals  and  a  random \nmixing matrix.  The extensive computer simulations have fully confirmed the theory \nand  the  validity  of the  algorithm  (19).  Due  to  the  limit  of space  we  present  here \nonly one illustrative example. \n\nExample: \n\nAssume  that  the following  three  unknown sources  are mixed  by a  random mixing \nmatrix A: \n\n[SI (t), S2(t), S3(t)]  =  [n(t), O.lsin( 400t)cos(30t), 0.01sign[sin(500t + 9cos( 40t))] \n\nwhere  net)  is  a  noise  source  uniformly  distributed in  the range  [-1, +1],  and  S2(t) \nand S3(t)  are two deterministic source signals.  The elements  of the mixing matrix \nA  are  randomly  chosen  in  [-1, +1].  The  learning  rate  is  exponentially  decreaSing \nto zero as rJ(t)  =  250exp( -5t). \nA  simulation  result  is  shown  in  Figure  1.  The  first  three  signals  denoted  by  Xl, \nX2  and  X3  represent  mixing  (sensor)  signals:  x l (t),  x2(t)  and  x3(t).  The  last \nthree signals denoted by 01,  02 and 03 represent  the output signals:  yl(t), y2(t), \nand  y3(t).  By  using  the  proposed  learning  algorithm,  the  neural  network  is  able \nto  extract  the deterministic signals  from  the observations after  approximately 500 \nmilliseconds. \nThe performance index El  is  defined by \n\n- 1) \n\nEl = tct  IPijl \n\ni=1  j=1  maxk IPikl \n\n- 1) + tct  IPijl \n\nj=l  i=l maxk IPkjl \n\nwhere  P  =  (Pij) =  WA. \n\n6  CONCLUSION \n\nThe major contribution of this  paper the rigorous  derivation  of the effective  blind \nseparation  algorithm  with  equivariant  property  based  on  the  minimization  of the \nMI  of the  outputs.  The  ICA  is  a  general  principle  to  design  algorithms  for  blind \nsignal  separation.  The  most  difficulties  in  applying  this  principle  are  to  evaluate \nthe  MI  of the  outputs  and  to  find  a  working  algorithm  which  decreases  the  MI. \nDifferent  from  the work  in  [6],  we  use  the  Gram-Charlier expansion instead of the \nEdgeworth expansion to calculate the marginal entropy in evaluating the MI.  Using \n\n\fA New  Learning Algorithm for  Blind Signal  Separation \n\n763 \n\nthe natural gradient method to minimize the MI, we have found an on-line learning \nalgorithm to find  a de-mixing matrix.  The algorithm has equivariant  property and \ncan be easily implemented on a  neural network like  model.  Our approach provides \na rational selection of the activation function for  the formal neurons in the network. \nThe algorithm has been simulated for  separating unknown source signals mixed by \na random mixing matrix.  Our theory and the validity of the new learning algorithm \nare verified  by the simulations. \n\no. \n04 \n0' \no \n\nI \n\nFigure 1:  The mixed and separated signals, and the performance index \n\nAcknowledgment \nWe would like to thank Dr.  Xiao Yan SU for  the proof-reading of the manuscript. \n\nReferences \n\n[1]  S.-I.  Amari.  Differential-Geometrical  Methods  in  Statistics,  Lecture  Notes  in \n\nStatistics  vol.28.  Springer,  1985. \n\n[2]  S.  Amari, A.  Cichocki, and H.  H.  Yang.  Recurrent neural networks for blind sep(cid:173)\n\naration of sources.  In  Proceedings  1995 International Symposium  on  Nonlinear \nTheory  and  Applications,  volume I,  pages 37-42, December 1995. \n\n[3]  A.  J. Bell and T . J . Sejnowski.  An information-maximisation approach to blind \n\nseparation and blind deconvolution.  Neural  Computation,  7:1129-1159,  1995. \n\n[4]  J.-F.  Cardoso  and Beate  Laheld.  Equivariant  adaptive  source  separation.  To \n\nappear in IEEE  Trans.  on Signal  Processing,  1996. \n\n[5]  A.  Cichocki,  R.  Unbehauen,  L.  MoszczyIiski,  and E.  Rummert.  A  new  on-line \nadaptive learning algorithm for  blind separation of source signals.  In ISANN94, \npages 406-411,  Taiwan,  December 1994. \n\n[6]  P.  Comon.  Independent component analysis, a new concept?  Signal Processing, \n\n36:287-314,  1994. \n\n[7]  C.  Jutten  and  J.  Herault.  Blind  separation  of  sources,  part  i:  An  adaptive \nalgorithm based on neuromimetic architecture.  Signal Processing,  24:1- 10,  1991. \n[8]  A.  Stuart  and  J.  K.  Ord.  Kendall's  Advanced  Theory  of Statistics.  Edward \n\nArnold,  1994. \n\n\f", "award": [], "sourceid": 1115, "authors": [{"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}, {"given_name": "Andrzej", "family_name": "Cichocki", "institution": null}, {"given_name": "Howard", "family_name": "Yang", "institution": null}]}