{"title": "Unsupervised Classification with Non-Gaussian Mixture Models Using ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 508, "page_last": 514, "abstract": null, "full_text": "Unsupervised  Classification with \n\nNon-Gaussian Mixture  Models  using ICA \n\nTe-Won Lee,  Michael S.  Lewicki and Terrence Sejnowski \n\nHoward Hughes  Medical Institute \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\n10010 N.  Torrey Pines Road \n\nLa Jolla,  California 92037,  USA \n\n{tewon,lewicki,terry}Osalk.edu \n\nAbstract \n\nWe  present  an  unsupervised  classification  algorithm  based  on  an \nICA  mixture  model.  The  ICA  mixture  model  assumes  that  the \nobserved  data can  be  categorized  into  several  mutually  exclusive \ndata classes  in  which  the components  in each  class  are generated \nby  a  linear  mixture  of independent  sources.  The  algorithm  finds \nthe independent sources, the mixing matrix for  each class and also \ncomputes  the  class  membership  probability  for  each  data  point. \nThis  approach  extends  the  Gaussian  mixture  model  so  that  the \nclasses  can  have  non-Gaussian  structure.  We  demonstrate  that \nthis method can learn efficient codes to represent images of natural \nscenes and text.  The learned classes of basis functions yield a better \napproximation of the underlying distributions of the data, and thus \ncan provide greater coding efficiency.  We  believe that this method \nis  well  suited to  modeling structure in  high-dimensional  data and \nhas many potential applications. \n\n1 \n\nIntrod uction \n\nRecently,  Blind  Source  Separation  (BSS)  by  Independent  Component  Analysis \n(ICA)  has  shown  promise  in  signal  processing  applications  including  speech  en(cid:173)\nhancement  systems,  telecommunications  and  medical  signal  processing.  ICA  is  a \ntechnique for finding a linear non-orthogonal coordinate system in multivariate data. \nThe directions  of the axes  of this coordinate system are determined by the  data's \nsecond- and higher-order statistics.  The goal of the ICA is to linearly transform the \ndata such that the transformed variables are as statistically independent from each \n\n\fUnsupervised Classification with Non-Gaussian Mixture Models  Using ICA \n\n509 \n\nother as  possible  (Bell  and Sejnowski,  1995;  Cardoso  and Laheld,  1996;  Lee et al., \n1999a).  ICA  generalizes  the  technique  of  Principal  Component  Analysis  (PCA) \nand,  like  PCA, has proven a  useful  tool for  finding  structure in data. \n\nOne limitation  of ICA  is  the assumption  that the sources  are independent.  Here, \nwe  present  an  approach  for  relaxing  this  assumption  using  mixture  models.  In  a \nmixture model  (Duda and  Hart,  1973),  the observed  data can be  categorized into \nseveral  mutually  exclusive  classes.  When  the  class  variables  are  modeled  as  mul(cid:173)\ntivariate  Gaussian densities,  it is  called  a  Gaussian mixture model.  We  generalize \nthe  Gaussian  mixture  model  by  modeling  each  class  with  independent  variables \n(ICA  mixture  model).  This  allows  modeling  of classes  with  non-Gaussian  (e.g., \nplatykurtic or leptokurtic)  structure.  An  algorithm for  learning the  parameters is \nderived using the expectation maximization (EM)  algorithm.  In Lee et al.  (1999c), \nwe  demonstrated  that  this  approach  showed  improved  performance  in  data clas(cid:173)\nsification  problems.  Here,  we  apply  the  algorithm  to  learning  efficient  codes  for \nrepresenting different  types of images. \n\n2  The ICA Mixture Model \n\nWe  assume  that  the  data were  generated  by  a  mixture  density  (Duda  and  Hart, \n1973): \n\nK \n\np(xI8)  =  LP(xICk,(h)p(Ck), \n\n(1) \n\nk=l \n\nwhere  8  = ((}l,'\" \n,(}K)  are  the  unknown  parameters  for  each p(xICk, (}k),  called \nthe component  densities.  We  further  assume  that  the  number  of classes,  K,  and \nthe a  priori probability, p(Ck ), for  each class are known.  In the case  of a  Gaussian \nmixture  model,  p(XICk , (}k)  ex  N(f-Lk, Ek)'  Here  we  assume  that  the  form  of  the \ncomponent densities is  non-Gaussian and the data within each  class  are described \nby an ICA model. \n\nXk  =  AkSk + bk, \n\n(2) \nwhere  Ak  is  a  N  x  M  scalar  matrix  (called  the  basis  or  mixing  matrix)  and  b k \nis  the  bias  vector  for  class  k.  The  vector  Sk  is  called  the  source  vector  (these \nare  also  the  coefficients  for  each  basis  vector).  It is  assumed  that  the  individual \nsources  Si  within each class are mutually independent  across a  data ensemble.  For \nsimplicity,  we  consider  the case  where  Ak  is  full  rank,  i.e.  the  number  of sources \n(M)  is  equal to the  number of mixtures  (N).  Figure  1 shows  a  simple  example of \na  dataset that can be described  by  ICA  mixture model.  Each class  was  generated \nfrom  eq.2  using  a  different  A  and  b.  Class  (0)  was  generated  by  two  uniform \ndistributed sources,  whereas class  (+)  was  generated by two  Laplacian distributed \nsources  (P(s)  ex  exp( -lsl)).  The  task  is  to  model  the  unlabeled  data  points  and \nto  determine  the  parameters  for  each  class,  A k , bk  and  the  probability  of  each \nclass p( Ck lx, (}l:K)  for  each  data point.  A learning algorithm can be derived  by an \nexpectation  maximization  approach  (Ghahramani,  1994)  and  implemented  in  the \nfollowing  steps: \n\n\u2022  Compute the log-likelihood of the data for  each class: \n\n10gp(xICk,(}k)  =  logp(sk) -log(det IAkl), \n\nwhere (}k  =  {Ak, bk,Sd\u00b7 \n\n\u2022  Compute the probability for  each class given the data vector x \n\n(C  I \np \n\n() \n\nk  x,  1 . K \n. \n\np(XI(}k' Ck)p(Ck) \n\n) -\n- ==--'---:---:--,:--:-=-:-:-----:--::-:-:-\nLkP(xl(}k,Ck)P(Ck) \n\n(3) \n\n(4) \n\n\f510 \n\nT.-w. Lee,  M.  S.  Lewicki and T.  1.  Sejnowski \n\n+ \n\n+ \n\n+ \n+ \n\n+ \n\n+ \n\n+ \n\n10 \n\n'\" )( \n\n+ \n\n... \n\n++ \n\n+ \n\n+ \n\n+ \n\n+ \n\n-5 \n\n+  + \n\n-10 \n\n-5 \n\n0 \n\nXl \n\n10 \n\nFigure  1:  A  simple  example  for  classifying  an  ICA  mixture  model.  There  are \ntwo  classes  (+)  and  (0);  each  class  was  generated  by  two  independent  variables, \ntwo  bias  terms  and  two  basis  vectors.  Class  (0)  was  generated  by  two  uniform \ndistributed sources as indicated next to the data class.  Class (+)  was generated by \ntwo  Laplacian  distributed  sources  with  a  sharp  peak  at  the  bias  and  heavy  tails. \nThe inset graphs show the distributions of the source variables,  Si ,k, for  each basis \nvector. \n\n\u2022  Adapt the basis functions  A  and the bias terms b for  each class.  The basis \n\nfunctions  are adapted using gradient ascent \n\n8 \n\nex:  8Ak 10gp(xIBI:K) \n\np(Cklx, B1:K ) 8Ak 10gp(xICk, Ok). \n\n8 \n\n(5) \n\nNote  that  this  simply  weights  any  standard  ICA  algorithm  gradient  by \np(Cklx,OI:K).  The gradient can also be summed over multiple data points. \nThe bias term is  updated according to \n\nb \n\nk-\n-\n\nLt Xtp( Ck IXt, BI:K ) \nLtp(Cklxt,OI:K)  , \n\n(6) \n\nwhere t  is  the data index  (t =  1, ... , T) . \n\nThe  three  steps  in  the  learning  algorithm  perform  gradient  ascent  on  the  total \nlikelihood of the data in eq .1. \n\nThe extended infomax ICA  learning rule is  able to blindly separate mixed  sources \nwith  sub- and  super-Gaussian  distributions.  This  is  achieved  by  using  a  simple \ntype  of  learning  rule  first  derived  by  Girolami  (1998).  The  learning  rule  in  Lee \net  al.  (1999b)  uses  the stability  analysis  of Cardoso and  Laheld  (1996)  to switch \nbetween  sub- and super-Gaussian regimes.  The learning rule expressed in terms of \nW  =  A-I, called the filter  matrix is: \n\nAWex:  [1  - K tanh(u)uT  - uuT ]  W , \n\n(7) \n\n\fUnsupervised Classification with Non-Gaussian  Mixture Models  Using lCA \n\n511 \n\nwhere  ki  are elements of the N-dimensional diagonal matrix K  and u  =  Wx.  The \nunmixed  sources  u  are the source estimate s  (Bell  and Sejnowski,  1995).  The  ki's \nare  (Lee et al.,  1999b) \n\nki =  sign (E[sech2ui]E[u~] - E[Ui tanh UiJ)  . \n\n(8) \nThe source distribution is super-Gaussian when k i  = 1 and sub-Gaussian when k i  = \n-1.  For the log-likelihood estimation in eq.3 the term log p{ s) can be approximated \nas follows \n\nlogp(s) ex- 2:logcoshsn -\n\nS2 \n; \n\nsuper-Gaussian \n\nn \n\nlogp(s) ex+ 2: log cosh Sn  - ; \n\nS2 \n\nn \n\nsub-Gaussian \n\n(9) \n\nSuper-Gaussian  densities,  are  approximated  by  a  density  model  with  heavier  tail \nthan the Gaussian density;  Sub-Gaussian densities are approximated by a  bimodal \ndensity  (Girolami,  1998).  Although  the source  density  approximation  is  crude  it \nhas  been  demonstrated  that  they  are  sufficient  for  standard  leA  problems  (Lee \net al., 1999b).  When learning sparse representations only, a Laplacian prior (p(s)  ex \nexp{ -lsi\u00bb  can be used for the weight update which simplifies the infomax learning \nrule to \n\n~ W \n\nlogp(s) \n\nex \n\nex \n\n[I - sign(u)uT ]  W, \n- 2: ISnl \n\nLaplacian prior \n\n(10) \n\n3  Learning efficient  codes for  images \n\nn \n\nRecently, several approaches have been proposed to learn image codes that utilize a \nset of linear basis functions.  Olshausen and Field  (1996)  used a sparseness criterion \nand found codes that were similar to localized and oriented receptive fields.  Similar \nresults  were  presented  by  Bell  and  Sejnowski  (1997)  using  the  infomax  algorithm \nand by  Lewicki and Olshausen (1998)  using a  Bayesian approach.  By applying the \nleA mixture model we  present  results  which  show a  higher degree of flexibility  in \nencoding  the images.  We  used  images  of natural scenes  obtained  from  Olshausen \nand  Field  (1996)  and  text  images  of  scanned  newspaper  articles.  The  training \nset  consisted of 12  by  12  pixel  patches  selected  randomly from  both  image types. \nFigure  2 illustrates examples  of those  image patches.  Two  complete  basis  vectors \nAl  and  A2  were  randomly  initialized.  Then,  for  each  gradient  in  eq.5  a  stepsize \nwas  computed  as a  function  of the amplitude of the basis vectors and the number \nof  iterations.  The  algorithm  converged  after  100,000  iterations  and  learned  two \nclasses of basis functions  as shown in figure 3.  Figure 3  (top)  shows basis functions \ncorresponding to  natural  images.  The basis  functions  show  Gabor1-like structure \nas  previously  reported  in  (Olshausen  and  Field,  1996;  Bell  and  Sejnowski,  1997; \nLewicki  and  Olshausen,  1998).  However,  figure  3  (bottom)  shows  basis functions \ncorresponding to  text  images.  These  basis  functions  resemble  bars  with  different \nlengths  and  widths  that  capture the high-frequency  structure present  in  the  text \nimages. \n\n3.1  Comparing coding efficiency \n\nWe have compared the coding efficiency between the leA mixture model and similar \nmodels  using  Shannon's  theorem  to  obtain  a  lower  bound  on  the  number  of bits \n\n1 Gaussian  modulated siusoidal \n\n\f512 \n\nT-W Lee,  M  S.  Lewicki and T.  J.  Sejnowski \n\nam M  ~ W3  tiifi  ~'1 Z  a .!IE R  - m  I!!B!!tlr;a \n. .  lIPS  ~j.l  111  ~ au t:k.1  __ Ui . .  :1111  ~ lUi OJ BII KG \n\nFigure  2:  Example  of  natural  scene  and  text  image.  The  12  by  12  pixel  image \npatches  were  randomly  sampled  from  the  images  and  used  as  inputs  to  the  ICA \nmixture model. \n\nrequired to encode the pattern. \n\n#bits ~ -log2 P(xIA) - Nlog2 (O\"x), \n\n(11) \nwhere N  is  the dimensionality of the input pattern x  and o\"x  is  the coding precision \n(standard deviation of the noise introduced by errors in encoding).  Table 1 compares \nthe coding efficiency of five  different methods.  It shows the number of bits required \nto encode  three  different  test  data sets  (5000  image  patches  from  natural scenes, \n5000  image  patches  from  text  images  and  5000  image  patches  from  both  image \ntypes)  using  five  different  encoding  methods  (ICA  mixture model,  nature trained \nICA, text trained ICA, nature and text trained ICA, and PCA trained on all three \ntest  sets).  It  is  clear  that  ICA  basis  functions  trained  on  natural  scene  images \nexhibit the best encoding when only natural scenes are presented (column:  nature). \nThe same applies  to text  images  (column:  text).  Note that  text training yields a \nreasonable basis for  both data sets  but nature training gives  a  good basis only for \nnature.  The ICA  mixture model shows the same encoding power for  the individual \ntest  data sets,  and it gives  the best  encoding when  both image types  are  present. \nIn this case,  the  encoding  difference  between the ICA  mixture model  and  PCA is \nSignificant  (more than 20%).  ICA mixtures yielded a small improvement over ICA \ntrained on both image types.  We  expect the size of the improvement to be greater \nin  situations  where  there  are greater differences  among  the classes.  An advantage \nof the mixture model is  that each image  patch is  automatically classified .. \n\n4  Discussion \n\nThe  new  algorithm  for  unsupervised  classification  presented  here  is  based  on  a \nmaximum likelihood mixture model using ICA to model the structure of the classes. \nWe  have demonstrated here that the algorithm can learn efficient codes to represent \ndifferent  image  types  such  as  natural  scenes  and  text  images.  In  this  case,  the \nlearned  classes  of  basis  functions  show  a  20%  improvement  over  PCA  encoding. \nICA  mixture  model  should  show  better  image  compression  rates than  traditional \ncompression algorithm such  as  JPEG. \nThe ICA  mixture  model  is  a  nonlinear  model in  which  each class  is  modeled  as a \nlinear  process  and  the  choice  of class  is  modeled  using  probabilities.  This  model \n\n\fUnsupervised Classification with Non-Gaussian Mixture Models  Using ICA \n\n513 \n\nFigure 3:  (Left) Basis function class corresponding to natural images.  (Right) Basis \nfunction class corresponding to text images. \n\nTable  1:  Comparing coding efficiency \n\nTest data \n\nTraining set and model  Nature  Text  Nature and Text \n\nICA  mixtures \nNature trained ICA \nText trained ICA \nNature and text trained ICA \npeA \n\n4.72 \n4.72 \n5.00 \n4.83 \n6.22 \n\n5.20 \n9.57 \n5.19 \n5.29 \n5.97 \n\n4.96 \n7.15 \n5.10 \n5.07 \n6.09 \n\nCodmg efficIency  (bIts per pIxel)  of five  methods IS  compared for  three test sets. \nCoding precision was set to  7 bits  (Nature:  U x  = 0.016  and Text:  U x  = 0.029). \n\ncan  therefore  be  seen  as  a  nonlinear  ICA  model.  Furthermore,  it  is  one  way  of \nrelaxing the independence assumption over the  whole  data set.  The ICA  mixture \nmodel is a conditional independence model, i.e., the independence assumption holds \nwithin only  each  class  and there  may  be  dependencies  among classes.  A  different \nview  of the ICA  mixture model  is  to think of the classes of being  an overcomplete \nrepresentation.  Compared  to  the  approach  of Lewicki  and  Sejnowski  (1998),  the \nmain difference is that the basis functions  learned here are mutually exclusive, i.e. \neach class uses  its own set of basis functions. \n\nThis method is similar to other approaches including the mixture density networks \nby  Bishop  (1994)  in  which  a  neural  network  was  used  to  find  arbitrary  density \nfunctions.  This algorithm reduces  to the Gaussian mixture model when the source \npriors are Gaussian.  Purely Gaussian structure, however,  is  rare in real  data sets. \nHere we have used priors of the form of super-Gaussian and sub-Gaussian densities. \nBut these could be extended as proposed by Attias (1999).  The proposed model was \nused for  learning a complete set of basis functions without additive noise.  However, \nthe  method  can be extended  to  take into  account  additive Gaussian  noise  and an \novercomplete set of basis vectors  (Lewicki and Sejnowski,  1998). \n\nIn  (Lee  et al.,  1999c),  we  have  performed several experiments on  benchmark data \nsets for classification problems.  The results were comparable or improved over those \nobtained by AutoClass (Stutz and Cheeseman, 1994) which uses a Gaussian mixture \n\n\f514 \n\nT.-w.  Lee,  M.  S.  Lewicki and T.  J.  Sejnowski \n\nmodel.  Furthermore, we showed that the algorithm  can be applied to blind source \nseparation in  nonstationary environments.  The  method  can switch  automatically \nbetween learned mixing matrices in different environments (Lee et al., 1999c).  This \nmay prove to be useful in the automatic detection of sleep stages by observing EEG \nsignals.  The method can identify these stages due to the changing source priors and \ntheir mixing. \n\nPotential applications of the proposed method include the problem of noise removal \nand the problem of filling  in  missing pixels.  We  believe  that this  method provides \ngreater  flexibility  in  modeling  structure  in  high-dimensional  data  and  has  many \npotential applications. \n\nReferences \n\nAttias,  H.  (1999) .  Blind separation of noisy  mixtures:  An  EM  algorithm for  inde(cid:173)\n\npendent factor analysis.  Neural  Computation,  in press. \n\nBell,  A.  J. and Sejnowski, T . J.  (1995).  An Information-Maximization Approach to \n\nBlind Separation and Blind  Deconvolution.  Neural  Computation,  7:1129-1159. \n\nBell,  A.  J.  and Sejnowski,  T.  J.  (1997).  The  'independent components'  of natural \n\nscenes are edge filters.  Vision  Research,  37(23):3327-3338. \n\nBishop,  C.  (1994).  Mixture density networks.  Technical  Report,  NCRG/4288. \nCardoso, J.-F. and Laheld, B.  (1996) . Equivariant adaptive source separation. IEEE \n\nTrans.  on  S.P.,  45(2):434- 444. \n\nDuda,  R.  and  Hart,  P.  (1973).  Pattern  classification  and  scene  analysis.  Wiley, \n\nNew  York. \n\nGhahramani,  Z.  (1994).  Solving inverse problems using an em approach to density \nestimation.  Proceedings  of the 1993  Connectionist Models  Summer School,  pages \n316--323. \n\nGirolami,  M.  (1998).  An  alternative perspective on  adaptive independent  compo(cid:173)\n\nnent analysis algorithms.  Neural  Computation,  10(8):2103-2114. \n\nLee,  T .-W.,  Girolami,  M.,  Bell,  A.  J.,  and  Sejnowski,  T.  J.  (1999a).  A  unifying \n\nframework for  independent component analysis.  International  Journal  on Math(cid:173)\nematical  and  Computer Models,  in press. \n\nLee,  T.-W.,  Girolami,  M.,  and  Sejnowski,  T.  J.  (1999b).  Independent  component \n\nanalysis using an extended infomax algorithm for  mixed sub-gaussian and super(cid:173)\ngaussian sources.  Neural  Computation,  11(2):409-433. \n\nLee,  T.-W.,  Lewicki,  M.  S.,  and  Sejnowski,  T.  J.  (1999c).  ICA  mixture  models \nfor unsupervised classification and automatic context switching.  In International \nWorkshop  on ICA,  Aussois,  in  press. \n\nLewicki,  M.  and  Olshausen,  B.  (1998).  Inferring sparse, overcomplete 'image codes \n\nusing an efficient coding framework.  In Advances in  Neural Information Process(cid:173)\ning Systems  10,  pages 556-562. \n\nLewicki, M. and Sejnowski, T. J. (1998).  Learning nonlinear overcomplete represen(cid:173)\n\nations for efficient coding.  In Advances'in Neural Information Processing Systems \n10, pages 815-821. \n\nOlshausen,  B.  and Field,  D.  (1996).  Emergence of simple-cell receptive field  prop(cid:173)\n\nerties by learning a sparse code for  natural images.  Nature,  381:607-609. \n\nStutz, J.  and  Cheeseman,  P.  (1994).  Autoclass - a  Bayesian approach to classifica(cid:173)\ntion.  Maximum Entropy  and Bayesian  Methods,  Kluwer Academic Publishers. \n\n\f", "award": [], "sourceid": 1592, "authors": [{"given_name": "Te-Won", "family_name": "Lee", "institution": null}, {"given_name": "Michael", "family_name": "Lewicki", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}