{"title": "Learning Nonlinear Overcomplete Representations for Efficient Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 556, "page_last": 562, "abstract": "", "full_text": "Learning nonlinear overcomplete \nrepresentations for  efficient  coding \n\nMichael S.  Lewicki \nlewicki~salk.edu \n\nTerrence J. Sejnowski \n\nterry~salk.edu \n\nHoward Hughes Medical Institute \nComputational Neurobiology Lab \n\nThe Salk Institute \n\n10010 N. Torrey Pines Rd. \n\nLa Jolla,  CA  92037 \n\nAbstract \n\nWe  derive a  learning algorithm for  inferring an overcomplete basis \nby  viewing  it  as  probabilistic  model  of the  observed data.  Over(cid:173)\ncomplete  bases  allow  for  better  approximation  of the  underlying \nstatistical density.  Using a Laplacian prior on the basis coefficients \nremoves  redundancy  and  leads  to  representations that  are  sparse \nand  are  a  nonlinear  function  of the data.  This  can  be  viewed  as \na  generalization of the technique  of independent  component  anal(cid:173)\nysis  and  provides  a  method  for  blind  source  separation  of  fewer \nmixtures  than  sources.  We  demonstrate  the  utility  of  overcom(cid:173)\nplete  representations  on  natural  speech  and  show  that  compared \nto the traditional Fourier basis the inferred representations poten(cid:173)\ntially have much greater coding efficiency. \n\nA  traditional  way  to represent  real-values  signals is  with  Fourier or wavelet  bases. \nA  disadvantage  of these  bases,  however,  is  that  they  are  not  specialized  for  any \nparticular  dataset.  Principal  component  analysis  (PCA)  provides  one  means  for \nfinding  an  basis  that is  adapted for  a  dataset,  but the basis  vectors are  restricted \nto  be  orthogonal.  An  extension  of  PCA  called  independent  component  analysis \n(Jutten  and  Herault,  1991;  Comon  et  al.,  1991;  Bell  and  Sejnowski,  1995)  allows \nthe learning of non-orthogonal bases.  All  of these bases  are complete in  the sense \nthat they span the input space,  but they are limited in terms of how  well  they can \napproximate the dataset's statistical density. \n\nRepresentations that are overcomplete,  i. e. more basis vectors than input variables, \ncan provide a better representation, because the basis vectors can be specialized for \n\n\fLearning Nonlinear Overcomplete Representations for Efficient Coding \n\n557 \n\na  larger variety of features  present  in  the entire  ensemble  of data.  A  criticism  of \novercomplete representations is that they are redundant, i.e.  a given data point may \nhave  many  possible  representations,  but  this  redundancy  is  removed  by  the  prior \nprobability of the basis coefficients which specifies the probability  of the alternative \nrepresentations. \n\nMost  of the  overcomplete  bases  used  in  the  literature are  fixed  in  the  sense  that \nthey  are not adapted to the structure in  the data.  Recently  Olshausen  and  Field \n(1996)  presented an algorithm that allows an overcomplete basis to be learned. This \nalgorithm relied on an approximation to the desired probabilistic objective that had \nseveral  drawbacks, including tendency to breakdown in the case of low  noise levels \nand when learning bases with higher degrees of overcompleteness.  In this paper, we \npresent an improved approximation to the desired probabilistic objective and show \nthat this leads to a simple  and robust algorithm for  learning optimal overcomplete \nbases. \n\n1 \n\nInferring the representation \n\nThe data,  X 1 :L '  are modeled with an overcomplete linear basis plus additive noise: \n\nx =  AS+i \n\n(1) \n\nwhere A  is  an L  x  M  matrix, whose columns are the basis  vectors,  where M  ~ L . \nWe  assume Gaussian additive noise so  that 10gP(xIA, s) ()(  -A(X - As)2/2, where \nA = 1/(12  defines the precision of the noise. \nThe redundancy in the overcomplete representation is removed by defining a density \nfor  the  basis  coefficients,  P(s),  which  specifies  the  probability  of  the  alternative \nrepresentations.  The most  probable representation,  5,  is  found  by  maximizing the \nposterior distribution \n\ns =  maxP(sIA,x) =  maxP(s)P(xIA,s) \n\n8 \n\n8 \n\n(2) \n\nP(s)  influences  how  the  data are  fit  in  the  presence  of noise  and  determines  the \nuniqueness  of the  representation.  In  this  model,  the  data is  a  linear  function  of \ns,  but  s  is  not,  in  general,  a  linear  function  of the  data.  IT  the  basis  function  is \ncomplete  (A  is  invertible)  then,  assuming  broad  priors  and  low  noise,  the  most \nprobable internal state can be computed simply  by  inverting A.  In the  case of an \novercomplete basis,  however,  A  can not be inverted.  Figure 1 shows how  different \npriors induce different  representations. \n\nUnlike  the  Gaussian  prior,  the  optimal  representation  under  the  Laplacian  prior \ncannot be obtained by  a simple linear operation.  One approach for  optimizing sis \nto use the gradient of the log posterior in an optimization algorithm.  An alternative \nmethod for  finding  the most  probable internal state is  to view  the problem as  the \nlinear program:  min 1 T s such that As =  x.  This can be generalized to handle both \npositive and negative s  and solved efficiently and exactly with interior point linear \nprogramming methods  (Chen  et al.,  1996). \n\n\f558 \n\na \n\nb \n\nif L2 \n\nL1 \n\nG) \n:::l \n~  0.1 \n\n10'05~\\ \n\n8 \n\n\\\" \n20 \n\no \n\n~-\n\nM. S.  Lewicki and T.  1.  Sejnowski \n\n-~ - - ~ - - 80 \n\n100 \n\n120 \n\nFigure 1:  Different priors induce different  representations.  (a)  The 2D  data distribution \nhas three main axes which form  an overcomplete representation.  The graphs marked  \"L2\" \nand  \"L1\"  show  the optimal scaled  basis vectors for  the data point x  under the  Gaussian \nand Laplacian  prior,  respectively.  Assuming zero  noise,  a  Gaussian  for  P{s)  is  equivalent \nto finding the exact fitting s with minimum L2  norm,  which is given by the pseudoinverse \ns =  A+x.  A Laplacian prior (P{Sm)  ex:  exp[-OlsmlJ)  yields the exact fit  with minimum L1 \nnorm, which is a nonlinear operation which essentially selects a subset of the basis vectors \nto represent  the data  (Chen et al.,  1996).  The resulting  representation  is  sparse.  (b)  A \n64-sample segment of speech was fit  to a 2x  overcomplete Fourier representation (128 basis \nvectors).  The plot shows rank order distribution of the coefficients of s under a  Gaussian \nprior  (dashed);  and a  Laplacian  prior  (solid).  Far more significantly  positive  coefficients \nare required under the Gaussian prior than under the Laplacian prior. \n\n2  Learning \n\nThe learning objective is to adapt A  to maximize the probability of the data which \nis  computed by marginalizing over the internal states \n\nP(xIA) = J ds P(s)P(xIA, s) \n\n(3) \n\ngeneral,  this  integral  cannot  be  evaluated  analytically  but  can  be  approximated \nwith a  Gaussian integral around s, yielding \n\nlog P(xIA) ~ const. + log pes)  - ~ (x - As)2 - ~ log det H \n\n(4) \n\nwhere  H  is  the  Hessian  of the  log  posterior  at  S,  given  by  )'ATA - VVlogP(s). \nTo  avoid  a  singularity  under  the  Laplacian  prior,  we  use  the  approximation \n(logP(sm\u00bb)'  ~ -8tanh(,8sm)  which  gives  the  Hessian  full  rank  and  positive  de(cid:173)\nterminant.  For large ,8  this approximates the true Laplacian prior.  A  learning rule \ncan be obtained by differentiating log P(xIA)  with respect to A. \n\nIn the following discussion,  we will  present the derivations of the three terms in  (4) \nand simplifying assumptions that lead to the following simple form  of the learning \nrule \n\n(5) \n\n\fLearning Nonlinear Ollercomplete Representations for Efficient Cod(ng \n\n559 \n\n2.1  Deriving V log pes) \n\nThis term specifies how to change A  so  as to make the probability of the represen(cid:173)\ntation s more probable.  IT we assume a  Laplacian prior, this component changes A \nto make the representation more sparse. \nWe  assume pes)  = rIm P(Sm).  In order to obtain 8sm/8aij,  we  need  to describe \ns as a  function  of A.  If the basis is  complete  (and we  assume low  noise),  then we \nmay simply invert A  to obtain s =  A -IX.  When A  is overcomplete, however, there \nis  no simple expression, but we  may still make an approximation. \nUnder priors, the most probable solution, s,  will yield at most L non-zero elements. \nIn effect,  this selects a  complete basis from  A.  Let  A represent this  reduced basis \nunder s.  We then have s = A -1(X- \u20ac)  where s is equal to s with M  - L  zero-valued \nelements removed.  A-I  obtained  by  removing the columns of A  corresponding to \nthe M  - L  zero-valued elements of s.  This allows us  to use results obtained for  the \ncase when A  is  invertible.  Following MacKay  (1996)  we  obtain \n\nRewriting in matrix notation we  have \n\n810gP(s)  _ \n-\n\n8A \n\n-\n\nA~ -Tv~T \n\nzs \n\n(6) \n\n(7) \n\nWe  can  use  this  to  obtain  an  expression  in  terms  of the  original  variables.  We \nsimply invert the mapping s ~ s to obtain Z  f- z and W T  f- A -T (row-wise)  with \nZm  = 0 and row  m  ofWT = 0 if 8m  = O.  We  then have \n\n8 log P(s)  WT  T \n\n8A  = -\n\nzs \n\n(8) \n\n2.2  Deriving Vex - As)2 \n\nThe  second  term  specifies  how  to  change  A  so  as  to  minimize  the  data  misfit. \nLetting ek = [x - AS]k  and using the results and notation from  above we  have: \n\n~ ~  8s, \n8  A~ 2 \n-8 .. \"2  L-ek = AeiSj + A L-ek L-ak'~ \na\" \nalJ \n\nk \n\nk \n\nI \n\n= AeiSj + A Lek L \n=  AeiSj - AeiSj = 0 \n\nk \n\nI \n\n-aklWliSj \n\n(9) \n\n(10) \n\n(11) \n\nThus no gradient component arises from  the error term. \n\n2.3  Deriving V log det H \n\nThe  third  term  in  the  learning  rule  specifies  how  to  change  the  weights  so  as  to \nminimize  the  width  of  the  posterior  distribution  P(xIA)  and  thus  increase  the \noverall probability of the data.  An  element  of H  is  defined  by  Hmn  =  Cmn + bmn \n\n\f560 \n\nM.  S.  Lewicki and T.  J.  Sejnowski \n\nwhere Cmn  =  Ek Aakmakn  and bmn  =  [- V'V' log P(s)]mn.  This gives \n\n8logdetH  _  \"\"H- 1  [8emn  + 8bmn ] \n\n8a\u00b7\u00b7 \n~ \n\n8a \u00b7\u00b7 \n~ \n\n- ~ nm  8a .. \nU \n\nmn \n\nFirst considering 8Cmn/8aij,  we can obtain \n\nL H~~ ~~~ = L H~;.\\aim + L Hj~.\\aim + Hj/ 2Aaij \nmn \n\nm:f.j \n\nm:f.j \n\n~3 \n\n(12) \n\n(13) \n\nUsing the fact  that H~; = Hj~ due to the symmetry of the Hessian, we  have \n\n(14) \n\nNext  we  derive  8bmn/8aij.  We  have  that  V'V'logP(s)  is  diagonal,  because  we \n\nassume  pes)  = nm P(sm).  Letting  2Ym  = H~!n8bmm/8sm and  using  the  result \n\nunder the reduced representation (6)  we  can obtain \n\n(15) \n\n2.4  Stabilizing and simplifying the learning rule \n\nPutting the  terms  together yields  a  problematic  expression  due  to the matrix in(cid:173)\nverses.  This can be alleviated by multiplying the gradient by an appropriate positive \ndefinite  matrix,  which  rescales  the gradient  components  but preserves  a  direction \nvalid  for  optimization.  Noting that ATWT  =  I  we  have \n\nH'\\ is large (low  noise)  then the Hessian is  dominated by  AATA  and we  have \n\n(16) \n\n(17) \n\nThe vector y hides a computation involving the inverse Hessian.  IT the basis vectors \nin  A  are randomly distributed, then as the dimensionality of A  increases the basis \nvectors  become  approximately  orthogonal  and  consequently  the  Hessian  becomes \napproximately  diagonal.  It  can  be  shown  that  if log pes)  and  its  derivatives  are \nsmooth,  Ym  vanishes  for  large A.  Combining the remaining  terms yields  equation \n(5).  Note that this  rule contains no matrix inverses and the vector z  involves only \nthe derivative of the log prior. \n\nIn the case where A  is square, this form of the rule is similar to the natural gradient \nindependent  component  analysis  (ICA)  learning  rule  (Amari  et  al.,  1996).  The \ndifference in the more general case where A  is rectangular is that s must maximize \nthe  posterior  distribution  P(slx, A)  which  cannot  be  done  simply  with  the  filter \nmatrix as in  standard ICA  algorithms. \n\n\fLearning Nonlinear Overcomplete Representations/or Efficient Coding \n\n561 \n\n3  Examples \n\nMore  sources  than  inputs.  In  these  2D  examples,  the  bases  were  initialized \nto  random,  normalized  vectors.  The  coefficients  were  solved  using  BPMPD  and \npublicly available interior point linear programming package (Meszaros, 1997) which \ngives  the  most  probable  solution  under  the  Laplacian  prior  assuming  zero  noise. \nThe algorithm was  run for  30 iterations using equation  (5)  with a  stepsize of 0.001 \nand  a  batchsize  of 200.  Convergence  was  rapid,  typically  requiring  less  than  20 \niterations.  In all  cases,  the  direction  of the learned  vectors  matched  those of the \ntrue generating  distribution;  the  magnitude was  estimated  less  precisely,  possibly \ndue to the approximation oflogP(xIA).  This can be viewed as a source separation \nproblem,  but  true separation will  be  limited  due  to  the  projection  of the sources \ndown  to a smaller subspace which necessarily loses information . \n\n. ' \n\n'.  ~ \n\nFigure 2:  Examples  illustrating the fitting  of 2D  distributions  with  overcomplete bases. \nThe first  example is equivalent to 3 sources mixed into 2 channels; the second to 4 sources \nmixed into 2 channels.  The data in  both examples  were  generated from  the true basis A \nusing  x = As with the elements of s distributed according to an exponential distribution \nwith  unit  mean.  Identical  results  were  obtained  by  drawing  s  from  a  Laplacian  prior \n(positive  and  negative  coefficients).  The  overcomplete  bases  allow  the  model  to  capture \nthe true underlying statistical structure in the 2D  data space. \n\nOvercomplete representations of speech.  Speech data were obtained from the \nTIMIT  database,  using  a  single  speaker  was  speaking  ten  different  example  sen(cid:173)\ntences with no preprocessing.  The basis was  initialized to an overcomplete Fourier \nbasis.  A  conjugate  gradient  routine  was  used  to  obtain  the  most  probable  basis \ncoefficients.  The  stepsize  was  gradually  reduced  over  10000  iterations.  Figure  3 \nshows that the learned basis is  quite different  from the Fourier representation.  The \npower spectrum for the learned basis vectors can be multimodal and/or broadband. \nThe learned  basis  achieves  greater  coding  efficiency:  2.19 \u00b1  0.59  bits  per  sample \ncompared to 3.86 \u00b1 0.28 bits per sample for  a  2x  overcomplete Fourier basis. \n\n4  Summary \n\nLearning overcomplete representations allows a basis to better approximate the un(cid:173)\nderlying statistical density of the data and consequently the learned representations \nhave better encoding and  denoising properties than generic bases.  Unlike the case \nfor  complete representations  and  the standard ICA  algorithm,  the transformation \n\n\f562 \n\nM  S.  Lewicki and T.  1. Sejnowski \n\n~ ~ .....  ~ ~ ~~J~~ \nJvV  ~ ...........  ~ ~ ~ ~ ~~~ \n+  ~ fIN  ~ ~ A- L \n.....  ..  -JVV  H  ~ i~~L~ \n...  +  .....  .....  .......  ~J~--L~ \n~ ~ ~ .\"... \n\n-L J-.-L \nf\\;vJ'v  JL~~JL-\n\nFigure  3:  An  example  of fitting  a  2x  overcomplete  representation  to  segments  of from \nnatural speech.  Each segment consisted of 64  samples, sampled at a frequency of 8000  Hz \n(8  msecs).  The plot shows a random sample of 30  of the 128  basis vectors  (each scaled to \nfull  range).  The right graph shows  the corresponding power spectral densities  (0  to 4000 \nHz). \n\nfrom  the data to the internal representation is  non-linear.  The probabilistic formu(cid:173)\nlation of the basis inference problem offers the advantages that assumptions about \nthe prior distribution on the basis coefficients  are made explicit  and that  different \nmodels  can be compared objectively using log P(xIA). \n\nReferences \n\nAmari, S., Cichocki, A., and Yang, H.  H.  (1996).  A new learning algorithm for blind \nsignal  separation.  In  Advances  in  Neural  and  Information  Processing  Systems, \nvolume 8,  pages 757-763, San Mateo.  Morgan Kaufmann. \n\nBell,  A.  J. and Sejnowski,  T. J.  (1995).  An  information maximization approach to \nblind separation and blind deconvolution.  Neural  Computation,  7(6}:1129-1159. \n\nChen,  S.,  Donoho,  D.  L.,  and  Saunders,  M.  A.  (1996).  Atomic  decomposition  by \n\nbasis pursuit.  Technical report, Dept. Stat., Stanford Univ.,  Stanford, CA. \n\nComon,  P.,  Jutten,  C.,  and  Herault,  J.  (1991).  Blind  separation  of  sources  .2. \n\nproblems statement.  Signal  Processing,  24(1}:11-20. \n\nJutten,  C.  and  Herault,  J.  (1991).  Blind  separation  of sources  .1.  an  adaptive \n\nalgorithm based on neuromimetic architecture.  Signal Processing,  24(1):1-10. \n\nMacKay,  D.  J.  C.  (1996).  Maximum likelihood  and covariant algorithms for  inde(cid:173)\n\npendent  component  analysis.  University  of Cambridge,  Cavendish  Laboratory. \nAvailable at ftp: / /wol. ra. phy. cam. ac. uk/pub/mackay / ica. ps. gz. \n\nMeszaros,  C.  (1997).  BPMPD: An  interior point linear programming solver.  Code \n\navailable at ftp: / /ftp.netlib. org/ opt/bpmpd. tar. gz. \n\nOlshausen,  B.  A.  and Field,  D.  J.  (1996).  Emergence of simple-cell  receptive-field \n\nproperties by learning a  sparse code for  natural images.  Nature,  381:607-609. \n\n\f", "award": [], "sourceid": 1424, "authors": [{"given_name": "Michael", "family_name": "Lewicki", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}