{"title": "Coding Time-Varying Signals Using Sparse, Shift-Invariant Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 730, "page_last": 736, "abstract": null, "full_text": "Coding time-varying signals  using sparse, \n\nshift-invariant  representations \n\nMichael S.  Lewicki* \nlewickiCsalk.edu \n\nTerrence J. Sejnowski \n\nterryCsalk.edu \n\nHoward Hughes Medical Institute \n\nComputational Neurobiology Laboratory \n\nThe Salk Institute \n\n10010 N.  Torrey Pines Rd. \n\nLa Jolla,  CA  92037 \n\nAbstract \n\nA common way to represent a time series is to divide it into short(cid:173)\nduration blocks,  each of which is  then represented by a set of basis \nfunctions.  A limitation of this approach, however, is  that the tem(cid:173)\nporal alignment of the basis functions with the underlying structure \nin the time series is arbitrary.  We present an algorithm for encoding \na time series that does not require blocking the data.  The algorithm \nfinds  an efficient  representation by inferring the best temporal po(cid:173)\nsitions  for  functions  in  a  kernel  basis.  These  can  have  arbitrary \ntemporal  extent  and  are  not  constrained  to  be  orthogonal.  This \nallows the model to capture structure in the signal that may occur \nat arbitrary temporal positions and preserves the relative temporal \nstructure of underlying events.  The model is shown to be equivalent \nto a  very sparse and highly over complete basis.  Under this model, \nthe mapping from  the data to the representation is  nonlinear,  but \ncan  be  computed  efficiently.  This form  also  allows  the  use  of ex(cid:173)\nisting methods for  adapting the basis itself to data.  This approach \nis  applied to speech data and results in a shift invariant, spike-like \nrepresentation that resembles coding in the cochlear nerve. \n\n1 \n\nIntroduction \n\nTime series are often encoded by first  dividing the signal into a  sequence of blocks. \nThe data within  each  block  is  then fit  with  a  standard basis  such as  a  Fourier or \nwavelet.  This  has  a  limitation  that  the  components  of the  bases  are  arbitrarily \naligned with respect to structure in the time series.  Figure 1 shows a short segment \nof speech data and the boundaries of the blocks.  Although the structure in the signal \nis  largely  periodic,  each  large oscillation appears in a  different  position within  the \nblocks  and is  sometimes  split  across  blocks.  This  problem  is  particularly  present \nfor  acoustic  events  with  sharp onset,  such  as  plosives  in  speech.  It also  presents \n\n\u00b7To whom correspondence should be addressed. \n\n\fCoding Time-Varying Signals  Using Sparse, Shift-Invariant Representations \n\n731 \n\ndifficulties  for  encoding the signal efficiently,  because any  basis that is  adapted to \nthe underlying structure must represent all possible phases.  This can be somewhat \ncircumvented  by  techniques  such  as  windowing  or averaging sliding  blocks,  but  it \nwould  be more desirable if the representation were  shift  invariant. \n\ntime \n\nFigure 1:  Blocking results in  arbitrary phase alignment the underlying structure. \n\n2  The Model \n\nOur goal is  to model  a  signal by  using a  small set of kernel  functions  that can be \nplaced  at  arbitrary  time  points.  Ultimately,  we  want  to  find  the  minimal  set  of \nfunctions  and time points that fit  the signal within a  given  noise  level.  We  expect \nthis type of model to work well for signals composed of events whose onset can occur \nat  arbitrary temporal  positions.  Examples  of these  include,  musical  instruments \nsounds with sharp attack or plosive sounds in speech. \nWe  assume time series x(t) is  modeled by \n\n(1) \n\nwhere Ti  indicates the temporal position of the ith  kernel function,  <Pm [i) , which  is \nscaled by  Si.  The notation m[i]  represents an index function that specifies which of \nthe M  kernel functions  is  present at time Ti.  A single kernel function  can occur at \nmultiple times during the time series.  Additive noise at time t  is  given by  E(t). \nA  more  general  way  to  express  (1)  is  to  assume  that  the  kernel  functions  exist \nat  all  time  points  during  the  signal,  and  let  the  non-zero  coefficients  determine \nthe  positions of the  kernel  functions.  In  this  case,  the  model  can  be  expressed  in \nconvolutional form \n\nx(t) \n\nL / Sm(T)<Pm(t - T)dT + E(t) \nL sm(t) * <Pm(t) + E(t)  , \n\nm \n\nm \n\n(2) \n\n(3) \n\nwhere Sm(T)  is the coefficient at time T for  kernel function <Pm. \nIt is  also helpful  to express the model  in  matrix form  using a  discrete sampling of \nthe continuous time series: \n\nx  =  As + E. \n\n(4) \n\n\f732 \n\nM.  S.  Lewicki and T.  J.  Sejnowski \n\nThe basis matrix,  A,  is  defined  by \n\n(5) \n\nwhere  C(a)  is  an  N-by-N  circulant  matrix  parameterized  by  the  vector  a.  This \nmatrix is  constructed by  replicating the kernel functions  at each sample position \n\na2 \na3 \n\nal  1 \n\na2 \n\nao  aN-l \nal \n\nao \n\n(6) \n\n[ ~  aN-I \n\nan \n\nal \n\nC(a)  = \n\naN-2  aN-3 \naN-l  aN-2 \n\nThe kernels are zero padded to be of length N .  The length of each kernel is typically \nmuch less than the length of the signal, making A very sparse.  This can be viewed as \na special case of a Toeplitz matrix.  Note that the size of A is M N-by-N, and is thus \nan example of an overcomplete  basis,  i.e.  a  basis  with  more  basis  functions  than \ndimensions  in  the data space  (Simoncelli  et al.,  1992; Coifman  and Wickerhauser, \n1992;  Mallat and  Zhang,  1993; Lewicki and Sejnowski,  1998) . \n\n3  A  probabilistic formulation \n\nThe optimal coefficient  values  for  a  signal  are found  by  maximizing  the  posterior \ndistribution \n\n8 \n\n8 \n\ns = argmaxP(slx,A) = argmaxP(xIA,s)P(s) \n\n(7) \nwhere s is the most probable representation of the signal.  Note that omission of the \nnormalizing constant  P(xIA)  does  not change the location of the maximum.  This \nformulation of the problem offers the advantage that the model can fit  more general \ntypes of distributions and naturally  \"denoises\"  the signal.  Note that the mapping \nfrom  x  to s is  nonlinear  with  non-zero  additive  noise  and  an  overcomplete  basis \n(Chen et al.,  1996; Lewicki and Sejnowski,  1998).  Optimizing (7)  essentially selects \nout the subset of basis functions  that best account for  the data. \nTo  define  a  probabilistic  model,  we  follow  previous  conventions  for  linear genera(cid:173)\ntive models  with additive noise  (Cardoso,  1997; Lewicki  and Sejnowski,  1998).  We \nassume the noise,  to,  to have a  Gaussian distribution which yields a  data likelihood \nfor  a  given representation of \n\nlogP(xIA,s) ex  - 2u2(x - As)2. \n\n1 \n\n(8) \n\nThe function  P(s)  describes the a  priori distribution of the coefficients.  Under the \nassumption that P(s) is sparse (highly -peaked around zero), maximizing (7)  results \nin  very  few  nonzero  coefficients.  A  compact  representation of s is  to describe  the \nvalues of the non-zero coefficients and their temporal positions \n\nP(s) = II P(Um,Tm) =  II II P(Um,i)P(Tm,i), \n\nM  n\", \n\nm \n\nm=l i =l \n\n(9) \n\nwhere the prior for the non-zero coefficient values, Um,i,  is  assumed to be Laplacian, \nand  the  prior  for  the  temporal  positions  (or  intervals),  Tm,i,  is  assumed  to  be  a \ngamma distribution. \n\n\fCoding Time-Varying Signals Using Sparse,  Shift-Invariant Representations \n\n733 \n\n4  Finding the best encoding \n\nA difficult  challenge presented by the proposed model is  finding  a  computationally \ntractable method for  fitting  it to the data.  The brute-force approach of generating \nthe  basis  matrix  A  generates  an  intractable  number  basis  functions  for  signals  of \nany  reasonable  length,  so  we  need  to look  for  ways  of making  the optimization of \n(7)  more efficient.  The gradient of the log posterior is  given by \n\na \nas 10gP(sIA,x) oc  AT(x - As) + z(s) , \n\n(10) \n\nwhere  z(s)  =  (logP(s)),.  A  basic  operation  required  is  v  =  AT u.  We  saw  that \nx  =  As can be computed efficiently using convolution (2).  Because AT is also block \ncirculant \n\nAT =  [  C.(~.D  1 \n\nC(\u00a2'u ) \n\n(11) \n\nwhere  \u00a2'(1  : N)  =  \u00a2(N : -1 : 1).  Thus, terms involving  AT  can also  be  computed \nefficiently  using convolution \n\nv  =  AT U  =  [  \u00a21 (-~~ ~ u(t)  1 \n\n\u00a2M( -t) * u(t) \n\n(12) \n\nObtaining an initial representation \n\nAn  alternative  approach  to optimizing  (7)  is  to  make  use  of the  fact  that  if  the \nkernel  functions  are  short  enough  in  length,  direct  multiplication  is  faster  than \nconvolution,  and  that,  for  this  highly  overcomplete  basis,  most  of the  coefficients \nwill  be zero after being fit  to the data.  The central problem in encoding the signal \nthen is to determine which coefficients are non-zero, ideally finding a  description of \nthe time series with the minimal number of non-zero coefficients.  This is equivalent \nto determining  the  best  set of temporal  positions for  each  of the  kernel  functions \n(1). \n\nA crucial step in this approach is to obtain a good initial estimate of the coefficients. \nOne way to do this is  to consider the projection of the signal onto each of the basis \nfunctions,  i.e.  AT x.  This estimate  will  be  exact  (i.e.  zero  residual  error)  in  the \ncase of zero noise and A orthogonal.  For the non-orthogonal, overcomplete case the \nsolution  will  be approximate,  but for  certain choices  of the  basis matrix,  an exact \nrepresentation can still be obtained efficiently  (Daubechies,  1990;  Simoncelli et aI., \n1992). \n\nFigure 2 shows examples of convolving two different kernel functions with data.  One \ndisadvantage with this initial solution is that the coefficient functions  s~(t) are not \nsparse.  For example, even though the signal in figure  2a is  composed of only three \ninstances of the kernel function,  the convolution is mostly non-zero. \nA  simple  procedure  for  obtaining  a  better  initial  estimate  of  the  most  probable \ncoefficients is to select the time locations of the maxima (or extrema) in the convo(cid:173)\nlutions.  These are positions where the kernel functions capture the greatest amount \nof signal structure and where the optimal coefficients are likely to be non-zero.  This \ngenerates a  large number of positions,  but their number can be  reduced further  by \nselecting  only  those  that  contribute  significantly,  i.e.  where the average  power  is \ngreater than some fraction of the noise level.  From these, a basis for the entire signal \nis constructed by replicating the kernel functions at the appropriate time positions. \n\n\f734 \n\nM.  S.  Lewicki and T J  Sejnowski \n\n~Z'C7'C71 \nV1 \n\nI \n\nfJVSNSM \n~ \n\nI \n\nFigure  2:  Convolution  using  the  fast  Fourier  transform  is  an  efficient  way  to  select  an \ninitial solution  for  the temporal positions of the kernel  functions.  (a)  The convolution  of \na sawtooth-shaped  kernel  function,  \u00a2J(t),  with  a sawtooth  waveform,  x(t).  (b)  A single \nperiod sine-wave  kernel  function  convolved with  a speech  segment. \n\nOnce  an  initial  estimate and  basis  are  formed,  the  most  probable  coefficient  val(cid:173)\nues  are estimated  using  a  modified  conjugate gradient  procedure.  The size  of the \ngenerated  basis  does  not  pose  a  problem  for  optimization,  because  it  is  has  very \nfew  non-zero  elements  (the  number  of which  is  roughly  constant  per  unit  time). \nThis arises because each column is  non-zero only around the position of the kernel \nfunction,  which is typically much shorter in duration than the data waveform.  This \nstructure affords the use of sparse matrix routines for  all the key  computations in \nthe conjugate gradient routine.  After the initial fit,  there typically are a large num(cid:173)\nber  of basis  functions  that  give  a  very  small  contribution.  These  can  be  pruned \nto yield,  after refitting, a  more probable representation that has significantly fewer \ncoefficients. \n\n5  Properties of the representation \n\nFigure 3  shows  the results  of fitting  a  segment  of speech  with  a  sine  wave  kernel. \nThe 64  kernel  functions  were  constructed  using  a  single  period  of a  sine  function \nwhose log frequencies were evenly distributed between 0 and Nyquist (4 kHz), which \nyielded  kernel  functions  that  were  minimally  correlated  (they  are  not  orthogonal \nbecause each has only one cycle and is zero elsewhere).  The kernel function lengths \nvaried  between  2  and  64  samples.  The  plots  show  the  positions  of the  non-zero \ncoefficients  superimposed  on  the  waveform.  The  residual  errors  curves  from  the \nfitted  waveforms  are  shown  offset,  below  each  waveform.  The  right  axes  indicate \nthe  kernel  function  number  which  increase  with  frequency.  The  dots  show  the \nstarting position of the kernels  with  non-zero coefficients,  with the dot size  scaled \naccording to the mean power contribution.  This plot is essentially a time/frequency \nanalysis,  similar to a  wavelet decomposition,  but on a  finer  temporal scale. \n\nFigure 3a shows that the structure in the coefficients  repeats for  each oscillation in \nthe  waveform.  Adding  a  delay  leaves  the  relative  temporal  structure of the  non(cid:173)\nzero  coefficients  mostly  unchanged  (figure  3b).  The  small  variations  between  the \ntwo  sets  of coefficients  are  due  to variations  in  the fitting  of the  small-magnitude \ncoefficients.  Representing  the  signal  in  figure  3b  with  a  standard  complete  basis \nwould result in a  very different  representation. \n\n\fCoding Time- Varying Signals Using Sparse,  Shift-Invariant Representations \n\n735 \n\n. . \n\n\u2022 \n\n: \n\n: . : .. . :  . . \n\na \n\n0.2 \n\n0.1 \n\ne \u2022\u2022\u2022 . \n. :  : \n\no \n\n~. 1 \n\no \n\n20 \n\n40 \n\n60 \n\ntime \n\n60 \n\n100 \n\n120 \n\n53 \n\n14 \n\n14 \n\n~. 1 \n\no \n\n20 \n\n40 \n\n60 \n\ntime \n\n80 \n\n100 \n\n120 \n\nFigure  3:  Fitting  a  shift-invariant  model  to  a  segment  of speech,  x(t).  Dots  indicate \npositions of kernels  (right  axis)  with size  scaled  by the mean power contribution.  Fitting \nerror is  plotted below speech signal. \n\n\f736 \n\n6  Discussion \n\nM  S.  Lewicki and T.  J.  Sejnowski \n\nThe model presented here can be viewed as an extension of the shiftable transforms \nof Simoncelli et al.  (1992).  One difference  is  that here no constraints are placed on \nthe  kernel  functions.  Furthermore,  this  model  accounts  for  additive  noise,  which \nyields  automatic signal  denoising and provides sensible  criteria for  selecting signif(cid:173)\nicant  coefficients.  An  important  unresolved issue  is  how  well  the algorithm  works \nfor  increasingly non-orthogonal kernels. \n\nOne interesting property of this  representation is  that it results in a  spike-like rep(cid:173)\nresentation.  In the resulting set of non-zero coefficients,  not only is  their  value im(cid:173)\nportant for  representing the signal, but also their relative temporal position, which \nindicate when an underlying event has occurred.  This shares many properties with \ncochlear models.  The model  described  here also has capacity to have  an over com(cid:173)\nplete  representation at any  given  timepoint,  e.g.  a  kernel  basis with an  arbitrarily \nlarge  number of frequencies.  These  properties make  this  model  potentially  useful \nfor  binaural signal processing applications. \nThe effectiveness of this method for  efficient coding remains to be proved.  A trivial \nexample of a  shift-invariant basis is  a  delta-function model.  For a  model to encode \ninformation  efficiently,  the  representation  should  be  non-redundant.  Each  basis \nfunction  should  \"grab\"  as  much  structure in the  data as  possible  and  achieve  the \nsame  level  of coding  efficiency  for  arbitrary  shifts  of the  data.  The  matrix form \nof the  model  (4)  suggests  that  it  is  possible  to  achieve  this optimum  by  adapting \nthe kernel functions themselves using the methods of Lewicki and Sejnowski  (1998). \nInitial results suggest that this approach is promising.  Beyond this, it is evident that \nmodeling the higher-order structure in the coefficients themselves will  be necessary \nboth to achieve an efficient representation and to capture structure that is  relevant \nto such tasks as speech recognition or auditory stream segmentation.  These results \nare a  step toward these goals. \nAcknowledgments.  We  thank Tony  Bell,  Bruno Olshausen,  and  David  Donoho \nfor  helpful  discussions. \n\nReferences \nCardoso, J.-F. (1997).  Infomax and maximum likelihood for blind source separation. \n\nIEEE Signal  Processing Letters,  4: 109- 11 I. \n\nChen,  S.,  Donoho,  D.  L.,  and  Saunders,  M.  A.  (1996).  Atomic  decomposition  by \n\nbasis  pursuit.  Technical report, Dept. Stat., Stanford Univ.,  Stanford,  CA. \n\nCoifman, R.  R.  and Wickerhauser, M.  V.  (1992).  Entropy-based algorithms for  best \n\nbasis selection.  IEEE  Transactions  on  Information  Theory,  38(2) :713- 718. \n\nDaubechies,  I.  (1990).  The  wavelet  transform,  time-frequency  localization,  and \n\nsignal analysis.  IEEE  Transactions  on  Information  Theory,  36(5):961- 1004. \n\nLewicki,  M.  S.  and Sejnowski, T. J.  (1998).  Learning overcomplete representations. \n\nNeural  Computation.  submitted. \n\nMallat,  S.  G.  and  Zhang,  Z.  F.  (1993).  Matching  pursuits  with  time-frequency \n\ndictionaries.  IEEE  Transactions  on Signal  Processing, 41(12):3397-3415. \n\nSimoncelli,  E.  P., Freeman,  W.  T.,  Adelson,  E.  H.,  and J ., H. D.  (1992).  Shiftable \n\nmultiscale transforms.  IEEE  Trans . Info .  Theory,  38:587- 607. \n\n\f", "award": [], "sourceid": 1514, "authors": [{"given_name": "Michael", "family_name": "Lewicki", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}