{"title": "Temporal Low-Order Statistics of Natural Sounds", "book": "Advances in Neural Information Processing Systems", "page_first": 27, "page_last": 33, "abstract": null, "full_text": "Temporal  Low-Order Statistics of Natural \n\nSounds \n\nH.  Attias\u00b7 and  C.E. Schreinert \n\nSloan Center for  Theoretical Neurobiology and \n\nW.M.  Keck  Foundation Center for  Integrative Neuroscience \n\nUniversity of California at San Francisco \n\nSan Francisco,  CA  94143-0444 \n\nAbstract \n\nIn order to process  incoming sounds  efficiently,  it is  advantageous \nfor the auditory system to be adapted to the statistical structure of \nnatural auditory scenes.  As a first step in investigating the relation \nbetween  the system  and its inputs,  we  study low-order  statistical \nproperties in several sound ensembles  using a  filter  bank analysis. \nFocusing on the amplitude and phase in different frequency bands, \nwe  find  simple  parametric  descriptions  for  their  distribution  and \npower  spectrum  that  are valid  for  very  different  types  of sounds. \nIn  particular,  the  amplitude  distribution  has  an  exponential  tail \nand  its  power  spectrum  exhibits  a  modified  power-law  behavior, \nwhich is  manifested by self-similarity and long-range temporal cor(cid:173)\nrelations.  Furthermore,  the statistics for  different  bands  within  a \ngiven  ensemble  are  virtually  identical,  suggesting  translation  in(cid:173)\nvariance along the cochlear  axis.  These results  show  that natural \nsounds are highly redundant, and have possible implications to the \nneural code used  by the auditory system. \n\n1 \n\nIntroduction \n\nThe capacity of the auditory system to represent the auditory scene is  restricted by \nthe finite  number of cells  and by intrinsic noise.  This fact  limits  the ability of the \norganism  to  discriminate  between  different  sounds  with  similar  spectro-temporal \n\n\u00b7Corresponding author.  E-mail:  hagai@phy.ucsf.edu. \ntE-mail:  chris@phy.ucsf.edu. \n\n\f28 \n\nH.  Attias and C.  E.  Schreiner \n\ncharacteristics.  However,  it  is  possible  to  enhance  the  discrimination  ability  by \na  suitable  choice  of  the  encoding  procedure  used  by  the  system,  namely  of  the \ntransformation of sounds  reaching  the cochlea to neural spike  trains  generated  in \nsuccessive  processing stages in response to these  sounds.  In general,  the choice  of \na  good  encoding  procedure  requires  knowledge  of the  statistical  structure  of the \nsound ensemble. \n\nFor  the  visual  system,  several  investigations  of the  statistical  properties  of image \nensembles  and  their  relations  to  neuronal  response  properties  have  recently  been \nperformed  (Field  1987,  Atick  and  Redlich  1990,  Ruderman  and  Bialek  1994).  In \nparticular, receptive fields of retinal ganglion and LG N cells were found to be consis(cid:173)\ntent with  an optimal-code  prediction formulated  within  information theory  (Atick \n1992, Dong and Atick 1995), suggesting that the visual  periphery may be designed \nas  to take advantage of simple statistical properties of visual scenes. \n\nIn  order  to  investigate  whether  the  auditory  system  is  similarly  adapted  to  the \nstatistical structure of its own inputs, a good characterization of auditory scenes is \nnecessary.  In this paper we take a  first  step in this direction by studying low-order \nstatistical  properties  of several sound  ensembles.  The  quantities  we  focus  on  are \nthe spectro-temporal amplitude and phase defined  as  follows.  For  the sound  s(t), \nlet  SII(t)  denote its components at the set of frequencies  v,  obtained by filtering  it \nthrough a  bandpass filter  bank centered at those frequencies.  Then \n\nSII(t)  =  XII (t)cos (vt + rPlI(t)) \n\n(1) \n\nwhere  xlI(t)  ~ 0  and  rPlI(t)  are  the  spectro-temporal  amplitude  (STA)  and  phase \n(STP), respectively.  A complete characterization of a  sound ensemble with respect \nto  a  given  filter  bank  must  be  given  by  the joint  distribution  of amplitudes  and \nphases at all times, P (XlIl (tl), rPlII (tD, ... , XII\" (tn ), rPlI\" (t~)).  In this  paper, however, \nwe restrict ourselves to second-order statistics in the time domain and examine the \ndistribution and power spectrum of the stochastic processes xlI(t)  and rPlI(t). \n\nNote that the STA and STP are quantities directly relevant to auditory processing. \nThe different stages of the auditory system are organized in topographic frequency \nmaps,  so  that  cells  tuned to the same  sound frequency  v  are  organized in  stripes \nperpendicular  to  the  direction  of  frequency  progression  (see,  e.g.,  Pickles  1988). \nThe neuronal responses are thus determined by XII  and rPlI'  and by  XII  alone  when \nphase-locking disappears above 4-5KHz. \n\n2  Methods \n\nSince it is  difficult to obtain a  reliable sample of an animal's auditory scene over a \nsufficiently long time, we chose instead to analyze several different sound ensembles, \neach consisting of a  15min sound of a certain type.  We  used  cat vocalizations, bird \nsongs,  wolf  cries,  environmental  sounds,  symphonic  music,  jazz,  pop  music,  and \nspeech.  The sounds were obtained from  commercially available compact discs  and \nfrom  recordings  of animal vocalizations in two  laboratories.  No  attempt  has  been \nmade to manipulate the recorded sounds in any way  (e.g.,  by removing noise). \n\nEach  sound  ensemble was  loaded  into  the  computer  by  30sec segments  at  a  sam(cid:173)\npling  rate  of Is  =  44.1KHz.  After  decimating  to  Is/2,  we  performed  the follow(cid:173)\ning  frequency-band  analysis.  Each  segment  was  passed  through  a  bandpass  fil-\n\n\fTemporal Low-Order Statistics of Natural Sounds \n\n29 \n\nSymphonic music \n\nSpeech \n\nOr---~------~----~--~ \n\nOr---~------~--------~ \n\n-0.5 \n\n-1 \n\n_ \nas \n0:: \n0-1.5 \n~ g> \n\n-0.5 \n\n-1 \n\n_ \nas \n0:: \n0-1.5 \n\n~ \n\ng>  -2 \n\n_3L---~------~----~~~ \n\n-2 \n\no \n\n2 \n\n-2 \n\no \n\n2 \n\nCat vocalizations \n\nEnvironmental sounds \n\nOr---~------~----~--~ \n\nO~--~------~----~--~ \n\n-0.5 \n\n_ \n\n-1 \n\n-5: 0-1 .5 \n8>  -2 \n\n-2.5 \n\n-0.5 \n\n-1 \n\n_ \n~ \nCI.. \n0-1.5 \nE  -2 \n\n~ \n\n_3ll-----------~----~--~ \n\n_3L-----------~----~--~ \n\n-2 \n\n0 \n\n2 \n\na=log10(x) \n\n-2 \n\n0 \n\n2 \n\na=log10(x) \n\nFigure  1:  Amplitude  probability distribution in different  frequency  bands for  four \nsound ensembles. \n\nter  bank with  impulse  responses  hv{t)  to get  the  narrow-band component  signals \nsv{t)  =  s(t) * hv{t).  We  used  square,  non-overlapping filters  with  center  frequen(cid:173)\ncies  II  logarithmically spaced  within  the range of 100 - 11025Hz.  The filters  were \nusually 1/8-octave wide,  but we  experimented with larger bandwidths as well.  The \namplitude and phase in band  II  were then obtained via the Hilbert transform \n\nH  [sv{t)]  =  sv{t) + :;  dt' t _ t'  =  xv{t)ei(vHtPv(t\u00bb  . \n\n(2) \n\ni  J  s{t') \n\nThe frequency content of Xv  is bounded by 0 and by the bandwidth of hv  (Flanagan \n1980),  so  keeping  the latter  below  II  guarantees that the low  frequencies  in  sv  are \nall  contained  in  Xv,  confirming  its  interpretation  as  the  amplitude  modulator  of \nthe carrier cos lit suggested by  (1).  The phase ,pv,  being time-dependent,  produces \nfrequency  modulation.  For a given  II  the results were averaged over all segments. \n\n3  Amplitude Distribution \n\nWe  first  examined  the  STA  distribution  in  different  frequency  bands  II.  Fig.  1 \npresents  historgrams of P{IOglO xv)  on  a  logarithmic scale for  four  different  sound \nensembles.  In order to facilitate a comparison among different bands and ensembles, \nwe  normalized  the  variable  to  have  zero  mean  and  unit  variance,  (loglO xv(t))  = \n0,  ((IOglO x v (t))2)  = 1,  corresponding to a  linear gain control. \n\n\f30 \n\nH.  Attias and C.  E.  Schreiner \n\nSymphonic music \n\nSpeech \n\no.---~------------~---. \n\no.---~------------~---. \n\n-0.5 \n\n-0.5 \n\n-1 \n\n_ \n:\u00a7: \nc.. \n0'-1 .5 \ng \n\n-2 \n\n-1 \n\n= \nS c.. \n0'-1.5 \n'Ol \n.Q \n\n-2 \n\n-2.5 \n\n-2.5 \n\n-3~--~------~----~--~ \n\n-2 \n\no \n\n2 \n\na=10910(x) \n\n_3~--~------~----~-L~ \n\n-2 \n\n0 \n\n2 \n\na=10910(x) \n\nFigure  2:  n-point  averaged  amplitude  distributions  for  v  = 800Hz  in  two  sound \nensembles,  using  n  =  1,20,50,100,200.  The speech  ensemble is  different  from  the \none used in  Fig.  1. \n\nAs  shown in  the figure,  within  a  given ensemble,  the histograms corresponding to \ndifferent  bands lie  atop one another.  Furthermore,  although curves from  different \nensembles are not identical, we found that they could all be fitted accurately to the \nsame parametric functional form,  given by \n\ne-'\")'Z\", \n\np(x,,)  ex  (b5  + X~){J/2 \n\n(3) \n\nwith  parameter  values  roughly  in  the  range  of 0.1  ~ 'Y  ~ 1,  0  ~ f3  ~ 2.5,  and \n0.1  ~ bo ~ 0.6.  In some cases,  a  mixture of two  distributions  of the form  (3)  was \nnecessary, suggesting the presence of two types of sound sources; see, e.g., the slight \nbimodality in the lower parts of Fig.  1.  Details of the fitting procedure will be given \nin a  longer paper.  We  found  the form  (3)  to be preserved as  the filter  bandwidths \nincreased. \n\nWhereas  this  distribution  decays  exponentially  fast  at  high  amplitudes  (p  ex \ne-'\")'z\", /xe),  it  does  not  vanish  at  low  amplitudes,  indicating  a  finite  probability \nfor  the  occurence  of arbitrarily  soft  sounds.  In  contrast,  the  STA  of a  Gaussian \nnoise signal  can be shown  to be distributed according to p ex  x\"e-'\\z~, which  van(cid:173)\nishes at x\" =  0 and decays faster than (3)  at large x\".  Hence, the origin of the large \ndynamic range usually associated with audio signals can be traced to the abundance \nof soft  sounds rather than of loud ones. \n\n4  Amplitude Self-Similarity \n\nAn  interesting  probe  of  the  STA  temporal  correlations  is  the  property  of  scale \ninvariance (also called statistical self-similarity).  The process x,,(t) is scale-invariant \nwhen any statistical quantity on a  given scale (e.g.,  at a  given temporal resolution, \ndetermined  by  the  sampling  rate)  does  not  change  as  that  scale  is  varied.  To \nobserve this property we examined the STA distribution p(x,,) at different temporal \nresolutions, by defining the n-point  averaged amplitude \n\n1 n-l \n\nx~n)(t) = - L x,,(t + k6.) \n\nn \n\nk=O \n\n(4) \n\n\fTemporal Low-Order Statistics of Natural Sounds \n\n31 \n\nor-__ ~~--~----------~ \n\nSymphonic music \n\n-0.5 \n\n-1 \n\n12 - 1.5 \n\nen 0\"  -2 \n~-2.5 \n-3 \n\n-3.5 \n\n-4~~------~----~----~ \n\n1 \n\n2 \n\n-1 \n\no \n\n0 \n\n-1 -6: en \n'0-2 -~ \n\n-3 \n\n-4 \n\nSpeech \n\n-1 \n\n0 \n\n1 \n\n2 \n\nCat vocalizations \n\nEnvironmental sounds \n\nOr--.~----~----------~ \n\nOr-~------~-----------. \n\n-1 -6: en \n'0-2 \ng. \n\n-3 \n\n-1 \n\n~ '0-2 \ni \n\n-3 \n\n-4~~------~----~----~ \n\n-1 \n\no \nlog10(f) \n\n1 \n\n-4~~------~----~----~ \n2 \n\n-1 \n\n1 \n\no \nlog10(f) \n\nFigure  3:  Amplitude  power  spectrum  in  different  frequency  bands  for  four  sound \nensembles. \n\n(A  =  1/ is)  and  computing  its  distribution.  Fig.  2  displays  the  histograms  of \nP(IOglO x~n)  for  the  II  =  800Hz  frequency  band in  two  sound ensembles on  a  loga(cid:173)\nrithmic scale, using n  =  1,20,50, 100, 200 which correspond to a temporal resolution \nrange of 0.75 - 150msec.  Remarkably,  the  histogram remains unmodified  even  for \nn  =  200.  Had the xlI(t + kA)  been statistically independent  variables, the  central \nlimit  theorem  would  have  predicted a  Gaussian p(x~n)  for  large n.  The fact  that \nthis non-Gaussian distribution preserves its form as n increases implies the presence \nof temporal STA  correlations over long periods. \nNotice the analogy between  the invariance of p(xII ) under a  change in filter  band(cid:173)\nwidth, reported in the previous section, and under a change in temporal resolution. \nAn  XII  with  a  broad bandwidth is  essentially an average over the XII'S  with  narrow \nbandwidth within the same band,  thus bandwidth invariance is  a  manifestation of \nSTA  correlations across frequency bands. \n\n5  Amplitude Power Spectrum \n\nIn  order  to study  the  temporal  amplitude  correlations directly,  we  computed  the \nSTA  power  spectrum  BII(w)  =  (I  XII(W)  12)  in  different  bands  II,  where  xII(w)  is \nthe  Fourier  transform  of  the  log-amplitude  loglO xlI(t)  obtained  by  a  512-point \nFFT.  As  is  well-known,  the  spectrum  BII(w)  is  the  Fourier  transform  of  the  log(cid:173)\namplitude  auto-correlation function  clI(r)  =  (IOglO xlI(t) loglO xlI(t + r\u00bb).  We  used \n\n\f32 \n\nH.  Attias and C.  E.  Schreiner \n\nthe  zero-mean,  unit-variance  normalization of IOglO  Xv,  which  implies  the normal(cid:173)\nization  J dJ..J8v (w)  =  const.  of the  spectra.  Fig.  3  presents  8v  as  a  function  of \nthe modulation frequency  j  =  w /21r  on a logarithmic scale for  four  different sound \nensembles.  Notice that, as  in the case of the STA  distribution, the different  curves \ncorresponding  to  different  frequency  bands  within  a  given  ensemble  lie  atop  one \nanother, including individual peaks; and whereas spectra in different ensembles are \nnot identical, we found a simple parametric description valid for all ensembles which \nis  given  by \n\n(5) \n\nwith parameter values roughly in the range of 1 ::;  a  ::;  2.5 and 10-4  ::;  Wo  ::;  1.  This \nis  a  modified  power-law  form  (note  that  8 v  -+  C / wQ  at  large w),  implying  long(cid:173)\nrangle temporal correlations in the amplitude:  these correlations decrease slowly (as \na  power law  in  t)  on  a  time scale of l/wo,  beyond  which  they decay exponentially \nfast.  Larger Wo  contributes more to the flattening of the spectrum at low frequencies \n(see  especially  the speech  spectra)  and  corresponds  to a  shorter  correlation  time. \nAgain,  in  some  cases  a  sum  of two  such  forms  was  necessary,  corresponding  to  a \nmixture STA  distribution  as  mentioned  above;  see,  e.g.,  the environmental sound \nspectra (lower right part of Fig.  3 and Fig.  1). \n\nThe form (5)  persisted as the filter bandwidth increased.  In the limit of allpass filter \n(not  shown)  we  still  observed this form,  a  fact  related  to the report  of  (Voss  and \nClarke  1975)  on  1/ j-like power  spectra of sound  'loudness'  S(t)2  found  in  several \nspeech and music ensembles. \n\n6  Phase Distribution and  Power  Spectrum \n\nWhereas  the STA  is  a  non-stationary  process  which  is  locally  stationary and  can \nthus be studied on the appropriate time scale using our methods,  the STP is  non(cid:173)\nstationary even  locally.  A more suitable quantity to examine  is  its rate of change \nd\u00a2v / dt,  called the instantaneous frequency.  We  studied the statistics of I d\u00a2v / dt I \nin different  ensembles,  and found  its distribution to be described accurately by the \nparametric form  (3)  with  'Y  =  0,  whereas  its  power  spectrum could  be  well  fitted \nby the form  (5).  In  addition,  those  quantities  were  virtually  identical  in  different \nbands  within  a  given  ensemble.  More  details  on  this  work  will  be provided  in  a \nlonger paper. \n\n7 \n\nImplications for  Auditory Processing \n\nWe  have shown that auditory scenes have several robust low-order statistical prop(cid:173)\nerties.  The  STA  power  spectrum  has  a  modified  power-law  behavior,  which  is \nmanifested in self-similarity and temporal correlations over a  few  hundred millisec(cid:173)\nonds.  The distribution has an exponential tail and features a  finite  probability for \narbitrarily soft  sounds.  Both the phase and  amplitude statistics can  be described \nby  simple parametrized functional forms  which are valid for  very different  types of \nsounds.  These results lead to the conclusion  that natural sounds are highly redun(cid:173)\ndant, i.e.,  they occupy a very small subspace in the space of all possible sounds.  It \nwould therefore be beneficial for the auditory system to adapt its sound representa(cid:173)\ntion to these statistics, thus improving the animal discrimination ability.  Whether \n\n\fTemporal Low-Order Statistics of Natural Sounds \n\n33 \n\nthe auditory system actually follows  this  design  principle is  an  empirical  question \nwhich  can be attacked by suitable experiments. \n\nFurthermore,  since  different  frequency  bands  correspond  to  different  spatial  loca(cid:173)\ntions  on  the  basal  membrane  (Pickles  1988),  the  fact  that  the  distributions  and \nspectra in  different  bands  within  a  given  ansemble are identical  suggests  the exis(cid:173)\ntence  of translation invariance along the cochlear axis, i.e.,  all the locations in  the \ncochlea  'see'  the  same  statistics.  This  is  analogous  to  the  translation  invariance \nfound  in natural images. \n\nFinally, a recent theory for peripheral visual processing (Dong and Atick 1995)  pro(cid:173)\nposes  that,  in  order  to  maximize  information  transmission  into  cortex,  the  LGN \nperforms  temporal  correlation  of retinal  images.  Within  an  analogous  auditory \nmodel,  the decorrelation time for  sound ensembles  reported  here  implies  that  the \nauditory system should process  incoming sounds  by  a  few  hundred  msec-Iong  seg(cid:173)\nments.  The ability of cortical neurons to follow  in their response modulation rates \nnear and below 10Hz but usually not higher (Schreiner and Urbas 1988) may reflect \nsuch a  process. \n\nAcknowledgements \n\nWe thank B.  Bonham, K. Miller, S.  Nagarajan, and especially W. Bialek for helpful \ndiscussions and suggestions.  We  also thank F. Theunissen for  making his bird song \nrecordings available to us.  Supported by The Office of Naval Research  (NOOOI4-94-\n1-0547).  H.A.  was  supported by a Sloan Foundation grant for  the Sloan Center for \nTheoretical Neurobiology. \n\nReferences \n\nJ.J. Atick and N. Redlich (1990), Towards a theory of early visual processing.  Neural \nComput.  2,  308-320. \n\nJ.J. Atick (1992),  Could information theory provide an ecological theory of sensory \nprocessing.  Network 3, 213-25l. \n\nD.W.  Dong and J.J.  Atick  (1995),  Temporal decorrelation:  a  theory of lagged and \nnon-lagged responses in the lateral geniculate nucleus.  Network 6,  159-178. \n\nD.J.  Field  (1987),  Relations  between  the  statistics  of natural  images  and  the  re(cid:173)\nsponse properties of cortical cells.  J.  Opt.  Soc.  Am.  4, 2379-2394. \n\nJ.L.  Flanagan  (1980),  Parametric coding of speech  spectra.  J.  Acoust.  Soc.  Am. \n68, 412-419. \n\nJ.O.  Pickles  (1988),  An introduction  to  the  physiology  of hearing  (2nd  Ed.).  San \nDiego,  CA:  Academic Press. \n\nD.L.  Ruderman and W.  Bialek  (1994),  Statistics of natural images:  scaling in  the \nwoods.  Phys.  Rev.  Lett.  73, 814-817. \n\nC.E.  Schreiner  and  J.V.  Urbas,  Representation  of  amplitude  modulation  in  the \nauditory cortex of the cat.  II.  Comparison between cortical fields.  Hear.  Res.  32, \n49-63. \nR.F. Voss and J. Clarke (1975), 1/ f  noise in music and speech.  Nature 258,317-318. \n\n\f", "award": [], "sourceid": 1262, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}, {"given_name": "Christoph", "family_name": "Schreiner", "institution": null}]}