{"title": "Temporal Low-Order Statistics of Natural Sounds", "book": "Advances in Neural Information Processing Systems", "page_first": 27, "page_last": 33, "abstract": null, "full_text": "Temporal Low-Order Statistics of Natural \n\nSounds \n\nH. Attias\u00b7 and C.E. Schreinert \n\nSloan Center for Theoretical Neurobiology and \n\nW.M. Keck Foundation Center for Integrative Neuroscience \n\nUniversity of California at San Francisco \n\nSan Francisco, CA 94143-0444 \n\nAbstract \n\nIn order to process incoming sounds efficiently, it is advantageous \nfor the auditory system to be adapted to the statistical structure of \nnatural auditory scenes. As a first step in investigating the relation \nbetween the system and its inputs, we study low-order statistical \nproperties in several sound ensembles using a filter bank analysis. \nFocusing on the amplitude and phase in different frequency bands, \nwe find simple parametric descriptions for their distribution and \npower spectrum that are valid for very different types of sounds. \nIn particular, the amplitude distribution has an exponential tail \nand its power spectrum exhibits a modified power-law behavior, \nwhich is manifested by self-similarity and long-range temporal cor(cid:173)\nrelations. Furthermore, the statistics for different bands within a \ngiven ensemble are virtually identical, suggesting translation in(cid:173)\nvariance along the cochlear axis. These results show that natural \nsounds are highly redundant, and have possible implications to the \nneural code used by the auditory system. \n\n1 \n\nIntroduction \n\nThe capacity of the auditory system to represent the auditory scene is restricted by \nthe finite number of cells and by intrinsic noise. This fact limits the ability of the \norganism to discriminate between different sounds with similar spectro-temporal \n\n\u00b7Corresponding author. E-mail: hagai@phy.ucsf.edu. \ntE-mail: chris@phy.ucsf.edu. \n\n\f28 \n\nH. Attias and C. E. Schreiner \n\ncharacteristics. However, it is possible to enhance the discrimination ability by \na suitable choice of the encoding procedure used by the system, namely of the \ntransformation of sounds reaching the cochlea to neural spike trains generated in \nsuccessive processing stages in response to these sounds. In general, the choice of \na good encoding procedure requires knowledge of the statistical structure of the \nsound ensemble. \n\nFor the visual system, several investigations of the statistical properties of image \nensembles and their relations to neuronal response properties have recently been \nperformed (Field 1987, Atick and Redlich 1990, Ruderman and Bialek 1994). In \nparticular, receptive fields of retinal ganglion and LG N cells were found to be consis(cid:173)\ntent with an optimal-code prediction formulated within information theory (Atick \n1992, Dong and Atick 1995), suggesting that the visual periphery may be designed \nas to take advantage of simple statistical properties of visual scenes. \n\nIn order to investigate whether the auditory system is similarly adapted to the \nstatistical structure of its own inputs, a good characterization of auditory scenes is \nnecessary. In this paper we take a first step in this direction by studying low-order \nstatistical properties of several sound ensembles. The quantities we focus on are \nthe spectro-temporal amplitude and phase defined as follows. For the sound s(t), \nlet SII(t) denote its components at the set of frequencies v, obtained by filtering it \nthrough a bandpass filter bank centered at those frequencies. Then \n\nSII(t) = XII (t)cos (vt + rPlI(t)) \n\n(1) \n\nwhere xlI(t) ~ 0 and rPlI(t) are the spectro-temporal amplitude (STA) and phase \n(STP), respectively. A complete characterization of a sound ensemble with respect \nto a given filter bank must be given by the joint distribution of amplitudes and \nphases at all times, P (XlIl (tl), rPlII (tD, ... , XII\" (tn ), rPlI\" (t~)). In this paper, however, \nwe restrict ourselves to second-order statistics in the time domain and examine the \ndistribution and power spectrum of the stochastic processes xlI(t) and rPlI(t). \n\nNote that the STA and STP are quantities directly relevant to auditory processing. \nThe different stages of the auditory system are organized in topographic frequency \nmaps, so that cells tuned to the same sound frequency v are organized in stripes \nperpendicular to the direction of frequency progression (see, e.g., Pickles 1988). \nThe neuronal responses are thus determined by XII and rPlI' and by XII alone when \nphase-locking disappears above 4-5KHz. \n\n2 Methods \n\nSince it is difficult to obtain a reliable sample of an animal's auditory scene over a \nsufficiently long time, we chose instead to analyze several different sound ensembles, \neach consisting of a 15min sound of a certain type. We used cat vocalizations, bird \nsongs, wolf cries, environmental sounds, symphonic music, jazz, pop music, and \nspeech. The sounds were obtained from commercially available compact discs and \nfrom recordings of animal vocalizations in two laboratories. No attempt has been \nmade to manipulate the recorded sounds in any way (e.g., by removing noise). \n\nEach sound ensemble was loaded into the computer by 30sec segments at a sam(cid:173)\npling rate of Is = 44.1KHz. After decimating to Is/2, we performed the follow(cid:173)\ning frequency-band analysis. Each segment was passed through a bandpass fil-\n\n\fTemporal Low-Order Statistics of Natural Sounds \n\n29 \n\nSymphonic music \n\nSpeech \n\nOr---~------~----~--~ \n\nOr---~------~--------~ \n\n-0.5 \n\n-1 \n\n_ \nas \n0:: \n0-1.5 \n~ g> \n\n-0.5 \n\n-1 \n\n_ \nas \n0:: \n0-1.5 \n\n~ \n\ng> -2 \n\n_3L---~------~----~~~ \n\n-2 \n\no \n\n2 \n\n-2 \n\no \n\n2 \n\nCat vocalizations \n\nEnvironmental sounds \n\nOr---~------~----~--~ \n\nO~--~------~----~--~ \n\n-0.5 \n\n_ \n\n-1 \n\n-5: 0-1 .5 \n8> -2 \n\n-2.5 \n\n-0.5 \n\n-1 \n\n_ \n~ \nCI.. \n0-1.5 \nE -2 \n\n~ \n\n_3ll-----------~----~--~ \n\n_3L-----------~----~--~ \n\n-2 \n\n0 \n\n2 \n\na=log10(x) \n\n-2 \n\n0 \n\n2 \n\na=log10(x) \n\nFigure 1: Amplitude probability distribution in different frequency bands for four \nsound ensembles. \n\nter bank with impulse responses hv{t) to get the narrow-band component signals \nsv{t) = s(t) * hv{t). We used square, non-overlapping filters with center frequen(cid:173)\ncies II logarithmically spaced within the range of 100 - 11025Hz. The filters were \nusually 1/8-octave wide, but we experimented with larger bandwidths as well. The \namplitude and phase in band II were then obtained via the Hilbert transform \n\nH [sv{t)] = sv{t) + :; dt' t _ t' = xv{t)ei(vHtPv(t\u00bb . \n\n(2) \n\ni J s{t') \n\nThe frequency content of Xv is bounded by 0 and by the bandwidth of hv (Flanagan \n1980), so keeping the latter below II guarantees that the low frequencies in sv are \nall contained in Xv, confirming its interpretation as the amplitude modulator of \nthe carrier cos lit suggested by (1). The phase ,pv, being time-dependent, produces \nfrequency modulation. For a given II the results were averaged over all segments. \n\n3 Amplitude Distribution \n\nWe first examined the STA distribution in different frequency bands II. Fig. 1 \npresents historgrams of P{IOglO xv) on a logarithmic scale for four different sound \nensembles. In order to facilitate a comparison among different bands and ensembles, \nwe normalized the variable to have zero mean and unit variance, (loglO xv(t)) = \n0, ((IOglO x v (t))2) = 1, corresponding to a linear gain control. \n\n\f30 \n\nH. Attias and C. E. Schreiner \n\nSymphonic music \n\nSpeech \n\no.---~------------~---. \n\no.---~------------~---. \n\n-0.5 \n\n-0.5 \n\n-1 \n\n_ \n:\u00a7: \nc.. \n0'-1 .5 \ng \n\n-2 \n\n-1 \n\n= \nS c.. \n0'-1.5 \n'Ol \n.Q \n\n-2 \n\n-2.5 \n\n-2.5 \n\n-3~--~------~----~--~ \n\n-2 \n\no \n\n2 \n\na=10910(x) \n\n_3~--~------~----~-L~ \n\n-2 \n\n0 \n\n2 \n\na=10910(x) \n\nFigure 2: n-point averaged amplitude distributions for v = 800Hz in two sound \nensembles, using n = 1,20,50,100,200. The speech ensemble is different from the \none used in Fig. 1. \n\nAs shown in the figure, within a given ensemble, the histograms corresponding to \ndifferent bands lie atop one another. Furthermore, although curves from different \nensembles are not identical, we found that they could all be fitted accurately to the \nsame parametric functional form, given by \n\ne-'\")'Z\", \n\np(x,,) ex (b5 + X~){J/2 \n\n(3) \n\nwith parameter values roughly in the range of 0.1 ~ 'Y ~ 1, 0 ~ f3 ~ 2.5, and \n0.1 ~ bo ~ 0.6. In some cases, a mixture of two distributions of the form (3) was \nnecessary, suggesting the presence of two types of sound sources; see, e.g., the slight \nbimodality in the lower parts of Fig. 1. Details of the fitting procedure will be given \nin a longer paper. We found the form (3) to be preserved as the filter bandwidths \nincreased. \n\nWhereas this distribution decays exponentially fast at high amplitudes (p ex \ne-'\")'z\", /xe), it does not vanish at low amplitudes, indicating a finite probability \nfor the occurence of arbitrarily soft sounds. In contrast, the STA of a Gaussian \nnoise signal can be shown to be distributed according to p ex x\"e-'\\z~, which van(cid:173)\nishes at x\" = 0 and decays faster than (3) at large x\". Hence, the origin of the large \ndynamic range usually associated with audio signals can be traced to the abundance \nof soft sounds rather than of loud ones. \n\n4 Amplitude Self-Similarity \n\nAn interesting probe of the STA temporal correlations is the property of scale \ninvariance (also called statistical self-similarity). The process x,,(t) is scale-invariant \nwhen any statistical quantity on a given scale (e.g., at a given temporal resolution, \ndetermined by the sampling rate) does not change as that scale is varied. To \nobserve this property we examined the STA distribution p(x,,) at different temporal \nresolutions, by defining the n-point averaged amplitude \n\n1 n-l \n\nx~n)(t) = - L x,,(t + k6.) \n\nn \n\nk=O \n\n(4) \n\n\fTemporal Low-Order Statistics of Natural Sounds \n\n31 \n\nor-__ ~~--~----------~ \n\nSymphonic music \n\n-0.5 \n\n-1 \n\n12 - 1.5 \n\nen 0\" -2 \n~-2.5 \n-3 \n\n-3.5 \n\n-4~~------~----~----~ \n\n1 \n\n2 \n\n-1 \n\no \n\n0 \n\n-1 -6: en \n'0-2 -~ \n\n-3 \n\n-4 \n\nSpeech \n\n-1 \n\n0 \n\n1 \n\n2 \n\nCat vocalizations \n\nEnvironmental sounds \n\nOr--.~----~----------~ \n\nOr-~------~-----------. \n\n-1 -6: en \n'0-2 \ng. \n\n-3 \n\n-1 \n\n~ '0-2 \ni \n\n-3 \n\n-4~~------~----~----~ \n\n-1 \n\no \nlog10(f) \n\n1 \n\n-4~~------~----~----~ \n2 \n\n-1 \n\n1 \n\no \nlog10(f) \n\nFigure 3: Amplitude power spectrum in different frequency bands for four sound \nensembles. \n\n(A = 1/ is) and computing its distribution. Fig. 2 displays the histograms of \nP(IOglO x~n) for the II = 800Hz frequency band in two sound ensembles on a loga(cid:173)\nrithmic scale, using n = 1,20,50, 100, 200 which correspond to a temporal resolution \nrange of 0.75 - 150msec. Remarkably, the histogram remains unmodified even for \nn = 200. Had the xlI(t + kA) been statistically independent variables, the central \nlimit theorem would have predicted a Gaussian p(x~n) for large n. The fact that \nthis non-Gaussian distribution preserves its form as n increases implies the presence \nof temporal STA correlations over long periods. \nNotice the analogy between the invariance of p(xII ) under a change in filter band(cid:173)\nwidth, reported in the previous section, and under a change in temporal resolution. \nAn XII with a broad bandwidth is essentially an average over the XII'S with narrow \nbandwidth within the same band, thus bandwidth invariance is a manifestation of \nSTA correlations across frequency bands. \n\n5 Amplitude Power Spectrum \n\nIn order to study the temporal amplitude correlations directly, we computed the \nSTA power spectrum BII(w) = (I XII(W) 12) in different bands II, where xII(w) is \nthe Fourier transform of the log-amplitude loglO xlI(t) obtained by a 512-point \nFFT. As is well-known, the spectrum BII(w) is the Fourier transform of the log(cid:173)\namplitude auto-correlation function clI(r) = (IOglO xlI(t) loglO xlI(t + r\u00bb). We used \n\n\f32 \n\nH. Attias and C. E. Schreiner \n\nthe zero-mean, unit-variance normalization of IOglO Xv, which implies the normal(cid:173)\nization J dJ..J8v (w) = const. of the spectra. Fig. 3 presents 8v as a function of \nthe modulation frequency j = w /21r on a logarithmic scale for four different sound \nensembles. Notice that, as in the case of the STA distribution, the different curves \ncorresponding to different frequency bands within a given ensemble lie atop one \nanother, including individual peaks; and whereas spectra in different ensembles are \nnot identical, we found a simple parametric description valid for all ensembles which \nis given by \n\n(5) \n\nwith parameter values roughly in the range of 1 ::; a ::; 2.5 and 10-4 ::; Wo ::; 1. This \nis a modified power-law form (note that 8 v -+ C / wQ at large w), implying long(cid:173)\nrangle temporal correlations in the amplitude: these correlations decrease slowly (as \na power law in t) on a time scale of l/wo, beyond which they decay exponentially \nfast. Larger Wo contributes more to the flattening of the spectrum at low frequencies \n(see especially the speech spectra) and corresponds to a shorter correlation time. \nAgain, in some cases a sum of two such forms was necessary, corresponding to a \nmixture STA distribution as mentioned above; see, e.g., the environmental sound \nspectra (lower right part of Fig. 3 and Fig. 1). \n\nThe form (5) persisted as the filter bandwidth increased. In the limit of allpass filter \n(not shown) we still observed this form, a fact related to the report of (Voss and \nClarke 1975) on 1/ j-like power spectra of sound 'loudness' S(t)2 found in several \nspeech and music ensembles. \n\n6 Phase Distribution and Power Spectrum \n\nWhereas the STA is a non-stationary process which is locally stationary and can \nthus be studied on the appropriate time scale using our methods, the STP is non(cid:173)\nstationary even locally. A more suitable quantity to examine is its rate of change \nd\u00a2v / dt, called the instantaneous frequency. We studied the statistics of I d\u00a2v / dt I \nin different ensembles, and found its distribution to be described accurately by the \nparametric form (3) with 'Y = 0, whereas its power spectrum could be well fitted \nby the form (5). In addition, those quantities were virtually identical in different \nbands within a given ensemble. More details on this work will be provided in a \nlonger paper. \n\n7 \n\nImplications for Auditory Processing \n\nWe have shown that auditory scenes have several robust low-order statistical prop(cid:173)\nerties. The STA power spectrum has a modified power-law behavior, which is \nmanifested in self-similarity and temporal correlations over a few hundred millisec(cid:173)\nonds. The distribution has an exponential tail and features a finite probability for \narbitrarily soft sounds. Both the phase and amplitude statistics can be described \nby simple parametrized functional forms which are valid for very different types of \nsounds. These results lead to the conclusion that natural sounds are highly redun(cid:173)\ndant, i.e., they occupy a very small subspace in the space of all possible sounds. It \nwould therefore be beneficial for the auditory system to adapt its sound representa(cid:173)\ntion to these statistics, thus improving the animal discrimination ability. Whether \n\n\fTemporal Low-Order Statistics of Natural Sounds \n\n33 \n\nthe auditory system actually follows this design principle is an empirical question \nwhich can be attacked by suitable experiments. \n\nFurthermore, since different frequency bands correspond to different spatial loca(cid:173)\ntions on the basal membrane (Pickles 1988), the fact that the distributions and \nspectra in different bands within a given ansemble are identical suggests the exis(cid:173)\ntence of translation invariance along the cochlear axis, i.e., all the locations in the \ncochlea 'see' the same statistics. This is analogous to the translation invariance \nfound in natural images. \n\nFinally, a recent theory for peripheral visual processing (Dong and Atick 1995) pro(cid:173)\nposes that, in order to maximize information transmission into cortex, the LGN \nperforms temporal correlation of retinal images. Within an analogous auditory \nmodel, the decorrelation time for sound ensembles reported here implies that the \nauditory system should process incoming sounds by a few hundred msec-Iong seg(cid:173)\nments. The ability of cortical neurons to follow in their response modulation rates \nnear and below 10Hz but usually not higher (Schreiner and Urbas 1988) may reflect \nsuch a process. \n\nAcknowledgements \n\nWe thank B. Bonham, K. Miller, S. Nagarajan, and especially W. Bialek for helpful \ndiscussions and suggestions. We also thank F. Theunissen for making his bird song \nrecordings available to us. Supported by The Office of Naval Research (NOOOI4-94-\n1-0547). H.A. was supported by a Sloan Foundation grant for the Sloan Center for \nTheoretical Neurobiology. \n\nReferences \n\nJ.J. Atick and N. Redlich (1990), Towards a theory of early visual processing. Neural \nComput. 2, 308-320. \n\nJ.J. Atick (1992), Could information theory provide an ecological theory of sensory \nprocessing. Network 3, 213-25l. \n\nD.W. Dong and J.J. Atick (1995), Temporal decorrelation: a theory of lagged and \nnon-lagged responses in the lateral geniculate nucleus. Network 6, 159-178. \n\nD.J. Field (1987), Relations between the statistics of natural images and the re(cid:173)\nsponse properties of cortical cells. J. Opt. Soc. Am. 4, 2379-2394. \n\nJ.L. Flanagan (1980), Parametric coding of speech spectra. J. Acoust. Soc. Am. \n68, 412-419. \n\nJ.O. Pickles (1988), An introduction to the physiology of hearing (2nd Ed.). San \nDiego, CA: Academic Press. \n\nD.L. Ruderman and W. Bialek (1994), Statistics of natural images: scaling in the \nwoods. Phys. Rev. Lett. 73, 814-817. \n\nC.E. Schreiner and J.V. Urbas, Representation of amplitude modulation in the \nauditory cortex of the cat. II. Comparison between cortical fields. Hear. Res. 32, \n49-63. \nR.F. Voss and J. Clarke (1975), 1/ f noise in music and speech. Nature 258,317-318. \n\n\f", "award": [], "sourceid": 1262, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}, {"given_name": "Christoph", "family_name": "Schreiner", "institution": null}]}