{"title": "Natural Sound Statistics and Divisive Normalization in the Auditory System", "book": "Advances in Neural Information Processing Systems", "page_first": 166, "page_last": 172, "abstract": null, "full_text": "Natural sound statistics and divisive \nnormalization in the auditory system \n\nOdelia Schwartz \n\nCenter for Neural Science \n\nNew York University \nodelia@cns.nyu.edu \n\nEero P. Simoncelli \n\nHoward Hughes Medical Institute \n\nCenter for Neural Science, and \n\nCourant Institute of Mathematical Sciences \n\nNew York University \n\neero.simoncelli@nyu.edu \n\nAbstract \n\nWe explore the statistical properties of natural sound stimuli pre(cid:173)\nprocessed with a bank of linear filters. The responses of such filters \nexhibit a striking form of statistical dependency, in which the response \nvariance of each filter grows with the response amplitude of filters tuned \nfor nearby frequencies. These dependencies may be substantially re(cid:173)\nduced using an operation known as divisive normalization, in which \nthe response of each filter is divided by a weighted sum of the recti(cid:173)\nfied responses of other filters. The weights may be chosen to maximize \nthe independence of the normalized responses for an ensemble of natu(cid:173)\nral sounds. We demonstrate that the resulting model accounts for non(cid:173)\nlinearities in the response characteristics of the auditory nerve, by com(cid:173)\nparing model simulations to electrophysiological recordings. In previous \nwork (NIPS, 1998) we demonstrated that an analogous model derived \nfrom the statistics of natural images accounts for non-linear properties of \nneurons in primary visual cortex. Thus, divisive normalization appears to \nbe a generic mechanism for eliminating a type of statistical dependency \nthat is prevalent in natural signals of different modalities. \n\nSignals in the real world are highly structured. For example, natural sounds typically con(cid:173)\ntain both harmonic and rythmic structure. It is reasonable to assume that biological auditory \nsystems are designed to represent these structures in an efficient manner [e.g., 1,2]. Specif(cid:173)\nically, Barlow hypothesized that a role of early sensory processing is to remove redundancy \nin the sensory input, resulting in a set of neural responses that are statistically independent. \n\nExperimentally, one can test this hypothesis by examining the statistical properties of neural \nresponses under natural stimulation conditions [e.g., 3,4], or the statistical dependency of \npairs (or groups) of neural responses. Due to their technical difficulty, such multi-cellular \nexperiments are only recently becoming possible, and the earliest reports in vision appear \nconsistent with the hypothesis [e.g., 5]. An alternative approach, which we follow here, \nis to develop a neural model from the statistics of natural signals and show that response \nproperties of this model are similar to those of biological sensory neurons. \n\nA number of researchers have derived linear filter models using statistical criterion. For vi(cid:173)\nsual images, this results in linear filters localized in frequency, orientation and phase [6, 7]. \n\n\fSimilar work in audition has yielded filters localized in frequency and phase [8]. Although \nthese linear models provide an important starting point for neural modeling, sensory neu(cid:173)\nrons are highly nonlinear. In addition, the statistical properties of natural signals are too \ncomplex to expect a linear transformation to result in an independent set of components. \n\nRecent results indicate that nonlinear gain control plays an important role in neural pro(cid:173)\ncessing. Ruderman and Bialek [9] have shown that division by a local estimate of standard \ndeviation can increase the entropy of responses of center-surround filters to natural images. \nSuch a model is consistent with the properties of neurons in the retina and lateral genicu(cid:173)\nlate nucleus. Heeger and colleagues have shown that the nonlinear behaviors of neurons \nin primary visual cortex may be described using a form of gain control known as divisive \nnormalization [10], in which the response of a linear kernel is rectified and divided by the \nsum of other rectified kernel responses and a constant. We have recently shown that the \nresponses of oriented linear filters exhibit nonlinear statistical dependencies that may be \nsubstantially reduced using a variant of this model, in which the normalization signal is \ncomputed from a weighted sum of other rectified kernel responses [11, 12]. The resulting \nmodel, with weighting parameters determined from image statistics, accounts qualitatively \nfor physiological nonlinearities observed in primary visual cortex. \n\nIn this paper, we demonstrate that the responses of bandpass linear filters to natural sounds \nexhibit striking statistical dependencies, analogous to those found in visual images. A \ndivisive normalization procedure can substantially remove these dependencies. We show \nthat this model, with parameters optimized for a collection of natural sounds, can account \nfor nonlinear behaviors of neurons at the level of the auditory nerve. Specifically, we show \nthat: 1) the shape offrequency tuning curves varies with sound pressure level, even though \nthe underlying linear filters are fixed; and 2) superposition of a non-optimal tone suppresses \nthe response of a linear filter in a divisive fashion, and the amount of suppression depends \non the distance between the frequency of the tone and the preferred frequency of the filter. \n\n1 Empirical observations of natural sound statistics \n\nThe basic statistical properties of natural sounds, as observed through a linear filter, have \nbeen previously documented by Attias [13]. In particular, he showed that, as with visual \nimages, the spectral energy falls roughly according to a power law, and that the histograms \nof filter responses are more kurtotic than a Gaussian (i.e., they have a sharp peak at zero, \nand very long tails). \n\nHere we examine the joint statistical properties of a pair of linear filters tuned for nearby \ntemporal frequencies. We choose a fixed set of filters that have been widely used in mod(cid:173)\neling the peripheral auditory system [14]. Figure 1 shows joint histograms of the instanta(cid:173)\nneous responses of a particular pair of linear filters to five different types of natural sound, \nand white noise. First note that the responses are approximately decorrelated: the expected \nvalue of the y-axis value is roughly zero for all values of the x-axis variable. The responses \nare not, however, statistically independent: the width of the distribution of responses of \none filter increases with the response amplitude of the other filter. If the two responses \nwere statistically independent, then the response of the first filter should not provide any \ninformation about the distribution of responses of the other filter. We have found that this \ntype of variance dependency (sometimes accompanied by linear correlation) occurs in a \nwide range of natural sounds, ranging from animal sounds to music. We emphasize that \nthis dependency is a property of natural sounds, and is not due purely to our choice of lin(cid:173)\near filters. For example, no such dependency is observed when the input consists of white \nnoise (see Fig. 1). \n\nThe strength of this dependency varies for different pairs of linear filters . In addition, \nwe see this type of dependency between instantaneous responses of a single filter at two \n\n\fSpeech \n\nCat \n\nMonkey \n\n-1 \n\no \n\nDrums \n\nNocturnal nature \n\nWhite noise \n\n\u2022 I~~; \n~ \n\u2022 \n\nFigure 1: Joint conditional histogram of instantaneous linear responses of two bandpass \nfilters with center frequencies 2000 and 2840 Hz. Pixel intensity corresponds to frequency \nof occurrence of a given pair of values, except that each column has been independently \nrescaled to fill the full intensity range. For the natural sounds, responses are not indepen(cid:173)\ndent: the standard deviation of the ordinate is roughly proportional to the magnitude of the \nabscissa. Natural sounds were recorded from CDs and converted to sampling frequency of \n22050 Hz. \n\nnearby time instants. Since the dependency involves the variance of the responses, we can \nsubstantially reduce it by dividing. In particular, the response of each filter is divided by a \nweighted sum of responses of other rectified filters and an additive constant. Specifically: \n\nRi = \n\nL2 \n12 \n\n2:j WjiLj + 0'2 \n\n(1) \n\nwhere Li is the instantaneous linear response of filter i, 0' is a constant and Wji controls the \nstrength of suppression of filter i by filter j. \n\nWe would like to choose the parameters of the model (the weights Wji, and the constant 0') \nto optimize the independence of the normalized response to an ensemble of natural sounds. \nSuch an optimization is quite computationally expensive. We instead assume a Gaussian \nform for the underlying conditional distribution, as described in [15]: \n\nP (LiILj,j E Ni ) '\" N(O; L wjiL; + 0'2) \n\nj \n\nwhere Ni is the neighborhood of linear filters that may affect filter i. We then maximize \nthis expression over the sound data at each time t to obtain the parameters: \n\nWe solve for the optimal parameters numerically, using conjugate gradient descent. Note \nthat the value of 0' depends on the somewhat arbitrary scaling of the input signal (i.e., \ndoubling the input strength would lead to a doubling of 0') . \n\n(2) \n\n\f4 \n\nI \n\nI \nI \n\nI \nI \n\n: 0 \\,-____ ----:--' \n\nI \n\nI \n\nOther squared I \nfilter resRonses \\ \n\nI \nI \n\n, \n\nOther squared \nfilter responses \n\n*' \n\nFigure 2: Nonlinear whitening of a natural auditory signal with a divisive normalization \nmodel. The histogram on the left shows the statistical dependency of the responses of two \nlinear bandpass filters. The joint histogram on the right shows the approximate indepen(cid:173)\ndence of the normalized coefficients. \n\nFigure 2 depicts our statistically derived neural model. A natural sound is passed through \na bank of linear filters (only 2 depicted for readability). The responses of the filters to a \nnatural sound exhibit a strong statistical dependency. Normalization largely removes this \ndependency, such that vertical cross sections through the joint conditional histogram are all \nroughly the same. \n\nFor the simulations in the next section, we use a set of Gammatone filters as the linear front \nend [14]. We choose a primary filter with center frequency 2000 Hz. We also choose a \nneighborhood of filters for the normalization signal: 16 filters with center frequencies 205 \nto 4768 Hz, and replicas of all filters temporally shifted by 100, 200, and 300 samples. \nWe compute optimal values for u and the normalization weights Wj using equation (2), \nbased on statistics of a natural sound ensemble containing 9 animal and speech sounds, \neach approximately 6 seconds long. \n\n2 Model simulations vs. physiology \n\nWe compare the normalized responses of the primary filter in our model (with all parameter \nvalues held fixed at the optimal values described above) to data recorded electrophysiologi(cid:173)\ncally from auditory nerve. Figure 3 shows data from a \"two-tone suppression\" experiment, \nin which the response to an optimal tone is suppressed by the presence of a second tone \nof non-optimal frequency. Two-tone suppression is often demonstrated by showing that \nthe rate-level function of the optimal tone alone is shifted to the right in the presence of \na non-optimal tone. In both cell and model, we obtain a larger rightward shift when the \nnon-optimal tone is relatively close in frequency to the optimal tone, and almost no right(cid:173)\nward shift when the non-optimal tone is more than two times the optimal frequency. In \nthe model, this behavior is due to the fact that the strength of statistical dependency (and \nthus the strength of normalization weighting) falls with the frequency separation of a pair \nof filters. \n\n\fCell \n\n(Javel et al., 1978) \n\n120 ,---~-~-~-~-~_____, \n\n-e-\n-e(cid:173)\n\nQ) \n~ 100 \nQ) E> 80 \nro \nJ::: \n(J 60 \n.!!1 \n\"0 \nc 40 \nro \nQ) \n::;; 20 \n\nModel \n\n--e- no mask \n-e- mask = 1.2S\"CF \n\n30 \n\n40 \n\n50 \n\n60 \nDecibels \n\n70 \n\n80 \n\n20 \n\n30 \n\n70 \n\n80 \n\n1ro ,-------~-~-~_____, \n\n-e-\n-e(cid:173)\n\nQ) \n\n\"\u00a7100 \n\nQ) \n~ 80 \nro \nJ::: \n~ 6O \n'6 \nc 40 \nro \nQ) \n::;; ro \n\n--e- no mask \n-e- mask = 1.5S\u00b7CF \n\n50 \n\n60 \nDecibels \n\n70 \n\n80 \n\n20 \n\n30 \n\n70 \n\n80 \n\n--e- no mask \n\nmask = 2.00*CF \n\nQ) \n~ 100 \nQ) \n~ 80 \nro \nJ::: \n~ 6O \n'6 \nc 40 \nro \nQ) \n::;; ro \n\n40 \n\n50 \n\n60 \nDecibels \n\n70 \n\n80 \n\nro \n\n30 \n\n40 \n\n60 \n\n60 \nDecibels \n\nm \n\n80 \n\nFigure 3: Two tone suppression data. Each plot shows neural response as a function of SPL \nfor a single tone (circles), and for a tone in the presence of a secondary suppressive tone at \n80 dB SPL (squares). The maximum mean response rate in the model is scaled to fit the \ncell data. Cell data re-plotted from [16] . \n\nCell \n\n(Rose et aI., 1971) \n\n120' , - - - - - - - - - - - - - - - - , \n\nModel \n\nOJ 100 \n1;; \n;U 80 \n~ rn \no en \n'6 40 \nc \nrn \nOJ \n::;; 20 \n\nJ:::60 \n\nFrequency \n\nFrequency \n\nFigure 4: Frequency tuning curves for cell and model for different sound pressure levels. \nCell data are re-plotted from [17]. \n\n\fFigure 4 shows frequency tuning for different sound pressure levels. As the sound pressure \nlevel (SPL) increases, the frequency tuning becomes broader, developing a \"shoulder\" and \na secondary mode. Both cell and model show similar behavior, despite the fact that we are \nnot fitting the model to these data: all parameters in the model are chosen by optimizing the \nindependence of the responses to the ensemble of natural sound statistics. This result is \nparticularly interesting because the data have been in the literature for many years, and are \ngenerally interpreted to mean that the frequency tuning properties of these cells varies with \nSPL. Our model suggests an alternative interpretation: the fundamental frequency tuning \nis determined by a fixed linear kernel, and is modulated by a divisive nonlinearity. \n\n3 Discussion \n\nWe have developed a weighted divisive normalization model for early auditory processing. \nBoth the form and parameters of the model are determined from natural sound statistics. \nWe have shown that the model can account for some prominent nonlinearities occurring at \nthe level of the auditory nerve. A number of authors have suggested forms of divisive gain \ncontrol in auditory models. Wang et al. [18] suggest that gain control in early auditory \nprocessing is consistent with psychophysical data and might be advantageous for applica(cid:173)\ntions of noise removal. Auditory gain control is also a central concept in the work of Lyon \n(e.g., [19]). Our work may provide theoretical justification for such models of divisive gain \ncontrol in the auditory system. \n\nOur model is limited in a number of important ways. The current model lacks a detailed \nspecification of a physiological implementation. In particular, normalization must be pre(cid:173)\nsumably implemented using lateral or feedback connections between neurons [e.g., 20]. \nThe normalization signal of the model is computed and applied instantaneously, and thus \nlacks temporal dynamical properties [e.g., 19]. In addition, we have not made any distinc(cid:173)\ntion between nonlinearities that arise mechanically in the cochlea, and nonlinearities that \narise at the neural level. It is likely that normalization occurs at least partially in outer hair \ncells [21,22]. \n\nOn a more theoretical level, we have not addressed mechanisms by which the system op(cid:173)\ntimizes itself. Our modeling uses parameters optimized for a fixed ensemble of natural \nsounds. Biologically, this optimization would presumably occur on multiple time scales \nthrough processes of evolution, development, learning, and adaptation. The ultimate ques(cid:173)\ntion regarding the independence hypothesis underlying our model is: how far can such a \nbottom-up criterion go toward explaining neural processing? It seems likely that the model \ncan be extended to account for levels of processing beyond the auditory nerve. For ex(cid:173)\nample, Nelken et al. [23] suggest that co-modulation masking release in auditory cortex \nresults from the statistical structure of natural sound. But ultimately, it seems likely that \none must also consider the auditory tasks, such as localization and recognition, that the \norganism must perform. \n\nReferences \n[1] F Attneave. Some informational aspects of visual perception. P~ych. Rev., 61:183- 193, 1954. \n[2] H B Barlow. Possible principles underlying the transformation of sensory messages. In W A \n\nRosenblith, editor, Sensory Communications, page 217. MIT Press, Cambridge, MA, 1961. \n\n[3] Y Dan and J J Atick ad R C Reid. Efficient coding of natural scenes in the lateral geniculate \n\nnucleus: Experimental test of a computational theory. J. Neuroscience, 16:3351-3362, 1996. \n\n[4] H Attias and C E Schreiner. Coding of naturalistic stimuli by auditory midbrain neurons. Adv \n\nin Neural Info Processing Systems, 10: 103-109, 1998. \n\n[5] WE Vinje and J L Gallant. Sparse coding and decorrelation in primary visual cortex during \n\nnatural vision. Science, 287, Feb 2000. \n\n\f[6] B A Olshausen and D J Field. Natural image statistics and efficient coding. Network: Compu(cid:173)\n\ntation in Neural Systems, 7:333-339, 1996. \n\n[7] A J Bell and T J Sejnowski. The 'independent components' of natural scenes are edge filters. \n\nVision Research, 37(23):3327-3338, 1997. \n\n[8] A J Bell and T J Sejnowski. Learning the higher-order structure of a natural sound. Network: \n\nComputation in Neural Systems, 7:261- 266,1996. \n\n[9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. \n\nLetters, 73(6):814-817, 1994. \n\n[10] D J Heeger. Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9:181-\n\n198, 1992. \n\n[11] E P Simoncelli and 0 Schwartz. Image statistics and cortical normalization models. In M. S. \nKearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, \nvolume 11, pages 153-159, Cambridge, MA, 1999. MIT Press. \n\n[12] M J Wainwright, 0 Schwartz, and E P Simoncelli. Natural image statistics and divisive nor(cid:173)\n\nmalization: Modeling nonlinearities and adaptation in cortical neurons. In R Rao, B Olshausen, \nand M Lewicki, editors, Statistical Theories of the Brain. MIT Press, 2001. To appear. \n\n[13] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, \nM Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27-33. \nMIT Press, 1997. \n\n[14] M Slaney. An efficient implementation of the patterson and holdworth auditory filter bank. \n\nApple Technical Report 35, 1993. \n\n[15] E P Simoncelli. Modeling the joint statistics of images in the wavelet domain. In Proc SPIE, \n\n44th Annual Meeting, volume 3813, Denver, July 1999. Invited presentation. \n\n[16] E Javel, D Geisler, and A Ravindran. Two-tone suppression in auditory nerve of the cat: Rate(cid:173)\n\nintensity and temporal analyses. J. Acoust. Soc. Am., 63(4):1093- 1104, 1978. \n\n[17] J ERose, D J Anderson, and J F Brugge. Some effects of stimulus intensity on response of \n\nauditory nerve fibers in the squirell monkey. Journal Neurophys., 34:685-699, 1971. \n\n[18] K Wang and S Shamma. Self-normalization and noise-robustness in early auditory representa(cid:173)\n\ntions. In IEEE Trans. Speech and Audio Proc., volume 2, pages 421-435, 1994. \n\n[19] R F Lyon. Automatic gain control in cochlear mechanics. In P Dallos et al., editor, The Me(cid:173)\n\nchanics and Biophysics of Hearing, pages 395-420. Springer-Verlag, 1990. \n\n[20] M Carandini, D J Heeger, and J A Movshon. Linearity and normalization in simple cells of the \n\nmacaque primary visual cortex. Journal of Neuroscience, 17:8621- 8644, 1997. \n\n[21] D Geisler. From Sound to Synapse: Physiology of the Mammalian Ear. Oxford University \n\nPress, New York, 1998. \n\n[22] H B Zhao and J Santos-Sacchi. Auditory collusion and a coupled couple of outer hair cells. \n\nNature, 399(6734):359-362, 1999. \n\n[23] I Nelken, Y Rotman, and 0 Bar Yosef. Responses of auditory-cortex neurons to structural \n\nfeatures of natural sounds. Nature, 397(6715):154-157, 1999. \n\n\f", "award": [], "sourceid": 1860, "authors": [{"given_name": "Odelia", "family_name": "Schwartz", "institution": null}, {"given_name": "Eero", "family_name": "Simoncelli", "institution": null}]}