{"title": "Receptive Field Formation in Natural Scene Environments: Comparison of Single Cell Learning Rules", "book": "Advances in Neural Information Processing Systems", "page_first": 423, "page_last": 429, "abstract": null, "full_text": "Receptive field formation in natural scene \n\nenvironments: comparison of single cell \n\nlearning rules \n\nBrian S. Blais \n\nN.lntrator \n\nBrown University Physics Department \n\nSchool of Mathematical Sciences \n\nProvidence, Rl 02912 \n\nTel-Aviv University \n\nRamat-Aviv, 69978 ISRAEL \n\nH. Shouval \n\nInstitute for Brain and Neural Systems \n\nBrown University \n\nProvidence, Rl 02912 \n\nLeon N Cooper \n\nBrown University Physics Department and \n\nInstitute for Brain and Neural Systems \n\nBrown University \n\nProvidence, Rl 02912 \n\nAbstract \n\nWe study several statistically and biologically motivated learning \nrules using the same visual environment, one made up of natural \nscenes, and the same single cell neuronal architecture. This allows \nus to concentrate on the feature extraction and neuronal coding \nproperties of these rules. Included in these rules are kurtosis and \nskewness maximization, the quadratic form of the BCM learning \nrule, and single cell ICA. Using a structure removal method, we \ndemonstrate that receptive fields developed using these rules de(cid:173)\npend on a small portion of the distribution. We find that the \nquadratic form of the BCM rule behaves in a manner similar to a \nkurtosis maximization rule when the distribution contains kurtotic \ndirections, although the BCM modification equations are compu(cid:173)\ntationally simpler. \n\n\f424 \n\nB. S. Blais, N. Intrator, H. Shouval and L N. Cooper \n\n1 \n\nINTRODUCTION \n\nRecently several learning rules that develop simple cell-like receptive fields in a \nnatural image environment have been proposed (Law and Cooper, 1994; Olshausen \nand Field, 1996; Bell and Sejnowski, 1997). The details of these rules are different \nas well as their computational reasoning, however they all depend on statistics of \norder higher than two and they all produce sparse distributions. \nIn what follows we investigate several specific modification functions that have \nthe. general properties of BCM synaptic modification functions (Bienenstock et al., \n1982), and study their feature extraction properties in a natural scene environment. \nSeveral of the rules we consider are derived from standard statistical measures \n(Kendall and Stuart, 1977), such as skewness and kurtosis, based on polynomial \nmoments. We compare these with the quadratic form of BCM (Intrator and Cooper, \n1992), though one should note that this is not the only form that could be used. \nBy subjecting all of the learning rules to the same input statistics and retina/LGN \npreprocessing and by studying in detail the single neuron case, we eliminate possible \nnetwork/lateral interaction effects and can examine the properties of the learning \nrules themselves. \nWe compare the learning rules and the receptive fields they form, and introduce a \nprocedure for directly measuring the sparsity of the representation a neuron learns. \nThis gives us another way to compare the learning rules, and a more quantitative \nmeasure of the concept of sparse'representations. \n\n2 MOTIVATION \n\nWe use two methods for motivating the use of the particular rules. One comes \nfrom Projection Pursuit (Friedman, 1987) and the other is Independent Component \nAnalysis (Comon, 1994). These methods are related, as we shall see, but they \nprovide two different approaches for the current work. \n\n2.1 EXPLORATORY PROJECTION PURSUIT \n\nDiaconis and Freedman (1984) show that for most high-dimensional clouds (of \npoints), most low-dimensional projections are approximately Gaussian. This find(cid:173)\ning suggests that important information in the data is conveyed in those directions \nwhose single dimensional projected distribution is far from Gaussian. \nIntrator (1990) has shown that a BCM neuron can find structure in the input \ndistribution that exhibits deviation from Gaussian distribution in the form of multi(cid:173)\nmodality in the projected distributions. This type of deviation is particUlarly useful \nIn the natural scene environment, \nfor finding clusters in high dimensional data. \nhowever, the structure does not seem to be contained in clusters. In this work we \nshow that the BCM neuron can still find interesting structure in non-clustered data. \nThe most common measures for deviation from Gaussian distribution are skewness \nand. kurtosis which are functions of the first three and four moments of the dis(cid:173)\ntribution respectively. Rules based on these statistical measures satisfy the BCM \nconditions proposed in Bienenstock et aI. (1982), including a threshold-based sta(cid:173)\nbilization. The details of these rules and some of the qualitative features of the \nstabilization are different, however. \nIn addition, there are some learning rules, \nsuch as the ICA rule of Bell and Sejnowski (1997) and the sparse coding algorithm \nof Olshausen and Field (1995), which have been used with natural scene inputs to \nproduce oriented receptive fields. We do not include these in our comparison be-\n\n\fRF Formation in Natural Scenes: Comparison of Single Cell Learning Rules \n\n425 \n\ncause they are not single cell learning rules, and thus detract from our immediate \ngoal of comparing rules with the same input structure and neuronal architecture. \n\n2.2 \n\nINDEPENDENT COMPONENT ANALYSIS \n\nRecently it has been claimed that the independent components of natural scenes \nare the edges found in simple cells (Bell and Sejnowski, 1997). This was achieved \nthrough the maximization of the mutual entropy of a set of mixed signals. Others \n(Hyvarinen and Oja, 1996) have claimed that maximizing kurtosis can also lead \nto the separation of mixed signals into independent components. This alternate \nconnection between kurtosis and receptive fields leads us into a discussion of ICA. \nIndependent Component AnalYSis (ICA) is a statistical signal processing technique \nwhose goal is to express a set of random variables as a linear mixture of statistically \nindependent variables. The problem of ICA is then to find the transformation from \nthe observed mixed signals to the \"unmixed\" independent sources. The search \nfor independent components relies on the fact that a linear mixture of two non(cid:173)\nGaussian distributions will become more Gaussian than either of them. Thus, \nby seeking projections which maximize deviations from Gaussian distribution, we \nrecover the original (independent) signals. This explains the connection of ICA to \nthe framework of exploratory projection pursuit. \n\n3 SYNAPTIC MODIFICATION RULES \n\nIn this section we outline the derivation for the learning rules in this study. Neural \nactivity is assumed to be a positive quantity, so for biological plausibility we denote \nby c the rectified activity (T(d . m), where (T(.) is a smooth monotonic function \n(T' denotes the \nwith a positive output (a slight negative output is also allowed). \nderivative of the sigmoidal. The rectification is required for all rules that depend \non odd moments because these vanish in symmetric distributions such as natural \nscenes. We study the following measures(Kendall and Stuart, 1977, for review) : \n\nSkewness 1 This measures the deviation from symmetry, and is of the form: \n\n51 = E[c3]j E1.5[C2]. \n\nA maximization of this measure via gradient ascent gives \n\"V51 = \nwhere em is defined as E[c2 ]. \n\n\\.5E [c (c - E[c3]jE[c2]) (TId] = \n\n\\\neM \n\neM \n\n.5E [c (c - E[c3]jeM) (TId] \n\n(1) \n\n(2) \n\nSkewness 2 Another skewness measure is given by \n\n(3) \nThis measure requires a stabilization mechanism which we achieve by requiring that \nthe vector of weights, denoted by m, has norm of 1. The gradient of 52 is \n\n52 = E[c3] - E1.5[C2]. \n\n\"V52 = 3E [c2 - cJE[c2]] = 3E [c (c - JeM) (TId] ,II m 11= 1 \n\n(4) \n\n(5) \n\n(6) \n\nKurtosis 1 Kurtosis measures deviation from Gaussian distribution mainly in the \ntails of the distribution. It has the form \n\nKl = E[c4]jE2[C2] - 3. \n\nThis measure has a gradient of the form \n\"VKl = -2E [c (c2 - E[c4]jE[c2]) (TId] = -2E [c (c2 - E[c4]jeM) (TId]. \n\n1 \neM \n\n1 \neM \n\n\f426 \n\nB. S. Blais, N. Intrator. H. Shouval and L N. Cooper \n\nKurtosis 2 As before, there is a similar form which requires some stabilization: \n\nK2 = E[c4 ] - 3E2[C2]. \n\nThis measure has a gradient of the form \n\n'V K2 = 4E [c3 - cE[c2]] = 3E [c(c2 - eM )](1'd], \n\nII m 11= 1. \n\n(7) \n\n(8) \n\nKurtosis 2 and ICA It has been shown that kurtosis, defined as \n\nK2 = E [c4] - 3E2 [c2] \ncan be used for ICA(Hyvarinen and Oja, 1996). \nThus, finding the extrema of \nkurtosis of the projections enables the estimation of the independent components. \nThey obtain the following expression \n\nm = ~ (E- l [ddT] E [d(m\u00b7 d)3] - 3m). \n\n(9) \n\nwhich leads to an iterative \"fixed-point algorithm\". \n\nQuadratic BCM The Quadratic BCM (QBCM) measure as given in (Intrator \nand Cooper, 1992) is of the form \n\nQBCM = !E[c3] -\n\n3 \n\n!E2[C2]. \n4 \n\nMaximizing this form using gradient ascent gives the learning rule: \n\n(10) \n\n(11) \n\n4 METHODS \n\nWe use 13x13 circular patches from 12 images of natural scenes, presented to the \nneuron each iteration of the learning. The natural scenes are preprocessed either \nwith a Difference of Gaussians (DOG) filter(Law and Cooper, 1994) or a whitening \nfilter (Oja, 1995; Bell and Sejnowski, 1995), which eliminates the second order cor(cid:173)\nrelations. The moments of the output, c, are calculated iteratively, and when it is \nneeded (Le. K2 and 8 2 ) we also normalize the weights at each iteration. \nFor Oja's fixed-point algorithm, the learning was done in batches of 1000 patterns \nover which the expectation values were performed. However, the covariance matrix \nwas calculated over the entire set of input patterns. \n\n5 RESULTS \n\n5.1 RECEPTIVE FIELDS \n\nThe resulting receptive fields (RFs) formed are shown in Figure 1 for both the \nDOGed and whitened images. Every learning rule developed oriented receptive \nfields, though some were more sensitive to the preprocessing than others. The \nadditive versions of kurtosis and skewness, K2 and 8 2 respectively, developed RFs \nwith a higher spatial frequency, and more orientations, in the whitened environment \nthan in the DOGed environment. \nThe multiplicative versions of kurtosis and skewness, Kl and 8 1 respectively, as \nwell as QBCM, sampled from many orientations regardless of the preprocessing. \n8 1 gives receptive fields with lower spatial frequencies than either QBCM or Kl. \n\n\fRF Formation in Natural Scenes: Comparison of Single Cell Learning Rules \n\n427 \n\nThis disappears with the whitened inputs, which implies that the spatial frequency \nof the RF is related to the dependence of the learning rule on the second moment. \nExample receptive fields using Oja's fixed-point ICA algorithm not surprisingly \nlook qualitatively similar to those found using the stochastic maximization of K 2 \u2022 \nThe output distributions for all of the rules appear to be double exponential. This \ndistribution is one which we would consider sparse, but it would be difficult to \ncompare the sparseness of the distributions merely on the appearance of the output \ndistribution alone. In order to determine the sparseness of the code, we introduce \na method for measuring it directly. \n\nReceptive Fields from Natural Scene Input \n\nDOGed \n\nWhitened \n\nOutput Distribution \n\nOutput Distribution \n\n-20 \n\n0 \n\n-20 \n\n\u00a7 11.11 ~~~I 1\\ I \n\u00a7 LI g fIJ ~~~I A I \n~.11 i1 ~~~I A I ~.II a ~~~I A I \n~t! Ii1Ii ~~~VSJ ~ \u2022 \u2022 \u2022 ~~:I 1\\ I \n1\\1 \n\n-20 \n\n-20 \n\n0 \n\n-20 \n\n0 \n\n-20 \n\n0 \n\n0 \n\n20 \n\n0 \n\n20 \n\n20 \n\n20 \n\n20 \n\n20 \n\no \n\n20 \n\nFigure 1: Receptive fields using DOGed (left) and whitened (right) image input \nobtained from learning rules maximizing (froni top to bottom) the Quadratic BCM \nobjective function, Kurtosis (multiplicative), Kurtosis (additive), Skewness (multi(cid:173)\nplicative), and Skewness (additive). Shown are three examples (left to right) from \neach learning rule as well as the log of the normalized output distribution, before \nthe application of the rectifying sigmoid. \n\n5.2 STRUCTURE REMOVAL: SENSITIVITY TO OUTLIERS \n\nLearning rules which are dependent on large polynomial moments, such as \nQuadratic BCM and kurtosis, tend to be sensitive to the tails of the distribution. \nIn the case of a sparse code the outliers, or the rare and interesting events, are what \nis important. Measuring the degree to which the neurons form a sparse code can \nbe done in a straightforward and systematic fashion. \nThe procedure involves simply eliminating from the environment those patterns for \nwhich the neuron responds strongly. These patterns tend to be the high contrast \nedges, and are thus the structure found in the image. The percentage of patterns \nthat needs to be removed in order to cause a change in the receptive field gives a \ndirect measure of the sparsity of the coding. The results of this structure removal \n\n\f428 \n\nB. S. Blais, N. Intrator, H Shouval and L N. Cooper \n\nare shown in Figure 2. \nFor Quadratic BCM and kurtosis, one need only eliminate less than one half of a \npercent of the input patterns to change the receptive field significantly. To make \nthis more precise, we define a normalized difference between two mean zero vectors \nas V == H1 - cos a), where a is the angle between the two vectors. This measure \nhas a value of zero for identical vectors, and a maximum value of one for orthogonal \nvectors. \nAlso shown in Figure 2 is the normalized difference as a function of the percentage \neliminated, for the different learning rules. RF differences can be seen with as little \nas a tenth of a percent, which suggests that the neuron is coding the information in \na very sparse manner. Changes of around a half a percent and above are visible as \nsignificant orientation, phase, or spatial frequency changes. Although both skewness \nand Quadratic BCM depend primarily on the third moment, QBCM behaves more \nlike kurtosis with regards to sparse coding. \n\nStructure Removal for BCM, Kurtosis, and Skew \n\n, \" \n\n0 \n\nBCM \n\n0.3 \n\nBCM \u2022 \u2022 \u2022 \u2022 \u2022 \nKl II \u2022 I! 1'1 II e CI> \n\nII rI \n\nrI) \n\nSI \n\n' . \n\nII E \n[I ~ 0.1 \n\n0 \nZO.05 \n\n~0.15 \ni \n\n\". \n\n. \n\n;' \n\n, \n\n\" \n\nS, \n\nN \n\n~ BCM \n\n- 0 Kurtosis 1 \n' 0 Skew 1 \n\n~0.25 ~ -\n2l 0.2 \n<= \n\nC> \n\n---\n-\n--\n--\n\n-\n-\n--\n\nFigure 2: Example receptive fields (left), and normalized difference measure (right), \nresulting from structure removal using QBCM, Kl, and 8 1 , The RFs show the \nsuccessive deletion of top 1% of the distribution. On the right is the normalized \ndifference between RFs as a function of the percentage deleted in structure removal. \nThe maximum possible value of the difference is 1. \n\n6 DISCUSSION \n\nThis study attempts to compare several learning rules which have some statistical \nor biological motivation, or both. For a related study discussing projection pursuit \nand BCM see (Press and Lee, 1996). We have used natural scenes to gain some more \ninsight about the statistics underlying natural images. There are several outcomes \nfrom this study: \n\n\u2022 All rules used, found kurtotic distributions. \n\u2022 The single cell lCA rule we considered, which used the subtractive form of kur(cid:173)\n\ntosis, achieved receptive fields qualitatively similar to other rules discussed. \n\n\u2022 The Quadratic BCM and the multiplicative version of kurtosis are less sensitive \nto the second moments of the distribution and produce oriented RFs even when \nthe data is not whitened. The subtractive versions of kurtosis and skewness \nare sensitive and produces oriented RFs only after sphering the data (Friedman, \n1987; Field, 1994). \n\n\fRF Fonnation in Natural Scenes: Comparison of Single Cell Learning Rules \n\n429 \n\n\u2022 Both Quadratic BCM and kurtosis are sensitive to the elimination of the upper \nThe sensitivity to small portions of the \n\n1/2% portion of the distribution. \ndistribution represents the other side of the coin of sparse coding. \n\n\u2022 The skew rules' sensitivity to the upper parts of the distribution is not so strong. \n\u2022 Quadratic BCM learning rule, which has been advocated as a projection index \nfor finding multi-modality in high dimensional distribution, can find projections \nemphasizing high kurtosis when no cluster structure is present in the data. \n\nACKNOWLEDGMENTS \n\nThis work, was supported by the Office of Naval Research, the DANA Foundation \nand the National Science Foundation. \n\nReferences \nBell, A. J. and Sejnowski, T. J. (1995). An information-maximisation approach to blind \n\nseparation and blind deconvolution. Neural Computation, 7(6}:1129-1159. \n\nBell, A. J. and Sejnowski, T. J. (1997). The independent components of natural scenes \n\nare edge filters. Vision Research. in press. \n\nBienenstock, E . L., Cooper, L. N., and Munro, P. W. (1982) . Theory for the development \nof neuron selectivity: orientation specificity and binocular interaction in visual cortex. \nJournal of Neuroscience, 2:32-48. \n\nComon, P. (1994). Independent component analysis, a n'ew concept? Signal Processing, \n\n36:287-314. \n\nField, D. J. (1994). What is the goal of sensory coding. Neural Computation, 6:559-601. \nFriedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical \n\nAssociation, 82:249-266. \n\nHyvarinen, A. and Oja, E. (1996). A fast fixed-point algorithm for independent component \n\nanalysis. Int. Journal of Neural Systems, 7(6):671-687. \n\nIntrator, N. (1990). A neural network for feature extraction. In Touretzky, D. S. and Lipp(cid:173)\nmann, R. P., editors, Advances in Neural Information Processing Systems, volume 2, \npages 719-726. Morgan Kaufmann, San Mateo, CA. \n\nIntrator, N. and Cooper, L. N. (1992) . Objective function formulation of the BCM the(cid:173)\n\nory of visual cortical plasticity: Statistical connections, stability conditions. Neural \nNetworks, 5:3-17. \n\nKendall, M. and Stuart, A. (1977). The Advanced Theory of Statistics, volume 1. MacMil(cid:173)\n\nlan Publishing, New York. \n\nLaw, C. and Cooper, L. (1994). Formation of receptive fields according to the BCM \ntheory in realistic visual environments. Proceedings National Academy of Sciences, \n91:7797-7801. \n\nOja, E. (1995). The nonlinear pca learning rule and signal separation - mathematical \n\nanalysis. Technical Report A26, Helsinki University, CS and Inf. Sci. Lab. \n\nOlshausen, B. A. and Field, D. J. (1996). Emergence of simple cell receptive field properties \n\nby learning a sparse code for natural images. Nature, 381:607-609. \n\nPress, W. and Lee, C. W. (1996) . Searching for optimal visual codes: Projection pursuit \n\nanalysis of the statistical structure in natural scenes. In The Neurobiology of Compu(cid:173)\ntation: Proceedings of the fifth annual Computation and Neural Systems conference. \nPlenum Publishing Corporation. \n\n\f", "award": [], "sourceid": 1458, "authors": [{"given_name": "Brian", "family_name": "Blais", "institution": null}, {"given_name": "Nathan", "family_name": "Intrator", "institution": null}, {"given_name": "Harel", "family_name": "Shouval", "institution": null}, {"given_name": "Leon", "family_name": "Cooper", "institution": null}]}