{"title": "Evidence Optimization Techniques for Estimating Stimulus-Response Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 317, "page_last": 324, "abstract": null, "full_text": "Evidence Optimization Techniques\n\nfor Estimating Stimulus-Response Functions\n\nManeesh Sahani\nGatsby Unit, UCL\n\n17 Queen Sq., London, WC1N 3AR, UK.\n\nJennifer F. Linden\nKeck Center, UCSF\n\nSan Francisco, CA 94143\u20130732, USA.\n\nmaneesh@gatsby.ucl.ac.uk\n\nlinden@phy.ucsf.edu\n\nAbstract\n\nAn essential step in understanding the function of sensory nervous sys-\ntems is to characterize as accurately as possible the stimulus-response\nfunction (SRF) of the neurons that relay and process sensory informa-\ntion. One increasingly common experimental approach is to present a\nrapidly varying complex stimulus to the animal while recording the re-\nsponses of one or more neurons, and then to directly estimate a func-\ntional transformation of the input that accounts for the neuronal \ufb01ring.\nThe estimation techniques usually employed, such as Wiener \ufb01ltering or\nother correlation-based estimation of the Wiener or Volterra kernels, are\nequivalent to maximum likelihood estimation in a Gaussian-output-noise\nregression model. We explore the use of Bayesian evidence-optimization\ntechniques to condition these estimates. We show that by learning hyper-\nparameters that control the smoothness and sparsity of the transfer func-\ntion it is possible to improve dramatically the quality of SRF estimates,\nas measured by their success in predicting responses to novel input.\n\n1 Introduction\n\nA common experimental approach to the measurement of the stimulus-response function\n(SRF) of sensory neurons, particularly in the visual and auditory modalities, is \u201creverse\ncorrelation\u201d and its related non-linear extensions [1]. The neural response \u0002\u0001\u0004\u0003\u0006\u0005\nto a con-\ntinuous, rapidly varying stimulus\n\u0001\u0004\u0003\u0006\u0005 , is measured and used in an attempt to reconstruct\n\u0001\u0004\u0003\u0006\u0005\u000f\u000e . In the simplest case, the functional is taken to\nthe functional mapping \u0002\u0001\u0004\u0003\u0006\u0005\t\b\u000b\n\r\f\nbe a \ufb01nite impulse response (FIR) linear \ufb01lter; if the input is white the \ufb01lter is identi\ufb01ed\nby the spike-triggered average of the stimulus, and otherwise by the Wiener \ufb01lter. Such\nlinear \ufb01lter estimates are often called STRFs for spatio-temporal (in the visual case) or\nspectro-temporal (in the auditory case) receptive \ufb01elds. The general the SRF may also be\nparameterized on the basis of known or guessed non-linear properties of the neurons, or\nmay be expanded in terms of the Volterra or Wiener integral power series. In the case\nof the Wiener expansion, the integral kernels are usually estimated by measuring various\ncross-moments of \u0002\u0001\u0004\u0003\u0006\u0005 and\n. In visual experiments, the\nIn practice, the stimulus is often a discrete-time process\ndiscretization may correspond to the frame rate of the display. In the auditory experiments\nthat will be considered below, it is set by the rate of the component tone pulses in a random\n\n\u0001\u0004\u0003\u0006\u0005 .\n\n\u0010\u0011\u0007\u0013\u0012\u0015\u0014\n\n\u0007\n\u0007\n\u0007\n\fchord stimulus. On time-scales \ufb01ner than that set by this discretization rate, the stimulus is\nstrongly autocorrelated. This makes estimation of the SRF at a \ufb01ner time-scale extremely\nnon-robust. We therefore lose very little generality by discretizing the response with the\nsame time-step, obtaining a response histogram\n\n.\n\n\u0012\u0015\u0014\n\n.\n\n\u0012\u0015\u0014\n\n, we construct a set of input lag-vectors\n\nIn this discrete-time framework, the estimation of FIR Wiener-Volterra kernels (of any\norder) corresponds to linear regression. To estimate the \ufb01rst-order kernel up to a given\n. If\na single stimulus frame,\n-dimensional vector (representing, say, pixels in\nan image or power in different frequency bands) then the lag vectors are formed by con-\n. The Wiener \ufb01lter is then\nto the corresponding\n\u0010\u0006\u0001\n\nobtained by least-squares linear regression from the lag vectors\nobserved activities\n\nstimulus frames together into vectors of length \u0013\u0012\n\nmaximum time lag \ncatenating \n\n, is itself a \u0012\n\n\u0010\u0002\u0001\n\n\u0007\u0004\u0003\n\n\u0003\u0006\u0005\u0002\u0012\b\u0007\n\t\f\u000b\u000e\r\u0010\u000f\u0011\u000f\u0011\u000f\n\nsion using input vectors formed by all quadratic combinations of the elements of \u0001\n\nHigher-order kernels can also be found by linear regression, using augmented versions\nof the stimulus lag vectors. For example, the second-order kernel is obtained by regres-\n(or,\nequivalently, by support-vector-like kernel regression using a homogeneous second-order\npolynomial kernel). The present paper will be con\ufb01ned to a treatment of the linear case.\nIt should be clear, however, that the basic techniques can be extended to higher orders at\nthe expense of additional computational load, provided only that a sensible de\ufb01nition of\nsmoothness in these higher-order kernels is available.\n\nThe least-squares solution to a regression problem is identical to the maximum likelihood\nfor the probabilistic regression model with Gaussian\n\n(ML) value of the weight vector \u0014\noutput noise of constant variance \u0015\u0017\u0016 :\n\n\u0012\u0019\u0018\u001a\u0001\n\n\u0012\u001c\u001b\u001e\u001d\n\n\u0014 \u001f\f\u0001\n\n\u0012\u0010!\"\u0015\n\n\u0005$#\n\n(1)\n\nAs is common with ML learning, weight vectors obtained in this way are often over\ufb01t\nto the training data, and so give poor estimates of the true underlying stimulus-response\nfunction. This is the case even for linear models. If the stimulus is uncorrelated, the ML-\nestimated weight along some input dimension is proportional to the observed correlation\nbetween that dimension of the stimulus and the output response. Noise in the output can\nintroduce spurious input-output correlations and thus result in erroneous weight values.\nFurthermore, if the true relationship between stimulus and response is non-linear, limited\nsampling of the input space may also lead to observed correlations that would have been\nabsent given unlimited data.\n\nThe statistics and machine learning literatures provide a number of techniques for the con-\ntainment of over\ufb01tting in probabilistic models. Many of these approaches are equivalent\nto the maximum a posteriori (MAP) estimation of parameters under a suitable prior distri-\nbution. Here, we investigate an approach in which these prior distributions are optimized\nwith reference to the data; as such, they cease to be \u201cprior\u201d in a strict sense, and instead\nbecome part of a hierarchical probabalistic model. A distribution on the regression param-\neters is \ufb01rst speci\ufb01ed up to the unknown values of some hyperparameters. These hyper-\nparameters are then adjusted so as to maximize the marginal likelihood or \u201cevidence\u201d \u2014\nthat is, the probability of the data given the hyperparameters, with the parameters them-\nselves integrated out. Finally, the estimate of the parameters is given by the MAP weight\nvector under the optimized \u201cprior\u201d. Such evidence optimization schemes have previously\nbeen used in the context of linear, kernel and Gaussian-process regression. We show that,\nwith realistic data volumes, such techniques provide considerably better estimates of the\nstimulus-response function than do the unregularized (ML) Wiener estimates.\n\n\u0010\n\n\u0012\n\b\n\u0001\n\u0005\n\u0012\n\u0014\n\u0007\n\u0012\n\u0012\n\u0014\n\u0010\n\n\u0012\n\n\u0001\n\u0016\n\f2 Test data and methods\n\nA diagnostic of over\ufb01tting, and therefore divergence from the true stimulus-response rela-\ntionship, is that the resultant model generalizes poorly; that is, it does not predict actual\nresponses to novel stimuli well. We assessed the generalization ability of parameters cho-\nsen by maximum likelihood and by various evidence optimization schemes on a set of\nresponses collected from the auditory cortex of rodents. As will be seen, evidence op-\ntimization yielded estimates that generalized far better than those obtained by the more\nelementary ML techniques, and so provided a more accurate picture of the underlying\nstimulus-response function.\n\nA total of 205 recordings were collected extracellularly from 68 recording sites in the\nthalamo-recipient layers of the left primary auditory cortex of anaesthetized rodents (6\nCBA/CaJ mice and 4 Long-Evans rats) while a dynamic random chord stimulus (described\nbelow) was presented to the right ear. Recordings often re\ufb02ected the activity of a number of\nneurons; single neurons were identi\ufb01ed by Bayesian spike-sorting techniques [2, 3] when-\never possible. The stimulus consisted of 20 ms tone pulses (ramped up and down with a\n5 ms cosine gate) presented at random center frequencies, maximal intensities, and times,\nsuch that pulses at more than one frequency might be played simultaneously. This stimu-\nlus resembled that used in a previous study [4], except in the variation of pulse intensity.\nThe times, frequencies and sound intensities of all tone pulses were chosen independently\nwithin the discretizations of those variables (20 ms bins in time, 1/12 octave bins covering\neither 2\u201332 or 25\u2013100 kHz in frequency, and 5 dB SPL bins covering 25\u201370 dB SPL in\nlevel). At any time point, the stimulus averaged two tone pulses per octave, with an ex-\npected loudness of approximately 73 dB SPL for the 2\u201332 kHz stimulus and 70 dB SPL for\nthe 25\u2013100 kHz stimulus. Each pulse was ramped up and down with a 5 ms cosine gate.\nThe total duration of each stimulus was 60 s. At each recording site, the 2\u201332 kHz stimulus\nwas repeated for 20 trials, and the 25\u2013100 kHz stimulus for 10 trials.\n\nNeural responses from all 10 or 20 trials were histogrammed in 20 ms bins aligned with\nstimulus pulse durations. Thus, in the regression framework, the instantaneous input vector\ncomprised the sound amplitudes at each possible frequency at time \u0003 , and the output \nwas the number of spikes per trial collected into the \u0003 th bin. The repetition of the same\nstimulus made it possible to partition the recorded response power into a stimulus-related\n(signal) component and a noise component. (For derivation, see Sahani and Linden, \u201cHow\nLinear are Auditory Cortical Responses?\u201d, this volume.) Only those 92 recordings in which\nthe signal power was signi\ufb01cantly greater than zero were used in this study.\n\nTests of generalization were performed by cross-validation. The total duration of the stim-\nulus was divided 10 times into a training data segment (9/10 of the total) and a test data\nsegment (1/10), such that all 10 test segments were disjoint. Performance was assessed by\nthe predictive power, that is the test data variance minus average squared prediction error.\nThe 10 estimates of the predictive power were averaged, and normalized by the estimated\nsignal power to give a number less than 1. Note that the predictive power could be negative\nin cases where the mean was a better description of the test data than was the model pre-\ndiction. In graphs of the predictive power as a function of noise level, the estimate of the\nnoise power is also shown after normalization by the estimated signal power.\n\n3 Evidence optimization for linear regression\n\nAs is common in regression problems, it is convenient to collect all the stimulus vectors\n, the \u0003 th\nand observed responses into matrices. Thus, we described the input by a matrix\ncolumn of which is the input lag-vector\n. Similarly, we collect the outputs into a row\ntime-steps are dropped to avoid\n, the \u0003 th element of which is \nvector\n\n. The \ufb01rst \u0003\u0002\u0005\u0004\n\n\u0007\n\u0012\n\u0012\n\n\u0001\n\u0012\n\u0001\n\u0012\n\fincomplete lag-vectors. Then, assuming independent noise in each time bin, we combine\nthe individual probabilities to give:\n\n\u0002\u0001\n\n\u0001\u0013!\n\n\u0016\u0004\u0003\n\nand \u0014\n!\"\u0015\n\n!\u0012\u0011\n\n\u0016\n\t\n\n\u0018\b\u000b\n\f\u000e\n\n\u0006\b\u0007\u0017\u0015\n\u000b\n\f\u0014\r\u0016\u0015\n\u0006\b\u0007\u0017\u0015\n\b+*\n\n(2)\n\n(3)\n\n\u001d\u001f'\u001b \n\"\b#\n\u0018 (4)\n\n(7)\n\nWe now choose the prior distribution on\nreason to favour either positive or negative weights) and covariance matrix\njoint density of\n\nto be normal with zero mean (having no prior\n. Then the\n\nis\n\n\u0018\u0002\u0018\nwe obtain an expression for the evidence:\n\nto be the observed values, this\n.\n\u0007\u0017\n\n and mean\n\n\u001f\u0017\u0011\n\n\u0010\u0019\u0018\n\nBy integrating this normal density in\n\nwhere the normalizer\u0013\n\u0006\u001a\u0007\u001b\u0011\nimplies a normal posterior on \u0014 with variance\n\u0006\b\u0007,\u001c \u0018\n\u0006\b\u0007\u0017\u0015\n\u0016\u001e\t\n\u0006\u001a\u0007\u001b\u0011\n\n\u0018 . Fixing\n\u0001\u001e\u001d\u001f\u001d! \n\"\b#%$\n\u000b-\f\u000e\r\u0016\u0015\n\n!)\u0011\n\nWe seek to optimize this evidence with respect to the hyperparameters in\nvariance\na parameter\n\n, and the noise\n. To do this we need the respective gradients. If the covariance matrix contains\n\n, then the derivative of the log-evidence with respect to\n\nis given by\n\n\u0015\u0017.\n\nwhile the gradient in the noise variance is\n\nTr\n\nTr\n\n\u00028&9&\n\n\u00027\u001c\n\r\u0012=\n\n\u0002<\u001c\u001f\u0011\n\n/214365\n\u0002\u0019:\n\n(5)\n\n(6)\n\n\u0007\u0017\n\n\u00028&\n\n\u00028&\n\nwhere\n\nis the number of training data points.\n\n4 Automatic relevance determination (ARD)\n\n14365\n\nThe most common evidence optimization scheme for regression is known as automatic\nrelevance determination (ARD). Originally proposed by MacKay and Neal, it has been\nused extensively in the literature, notably by MacKay[5] and, in a recent application to\nkernel regression, by Tipping [6]. The prior covariance on the weights is taken to be of the\n\u0005 . That is, the weights are taken to be independent with\nform\npotentially different prior precisions\n\n. Substitution into (5) yields\n\nPrevious authors have noted that, in comparison to simple gradient methods, iteration of\n\ufb01xed point equations derived from this and from (6) converge more rapidly:\n\n\b?>\n\n\u0007\u0017\r with>\n\nFJH\n\n\b?@BA\u001aCED\u0013\u0001GFIH\nFIH\n14365\n\nand\n\n.\n\n\u0005OKPLNM\n\n\u0002<\u001c\n\n\u00028&\n\nH4H\nH4H\n\u0002<\u001c\n\nF,K\nLNM\n\n\u00028&\n:\u0003\u0002<Q\n\n(8)\n\n(9)\n\nHRHSFIH\n\n\u0001\n\u0018\n\n!\n\u0014\n!\n\u0015\n\b\n\u0004\n\u0005\n\u0018\n\u000f\n\u0002\n\u0004\n\u0006\n\u0001\n\u0001\n\u0002\n\u0014\n\u001f\n\n\u0005\n\u0001\n\u0001\n\u0002\n\u0014\n\u001f\n\n\u0005\n\u001f\n\u0015\n\u0016\n\u0010\n\u0014\n\u0011\n\u0001\n\n\u0001\n\u0014\n\u0018\n\n\u0016\n\u0003\n\b\n\u0004\n\u0013\n\u0002\n\u0004\n\u0006\n\u000f\n\u0001\n\u0001\n\u0002\n\u0014\n\u001f\n\n\u0005\n\u0001\n\u0001\n\u0002\n\u0014\n\u001f\n\n\u0005\n\u001f\n\u0015\n\u0016\n\u0002\n\u0014\n\u0007\n\n\u0014\n\b\n\u0005\n\u0018\n\u0016\n\t\n\u0001\n\u001c\n\b\n\u0011\n\u0005\n\u0007\n&\n\b\n\u001c\n\u0014\n(\n\u0001\n\u0011\n!\n\u0015\n\u0016\n\u0005\n\b\n\n\u0001\n\u0001\n\u0018\n\n!\n\u0015\n\u0016\n\u0003\n\u0018\n\u0018\n\u0018\n\u0018\n\u0018\n\u0002\n\u0004\n\u0006\n\u0001\n\u000f\n\t\n\u0015\n\u0016\n\u0002\n\n\u001f\n\u001c\n\n\u0010\n\u0001\n\u001f\n\u0011\n\u0015\n\u0016\n/\n/\n0\n0\n(\n\b\n\u0004\n\u0006\n\u0015\n\u0001\n\u0011\n\u001f\n\u0005\n0\n0\n/\n\u0011\n\u0018\n0\n0\n\u0015\n\u0016\n(\n\b\n\u0004\n\u0015\n\u0016\n\u000f\n$\n;\n\t\n\u0007\n$\n\u0004\n\u0015\n\u0016\n\u0001\n\u0001\n\u001f\n\n\u0005\n\u0001\n\u0001\n\u001f\n\n\u0005\n\u001f\n\u0010\n:\n\u0011\n\u0010\n\u0014\n0\n0\n(\n\b\n\u0004\n\u0006\n\u0001\nF\n\u0007\n\nH\n\u0016\nH\n\u0003\n#\nH\n\b\n\u0004\n\u0002\nF\nH\n\u001c\n&\n\u0016\nH\n\u0001\n\u0015\n\u0016\n\b\n\u0001\n\u0001\n\u001f\n\n\u0005\n\u0001\n\u0001\n\u0002\n\n\u001f\n&\n\u0005\nH\n\u0001\n\u0004\n\u0005\n\fML\n\nARD\n\nASD\n\nASD/RD\n\n100\n\n 50\n\n 25\n\n\u2212240 \u2212180 \u2212120 \u221260\n\n0\n\ntime (ms)\n\n100\n\n 50\n\n 25\n\n100\n\n 50\n\n 25\n\n\u2212240 \u2212180 \u2212120 \u221260\n\n0\n\ntime (ms)\n\n\u2212240 \u2212180 \u2212120 \u221260\n\n0\n\ntime (ms)\n\n100\n\n 50\n\n 25\n\n)\nz\nH\nk\n(\n \ny\nc\nn\ne\nu\nq\ne\nr\nf\n\n\u2212240 \u2212180 \u2212120 \u221260\n\n0\n\ntime (ms)\n\nR2001011802G/20010731/pen14loc2poisshical020\n\nFigure 1: Comparison of various STRF estimates for the same recording.\n\nA pronounced general feature of the maxima discovered by this approach is that many of\nthe optimal precisions are in\ufb01nite (that is, the variances are zero). Since the prior distri-\nbution is centered on zero, this forces the corresponding weight to vanish. In practice, as\nthe iterated value of a precision crosses some pre-determined threshold, the corresponding\ninput dimension is eliminated from the regression problem. The results of evidence opti-\nmization suggest that such inputs are irrelevant to predicting the output; hence the name\ngiven to this technique. The resulting MAP estimates obtained under the optimized ARD\nprior thus tend to be sparse, with only a small number of non-zero weights often appearing\nas isolated spots in the STRF.\n\nThe estimated STRFs for one example recording using ML and ARD are shown in the\ntwo left-most panels of \ufb01gure 1 (the other panels show smoothed estimates which will be\ndescribed below), with the estimated weight vectors rearranged into time-frequency matri-\nces. The sparsity of the ARD solution is evident in the reduction of apparent estimation\nnoise at higher frequencies and longer time lags. This reduction improves the ability of\nthe estimated model to predict novel data by more than a factor of 2 in this case. Assessed\nby cross-validation, as described above, the ARD estimate accurately predicted 26% of the\nsignal power in test data, whereas the ML estimate (or Wiener kernel) predicted only 12%.\n\nThis improvement in predictive quality was evident in every one of the 92 recordings with\nsigni\ufb01cant signal power, indicating that the optimized prior does improve estimation accu-\nracy. The left-most panel of \ufb01gure 2 compares the normalized cross-validation predictive\npower of the two STRF estimates. The other two panels show the difference in predictive\npowers as function of noise (in the center) and as a histogram (right). The advantage of the\nevidence-optimization approach is clearly most pronounced at higher noise levels.\n\nr\ne\nw\no\np\n\n \n\ni\n\ne\nv\ni\nt\nc\nd\ne\nr\np\n \nD\nR\nA\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nn\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22121.5\n0.5\nnormalized ML predictive power\n\n\u22120.5\n\n\u22121\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n50\nnormalized noise power\n\n25\n\n0\n0\n40\nno. of recordings\n\n20\n\nFigure 2: Comparison of ARD and ML predictions.\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nd\n\n \n\n \n\n)\nL\nM\n\u2212\nD\nR\nA\n\n \n\n(\n\ni\n\nn\no\ni\nt\nc\nd\ne\nr\np\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nn\n\n\f\u000b-\f\u000e\n\n14365\n\nTr\n\nand\n\n5 Automatic smoothness determination (ASD)\n\nIn many regression problems, such as those for which ARD was developed, the different\ninput dimensions are often unrelated; indeed they may be measured in different units. In\nsuch contexts, an independent prior on the weights, as in ARD, is reasonable. By contrast,\nthe weights of an STRF are dimensionally and semantically similar. Furthermore, we might\nexpect weights that are nearby in either time or frequency (or space) to be similar in value;\nthat is, the STRF is likely to be smooth on the scale at which we model it.\n\nHere we introduce a new evidence optimization scheme, in which the prior covariance\nmatrix is used to favour smoothing of the STRF weights. The appropriate scale (along\neither the time or the frequency/space axis) cannot be known a priori. Instead, we introduce\nhyperparameters\nthat set the scale of smoothness in the spectral (or spatial) and\ntemporal dimensions respectively, and then, for each recording, optimize the evidence to\ndetermine their appropriate values.\n\nand\n\n\u0002\u0001\n\nThe new parameterized covariance matrix,\nThe \u0001\t\b\u000b\n\n\u000e\u0010\u000f\n\n\u0006\u0007\u0001\n\u0003\r\f element of each of these gives the squared distance between the weights\n\nin terms of center frequency (or space) and time respectively. We take\n\n, depends on two\n\nmatrices\n\n\u0012\u0005\u0004\n\nand\n\n.\n\nH and\n\n(10)\n\n\u0002\u0012\u0011\n\n\u0006\u0013\u0001\n\nwhere the exponent is taken element by element. In this scheme, the hyperparameters\nand\ndimensions, while the additional hyperparameter\n\n\u0014\u0001\nset the correlation distances for the weights along the spectral (spatial) and temporal\n\nsets their overall scale.\n\nSubstitution of (10) into the general hyperparameter derivative expression (5) gives\n\n\u0012=\n\n(11)\n\n(12)\n\nTr\n\n\u0002<\u001c\n\n\u00027\u001c\n\u00028&9&\n\n&9&\n\n\u0007\u0017\n\n\u001a\u001c\u001b\u001e\u001d\n\n\u0003\u0017\u0016\n\n14365\n\n14365\n\n(where the\nment by element), along with a similar expression for\nis performed by simple gradient methods.\n\n\u0019\u0018\ndenotes the Hadamard or Schur product; i.e., the matrices are multiplied ele-\n. In this case, optimization\n\n\u0015\u0001\n\nThe third panel of \ufb01gure 1 shows the ASD-optimized MAP estimate of the STRF for the\nsame example recording discussed previously. Optimization yielded smoothing width es-\ntimates of 0.96 (20 ms) bins in time and 2.57 (1/12 octave) bins in frequency; the effect of\nthis smoothing of the STRF estimate is evident. ASD further improved the ability of the\nlinear kernel to predict test data, accounting for 27.5% of the signal power in this example.\n\nIn the population of 92 recordings (\ufb01gure 3, upper panels) MAP estimates based on the\nASD-optimized prior again outperformed ML (Wiener kernel) estimates substantially on\nevery single recording considered, particularly on those with poorer signal-to-noise ratios.\nThey also tended to predict more accurately than the ARD-based estimates (\ufb01gure 3, lower\npanels). The improvement over ARD was not quite so pronounced (although it was fre-\nquently greater than in the example of \ufb01gure 1).\n\n6 ARD in an ASD-de\ufb01ned basis\n\nThe two evidence optimization frameworks presented above appear inconsistent. ARD\nyields a sparse, independent prior, and often leads to isolated non-zero weights in the esti-\nmated STRF. By contrast, ASD is explicitly designed to recover smooth STRF estimates.\n\n\n\u0012\n\u0003\n\u0012\n\u0006\n\u0012\n\u0005\n\u000e\n\u0003\n\b\n\u000f\n\u0002\n\u0004\n\u0006\n\u000f\n\n\u0016\n\u0001\n$\n\u0006\n\u0012\n\n\u0016\n\u0012\n\u0010\n\u0010\n!\n\n\u0012\n\u0011\n0\n0\n\u0011\n(\n\b\n\u0004\n\u0006\n;\n\u0001\n\u0003\n\u0002\n\u001f\n\u0005\n\u0003\n\u0007\n0\n0\n(\n\b\n\u0002\n\u0004\n\u0006\n\u0015\n\u0001\n\u0003\n\u001f\n\u0005\n\u0003\n\u0001\n\u0006\n\u0001\n\u0001\n\u0005\n\u0003\n\u0007\n\n\u0018\n\u0016\n\u001a\n(\n\f0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22121.5\n\n0.5\n\n0\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\nnormalized ML predictive power\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n0.4\n\n0.2\n\n0\n\n1.5\n\n1\n\n0.5\n\n0\n\n20\n\n40\n\nnormalized noise power\n\n60\n0\n20\nnumber of recordings\n\n10\n\n0.4\n\n0.2\n\n0\n\ni\n\nr\ne\nw\no\np\n \ne\nv\ni\nt\nc\nd\ne\nr\np\n \nD\nS\nA\n \nd\ne\nz\n\ni\nl\n\na\nm\nr\no\nn\n\ni\n\n \n\nr\ne\nw\no\np\ne\nv\ni\nt\nc\nd\ne\nr\np\nD\nS\nA\nd\ne\nz\n\n \n\n \n\ni\nl\n\na\nm\nr\no\nn\n\n\u22120.5\n\n\u22120.5\n\n0\n\n\u22120.2\n\n0\n\n0.5\n\nnormalized ARD predictive power\n\n20\n\n40\n\nnormalized noise power\n\ni\nl\n\na\nm\nr\no\nn\n\n60\n20\n0\nnumber of recordings\n\n10\n\n\u22120.2\n\ne\nc\nn\ne\nr\ne\nf\nf\ni\nd\n \nr\ne\nw\no\np\n \ne\nv\ni\nt\nc\nd\ne\nr\np\n \nd\ne\nz\n\ni\n\n)\n\n \n\nD\nS\nA\n\u2212\n \nL\nM\n\n(\n\ni\nl\n\na\nm\nr\no\nn\n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\n)\n\n \n\nD\nS\nA\n\u2212\nD\nR\nA\n\n \n\n(\n\n \n\nd\n \nr\ne\nw\no\np\ne\nv\ni\nt\nc\nd\ne\nr\np\n\ni\n\n \nd\ne\nz\n\nFigure 3: Comparison of ASD predictions to ML and ARD.\n\nNonetheless, both frameworks appear to improve the ability of estimated models to gen-\neralize to novel data. We are thus led to consider ways in which features of both methods\nmay be combined.\n\nBy decomposing the prior covariance\nof (3) as\n\n\b\u0001\n\n, it is possible to rewrite the joint density\n\n\u0007\u0017\n\n\u0007\u0017\n\n\u0005\u0004\n\n\b\b\n\n\b\u0006\n\n!)\u0011\n\n\u0002\u0001\n\n\u000b-\f\u000e\n\n\u0016\u0004\u0003\u0003\u0002\n\nand \u0014\u0007\u0004\n\n\u0007\u0017\r\n(13)\nMaking the substitutions\n, this expression may be recognized\nas the joint density for a transformed regression problem with unit prior covariance (the\nnormalizing constant, not shown, is appropriately transformed by the Jacobean associated\nwith the change in variables). If now we introduce and optimize a diagonal prior covariance\nof the ARD form in this transformed problem, we are indirectly optimizing a covariance\nin the original basis. Intuitively, the sparseness driven by\nmatrix of the form\nARD is applied to basis vectors drawn from the rows of the transformation matrix \n, rather\nthan to individual weights. If this basis re\ufb02ects the smoothness prior obtained from ASD\nthen the resulting prior will combine the smoothness and sparseness of two approaches.\nWe choose \n\n\f\n\nto be the (positive branch) matrix square root of the optimal prior matrix\n, then\nare the positive square roots of the\n, de\ufb01ned in this way, are Gaussian basis vectors\neigenvalues of\nslightly narrower than those in\n(this is easily seen by noting that the eigenvalue spectrum\nfor the Toeplitz matrix\nis given by the Fourier transform, and that the square-root of\nthe Gaussian function in the Fourier space is a Gaussian of larger width, corresponding\nto a smaller width in the original space). Thus, weight vectors obtained through ARD\n\n, where the diagonal elements of\n. The components of \n\n(see (10)) obtained from ASD. If the eigenvector decomposition of\n\n\b\t\n\n\u0012\u000b\n\n\u0007\u0017\n\n\f\n\nis\n\n\u0011\n\u001f\n\n\u0001\n!\n\u0014\n\u0018\n\n!\n\u0015\n\u0002\n\u0004\n\u0006\n\u000f\n\u0001\n\u0001\n\u0002\n\u0014\n\u001f\n\n\u0007\n\n\n\n\u0005\n\u0001\n\u0001\n\u0002\n\u0014\n\u001f\n\n\n\n\u0005\n\u001f\n\u0015\n\u0016\n\u0002\n\u0014\n\u001f\n\n\u0007\n\n\n\u001f\n\u0014\n\u0010\n#\n\n\u001f\n\u0014\n\u0011\n\u001f\n>\n\n\u0003\n\u0003\n\n\u001f\n\n\b\n\n\u0012\n\u0016\n\n\u001f\n\u0012\n\u0016\n\u0003\n\u0003\n\u0003\n\fi\n\nr\ne\nw\no\np\n \ne\nv\ni\nt\nc\nd\ne\nr\np\n \nD\nR\nD\nS\nA\n \nd\ne\nz\n\n/\n\ni\nl\n\na\nm\nr\no\nn\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n\u22120.02\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n\u22120.02\n\n0\n\n0.6\nnormalized ASD predictive power\n\n0.4\n\n0.2\n\n\u22120.04\n\n0\n\n50\nnormalized noise power\n\n25\n\n\u22120.04\n\n10\n\n0\n20\nno. of recordings\n\ne\nc\nn\ne\nr\ne\nf\nf\ni\nd\n \nn\no\ni\nt\nc\nd\ne\nr\np\n \nd\ne\nz\n\ni\n\ni\nl\n\na\nm\nr\no\nn\n\n)\n\n \n\n \n\nD\nS\nA\n\u2212\nD\nR\nD\nS\nA\n\n/\n\n(\n\nFigure 4: Comparison of ARD in the ASD basis and simple ASD\n\nin this basis will be formed by a superposition of Gaussian components, each of which\nindividually matches the ASD prior on its covariance.\n\nThe results of this procedure (labelled ASD/RD) on our example recording are shown in\nthe rightmost panel of \ufb01gure 1. The combined prior shows a similar degree of smoothing\nto the ASD-optimized prior alone; in addition, like the ARD prior, it suppresses the appar-\nent background estimation noise at higher frequencies and longer time lags. Predictions\nmade with this estimate are yet more accurate, capturing 30% of the signal power. This\nimprovement over estimates derived from ASD alone is borne out in the whole population\n(\ufb01gure 4), although the gain is smaller than in the previous cases.\n\n7 Conclusions\n\nWe have demonstrated a succession of evidence-optimization techniques which appear to\nimprove the accuracy of STRF estimates from noisy data. The mean improvement in pre-\ndiction of the ASD/RD method over the Wiener kernel is 40% of the stimulus-related signal\npower. Considering that the best linear predictor would on average capture no more than\n40% of the signal power in these data even in the absence of noise (Sahani and Linden,\n\u201cHow Linear are Auditory Cortical Responses?\u201d, this volume), this is a dramatic improve-\nment. These results apply to the case of linear models; our current work is directed toward\nextensions to non-linear SRFs within an augmented linear regression framework.\n\nReferences\n\n[1] Marmarelis, P. Z & Marmarelis, V. Z. (1978) Analysis of Physiological Systems. (Plenum Press,\n\nNew York).\n\n[2] Lewicki, M. S. (1994) Neural Comp  6, 1005\u20131030.\n[3] Sahani, M. (1999) Ph.D. thesis (California Institute of Technology, Pasadena, CA).\n[4] deCharms, R. C, Blake, D. T, & Merzenich, M. M. (1998) Science 280, 1439\u20131443.\n[5] MacKay, D. J. C. (1994) ASHRAE Transactions 100, 1053\u20131062.\n[6] Tipping, M. E. (2001) J  Machine Learning Res  1, 211\u2013244.\n\n\f", "award": [], "sourceid": 2294, "authors": [{"given_name": "Maneesh", "family_name": "Sahani", "institution": null}, {"given_name": "Jennifer", "family_name": "Linden", "institution": null}]}