{"title": "Design of Experiments via Information Theory", "book": "Advances in Neural Information Processing Systems", "page_first": 1319, "page_last": 1326, "abstract": "", "full_text": "Design of experiments via information theory (cid:3)\n\nLiam Paninski\n\nCenter for Neural Science\n\nNew York University\nNew York, NY 10003\n\nliam@cns.nyu.edu\n\nAbstract\n\nWe discuss an idea for collecting data in a relatively ef\ufb01cient manner. Our\npoint of view is Bayesian and information-theoretic: on any given trial,\nwe want to adaptively choose the input in such a way that the mutual in-\nformation between the (unknown) state of the system and the (stochastic)\noutput is maximal, given any prior information (including data collected\non any previous trials). We prove a theorem that quanti\ufb01es the effective-\nness of this strategy and give a few illustrative examples comparing the\nperformance of this adaptive technique to that of the more usual nonadap-\ntive experimental design. For example, we are able to explicitly calculate\nthe asymptotic relative ef\ufb01ciency of the \u201cstaircase method\u201d widely em-\nployed in psychophysics research, and to demonstrate the dependence of\nthis ef\ufb01ciency on the form of the \u201cpsychometric function\u201d underlying the\noutput responses.\n\n1\n\nIntroduction\n\nOne simple model of experimental design (we have neurophysiological experiments in\nmind, but our results are general with respect to the identity of the system under study)\nis as follows. We have some set X of input stimuli, and some knowledge of how the\nsystem should respond to every stimulus, x, in X. This knowledge is summarized in the\nform of a prior distribution, p0((cid:18)), on some space (cid:2) of models (cid:18). A model is a set of\nprobabilistic input-output relationships: regular conditional distributions p(yjx; (cid:18)) on Y ,\nthe set of possible output responses, given each x in X. Thus the joint probability of\nstimulus and response is:\n\np(x; y) = Z p(x; y; (cid:18))d(cid:18) = Z p0((cid:18))p(x)p(yj(cid:18); x)d(cid:18):\n\nThe \u201cdesign\u201d of an experiment is given by the choice of input probability p(x). We want\nto design our experiment \u2014 choose p(x) \u2014 optimally in some sense. One natural idea\nwould be to choose p(x) in such a way that we learn as much as possible about the under-\nlying model, on average. Information theory thus suggests we choose p(x) to optimize the\n\n(cid:3)A longer version of this paper,\n\nhttp://www.cns.nyu.edu/(cid:24)liam.\n\nincluding proofs, has been submitted and is available at\n\n\ffollowing objective function:\n\nI(fx; yg; (cid:18)) = ZX(cid:2)Y (cid:2)(cid:2)\n\np(x; y; (cid:18)) log\n\np(x; y; (cid:18))\np(x; y)p((cid:18))\n\n(1)\n\nwhere I(:; :) denotes mutual information. In other words, we want to maximize the infor-\nmation provided about (cid:18) by the pair fx; yg, given our current knowledge of the model as\nsummarized in the posterior distribution given N samples of data:\n\npN ((cid:18)) = p((cid:18)jfxi; yig1(cid:20)i(cid:20)N ):\n\nSimilar ideas have seen application in a wide and somewhat scattered literature; for a partial\nbibliography, see the longer draft of this paper at http://www.cns.nyu.edu/(cid:24)liam. Some-\nwhat surprisingly, we have not seen any applications of the information-theoretic objective\nfunction (1) to the design of neurophysiological experiments (although see the abstract by\n[7], who seem to have independently implemented the same idea in a simulation study).\n\nThe primary goal of this paper is to elucidate the asymptotic behavior of the a posteriori\ndensity pN ((cid:18)) when we choose x according to the recipe outlined above; in particular,\nwe want to compare the adaptive strategy to the more usual case, in which the stimuli are\ndrawn i.i.d. (non-adaptively) from some \ufb01xed distribution p(x). Our main result (section\n2) states that, under acceptably weak conditions on the models p(yj(cid:18); x), the information-\nmaximization strategy leads to consistent and ef\ufb01cient estimates of the true underlying\nmodel, in a natural sense. We also give a few simple examples to illustrate the applicability\nof our results (section 3).\n\n2 Main Result\n\nFirst, we note that the problem as posed in the introduction turns out to be slightly easier\nthan one might have expected, because I(fx; yg; (cid:18)) is linear in p(x). This, in turn, implies\nthat p(x) must be degenerate, concentrated on the points x where I is maximal. Thus,\ninstead of \ufb01nding optimal distributions p(x), we need only \ufb01nd optimal inputs x, in the\nsense of maximizing the conditional information between (cid:18) and y, given a single input x:\n\nI(y; (cid:18)jx) (cid:17) ZY Z(cid:2)\n\npN ((cid:18))p(yj(cid:18); x) log\n\n:\n\np(yjx; (cid:18))\n\nR(cid:2) pN ((cid:18))p(yjx; (cid:18))\n\nOur main result is a \u201cBernstein-von Mises\u201d - type theorem [12]. The classical form of\nthis kind of result says, basically, that if the posterior distributions are consistent (in the\nsense that pN (U ) ! 1 for any neighborhood U of the true parameter (cid:18)0) and the relevant\nlikelihood ratios are suf\ufb01ciently smooth on average, then the posterior distributions pN ((cid:18))\nare asymptotically normal, with easily calculable asymptotic mean and variance. We adapt\nthis result to the present case, where x is chosen according to the information-maximization\nrecipe. It turns out that the hard part is proving consistency (c.f. section 4); we give the\nbasic consistency lemma (interesting in its own right) \ufb01rst, from which the main theorem\nfollows fairly easily.\nLemma 1 (Consistency). Assume the following conditions:\n\n1. The parameter space (cid:2) is compact.\n2. The loglikelihood log p(yjx; (cid:18)) is Lipschitz in (cid:18), uniformly in x, with respect to\n\nsome dominating measure on Y .\n\n3. The prior measure p0 assigns positive measure to any neighborhood of (cid:18)0.\n4. The maximal divergence supx DKL((cid:18)0; (cid:18)jx) is positive for all (cid:18) 6= (cid:18)0.\n\n\fThen the posteriors are consistent: pN (U ) ! 1 in probability for any neighborhood U of\n(cid:18)0.\nTheorem 2 (Asymptotic normality). Assume the conditions of Lemma 1, stengthened as\nfollows:\n\ninformation matrices\n\n1. (cid:2) has a smooth, \ufb01nite-dimensional manifold structure in a neighborhood of (cid:18)0.\n2. The loglikelihood log p(yjx; (cid:18)) is uniformly C 2 in (cid:18).\np(yjx; (cid:18))(cid:19)t(cid:18) _p(yjx; (cid:18))\n\np(yjx; (cid:18))(cid:19)p(yj(cid:18); x);\n\nI(cid:18)(x) = ZY (cid:18) _p(yjx; (cid:18))\n\nIn particular, the Fisher\n\nwhere the differential _p is taken with respect to (cid:18), are well-de\ufb01ned and continuous\nin (cid:18), uniformly in (x; (cid:18)) in some neighborhood of (cid:18)0.\n\n3. The prior measure p0 is absolutely continuous in some neighborhood of (cid:18)0, with\n\na continuous positive density at (cid:18)0.\n\nmaxC2co(I(cid:18)0 (x)) det(C) > 0;\n\nwhere co(I(cid:18)0(x)) denotes the convex closure of the set of Fisher information ma-\ntrices I(cid:18)0(x).\n\n4.\n\nThen\n\njjpN (cid:0) N ((cid:22)N ; (cid:27)2\n\nN )jj ! 0\n\nin probability, where jj:jj denotes variation distance, N ((cid:22)N ; (cid:27)2\nsity with mean (cid:22)N and variance (cid:27)2\nvariance (cid:27)2\n\nN ) denotes the normal den-\nN , and (cid:22)N is asymptotically normal with mean (cid:18)0 and\n\nN . Here\n\nN )(cid:0)1 ! argmaxC2co(I(cid:18)0 (x)) det(C);\nthe maximum in the above expression is well-de\ufb01ned and unique.\n\n(N (cid:27)2\n\nThus, under these conditions, the information maximization strategy works, and works\nbetter than the i.i.d. x strategy (where the asymptotic variance (cid:27)2 is inversely related to an\naverage, not a maximum, over x, and is therefore generically larger).\nA few words about the assumptions are in order. Most should be fairly self-explanatory: the\nconditions on the priors, as usual, are there to ensure that the prior becomes irrelevant in the\nface of suf\ufb01cient posterior evidence; the smoothness assumptions on the likelihood permit\nthe local expansion which is the source of asymptotic normality; and the condition on the\nmaximal divergence function supx DKL((cid:18)0; (cid:18)jx) ensures that distinct models (cid:18)0 and (cid:18) are\nidenti\ufb01able. Finally, some form of monotonicity or compactness on (cid:2) is necessary, mostly\nto bound the maximal divergence function supx DKL((cid:18)0; (cid:18)jx) and its inverse away from\nzero (the lower bound, again, is to ensure identi\ufb01ability; the necessity of the upper bound,\non the other hand, will become clear in section 4); also, compactness is useful (though not\nnecessary) for adapting certain Glivenko-Cantelli bounds [12] for the consistency proof.\n\nIt should also be clear that we have not stated the results as generally as possible; we have\nchosen instead to use assumptions that are simple to understand and verify, and to leave the\ntechnical generalizations to the interested reader. Our assumptions should be weak enough\nfor most neurophysiological and psychophysical situations, for example, by assuming that\nparameters take values in bounded (though possibly large) sets and that tuning curves are\nnot in\ufb01nitely steep. The proofs of these three results are basically elaborations on Wald\u2019s\nconsistency method and Le Cam\u2019s approach to the Bernstein-von Mises theorem [12].\n\n\f3 Applications\n\n3.1 Psychometric model\n\nAs noted in the introduction, psychophysicists have employed versions of the information-\nmaximization procedure for some years [14, 9, 13, 6]. References in [13], for example, go\nback four decades, and while these earlier investigators usually couched their discussion\nin terms of variance instead of entropy, the basic idea is the same (note, for example,\nthat minimizing entropy is asymptotically equivalent to minimizing variance, by our main\ntheorem). Our results above allow us to precisely quantify the effectiveness of this stategy.\n\nThe standard psychometric model is as follows. The response space Y is binary, corre-\nsponding to subjective \u201cyes\u201d or \u201cno\u201d detection responses. Let f be \u201csigmoidal\u201d: a uni-\nformly smooth, monotonically increasing function on the line, such that f (0) = 1=2,\nlimt!(cid:0)1 f (t) = 0 and limt!1 f (t) = 1 (this function represents the detection proba-\nbility when the subject is presented with a stimulus of strength t). Let fa;(cid:18) = f ((t(cid:0) (cid:18))=a);\n(cid:18) here serves as a location (\u201cthreshold\u201d) parameter, while a sets the scale (we assume a is\nknown, for now, although of course this can be relaxed [6]). Finally, let p(x) and p0((cid:18)) be\nsome \ufb01xed sampling and prior distributions, respectively, both with smooth densities with\nrespect to Lebesgue measure on some interval (cid:2).\nNow, for any \ufb01xed scale a, we want to compare the performance of the information-\nmaximization strategy to that of the i.i.d. p(x) procedure. We have by theorem 2 that\nthe most ef\ufb01cient estimator of (cid:18) is asymptotically unbiased with asymptotic variance\n\n(cid:27)2\ninf o (cid:25) (N sup\n\nx\n\nI(cid:18)0(x))(cid:0)1;\n\nwhile the usual calculations show that the asymptotic variance of any ef\ufb01cient estimator\nbased on i.i.d. samples from p(x) is given by\n\n(cid:27)2\n\ndp(x)I(cid:18)0 (x))(cid:0)1;\n\niid (cid:25) (N ZX\niid is an average, while (cid:27)(cid:0)2\n\nthe key point, again, is that (cid:27)(cid:0)2\ninf o is a maximum, and hence\n(cid:27)iid (cid:21) (cid:27)inf o, with equality only in the exceptional case that the Fisher information I(cid:18)0(x)\nis constant almost surely in p(x).\nThe Fisher information here is easily calculated here to be\n\nI(cid:18) =\n\n( _fa;(cid:18))2\n\n:\n\nfa;(cid:18)(1 (cid:0) fa;(cid:18))\n\nWe can immediately derive two easy but important conclusions. First, there is just one func-\ntion f (cid:3) for which the i.i.d. sampling strategy is as asymptotically ef\ufb01cient as information-\nmaximization strategy; for all other f, information maximization is strictly more ef\ufb01cient.\nThe extremal function f (cid:3) is obtained by setting (cid:27)iid = (cid:27)inf o, implying that I(cid:18)0(x) is\nconstant a.e. [p(x)], and so f (cid:3) is the unique solution of the differential equation\n\ndf (cid:3)\ndt\n\n= c(cid:18)f (cid:3)(t)(1 (cid:0) f (cid:3)(t))(cid:19)1=2\n\n;\n\nwhere the auxiliary constant c = pI (cid:18) uniquely \ufb01xes the scale a. After some calculus, we\nobtain\n\nf (cid:3)(t) =\n\nsin(ct) + 1\n\n2\n\non the interval [(cid:0)(cid:25)=2c; (cid:25)=2c] (and de\ufb01ned uniquely, by monotonicity, as 0 or 1 outside this\ninterval). Since the support of the derivative of this function is compact, this result is quite\n\n\fiid is always strictly greater than (cid:27)2\n\ndependent of the sampling density p(x); if p(x) places any of its mass outside of the interval\ninf o. This recapitulates a basic\n[(cid:0)(cid:25)=2c; (cid:25)=2c], then (cid:27)2\ntheme from the psychophysical literature comparing adaptive and nonadaptive techniques:\nwhen the scale of the nonlinearity f is either unknown or smaller than the scale of the\ni.i.d. sampling density p(x), adaptive techniques are greatly preferable.\nSecond, a crude analysis shows that, as the scale a of the nonlinearity shrinks, the ratio\ninf o grows approximately as 1=a; this gives quantitative support to the intuition that\niid=(cid:27)2\n(cid:27)2\nthe sharper the nonlinearity with respect to the scale of the sampling distribution p(x), the\nmore we can expect the information-maximization strategy to help.\n\n3.2 Linear-nonlinear cascade model\n\nWe now consider a model that has received increasing attention from the neurophysiology\ncommunity (see, e.g., [8] for some analysis and relevant references). The model is of\ncascade form, with a linear stage followed by a nonlinear stage:\nthe input space X is\na compact subset of d-dimensional Euclidean space (take X to be the unit sphere, for\nconcreteness), and the \ufb01ring rate of the model cell, given input ~x 2 X, is given by the\nsimple form\n\nE(yj~x; (cid:18)) = f (< ~(cid:18); ~x >):\n\nHere the linear \ufb01lter ~(cid:18) is some unit vector in X 0, the dual space of X (thus, the model\nspace (cid:2) is isomorphic to X), while the nonlinearity f is some nonconstant, nonnegative\nfunction on [(cid:0)1; 1]. We assume that f is uniformly smooth, to satisfy the conditions of\ntheorem 2; we also assume f is known, although, again, this can be relaxed. The response\nspace Y \u2014 the space of possible spike counts, given the stimulus ~x \u2014 can be taken to\nbe the nonnegative integers. For simplicity, let the conditional probabilities p(yj~x; (cid:18)) be\nparametrized uniquely by the mean \ufb01ring rate f (< ~(cid:18); ~x >); the most convenient model, as\nusual, is to assume that p(yj~x; (cid:18)) is Poisson with mean f (< ~(cid:18); ~x >). Finally, we assume\nthat the sampling density p(x) is uniform on the unit sphere (this choice is natural for\nseveral reasons, mainly involving symmetry; see, e.g., [2, 8]), and that the prior p0((cid:18)) is\npositive and continuous (and is therefore bounded away from zero, by the compactness of\n(cid:2)).\nThe Fisher information for this model is easily calculated as\n\nI(cid:18)(x) =\n\n(f 0(< ~(cid:18); ~x >))2\n\nf (< ~(cid:18); ~x >)\n\nP~x;(cid:18);\n\nwhere f 0 is the usual derivative of the real function f and P~x;(cid:18) is the projection operator\ncorresponding to ~x, restricted to the (d (cid:0) 1)-dimensional tangent space to the unit sphere\nat (cid:18). Theorem 2 now implies that\n\n(cid:27)2\n\ninf o (cid:25) (cid:18)N max\n\nt2[(cid:0)1;1]\n\nwhile\n\n(cid:27)2\n\niid (cid:25) (cid:18)N Z[(cid:0)1;1]\n\ndp(t)\n\nf 0(t)2g(t)\n\nf (t) (cid:19)(cid:0)1\nf (t) (cid:19)(cid:0)1\n\nf 0(t)2g(t)\n\n;\n\n;\n\nwhere g(t) = 1 (cid:0) t2, p(t) denotes the one-dimensional marginal measure induced on the\ninterval [(cid:0)1; 1] by the uniform measure p(x) on the unit sphere, and (cid:27)2 in each of these\ntwo expressions multiplies the (d (cid:0) 1)-dimensional identity matrix.\nClearly, the arguments of subsection 3.1 apply here as well: the ratio (cid:27)2\ninf o grows\nroughly linearly in the inverse of the scale of the nonlinearity. The more interesting asymp-\ntotics here, though, are in d. This is because the unit sphere has a measure concentration\n\niid=(cid:27)2\n\n\fproperty [11]: as d ! 1, the measure p(t) becomes exponentially concentrated around\n0. In fact, it is easy to show directly that, in this limit, p(t) converges in distribution to\nthe normal measure with mean zero and variance d(cid:0)2. The most surprising implication of\nthis result is seen for nonlinearities f such that f 0(0) = 0, f (0) > 0; we have in mind, for\nexample, symmetric nonlinearities like those often used to model complex cells in visual\ncortex. For these nonlinearities,\n\n(cid:27)2\ninf o\n(cid:27)2\niid\n\n= O(d(cid:0)2) :\n\nthat is, the information maximization strategy becomes in\ufb01nitely more ef\ufb01cient than the\nusual i.i.d. approach as the dimensionality of the spaces X and (cid:2) grows.\n\n4 A Negative Example\n\nOur next example is more negative and perhaps more surprising:\nit shows how the\ninformation-maximation strategy can fail, in a certain sense, if the conditions of the con-\nsistency lemma are not met. Let (cid:2) be multidimensional, with coordinates which are \u201cin-\ndependent\u201d in a certain sense, and assume the expected information obtained from one\ncoordinate of the parameter remains bounded strictly away from the expected information\nobtained from one of the other coordinates. For instance, consider the following model:\n\n:5\nf(cid:0)1\n:5\nf1\n\n8>><\n>>:\n\n(cid:0)1 < x (cid:20) (cid:18)(cid:0)1;\n(cid:18)(cid:0)1 < x (cid:20) 0;\n0 < x (cid:20) (cid:18)1;\n(cid:18)1 < x (cid:20) 1\n\np(1jx) =\n\njf(cid:0)1 (cid:0) :5j > jf1 (cid:0) :5j;\n\nwhere 0 (cid:20) f(cid:0)1; f1 (cid:20) 1,\nare known and (cid:0)1 < (cid:18)(cid:0)1 < 0 and 0 < (cid:18)1 < 1 are the parameters we want to learn.\nLet the initial prior be absolutely continuous with respect to Lebesgue measure; this implies\nthat all posteriors will have the same property. Then, using the inverse cumulative proba-\nbility transform and the fact that mutual information is invariant with respect to invertible\nmappings, it is easy to show that the maximal information we can obtain by sampling from\nthe left is strictly greater than the maximal information obtainable from the right, uniformly\nin N. Thus the information-maximization strategy will sample from x < 0 forever, leading\nto a linear information growth rate (and easily-proven consistency) for the left parameter\nand non-convergence on the right. Compare the performance of the usual i.i.d. approach\nfor choosing x (using any Lebesgue-dominating measure on the parameter space), which\nleads to the standard root-N rate for both parameters (i.e., is strongly consistent in posterior\nprobability).\n\nNote that this kind of inconsistency problem does not occur in the case of suf\ufb01ciently\nsmooth p(yjx; (cid:18)), by our main theorem. Thus one way of avoiding this problem would\nbe to \ufb01x a \ufb01nite sampling scale for each coordinate (i.e., discretizing). Below this scale,\nno information can be extracted; therefore, when the algorithm hits this \u201c\ufb02oor\u201d for one\ncoordinate, it will switch to the other. However, it is possible to \ufb01nd other examples which\nshow that the lack of consistency is not necessarily tied to the discontinuous nature of the\nconditional densities.\n\n5 Directions\n\nIn this paper, we have presented a rigorous theoretical framework for adaptively design-\ning experiments using an information-theoretic objective function. Most importantly, we\n\n\fhave offered some asymptotic results which clarify the effectiveness of adaptive experiment\ndesign using the information-theoretic objective function (1); in addition, we expect that\nour asymptotic approximations should \ufb01nd applications in approximative computational\nschemes for optimizing stimulus choice during this type of online experiment. For exam-\nple, our theorem 2 might suggest the use of a mixture-of-Gaussians representation as an\nef\ufb01cient approximation for the posteriors pN ((cid:18)) [5].\nIt should be clear that we have left several important questions open. Perhaps the most\nobvious such question concerns the use of non-information theoretic objective functions.\nIt turns out that many of our results apply with only modest changes if the experiment is\ninstead designed to minimize something like the Bayes mean-square error (perhaps de-\n\ufb01ned only locally if (cid:2) has a nontrivial manifold structure), for example:\nin this case,\nthe results in sections 3.1 and 3.2 remain completely unchanged, while the statement of\nour main theorem requires only slight changes in the asymptotic variance formula (see\nhttp://www.cns.nyu.edu/(cid:24)liam). Thus it seems our results here can add very little to any\ndiscussion of what objective function is \u201cbest\u201d in general.\n\nWe brie\ufb02y describe a few more open research directions below.\n\n5.1\n\n\u201cBatch mode\u201d and stimulus dependencies\n\nPerhaps our strongest assumption here is that the experimenter will be able to freely choose\nthe stimuli on each trial. This might be inaccurate for a number of reasons: for example,\ncomputational demands might require that experiments be run in \u201cbatch mode,\u201d with stim-\nulus optimization taking place not after every trial, but perhaps only after each batch of k\nstimuli, all chosen according to some \ufb01xed distribution p(x). Another common situation\ninvolves stimuli which vary temporally, for which the system is commonly modelled as\nresponding not just to a given stimulus x(t), but also to all of its time-translates x(t (cid:0) (cid:28) ).\nFinally, if there is some cost C(x0; x1) associated with changing the state of the observa-\ntional apparatus from the current state x0 to x1, the experimenter may wish to optimize an\nobjective function which incorporates this cost, for example\n\nI(y; (cid:18)jx1)(cid:14)C(x0; x1):\n\nEach of these situations is clearly ripe for further study. Here we restrict ourselves to\nthe \ufb01rst setting, and give a simple conjecture, based on the asymptotic results presented\nabove and inspired by results like those of [1, 4, 10]. First, we state more precisely the\noptimization problem inherent in designing a \u201cbatch\u201d experiment: we wish to choose some\nsequence, fxig1(cid:20)i(cid:20)k, to maximize\n\nI(fxi; yig1(cid:20)i(cid:20)k; (cid:18));\n\nthe main difference here is that fxig1(cid:20)i(cid:20)k must be chosen nonadaptively, i.e., without se-\nquential knowledge of the responses fyig. Clearly, the order of any sequence of optimal\nfxig1(cid:20)i(cid:20)k is irrelevant to the above objective function; in addition, it should be apparent\nthat if no given piece of data (x; y) is too strong (for example, under Lipschitz conditions\nlike those in lemma 1) that any given elements of such an optimal sequence fxig1(cid:20)i(cid:20)k\nshould be asymptotically independent. (Without such a smoothness condition \u2014 for exam-\nple, if some input x could de\ufb01nitively decide between some given (cid:18)0 and (cid:18)1 \u2014 then no such\nasymptotic independence statement can hold, since no more than one sample from such an\nx would be necessary.) Thus, we can hope that we should be able to asymptotically ap-\nproximate this optimal experiment by sampling in an i.i.d. manner from some well-chosen\np(x). Moreover, we can make a guess as to the identity of this putative p(x):\nConjecture (\u201cBatch\u201d mode). Under suitable conditions, the empirical distribution corre-\n\n\fsponding to any optimal sequence fxig1(cid:20)i(cid:20)k,\n1\nk\n\n^pk(x) (cid:17)\n\nk\n\nXi=1\n\n(cid:14)(xi);\n\nconverges weakly as k ! 1 to S, the convex set of maximizers in p(x) of\n\nE(cid:18) log(det(ExI(cid:18)(x))):\n\n(2)\n\nExpression (2) above is an average over p((cid:18)) of terms proportional to the negative entropy\nof the asymptotic Gaussian posterior distribution corresponding to each (cid:18), and thus should\nbe maximized by any optimal approximant distribution p(x). (Note also that expression\n(2) is concave in p(x), ensuring the tractability of the above maximization.) In fact, it is\nnot dif\ufb01cult, using the results of Clarke and Barron [3] to prove the above conjecture under\nthe conditions like those of Theorem 2, assuming that X is \ufb01nite (in which case weak\nconvergence is equivalent to pointwise convergence); we leave generalizations for future\nwork.\n\nAcknowledgements\n\nWe thank R. Sussman, E. Simoncelli, C. Machens, and D. Pelli for helpful conversations. This work\nwas partially supported by a predoctoral fellowship from HHMI.\n\nReferences\n[1] J. Berger, J. Bernardo, and M. Mendoza. Bayesian Statistics 4, chapter On priors that maximize\n\nexpected information, pages 35\u201360. Oxford University Press, 1989.\n\n[2] E. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Compu-\n\ntation in Neural Systems, 12:199\u2013213, 2001.\n\n[3] B. Clarke and A. Barron. Information-theoretic asymptotics of Bayes methods. IEEE Transac-\n\ntions on Information Theory, 36:453 \u2013 471, 1990.\n\n[4] B. Clarke and A. Barron. Jeffreys\u2019 prior is asymptotically least favorable under entropy risk.\n\nJournal of Statistical Planning Inference, 41:37\u201360, 1994.\n\n[5] P. Deignan, P. Meckl, M. Franchek, J. Abraham, and S. Jaliwala. Using mutual information to\npre-process input data for a virtual sensor. In ACC, number ASME0043 in American Control\nConference, 2000.\n\n[6] L. Kontsevich and C. Tyler. Bayesian adaptive estimation of psychometric slope and threshold.\n\nVision Research, 39:2729\u20132737, 1999.\n\n[7] M. Mascaro and D. Bradley. Optimized neuronal tuning algorithm for multichannel recording.\n\nUnpublished abstract at http://www.compscipreprints.com/, 2002.\n\n[8] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network:\n\nComputation in Neural Systems, 14:437\u2013464, 2003.\n\n[9] D. Pelli. The ideal psychometric procedure. Investigative Ophthalmology and Visual Science\n\n(Supplement), 28:366, 1987.\n\n[10] H. R. Scholl. Shannon optimal priors on iid statistical experiments converge weakly to jeffreys\u2019\n\nprior. Available at citeseer.nj.nec.com/104699.html, 1998.\n\n[11] M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publ.\n\nMath. IHES, 81:73\u2013205, 1995.\n\n[12] A. van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, 1998.\n[13] A. Watson and A. Fitzhugh. The method of constant stimuli is inef\ufb01cient. Perception and\n\nPsychophysics, 47:87\u201391, 1990.\n\n[14] A. Watson and D. Pelli. QUEST: a Bayesian adaptive psychophysical method. Perception and\n\nPsychophysics, 33:113\u2013120, 1983.\n\n\f", "award": [], "sourceid": 2473, "authors": [{"given_name": "Liam", "family_name": "Paninski", "institution": null}]}