{"title": "Entropy and Inference, Revisited", "book": "Advances in Neural Information Processing Systems", "page_first": 471, "page_last": 478, "abstract": null, "full_text": "Entropy and Inference, Revisited\n\nIlya Nemenman,1;2 Fariel Shafee,3 and William Bialek1;3\n\n1NEC Research Institute, 4 Independence Way, Princeton, New Jersey 08540\n\n2Institute for Theoretical Physics, University of California, Santa Barbara, CA 93106\n\n3Department of Physics, Princeton University, Princeton, New Jersey 08544\n\nnemenman@itp.ucsb.edu, ffshafee/wbialekg@princeton.edu\n\nAbstract\n\nWe study properties of popular near\u2013uniform (Dirichlet) priors for learn-\ning undersampled probability distributions on discrete nonmetric spaces\nand show that they lead to disastrous results. However, an Occam\u2013style\nphase space argument expands the priors into their in\ufb01nite mixture and\nresolves most of the observed problems. This leads to a surprisingly good\nestimator of entropies of discrete distributions.\n\nLearning a probability distribution from examples is one of the basic problems in data\nanalysis. Common practical approaches introduce a family of parametric models, leading to\nquestions about model selection. In Bayesian inference, computing the total probability of\nthe data arising from a model involves an integration over parameter space, and the resulting\n\u201cphase space volume\u201d automatically discriminates against models with larger numbers of\nparameters\u2014hence the description of these volume terms as Occam factors [1, 2]. As we\nmove from \ufb01nite parameterizations to models that are described by smooth functions, the\nintegrals over parameter space become functional integrals and methods from quantum\n\ufb01eld theory allow us to do these integrals asymptotically; again the volume in model space\nconsistent with the data is larger for models that are smoother and hence less complex [3].\nFurther, at least under some conditions the relevant degree of smoothness can be determined\nself\u2013consistently from the data, so that we approach something like a model independent\nmethod for learning a distribution [4].\n\ndescribe as the number of times ni each possibility is observed in a set of N =PK\n\nThe results emphasizing the importance of phase space factors in learning prompt us to\nlook back at a seemingly much simpler problem, namely learning a distribution on a dis-\ncrete, nonmetric space. Here the probability distribution is just a list of numbers fqig,\ni = 1; 2;(cid:1)(cid:1)(cid:1) ; K, where K is the number of bins or possibilities. We do not assume any\nmetric on the space, so that a priori there is no reason to believe that any qi and qj should\nbe similar. The task is to learn this distribution from a set of examples, which we can\ni=1 ni\nsamples. This problem arises in the context of language, where the index i might label\nwords or phrases, so that there is no natural way to place a metric on the space, nor is it\neven clear that our intuitions about similarity are consistent with the constraints of a met-\nric space. Similarly, in bioinformatics the index i might label n\u2013mers of the the DNA or\namino acid sequence, and although most work in the \ufb01eld is based on metrics for sequence\ncomparison one might like an alternative approach that does not rest on such assumptions.\nIn the analysis of neural responses, once we \ufb01x our time resolution the response becomes\na set of discrete \u201cwords,\u201d and estimates of the information content in the response are de-\n\n\ftermined by the probability distribution on this discrete space. What all of these examples\nhave in common is that we often need to draw some conclusions with data sets that are not\nin the asymptotic limit N (cid:29) K. Thus, while we might use a large corpus to sample the\ndistribution of words in English by brute force (reaching N (cid:29) K with K the size of the\nvocabulary), we can hardly do the same for three or four word phrases.\n\nIn models described by continuous functions, the in\ufb01nite number of \u201cpossibilities\u201d can\nnever be overwhelmed by examples; one is saved by the notion of smoothness. Is there\nsome nonmetric analog of this notion that we can apply in the discrete case? Our intuition\nis that information theoretic quantities may play this role. If we have a joint distribution of\ntwo variables, the analog of a smooth distribution would be one which does not have too\nmuch mutual information between these variables. Even more simply, we might say that\nsmooth distributions have large entropy. While the idea of \u201cmaximum entropy inference\u201d\nis common [5], the interplay between constraints on the entropy and the volume in the\nspace of models seems not to have been considered. As we shall explain, phase space\nfactors alone imply that seemingly sensible, more or less uniform priors on the space of\ndiscrete probability distributions correspond to disastrously singular prior hypotheses about\nthe entropy of the underlying distribution. We argue that reliable inference outside the\nasymptotic regime N (cid:29) K requires a more uniform prior on the entropy, and we offer one\nway of doing this. While many distributions are consistent with the data when N (cid:20) K,\nwe provide empirical evidence that this \ufb02attening of the entropic prior allows us to make\nsurprisingly reliable statements about the entropy itself in this regime.\n\nAt the risk of being pedantic, we state very explicitly what we mean by uniform or nearly\nuniform priors on the space of distributions. The natural \u201cuniform\u201d prior is given by\n\nPu(fqig) =\n\n(cid:14) 1 (cid:0)\n\n1\nZu\n\nKXi=1\n\nqi! ; Zu =ZA\n\ndq1dq2 (cid:1)(cid:1)(cid:1) dqK (cid:14) 1 (cid:0)\n\nqi!\n\nKXi=1\n\n(1)\n\nwhere the delta function imposes the normalization, Zu is the total volume in the space of\nmodels, and the integration domain A is such that each qi varies in the range [0; 1]. Note\nthat, because of the normalization constraint, an individual qi chosen from this distribution\nin fact is not uniformly distributed\u2014this is also an example of phase space effects, since in\nchoosing one qi we constrain all the other fqj6=ig. What we mean by uniformity is that all\ndistributions that obey the normalization constraint are equally likely a priori.\n\nInference with this uniform prior is straightforward. If our examples come independently\nfrom fqig, then we calculate the probability of the model fqig with the usual Bayes rule: 1\n\nP (fqigjfnig) =\n\nP (fnigjfqig)Pu(fqig)\n\nPu(fnig)\n\n; P (fnigjfqig) =\n\nKYi=1\n\n(qi)ni :\n\n(2)\n\nIf we want the best estimate of the probability qi in the least squares sense, then we should\ncompute the conditional mean, and this can be done exactly, so that [6, 7]\n\nhqii =\n\nni + 1\nN + K\n\n:\n\n(3)\n\nThus we can think of inference with this uniform prior as setting probabilities equal to the\nobserved frequencies, but with an \u201cextra count\u201d in every bin. This sensible procedure was\n\ufb01rst introduced by Laplace [8]. It has the desirable property that events which have not\nbeen observed are not automatically assigned probability zero.\n\n1If the data are unordered, extra combinatorial factors have to be included inP (fnigjfqig). How-\n\never, these cancel immediately in later expressions.\n\n\fA natural generalization of these ideas is to consider priors that have a power\u2013law depen-\ndence on the probabilities, the so called Dirichlet family of priors:\n\nP(cid:12)(fqig) =\n\n1\n\nZ((cid:12))\n\n(cid:14) 1 (cid:0)\n\nqi! KYi=1\n\nKXi=1\n\nq(cid:12)(cid:0)1\ni\n\n;\n\n(4)\n\nIt is interesting to see what typical distributions from these priors look like. Even though\ndifferent qi\u2019s are not independent random variables due to the normalizing (cid:14)\u2013function,\ngeneration of random distributions is still easy: one can show that if qi\u2019s are generated\nsuccessively (starting from i = 1 and proceeding up to i = K) from the Beta\u2013distribution\n\n; (cid:12); (K (cid:0) i)(cid:12)! ; B (x; a; b) =\n\nxa(cid:0)1(1 (cid:0) x)b(cid:0)1\n\nB(a; b)\n\n;\n\n(5)\n\n0.8\n\n 0\n0.2\n\nq\n\nq\n\n 0\n0.01\n\nq\n\nb = 0.0007, S = 1.05 bits\n\nb = 0.02, S = 5.16 bits\n\nb = 1, S = 9.35 bits\n\n 0\n0\n\n200\n\n400\n\n600\n\nbin number\n\n800\n\n1000\n\nP (qi) = B \n\nqi\n\n1 (cid:0)Pj<i qj\n\nthen the probability of the whole sequence\nfqig is P(cid:12)(fqig). Fig. 1 shows some typ-\nical distributions generated this way. They\nrepresent different regions of the range of\npossible entropies: low entropy ((cid:24) 1 bit,\nwhere only a few bins have observable\nprobabilities), entropy in the middle of the\npossible range, and entropy in the vicinity\nof the maximum, log2 K. When learning\nan unknown distribution, we usually have\nno a priori reason to expect it to look like\nonly one of these possibilities, but choos-\ning (cid:12) pretty much \ufb01xes allowed \u201cshapes.\u201d\nThis will be a focal point of our discussion.\n\nFigure 1: Typical distributions, K = 1000.\nEven though distributions look different, inference with all priors Eq. (4) is similar [6, 7]:\n\nhqii(cid:12) =\n\nni + (cid:12)\nN + (cid:20)\n\n;\n\n(cid:20) = K(cid:12):\n\n(6)\n\nThis simple modi\ufb01cation of the Laplace\u2019s rule, Eq. (3), which allows us to vary proba-\nbility assigned to the outcomes not yet seen, was \ufb01rst examined by Hardy and Lidstone\n[9, 10]. Together with the Laplace\u2019s formula, (cid:12) = 1, this family includes the usual maxi-\nmum likelihood estimator (MLE), (cid:12) ! 0, that identi\ufb01es probabilities with frequencies, as\nwell as the Jeffreys\u2019 or Krichevsky\u2013Tro\ufb01mov (KT) estimator, (cid:12) = 1=2 [11, 12, 13], the\nSchurmann\u2013Grassberger (SG) estimator, (cid:12) = 1=K [14], and other popular choices.\nTo understand why inference in the family of priors de\ufb01ned by Eq. (4) is unreliable, con-\nsider the entropy of a distribution drawn at random from this ensemble. Ideally we would\nlike to compute this whole a priori distribution of entropies,\n\nP(cid:12)(S) =Z dq1dq2 (cid:1)(cid:1)(cid:1) dqK P(cid:12)(fqig) (cid:14)\"S +\n\nKXi=1\n\nqi log2 qi# ;\n\n(7)\n\nbut this is quite dif\ufb01cult. However, as noted by Wolpert and Wolf [6], one can compute\nthe moments of P(cid:12)(S) rather easily. Transcribing their results to the present notation (and\ncorrecting some small errors), we \ufb01nd:\n\n(cid:24)((cid:12)) (cid:17) h S[ni = 0]i(cid:12) = 0((cid:20) + 1) (cid:0) 0((cid:12) + 1) ;\n\n(cid:27)2((cid:12)) (cid:17) h ((cid:14)S)2[ni = 0]i(cid:12) =\n\n(cid:12) + 1\n(cid:20) + 1\n\n 1((cid:12) + 1) (cid:0) 1((cid:20) + 1) ;\n\n(8)\n\n(9)\n\n\fwhere m(x) = (d=dx)m+1 log2 (cid:0)(x) are the polygamma functions.\n\n(b\n\n0.5\n\n1\n\n2\n\n0.2\n\n1\n\n0.9\n\n0.8\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.8\n\n0.7\n\n0.1\n\n0.6\n\n2\n\n)\n\n(b\n\n0.4\n\nK\n\ng\no\n\nl\n \n/\n \n)\n\nK=10 \nK=100 \nK=1000\n\n0\n1e\u22127 1e\u22125 1e\u22123 .25 \n\nThis behavior of the moments is shown on\nFig. 2. We are faced with a striking obser-\nvation: a priori distributions of entropies in\nthe power\u2013law priors are extremely peaked\nfor even moderately large K.\nIndeed, as\na simple analysis shows, their maximum\nstandard deviation of approximately 0.61\nbits is attained at (cid:12) (cid:25) 1=K, where (cid:24)((cid:12)) (cid:25)\n1= ln 2 bits. This has to be compared with\nthe possible range of entropies, [0; log2 K],\nwhich is asymptotically large with K. Even\nworse, for any \ufb01xed (cid:12) and suf\ufb01ciently large\nK, (cid:24)((cid:12)) = log2 K (cid:0) O(K 0), and (cid:27)((cid:12)) /\n1=p(cid:20). Similarly, if K is large, but (cid:20) is\nsmall, then (cid:24)((cid:12)) / (cid:20), and (cid:27)((cid:12)) / p(cid:20).\nThis paints a lively picture: varying (cid:12) be-\ntween 0 and 1 results in a smooth variation\nof (cid:24), the a priori expectation of the entropy,\nfrom 0 to Smax = log2 K. Moreover, for\nlarge K, the standard deviation of P(cid:12)(S) is always negligible relative to the possible range\nof entropies, and it is negligible even absolutely for (cid:24) (cid:29) 1 ((cid:12) (cid:29) 1=K). Thus a seemingly\ninnocent choice of the prior, Eq. (4), leads to a disaster: \ufb01xing (cid:12) speci\ufb01es the entropy al-\nmost uniquely. Furthermore, the situation persists even after we observe some data: until\nthe distribution is well sampled, our estimate of the entropy is dominated by the prior!\n\nFigure 2: (cid:24)((cid:12))= log2 K and (cid:27)((cid:12)) as func-\ntions of (cid:12) and K; gray bands are the region\nof (cid:6)(cid:27)((cid:12)) around the mean. Note the transi-\ntion from the logarithmic to the linear scale\nat (cid:12) = 0:25 in the insert.\n\n0\n0\n\n1.5\n\n1.0 \n\n1.5 \n\n2.0 \n\nThus it is clear that all commonly used estimators mentioned above have a problem. While\nthey may or may not provide a reliable estimate of the distribution fqig2, they are de\ufb01-\nnitely a poor tool to learn entropies. Unfortunately, often we are interested precisely in\nthese entropies or similar information\u2013theoretic quantities, as in the examples (neural code,\nlanguage, and bioinformatics) we brie\ufb02y mentioned earlier.\n\nAre the usual estimators really this bad? Consider this: for the MLE ((cid:12) = 0), Eqs. (8, 9) are\nformally wrong since it is impossible to normalize P0(fqig). However, the prediction that\nP0(S) = (cid:14)(S) still holds. Indeed, SML, the entropy of the ML distribution, is zero even for\nN = 1, let alone for N = 0. In general, it is well known that SML always underestimates\nthe actual value of the entropy, and the correction\n\nS = SML +\n\nK (cid:3)\n2N\n\n+ O(cid:18) 1\nN 2(cid:19)\n\n(10)\n\nis usually used (cf. [14]). Here we must set K (cid:3) = K (cid:0) 1 to have an asymptotically correct\nresult. Unfortunately in an undersampled regime, N (cid:28) K, this is a disaster. To alleviate\nthe problem, different authors suggested to determine the dependence K (cid:3) = K (cid:3)(K) by\nvarious (rather ad hoc) empirical [15] or pseudo\u2013Bayesian techniques [16]. However, then\nthere is no principled way to estimate both the residual bias and the error of the estimator.\n\nThe situation is even worse for the Laplace\u2019s rule, (cid:12) = 1. We were unable to \ufb01nd any\nresults in the literature that would show a clear understanding of the effects of the prior\non the entropy estimate, SL. And these effects are enormous: the a priori distribution of\n\nthe entropy has (cid:27)(1) (cid:24) 1=pK and is almost (cid:14)-like. This translates into a very certain,\n\nbut nonetheless possibly wrong, estimate of the entropy. We believe that this type of error\n\n2In any case, the answer to this question depends mostly on the \u201cmetric\u201d chosen to measure\nreliability. Minimization of bias, variance, or information cost (Kullback\u2013Leibler divergence between\nthe target distribution and the estimate) leads to very different \u201cbest\u201d estimators.\n\nb\nx\nb\ns\n\f(cf. Fig. 3) has been overlooked in some previous literature.\n\nS\n\u2212\n\n \n\n \n\n>b\nS\n<\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nb = 0.001\nb = 0.02 \nb = 1 \n\nThe Schurmann\u2013Grassberger estimator, (cid:12) = 1=K, deserves a special attention. The vari-\nance of P(cid:12)(S) is maximized near this value of (cid:12) (cf. Fig. 2). Thus the SG estimator results\nin the most uniform a priori expectation of S possible for the power\u2013law priors, and conse-\nquently in the least bias. We suspect that this feature is responsible for a remark in Ref. [14]\nthat this (cid:12) was empirically the best for studying printed texts. But even the SG estimator is\n\ufb02awed: it is biased towards (roughly) 1= ln 2, and it is still a priori rather narrow.\nSummarizing, we conclude that simple\npower\u2013law priors, Eq. (4), must not be used\nto learn entropies when there is no strong\na priori knowledge to back them up. On\nthe other hand, they are the only priors\nwe know of that allow to calculate hqii,\nhSi, h(cid:31)2i, . . . exactly [6].\nIs there a way\nto resolve the problem of peakedness of\nP(cid:12)(S) without throwing away their analyt-\nical ease? One approach would be to use\n(cid:12) (fqig) = P(cid:12) (fqig)\nP(cid:12) (S[qi]) P actual(S[qi]) as\nP (cid:13)at\na prior on fqig. This has a feature that the\na priori distribution of S deviates from uni-\nformity only due to our actual knowledge\nP actual(S[qi]), but not in the way P(cid:12)(S)\ndoes. However, as we already mentioned,\nP(cid:12)(S[qi]) is yet to be calculated.\nAnother way to a \ufb02at prior is to write\n\n10000\nFigure 3: Learning the (cid:12) = 0:02 distribution\nfrom Fig. 1 with (cid:12) = 0:001; 0:02; 1. The\nactual error of the estimators is plotted; the\nerror bars are the standard deviations of the\nposteriors. The \u201cwrong\u201d estimators are very\ncertain but nonetheless incorrect.\n\nP(S) = 1 =R (cid:14)(S(cid:0) (cid:24))d(cid:24). If we \ufb01nd a family of priors P(fqig; parameters) that result in\na (cid:14)-function over S, and if changing the parameters moves the peak across the whole range\nof entropies uniformly, we may be able to use this. Luckily, P(cid:12)(S) is almost a (cid:14)-function! 3\nIn addition, changing (cid:12) results in changing (cid:24)((cid:12)) = h S[ni = 0]i(cid:12) across the whole range\n[0; log2 K]. So we may hope that the prior 4\n\n 300 \nN\n\n\u22123\n 10 \n\n 1000\n\n\u22121\n\n\u22122\n\n 30 \n\n 100 \n\n 3000\n\nP(fqig; (cid:12)) =\n\n(cid:14) 1 (cid:0)\n\n1\nZ\n\nqi! KYi=1\n\nKXi=1\n\nq(cid:12)(cid:0)1\ni\n\nd(cid:24)((cid:12))\nd(cid:12) P((cid:12))\n\n(11)\n\nmay do the trick and estimate entropy reliably even for small N, and even for distributions\nthat are atypical for any one (cid:12). We have less reason, however, to expect that this will give\nan equally reliable estimator of the atypical distributions themselves. 2 Note the term d(cid:24)=d(cid:12)\nin Eq. (11). It is there because (cid:24), not (cid:12), measures the position of the entropy density peak.\nInference with the prior, Eq. (11), involves additional averaging over (cid:12) (or, equivalently,\n\n3The approximation becomes not so good as (cid:12) ! 0 since (cid:27)((cid:12)) becomes O(1) before dropping\nto zero. Even worse, P(cid:12)(S) is skewed at small (cid:12). This accumulates an extra weight at S = 0. Our\napproach to dealing with these problems is to ignore them while the posterior integrals are dominated\nby (cid:12)\u2019s that are far away from zero. This was always the case in our simulations, but is an open\nquestion for the analysis of real data.\n\n4Priors that are formed as weighted sums of the different members of the Dirichlet family are\nusually called Dirichlet mixture priors. They have been used to estimate probability distributions of,\nfor example, protein sequences [17]. Equation (11), an in\ufb01nite mixture, is a further generalization,\nand, to our knowledge, it has not been studied before.\n\n\f(cid:24)), but is nevertheless straightforward. The a posteriori moments of the entropy are\n\ncSm = R d(cid:24) (cid:26)((cid:24);fnig)h Sm[ni]i(cid:12)((cid:24))\nKYi=1\n\nR d(cid:24) (cid:26)((cid:24); [ni])\n\n(cid:0)((cid:20)((cid:24)))\n\n(cid:26)((cid:24); [ni]) = P ((cid:12) ((cid:24)))\n\n(cid:0)(N + (cid:20)((cid:24)))\n\n(cid:0)((cid:12)((cid:24)))\n\n; where\n\n(cid:0)(ni + (cid:12)((cid:24)))\n\n(12)\n\n(13)\n\n:\n\nHere the moments h Sm[ni]i(cid:12)((cid:24)) are calculated at \ufb01xed (cid:12) according to the (corrected)\nformulas of Wolpert and Wolf [6]. We can view this inference scheme as follows: \ufb01rst, one\nsets the value of (cid:12) and calculates the expectation value (or other moments) of the entropy\nat this (cid:12). For small N, the expectations will be very close to their a priori values due to the\npeakedness of P(cid:12)(S). Afterwards, one integrates over (cid:12)((cid:24)) with the density (cid:26)((cid:24)), which\nincludes our a priori expectations about the entropy of the distribution we are studying\n[P ((cid:12) ((cid:24)))], as well as the evidence for a particular value of (cid:12) [(cid:0)-terms in Eq. (13)].\nThe crucial point is the behavior of the evidence. If it has a pronounced peak at some (cid:12)cl,\n\nthen the integrals over (cid:12) are dominated by the vicinity of the peak, bS is close to (cid:24)((cid:12)cl), and\n\nthe variance of the estimator is small. In other words, data \u201cselects\u201d some value of (cid:12), much\nin the spirit of Refs. [1] \u2013 [4]. However, this scenario may fail in two ways. First, there\nmay be no peak in the evidence; this will result in a very wide posterior and poor inference.\nSecond, the posterior density may be dominated by (cid:12) ! 0, which corresponds to MLE,\nthe best possible \ufb01t to the data, and is a discrete analog of over\ufb01tting. While all these\nsituations are possible, we claim that generically the evidence is well\u2013behaved. Indeed,\nwhile small (cid:12) increases the \ufb01t to the data, it also increases the phase space volume of all\nallowed distributions and thus decreases probability of each particular one [remember that\nhqii(cid:12) has an extra (cid:12) counts in each bin, thus distributions with qi < (cid:12)=(N +(cid:20)) are strongly\nsuppressed]. The \ufb01ght between the \u201cgoodness of \ufb01t\u201d and the phase space volume should\nthen result in some non\u2013trivial (cid:12)cl, set by factors / N in the exponent of the integrand.\nFigure 4 shows how the prior, Eq. (11), performs on some of the many distributions\nwe tested. The left panel describes learning of distributions that are typical in the prior\nP(cid:12)(fqig) and, therefore, are also likely in P(fqig; (cid:12)). Thus we may expect a reasonable\nperformance, but the real results exceed all expectations: for all three cases, the actual rel-\native error drops to the 10% level at N as low as 30 (recall that K = 1000, so we only\nhave (cid:24) 0:03 data points per bin on average)! To put this in perspective, simple estimates\nlike \ufb01xed (cid:12) ones, MLE, and MLE corrected as in Eq. (10) with K (cid:3) equal to the number of\nnonzero ni\u2019s produce an error so big that it puts them off the axes until N > 100. 5 Our\nresults have two more nice features: the estimator seems to know its error pretty well, and\nit is almost completely unbiased.\n\nOne might be puzzled at how it is possible to estimate anything in a 1000\u2013bin distribution\nwith just a few samples: the distribution is completely unspeci\ufb01ed for low N! The point is\nthat we are not trying to learn the distribution \u2014 in the absence of additional prior informa-\ntion this would, indeed, take N (cid:29) K \u2014 but to estimate just one of its characteristics. It is\nless surprising that one number can be learned well with only a handful of measurements.\nIn practice the algorithm builds its estimate based on the number of coinciding samples\n(multiple coincidences are likely only for small (cid:12)), as in the Ma\u2019s approach to entropy\nestimation from simulations of physical systems [18].\nWhat will happen if the algorithm is fed with data from a distribution f~qig that is strongly\natypical in P(fqig; (cid:12))? Since there is no f~qig in our prior, its estimate may suffer. Nonethe-\nless, for any f~qig, there is some (cid:12) which produces distributions with the same mean entropy\nas S[~qi]. Such (cid:12) should be determined in the usual \ufb01ght between the \u201cgoodness of \ufb01t\u201d and\n\n5More work is needed to compare our estimator to more complex techniques, like in Ref. [15, 16].\n\n\f(a)\n\n(b)\n\nS\n\n \n/\n \n)\n \n\n \n\nS\n\u2212\nS\n\n \n\n^\n\n \n(\n\nS\n\n \n/\n \n)\n \n\n \n\nS\n\u2212\nS\n\n \n\n^\n\n \n(\n\nS\n\n \n/\n \n)\n \n\n \n\nS\n\u2212\nS\n\n \n\n^\n\n \n(\n\n 0.6\n \n \n 0\n\u22120.2\n 0.6\n \n \n 0\n\u22120.2\n 0.1\n 0\n \n \n\u22120.3\n\n 10 \n\nb = 0.0007\nS = 1.05 bits\n\nS\n\n \n/\n \n)\n \n\n \n\nS\n\u2212\nS\n\n \n\n^\n\nb = 0.02\nS = 5.16 bits\n\n \n(\n\nS\n\n \n/\n \n)\n \n\n \n\nS\n\u2212\nS\n\n \n\n^\n\n \n(\n\nS\n\n \n/\n \n)\n \n\n 0.3\n \n \n 0\n \n \n \n\u22120.4\n 0.4\n\n \n\n 0\n\n\u22120.2\n 0.4\n\n \n\n 0\n\nb = 0.02\nK = 2000 (half empty)\nS = 5.16 bits\n\nZipf\u2019s law: q\n ~ 1/i\ni\nK = 1000\nS = 7.49 bits\n\n ~ 50 \u2212 4 (ln i)2\nq\ni\nK = 1000\nS = 4.68 bits\n\n 30 \n\n 100 \n\n 300 \nN\n\n 1000\n\nb = 1.0\nS = 9.35 bits\n 3000\n\n10000\n\n \n\nS\n\u2212\nS\n\n \n\n^\n\n \n(\n\n\u22120.2\n\n 10 \n\n 30 \n\n 100 \n\n 300 \nN\n\n 1000\n\n 3000\n\n10000\n\nFigure 4: Learning entropies with the prior Eq. (11) and P((cid:12)) = 1. The actual relative\nerrors of the estimator are plotted; the error bars are the relative widths of the posteriors.\n\n(a) Distributions from Fig. 1. (b) Distributions atypical in the prior. Note that while bS may\nbe safely calculated as just hSi(cid:12)cl, one has to do an honest integration over (cid:12) to getcS2 and\n\nthe error bars. Indeed, since P(cid:12)(S) is almost a (cid:14)-function, the uncertainty at any \ufb01xed (cid:12) is\nvery small (see Fig. 3).\n\nthe Occam factors, and the correct value of entropy will follow. However, there will be an\nimportant distinction from the \u201ccorrect prior\u201d cases. The value of (cid:12) indexes available phase\nspace volumes, and thus the smoothness (complexity) of the model class [19]. In the case\nof discrete distributions, smoothness is the absence of high peaks. Thus data with faster\ndecaying Zipf plots (plots of bins\u2019 occupancy vs. occupancy rank i) are rougher. The priors\nP(cid:12)(fqig) cannot account for all possible roughnesses. Indeed, they only generate distribu-\ntions for which the expected number of bins (cid:23) with the probability mass less than some q\nis given by (cid:23)(q) = KB(q; (cid:12); (cid:20) (cid:0) (cid:12)), where B is the familiar incomplete Beta function, as\nin Eq. (5). This means that the expected rank ordering for small and large ranks is\n\n(cid:21)1=((cid:20)(cid:0)(cid:12))\nqi (cid:25) 1 (cid:0)(cid:20) (cid:12)B((cid:12); (cid:20) (cid:0) (cid:12))(K (cid:0) 1) i\nqi (cid:25) (cid:20) (cid:12)B((cid:12); (cid:20) (cid:0) (cid:12))(K (cid:0) i + 1)\n(cid:21)1=(cid:12)\n\nK\n\nK\n\n;\n\ni (cid:28) K ;\n\n; K (cid:0) i + 1 (cid:28) K :\n\n(14)\n\n(15)\n\nIn an undersampled regime we can observe only the \ufb01rst of the behaviors. Therefore,\nany distribution with qi decaying faster (rougher) or slower (smoother) than Eq. (14) for\nsome (cid:12) cannot be explained well with \ufb01xed (cid:12)cl for different N. So, unlike in the cases of\nlearning data that are typical in P(cid:12)(fqig), we should expect to see (cid:12)cl growing (falling) for\nqualitatively smoother (rougher) cases as N grows.\nFigure 4(b) and Tbl. 1 illustrate these points. First, we study\nthe (cid:12) = 0:02 distribution from Fig. 1. However, we added a\n1000 extra bins, each with qi = 0. Our estimator performs\nremarkably well, and (cid:12)cl does not drift because the ranking\nlaw remains the same. Then we turn to the famous Zipf\u2019s\ndistribution, so common in Nature. It has ni / 1=i, which\nis qualitatively smoother than our prior allows. Correspond-\ningly, we get an upwards drift in (cid:12)cl. Finally, we analyze\na \u201crough\u201d distribution, which has qi / 50 (cid:0) 4(ln i)2, and\n(cid:12)cl drifts downwards. Clearly, one would want to predict\nthe dependence (cid:12)cl(N ) analytically, but this requires cal-\nculation of the predictive information (complexity) for the\n\nN 1/2 full Zipf rough\n(cid:1)10(cid:0)2 (cid:1)10(cid:0)1 (cid:1)10(cid:0)3\nunits\n1907 16.8\n1.7\n10\n2.2\n0.99 11.5\n30\n0.86 12.9\n2.4\n100\n8.3\n1.36\n2.2\n300\n6.4\n2.24\n2.1\n1000\n5.4\n3.36\n1.9\n3000\n10000\n2.0\n4.89\n4.5\nTable 1: (cid:12)cl for solutions\nshown on Fig. 4(b).\n\n\finvolved distributions [19] and is a work for the future. Notice that, the entropy estimator\nfor atypical cases is almost as good as for typical ones. A possible exception is the 100\u2013\n1000 points for the Zipf distribution\u2014they are about two standard deviations off. We saw\nsimilar effects in some other \u201csmooth\u201d cases also. This may be another manifestation of\nan observation made in Ref. [4]: smooth priors can easily adapt to rough distribution, but\nthere is a limit to the smoothness beyond which rough priors become inaccurate.\n\nTo summarize, an analysis of a priori entropy statistics in common power\u2013law Bayesian\nestimators revealed some very undesirable features. We are fortunate, however, that these\nminuses can be easily turned into pluses, and the resulting estimator of entropy is precise,\nknows its own error, and gives amazing results for a very large class of distributions.\n\nAcknowledgements\n\nWe thank Vijay Balasubramanian, Curtis Callan, Adrienne Fairhall, Tim Holy, Jonathan\nMiller, Vipul Periwal, Steve Strong, and Naftali Tishby for useful discussions. I. N. was\nsupported in part by NSF Grant No. PHY99-07949 to the Institute for Theoretical Physics.\n\nReferences\n\n[1] D. MacKay, Neural Comp. 4, 415\u2013448 (1992).\n[2] V. Balasubramanian, Neural Comp. 9, 349\u2013368 (1997).\n[3] W. Bialek, C. Callan, and S. Strong, Phys. Rev. Lett. 77, 4693\u20134697 (1996).\n[4] I. Nemenman and W. Bialek, Advances in Neural Inf. Processing Systems 13, 287\u2013293 (2001).\n[5]\nJ. Skilling, in Maximum entropy and Bayesian methods, J. Skilling ed. (Kluwer Academic\nPubl., Amsterdam, 1989), pp. 45\u201352.\n\nI. Nemenman, Ph.D. Thesis, Princeton, (2000), ch. 3, http://arXiv.org/abs/physics/0009032.\n\n[6] D. Wolpert and D. Wolf, Phys. Rev. E 52, 6841\u20136854 (1995).\n[7]\n[8] P. de Laplace, marquis de, Essai philosophique sur les probabilit\u00b4es (Courcier, Paris, 1814),\ntrans. by F. Truscott and F. Emory, A philosophical essay on probabilities (Dover, New York,\n1951).\n\n[9] G. Hardy, Insurance Record (1889), reprinted in Trans. Fac. Actuaries 8 (1920).\n[10] G. Lidstone, Trans. Fac. Actuaries 8, 182\u2013192 (1920).\n[11] H. Jeffreys, Proc. Roy. Soc. (London) A 186, 453\u2013461 (1946).\n[12] R. Krichevskii and V. Tro\ufb01mov, IEEE Trans. Inf. Thy. 27, 199\u2013207 (1981).\n[13] F. Willems, Y. Shtarkov, and T. Tjalkens, IEEE Trans. Inf. Thy. 41, 653\u2013664 (1995).\n[14] T. Schurmann and P. Grassberger, Chaos 6, 414\u2013427 (1996).\n[15] S. Strong, R. Koberle, R. de Ruyter van Steveninck, and W. Bialek, Phys. Rev. Lett. 80, 197\u2013\n\n200 (1998).\n\n[16] S. Panzeri and A. Treves, Network: Comput. in Neural Syst. 7, 87\u2013107 (1996).\n[17] K. Sjlander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I. S. Mian, and D. Haussler, Com-\n\nputer Applications in the Biosciences (CABIOS) 12, 327\u2013345 (1996).\n\n[18] S. Ma, J. Stat. Phys. 26, 221 (1981).\n[19] W. Bialek, I. Nemenman, N. Tishby, Neural Comp. 13, 2409-2463 (2001).\n\n\f", "award": [], "sourceid": 1965, "authors": [{"given_name": "Ilya", "family_name": "Nemenman", "institution": null}, {"given_name": "F.", "family_name": "Shafee", "institution": null}, {"given_name": "William", "family_name": "Bialek", "institution": null}]}