{"title": "Finite-Sample Analysis of Fixed-k Nearest Neighbor Density Functional Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 1217, "page_last": 1225, "abstract": "We provide finite-sample analysis of a general framework for using k-nearest neighbor statistics to estimate functionals of a nonparametric continuous probability density, including entropies and divergences. Rather than plugging a consistent density estimate (which requires k \u2192 \u221e as the sample size n \u2192 \u221e) into the functional of interest, the estimators we consider fix k and perform a bias correction. This can be more efficient computationally, and, as we show, statistically, leading to faster convergence rates. Our framework unifies several previous estimators, for most of which ours are the first finite sample guarantees.", "full_text": "Finite-Sample Analysis of Fixed-k Nearest Neighbor\n\nDensity Functional Estimators\n\nShashank Singh\n\nStatistics & Machine Learning Departments\n\nCarnegie Mellon University\nsss1@andrew.cmu.edu\n\nBarnab\u00e1s P\u00f3czos\n\nMachine Learning Departments\n\nCarnegie Mellon University\nbapoczos@cs.cmu.edu\n\nAbstract\n\nWe provide \ufb01nite-sample analysis of a general framework for using k-nearest neigh-\nbor statistics to estimate functionals of a nonparametric continuous probability\ndensity, including entropies and divergences. Rather than plugging a consistent\ndensity estimate (which requires k \u2192 \u221e as the sample size n \u2192 \u221e) into the\nfunctional of interest, the estimators we consider \ufb01x k and perform a bias cor-\nrection. This is more ef\ufb01cient computationally, and, as we show in certain cases,\nstatistically, leading to faster convergence rates. Our framework uni\ufb01es several\nprevious estimators, for most of which ours are the \ufb01rst \ufb01nite sample guarantees.\n\nIntroduction\n\n1\nEstimating entropies and divergences of probability distributions in a consistent manner is of im-\nportance in a number of problems in machine learning. Entropy estimators have applications in\ngoodness-of-\ufb01t testing [13], parameter estimation in semi-parametric models [51], studying fractal\nrandom walks [3], and texture classi\ufb01cation [14, 15]. Divergence estimators have been used to\ngeneralize machine learning algorithms for regression, classi\ufb01cation, and clustering from inputs in\nRD to sets and distributions [40, 33].\nDivergences also include mutual informations as a special case; mutual information estimators have\napplications in feature selection [35], clustering [2], causality detection [16], optimal experimental\ndesign [26, 38], fMRI data analysis [7], prediction of protein structures [1], and boosting and facial\nexpression recognition [41]. Both entropy estimators and mutual information estimators have been\nused for independent component and subspace analysis [23, 47, 37, 17], as well as for image\nregistration [14, 15]. Further applications can be found in [25].\nThis paper considers the more general problem of estimating functionals of the form\n\nF (P ) := E\nX\u223cP\n\n[f (p(X))] ,\n\n(1)\n\nusing n IID samples from P , where P is an unknown probability measure with smooth density\nfunction p and f is a known smooth function. We are interested in analyzing a class of nonparametric\nestimators based on k-nearest neighbor (k-NN) distance statistics. Rather than plugging a consistent\nestimator of p into (1), which requires k \u2192 \u221e as n \u2192 \u221e, these estimators derive a bias correction\nfor the plug-in estimator with \ufb01xed k; hence, we refer to this type of estimator as a \ufb01xed-k estimator.\nCompared to plug-in estimators, \ufb01xed-k estimators are faster to compute. As we show, \ufb01xed-k\nestimators can also exhibit superior rates of convergence.\nAs shown in Table 1, several authors have derived bias corrections necessary for \ufb01xed-k estimators of\nentropies and divergences, including, most famously, the Shannon entropy estimator of [20]. 1 The\nestimators in Table 1 estimators are known to be weakly consistent, 2 but, except for Shannon entropy,\n1MATLAB code for these estimators is in the ITE toolbox https://bitbucket.org/szzoli/ite/ [48].\n2Several of these proofs contain errors regarding the use of integral convergence theorems when their\n\nconditions do not hold, as described in [39].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fE [log p(X)]\n\nFunctional Name\nShannon Entropy\nR\u00e9nyi-\u03b1 Entropy\nKL Divergence\n\nE(cid:2)p\u03b1\u22121(X)(cid:3)\nE(cid:104)\n(cid:105)\n(cid:20)(cid:16) p(X)\n(cid:17)\u03b1\u22121(cid:21)\nexpectations are over X \u223c P . \u0393(t) =(cid:82) \u221e\n\n\u03b1-Divergence\n\nlog p(X)\nq(X)\n\nq(X)\n\nE\n\nFunctional Form Bias Correction\n\nAdditive constant: \u03c8(n) \u2212 \u03c8(k) + log(k/n)\nMultiplicative constant:\nNone\u2217\n\n\u0393(k+1\u2212\u03b1)\n\n\u0393(k)\n\nRef.\n[20][13]\n[25, 24]\n[50]\n\nMultiplicative constant:\n\n\u03932(k)\n\n\u0393(k\u2212\u03b1+1)\u0393(k+\u03b1\u22121)\n\n[39]\n\nTable 1: Functionals with known bias-corrected k-NN estimators, their bias corrections, and references. All\ndx log (\u0393(x)) is\nthe digamma function. \u03b1 \u2208 R\\{1} is a free parameter. \u2217For KL divergence, bias corrections for p and q cancel.\n\n0 xt\u22121e\u2212x dx is the gamma function, and \u03c8(x) = d\n\nno \ufb01nite sample bounds are known. The main goal of this paper is to provide \ufb01nite-sample analysis\nof these estimators, via uni\ufb01ed analysis of the estimator after bias correction. Speci\ufb01cally, we show\nconditions under which, for \u03b2-H\u00f6lder continuous (\u03b2 \u2208 (0, 2]) densities on D dimensional space, the\n\nbias of \ufb01xed-k estimators decays as O(cid:0)n\u2212\u03b2/D(cid:1) and the variance decays as O(cid:0)n\u22121(cid:1), giving a mean\nsquared error of O(cid:0)n\u22122\u03b2/D + n\u22121(cid:1). Hence, the estimators converge at the parametric O(n\u22121) rate\n\nwhen \u03b2 \u2265 D/2, and at the slower rate O(n\u22122\u03b2/D) otherwise. A modi\ufb01cation of the estimators would\nbe necessary to leverage additional smoothness for \u03b2 > 2, but we do not pursue this here. Along the\nway, we prove a \ufb01nite-sample version of the useful fact [25] that (normalized) k-NN distances have\nan Erlang asymptotic distribution, which may be of independent interest.\nWe present our results for distributions P supported on the unit cube in RD because this signi\ufb01cantly\nsimpli\ufb01es the statements of our results, but, as we discuss in the supplement, our results generalize\nfairly naturally, for example to distributions supported on smooth compact manifolds. In this context,\nit is worth noting that our results scale with the intrinsic dimension of the manifold. As we discuss\nlater, we believe deriving \ufb01nite sample rates for distributions with unbounded support may require a\ntruncated modi\ufb01cation of the estimators we study (as in [49]), but we do not pursue this here.\n\n2 Problem statement and notation\nLet X := [0, 1]D denote the unit cube in RD, and let \u00b5 denote the Lebesgue measure. Suppose P is an\nunknown \u00b5-absolutely continuous Borel probability measure supported on X , and let p : X \u2192 [0,\u221e)\ndenote the density of P . Consider a (known) differentiable function f : (0,\u221e) \u2192 R. Given n\nsamples X1, ..., Xn drawn IID from P , we are interested in estimating the functional\n\nF (P ) := E\nX\u223cP\n\n[f (p(X))] .\n\nSomewhat more generally (as in divergence estimation), we may have a function f : (0,\u221e)2 \u2192 R\nof two variables and a second unknown probability measure Q, with density q and n IID samples\nY1, ..., Yn. Then, we are interested in estimating\nF (P, Q) := E\nX\u223cP\n\n[f (p(X), q(X))] .\n\nFix r \u2208 [1,\u221e] and a positive integer k. We will work with distances induced by the r-norm\n\n(cid:107)x(cid:107)r :=\n\nxr\ni\n\nand de\ufb01ne\n\ncD,r :=\n\n(2\u0393(1 + 1/r))D\n\n\u0393(1 + D/r)\n\n= \u00b5(B(0, 1)),\n\nwhere B(x, \u03b5) := {y \u2208 RD : (cid:107)x \u2212 y(cid:107)r < \u03b5} denotes the open radius-\u03b5 ball centered at x. Our\nestimators use k-nearest neighbor (k-NN) distances:\nDe\ufb01nition 1. (k-NN distance): Given n IID samples X1, ..., Xn from P , for x \u2208 RD, we de\ufb01ne the\nk-NN distance \u03b5k(x) by \u03b5k(x) = (cid:107)x\u2212 Xi(cid:107)r, where Xi is the kth-nearest element (in (cid:107)\u00b7(cid:107)r) of the set\n{X1, ..., Xn} to x. For divergence estimation, given n samples Y1, ..., Yn from Q, then we similarly\nde\ufb01ne \u03b4k(x) by \u03b4k(x) = (cid:107)x \u2212 Yi(cid:107)r, where Yi is the kth-nearest element of {Y1, ..., Yn} to x.\n\u00b5-absolute continuity of P precludes the existence of atoms (i.e., \u2200x \u2208 RD, P ({x}) = \u00b5({x}) = 0).\nHence, each \u03b5k(x) > 0 a.s. We will require this to study quantities such as log \u03b5k(x) and 1/\u03b5k(x).\n\n(cid:33)1/r\n\n(cid:32) D(cid:88)\n\ni=1\n\n2\n\n\f3 Estimator\n3.1 k-NN density estimation and plug-in functional estimators\nThe k-NN density estimator\n\n\u02c6pk(x) =\n\nk/n\n\n\u00b5(B(x, \u03b5k(x))\n\n=\n\nk/n\ncD\u03b5D\nk (x)\n\nis well-studied nonparametric density estimator [28], motivated by noting that, for small \u03b5 > 0,\n\np(x) \u2248 P (B(x, \u03b5))\n\u00b5(B(x, \u03b5))\n\n,\n\nand that, P (B(x, \u03b5k(x))) \u2248 k/n. One can show that, for x \u2208 RD at which p is continuous, if\nk \u2192 \u221e and k/n \u2192 0 as n \u2192 \u221e, then \u02c6pk(x) \u2192 p(x) in probability ([28], Theorem 3.1). Thus, a\nnatural approach for estimating F (P ) is the plug-in estimator\n\n1\nn\n\nn(cid:88)\n\u03b2+D ,1}(cid:17)\n\ni=1\n\n\u02c6FP I :=\n\nf (\u02c6pk(Xi)) .\n\n(2)\n\n(cid:16)\n\nn\n\n\u2212 min{ 2\u03b2\n\nSince \u02c6pk \u2192 p in probability pointwise as k, n \u2192 \u221e and f is smooth, one can show \u02c6FP I is consistent,\nand in fact derive \ufb01nite sample convergence rates (depending on how k \u2192 \u221e). For example, [44]\nshow a convergence rate of O\nfor \u03b2-H\u00f6lder continuous densities (after sample\nsplitting and boundary correction) by setting k (cid:16) n\nUnfortunately, while necessary to ensure V [\u02c6pk(x)] \u2192 0, the requirement k \u2192 \u221e is computationally\nburdensome. Furthermore, increasing k can increase the bias of \u02c6pk due to over-smoothing (see (5)\nbelow), suggesting that this may be sub-optimal for estimating F (P ). Indeed, similar work based on\nkernel density estimation [42] suggests that, for plug-in functional estimators, under-smoothing may\nbe preferable, since the empirical mean results in additional smoothing.\n\n\u03b2+d .\n\n\u03b2\n\n3.2 Fixed-k functional estimators\nAn alternative approach is to \ufb01x k as n \u2192 \u221e. Since \u02c6FP I is itself an empirical mean, unlike V [\u02c6pk(x)],\n\n(cid:105) \u2192 0 as n \u2192 \u221e. The more critical complication of \ufb01xing k is bias. Since f is typically\n\nV(cid:104) \u02c6FP I\n\nnon-linear, the non-vanishing variance of \u02c6pk translates into asymptotic bias. A solution adopted by\nseveral papers is to derive a bias correction function B (depending only on known factors) such that\n\n(cid:20)\n\n(cid:18)\n\n(cid:18)\n\nB\n\nf\n\n(cid:19)(cid:19)(cid:21)\n\n(cid:20)\n\n(cid:18) P (B(x, \u03b5k(x)))\n\n(cid:19)(cid:21)\n\n\u00b5(B(x, \u03b5k(x))\n\nE\n\nX1,...,Xn\n\nk/n\n\n\u00b5(B(x, \u03b5k(x))\n\n=\n\nE\n\nX1,...,Xn\n\nf\n\nFor continuous p, the quantity\n\np\u03b5k(x)(x) :=\n\nP (B(x, \u03b5k(x)))\n\u00b5(B(x, \u03b5k(x))\n\n.\n\n(3)\n\n(4)\n\nis a consistent estimate of p(x) with k \ufb01xed, but it is not computable, since P is unknown. The bias\ncorrection B gives us an asymptotically unbiased estimator\n\n\u02c6FB(P ) :=\n\n1\nn\n\nB (f (\u02c6pk(Xi))) =\n\n1\nn\n\nk/n\n\n\u00b5(B(Xi, \u03b5k(Xi))\n\n(cid:18)\n\n(cid:18)\n\nB\n\nf\n\nn(cid:88)\n\ni=1\n\n(cid:19)(cid:19)\n\n.\n\nn(cid:88)\n\ni=1\n\nthat uses k/n in place of P (B(x, \u03b5k(x))). This estimate extends naturally to divergences:\n\n\u02c6FB(P, Q) :=\n\n1\nn\n\nB (f (\u02c6pk(Xi), \u02c6qk(Xi))) .\n\nn(cid:88)\n\ni=1\n\nAs an example, if f = log (as in Shannon entropy), then it can be shown that, for any continuous p,\n\nE [log P (B(x, \u03b5k(x)))] = \u03c8(k) \u2212 \u03c8(n).\n\n3\n\n\fHence, for Bn,k := \u03c8(k) \u2212 \u03c8(n) + log(n) \u2212 log(k),\n\n(cid:20)\n\n(cid:18)\n\n(cid:19)(cid:21)\n\nE\n\nX1,...,Xn\n\nf\n\nk/n\n\n\u00b5(B(x, \u03b5k(x))\n\n+ Bn,k =\n\nE\n\nX1,...,Xn\n\n(cid:20)\n\nf\n\n(cid:18) P (B(x, \u03b5k(x)))\n\n(cid:19)(cid:21)\n\n\u00b5(B(x, \u03b5k(x))\n\n.\n\ngiving the estimator of [20]. Other examples of functionals for which the bias correction is known\nare given in Table 1.\nIn general, deriving an appropriate bias correction can be quite a dif\ufb01cult problem speci\ufb01c to the\nfunctional of interest, and it is not our goal presently to study this problem; rather, we are interested in\nbounding the error of \u02c6FB(P ), assuming the bias correction is known. Hence, our results apply to all\nof the estimators in Table 1, as well as any estimators of this form that may be derived in the future.\n\n4 Related work\n4.1 Estimating information theoretic functionals\nRecently, there has been much work on analyzing estimators for entropy, mutual information,\ndivergences, and other functionals of densities. Besides bias-corrected \ufb01xed-k estimators, most of\nthis work has taken one of three approaches. One series of papers [27, 42, 43] studied a boundary-\ncorrected plug-in approach based on under-smoothed kernel density estimation. This approach has\nstrong \ufb01nite sample guarantees, but requires prior knowledge of the support of the density, and can\nhave a slow rate of convergence. A second approach [18, 22] uses von Mises expansion to partially\ncorrect the bias of optimally smoothed density estimates. This is statistically more ef\ufb01cient, but can\nrequire computationally demanding numerical integration over the support of the density. A \ufb01nal line\nof work [30, 31, 44, 46] studied plug-in estimators based on consistent, boundary corrected k-NN\ndensity estimates (i.e., with k \u2192 \u221e as n \u2192 \u221e). [32] study a divergence estimator based on convex\nrisk minimization, but this relies of the context of an RKHS, making results are dif\ufb01cult to compare.\nRates of Convergence: For densities over RD satisfying a H\u00f6lder smoothness condition parametrized\nby \u03b2 \u2208 (0,\u221e), the minimax mean squared error rate for estimating functionals of the form\n. [22] recently derived iden-\n\n(cid:82) f (p(x)) dx has been known since [6] to be O\n\n4\u03b2+D ,1}(cid:17)\n\n\u2212 min{ 8\u03b2\n\n(cid:16)\n\nn\n\ntical minimax rates for divergence estimation.\n\nn\n\nMost of the above estimators have been shown to converge at the rate O\n. Only\nthe von Mises approach [22] is known to achieve the minimax rate for general \u03b2 and D, but due to\nits computational demand (O(2Dn3)), 3 the authors suggest using other statistically less ef\ufb01cient\nestimators for moderate sample size. Here, we show that, for \u03b2 \u2208 (0, 2], bias-corrected \ufb01xed-k\nestimators converge at the relatively fast rate O\n. For \u03b2 > 2, modi\ufb01cations are\nneeded for the estimator to leverage the additional smoothness of the density. Notably, this rate is\nadaptive; that is, it does not require selecting a smoothing parameter depending on the unknown \u03b2;\nour results (Theorem 5) imply the above rate is achieved for any \ufb01xed choice of k. On the other hand,\nsince no empirical error metric is available for cross-validation, parameter selection is an obstacle for\ncompeting estimators.\n\nD ,1}(cid:17)\nn\u2212 min{ 2\u03b2\n\n(cid:16)\n\n(cid:16)\n\n\u03b2+D ,1}(cid:17)\n\n\u2212 min{ 2\u03b2\n\n4.2 Prior analysis of \ufb01xed-k estimators\nAs of writing this paper, the only \ufb01nite-sample results for \u02c6FB(P ) were those of [5] for the Kozachenko-\nLeonenko (KL) 4 Shannon entropy estimator. [20] Theorem 7.1 of [5] shows that, if the density p has\ncompact support, then the variance of the KL estimator decays as O(n\u22121). They also claim (Theorem\n7.2) to bound the bias of the KL estimator by O(n\u2212\u03b2), under the assumptions that p is \u03b2-H\u00f6lder\ncontinuous (\u03b2 \u2208 (0, 1]), bounded away from 0, and supported on the interval [0, 1]. However, in\ntheir proof, [5] neglect to bound the additional bias incurred near the boundaries of [0, 1], where the\ndensity cannot simultaneously be bounded away from 0 and continuous. In fact, because the KL\nestimator does not attempt to correct for boundary bias, it is not clear that the bias should decay as\nO(n\u2212\u03b2) under these conditions; we require additional conditions at the boundary of X .\n\n3Fixed-k estimators can be computed in O(cid:0)Dn2(cid:1) time, or O(cid:0)2Dn log n(cid:1) using k-d trees for small D.\n\n4Not to be confused with Kullback-Leibler (KL) divergence, for which we also analyze an estimator.\n\n4\n\n\f\u221a\n\n\u221a\n\n\u221a\n\nn-consistency. Their estimator\n[49] studied a closely related entropy estimator for which they prove\nn, replacing \u03b5k(x) with\nis identical to the KL estimator, except that it truncates k-NN distances at\nn}. This sort of truncation may be necessary for certain \ufb01xed-k estimators to satisfy\nmin{\u03b5k(x),\n\ufb01nite-sample bounds for densities of unbounded support, though consistency can be shown regardless.\nFinally, two very recent papers [12, 4] have analyzed the KL estimator. In this case, [12] generalize\nthe results of [5] to D > 1, and [4] weaken the regularity and boundary assumptions required by our\nbias bound, while deriving the same rate of convergence. Moreover, they show that, if k increases\nwith n at the rate k (cid:16) log5 n, the KL estimator is asymptotically ef\ufb01cient (i.e., asymptotically normal,\nwith optimal asymptotic variance). As explained in Section 8, together with our results this elucidates\nthe role of k in the KL estimator: \ufb01xing k optimizes the convergence rate of the estimator, but\nincreasing k slowly can further improve error by constant factors.\n\n5 Discussion of assumptions\nThe lack of \ufb01nite-sample results for \ufb01xed-k estimators is due to several technical challenges. Here,\nwe discuss some of these challenges, motivating the assumptions we make to overcome them.\nFirst, these estimators are sensitive to regions of low probability (i.e., p(x) small), for two reasons:\n1. Many functions f of interest (e.g., f = log or f (z) = z\u03b1, \u03b1 < 0) have singularities at 0.\n2. The k-NN estimate \u02c6pk(x) of p(x) is highly biased when p(x) is small. For example, for p\n\n\u03b2-H\u00f6lder continuous (\u03b2 \u2208 (0, 2]), one has ([29], Theorem 2)\n\n(cid:18) k\n\n(cid:19)\u03b2/D\n\nnp(x)\n\nBias(\u02c6pk(x)) (cid:16)\n\n.\n\n(5)\n\nFor these reasons, it is common in analysis of k-NN estimators to assume the following [5, 39]:\n\n(A1) p is bounded away from zero on its support. That is, p\u2217 := inf x\u2208X p(x) > 0.\n\nSecond, unlike many functional estimators (see e.g., [34, 45, 42]), the \ufb01xed-k estimators we consider\ndo not attempt correct for boundary bias (i.e., bias incurred due to discontinuity of p on the boundary\n\u2202X of X ). 5 The boundary bias of the density estimate \u02c6pk(x) does vanish at x in the interior X \u25e6\nof X as n \u2192 \u221e, but additional assumptions are needed to obtain \ufb01nite-sample rates. Either of the\nfollowing assumptions would suf\ufb01ce:\n\n(A2) p is continuous not only on X \u25e6 but also on \u2202X (i.e., p(x) \u2192 0 as dist(x, \u2202X ) \u2192 0).\n(A3) p is supported on all of RD. That is, the support of p has no boundary. This is the approach\nof [49], but we reiterate that, to handle an unbounded domain, they require truncating \u03b5k(x).\n\nUnfortunately, both assumptions (A2) and (A3) are inconsistent with (A1). Our approach is to assume\n(A2) and replace assumption (A1) with a much milder assumption that p is locally lower bounded on\nits support in the following sense:\n\n(A4) There exist \u03c1 > 0 and a function p\u2217 : X \u2192 (0,\u221e) such that, for all x \u2208 X , r \u2208 (0, \u03c1],\n\np\u2217(x) \u2264 P (B(x,r))\n\u00b5(B(x,r)) .\n\nWe show in Lemma 2 that assumption (A4) is in fact very mild; in a metric measure space of positive\ndimension D, as long as p is continuous on X , such a p\u2217 exists for any desired \u03c1 > 0. For simplicity,\nwe will use \u03c1 =\nAs hinted by (5) and the fact that F (P ) is an expectation, our bounds will contain terms of the form\n\nD = diam(X ).\n\n\u221a\n\n(cid:34)\n\n(cid:35)\n\n(cid:90)\n\n=\n\nE\nX\u223cP\n\n1\n\n(p\u2217(X))\u03b2/D\n\np(x)\n\nX\n\n(p\u2217(x))\u03b2/D\n\nd\u00b5(x)\n\n(with an additional f(cid:48)(p\u2217(x)) factor if f has a singularity at zero). Hence, our key assumption is that\nthese quantities are \ufb01nite. This depends primarily on how quickly p approaches zero near \u2202X . For\nmany functionals, Lemma 6 gives a simple suf\ufb01cient condition.\n\n5This complication was omitted in the bias bound (Theorem 7.2) of [5] for entropy estimation.\n\n5\n\n\f6 Preliminary lemmas\nHere, we present some lemmas, both as a means of summarizing our proof techniques and also\nbecause they may be of independent interest for proving \ufb01nite-sample bounds for other k-NN methods.\nDue to space constraints, all proofs are given in the appendix. Our \ufb01rst lemma states that, if p is\ncontinuous, then it is locally lower bounded as described in the previous section.\nLemma 2. (Existence of Local Bounds) If p is continuous on X and strictly positive on the interior\nX \u25e6 of X , then, for \u03c1 :=\nD = diam(X ), there exists a continuous function p\u2217 : X \u25e6 \u2192 (0,\u221e) and\na constant p\u2217 \u2208 (0,\u221e) such that\n\n\u221a\n\n0 < p\u2217(x) \u2264 P (B(x, r))\n\u00b5(B(x, r))\n\n\u2264 p\u2217 < \u221e,\n\n\u2200x \u2208 X , r \u2208 (0, \u03c1].\n\nWe now use these local lower and upper bounds to prove that k-NN distances concentrate around a\nterm of order (k/(np(x)))1/D. Related lemmas, also based on multiplicative Chernoff bounds, are\nused by [21, 9] and [8, 19] to prove \ufb01nite-sample bounds on k-NN methods for cluster tree pruning\nand classi\ufb01cation, respectively. For cluster tree pruning, the relevant inequalities bound the error of\nthe k-NN density estimate, and, for classi\ufb01cation, they lower bound the probability of nearby samples\nof the same class. Unlike in cluster tree pruning, we are not using a consistent density estimate, and,\nunlike in classi\ufb01cation, our estimator is a function of k-NN distances themselves (rather than their\nordering). Thus, our statement is somewhat different, bounding the k-NN distances themselves:\nLemma 3. (Concentration of k-NN Distances) Suppose p is continuous on X and strictly positive\non X \u25e6. Let p\u2217 and p\u2217 be as in Lemma 2. Then, for any x \u2208 X \u25e6,\n\n(cid:16) k\n(cid:20)\n(cid:16) k\n\np\u2217(x)n\n\n0,\n\np\u2217n\n\n(cid:17)1/D\n(cid:17)1/D(cid:19)\n\n,\n\n1. if r >\n\n2. if r \u2208\n\nthen\n\nP [\u03b5k(x) > r] \u2264 e\u2212p\u2217(x)rDn\n\n,\n\nthen\n\nP [\u03b5k(x) < r] \u2264 e\u2212p\u2217(x)rDn\n\n(cid:18)\n\n(cid:19)k\n(cid:19)kp\u2217(x)/p\u2217\n\n.\n\n.\n\ne\n\np\u2217(x)rDn\n\n(cid:18) ep\u2217rDn\n\nk\n\nk\n\nIt is worth noting an asymmetry in the above bounds: counter-intuitively, the lower bound depends on\np\u2217. This asymmetry is related to the large bias of k-NN density estimators when p is small (as in (5)).\nThe next lemma uses Lemma 3 to bound expectations of monotone functions of the ratio \u02c6pk/p\u2217. As\nsuggested by the form of integrals (6) and (7), this is essentially a \ufb01nite-sample statement of the fact\nthat (appropriately normalized) k-NN distances have Erlang asymptotic distributions; this asymptotic\nstatement is key to consistency proofs of [25] and [39] for \u03b1-entropy and divergence estimators.\nLemma 4. Let p be continuous on X and strictly positive on X \u25e6. De\ufb01ne p\u2217 and p\u2217 as in Lemma 2.\nSuppose f : (0,\u221e) \u2192 R is continuously differentiable and f(cid:48) > 0. Then, we have the upper bound 6\n\n(cid:19)(cid:21)\n\nE\n\n(cid:20)\n(cid:18) p\u2217(x)\n\nf+\n\n(cid:18) p\u2217(x)\n(cid:19)(cid:21)\n\n\u02c6pk(x)\n\n\u02c6pk(x)\n\nsup\nx\u2208X \u25e6\n\n(cid:20)\n\nE\n\nf\u2212\n\n\u2264 f+(1) + e\n\ne\u2212yyk\n\u0393(k + 1)\n\nf+\n\n\u221a\n\nk\n\nk\n\n(cid:90) \u221e\n(cid:90) \u03ba(x)\nD(cid:19)\n(cid:17) 1\n\n0\n\n(cid:115)\n(cid:18)(cid:16)\n\n\u2264 f\u2212(1) + e\n\nk\n\n\u03ba(x)\n\ne\u2212yy\u03ba(x)\n\u0393(\u03ba(x) + 1)\n\nf\u2212\n\n(cid:17)\n(cid:16) y\n(cid:16) y\n\nk\n\nk\n\ndy,\n\n(cid:17)\n\ndy\n\n(6)\n\n(7)\n\nand, for all x \u2208 X \u25e6, for \u03ba(x) := kp\u2217(x)/p\u2217, the lower bound\n\nNote that plugging the function z (cid:55)\u2192 f\ninto Lemma 4 gives bounds on\nE [f (\u03b5k(x))]. As one might guess from Lemma 3 and the assumption that f is smooth, this bound is\nD . For example, for any \u03b1 > 0, a simple calculation from (6) gives\n\nroughly of the order (cid:16)(cid:16) k\n\ncD,rnp\u2217(x)\n\nkz\n\n(cid:17)(cid:18)\n\n(cid:19) \u03b1\n\nD\n\n1 +\n\n\u03b1\nD\n\nk\n\ncD,rnp\u2217(x)\n\n.\n\n(8)\n\n(cid:17) 1\nk (x)] \u2264(cid:16)\n\nE [\u03b5\u03b1\n\nnp(x)\n\n(8) is used for our bias bound, and more direct applications of Lemma 4 are used in variance bound.\n6f+(x) = max{0, f (x)} and f\u2212(x) = \u2212 min{0, f (x)} denote the positive and negative parts of f. Recall\n\nthat E [f (X)] = E [f+(X)] \u2212 E [f\u2212(X)].\n\n6\n\n\f7 Main results\nHere, we present our main results on the bias and variance of \u02c6FB(P ). Again, due to space constraints,\nall proofs are given in the appendix. We begin with bounding the bias:\nTheorem 5. (Bias Bound) Suppose that, for some \u03b2 \u2208 (0, 2], p is \u03b2-H\u00f6lder continuous with constant\nL > 0 on X , and p is strictly positive on X \u25e6. Let p\u2217 and p\u2217 be as in Lemma 2. Let f : (0,\u221e) \u2192 R\nbe differentiable, and de\ufb01ne Mf,p : X \u2192 [0,\u221e) by\n\nAssume\n\n(cid:34)\n\nMf,p(X)\n(p\u2217(X))\n\n\u03b2\nD\n\nCf := E\nX\u223cp\n\n(cid:35)\n\nMf,p(x) :=\n\nsup\n\nz\u2208[p\u2217(x),p\u2217]\n\n< \u221e.\n\nThen,\n\nf (z)\n\n(cid:12)(cid:12)(cid:12)(cid:12) d\n(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)E \u02c6FB(P ) \u2212 F (P )\n\ndz\n\n(cid:18) k\n\n(cid:19) \u03b2\n\nD\n\nn\n\n.\n\n(cid:12)(cid:12)(cid:12) \u2264 Cf L\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202\n\n\u2202w\n\nThe statement for divergences is similar, assuming that q is also \u03b2-H\u00f6lder continuous with constant L\nand strictly positive on X \u25e6. Speci\ufb01cally, we get the same bound if we replace Mf,o with\n\nMf,p(x) :=\n\nf (w, z)\n\nsup\n\n(cid:35)\n\n(cid:35)\n\n\u03b2\nD\n\n\u03b2\nD\n\n(cid:34)\n\n< \u221e.\n\nX (p(x))\n\nX (p\u2217(x))\n\n+ E\nX\u223cp\n\nCf := E\nX\u223cp\n\nMf,q(X)\n(q\u2217(X))\n\n\u2212\u03b2/D d\u00b5(x) < \u221e.\n\nand de\ufb01ne Mf,q similarly (i.e., with \u2202\n\nAs an example of the applicability of Theorem 5, consider estimating the Shannon entropy. Then,\n\n(w,z)\u2208[p\u2217(x),p\u2217]\u00d7[q\u2217(x),q\u2217]\n\u2202z ) and we assume that\nMf,p(X)\n(p\u2217(X))\n\n(cid:34)\nf (z) = log(x), and so we need Cf =(cid:82)\nform(cid:82)\nfor all x \u2208 X with \u03b5(x) := dist(x, \u2202X ) < \u03c1\u2202, p(x) \u2265 c\u2202\u03b5b\u2202 (x). Then,(cid:82)\n\nThe assumption Cf < \u221e is not immediately transparent. For the functionals in Table 1, Cf has the\n\u2212c dx, for some c > 0, and hence Cf < \u221e intuitively means p(x) cannot approach\nzero too quickly as dist(x, \u2202X ) \u2192 0. The following lemma gives a formal suf\ufb01cient condition:\nLemma 6. (Boundary Condition) Let c > 0. Suppose there exist b\u2202 \u2208 (0, 1\n\nc ), c\u2202, \u03c1\u2202 > 0 such that,\n\u2212c d\u00b5(x) < \u221e.\nIn the supplement, we give examples showing that this condition is fairly general, satis\ufb01ed by\ndensities proportional to xb\u2202 near \u2202X (i.e., those with at least b\u2202 nonzero one-sided derivatives on\nthe boundary).\nWe now bound the variance. The main obstacle here is that the \ufb01xed-k estimator is an empirical\nmean of dependent terms (functions of k-NN distances). We generalize the approach used by [5] to\nbound the variance of the KL estimator of Shannon entropy. The key insight is the geometric fact\nthat, in (RD,(cid:107) \u00b7 (cid:107)p), there exists a constant Nk,D (independent of n) such that any sample Xi can be\namongst the k-nearest neighbors of at most Nk,D other samples. Hence, at most Nk,D + 1 of the\nterms in (2) can change when a single Xi is added, suggesting a variance bound via the Efron-Stein\ninequality [10], which bounds the variance of a function of random variables in terms of its expected\nchange when its arguments are resampled. [11] originally used this approach to prove a general Law\nof Large Numbers (LLN) for nearest-neighbors statistics. Unfortunately, this LLN relies on bounded\nkurtosis assumptions that are dif\ufb01cult to justify for the log or negative power statistics we study.\nTheorem 7. (Variance Bound) Suppose B \u25e6 f is continuously differentiable and strictly monotone.\nAssume Cf,p := EX\u223cP\n\n(cid:2)B2(f (p\u2217(X)))(cid:3) < \u221e, and Cf :=(cid:82) \u221e\n\nX (p\u2217(x))\n\n0 e\u2212yykf (y) < \u221e. Then, for\n\nwe have V(cid:104) \u02c6FB(P )\n\n(cid:105) \u2264 CV\n\n.\n\nn\n\nCV := 2 (1 + Nk,D) (3 + 4k) (Cf,p + Cf ) ,\n\nAs an example, if f = log (as in Shannon entropy), then, since B is an additive constant, we simply\nX p(x) log2(p\u2217(x)) < \u221e. In general, Nk,D is of the order k2cD, for some c > 0. Our\n\nrequire(cid:82)\nbound is likely quite loose in k; in practice, V(cid:104) \u02c6FB(P )\n(cid:105)\n\ntypically decreases somewhat with k.\n\n7\n\n\f8 Conclusions and discussion\nIn this paper, we gave \ufb01nite-sample bias and variance error bounds for a class of \ufb01xed-k estimators\nof functionals of probability density functions, including the entropy and divergence estimators in\nTable 1. The bias and variance bounds in turn imply a bound on the mean squared error (MSE) of the\nbias-corrected estimator via the usual decomposition into squared bias and variance:\nCorollary 8. (MSE Bound) Under the conditions of Theorems 5 and 7,\n\n(cid:20)(cid:16) \u02c6Hk(X) \u2212 H(X)\n(cid:17)2(cid:21)\n\nE\n\n(cid:18) k\n\n(cid:19)2\u03b2/D\n\n\u2264 C 2\n\nf L2\n\nn\n\n+\n\nCV\nn\n\n.\n\n(9)\n\nChoosing k: Contrary to the name, \ufb01xing k is not required for \u201c\ufb01xed-k\u201d estimators. [36] empirically\nstudied the effect of changing k with n and found that \ufb01xing k = 1 gave best results for estimating\nF (P ). However, there has been no theoretical justi\ufb01cation for \ufb01xing k. Assuming tightness of our\nbias bound in k, we provide this in a worst-case sense: since our bias bound is nondecreasing in k and\nour variance bound is no larger than the minimax MSE rate for these estimation problems, reducing\nvariance (i.e., increasing k) does not improve the (worst-case) convergence rate. On the other hand,\n[4] recently showed that slowly increasing k can improves the asymptotic variance of the estimator,\nwith the rate k (cid:16) log5 n leading to asymptotic ef\ufb01ciency. In view of these results, we suggest that\nincreasing k can improve error by constant factors, but cannot improve the convergence rate.\nFinally, we note that [36] found increasing k quickly (e.g., k = n/2) was best for certain hypothesis\ntests based on these estimators. Intuitively, this is because, in testing problems, bias is less problematic\nthan variance (e.g., an asymptotically biased estimator can still lead to a consistent test).\nAcknowledgments\nThis material is based upon work supported by a National Science Foundation Graduate Research\nFellowship to the \ufb01rst author under Grant No. DGE-1252522.\nReferences\n[1] C. Adami. Information theory in molecular biology. Physics of Life Reviews, 1:3\u201322, 2004.\n[2] M. Aghagolzadeh, H. Soltanian-Zadeh, B. Araabi, and A. Aghagolzadeh. A hierarchical clustering based\non mutual information maximization. In Proc. of IEEE International Conf. on Image Processing, 2007.\n\n[3] P. A. Alemany and D. H. Zanette. Fractal random walks from a variational formalism for Tsallis entropies.\n\nPhys. Rev. E, 49(2):R956\u2013R958, Feb 1994. doi: 10.1103/PhysRevE.49.R956.\n\n[4] Thomas B Berrett, Richard J Samworth, and Ming Yuan. Ef\ufb01cient multivariate entropy estimation via\n\nk-nearest neighbour distances. arXiv preprint arXiv:1606.00304, 2016.\n\n[5] G\u00e9rard Biau and Luc Devroye. Entropy estimation. In Lectures on the Nearest Neighbor Method, pages\n\n75\u201391. Springer, 2015.\n\n[6] L. Birge and P. Massart. Estimation of integral functions of a density. Annals of Statistics, 23:11\u201329, 1995.\n[7] B. Chai, D. B. Walther, D. M. Beck, and L. Fei-Fei. Exploring functional connectivity of the human brain\n\nusing multivariate information analysis. In NIPS, 2009.\n\n[8] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classi\ufb01cation. In\n\nAdvances in Neural Information Processing Systems, pages 3437\u20133445, 2014.\n\n[9] Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike von Luxburg. Consistent procedures\n\nfor cluster tree estimation and pruning. IEEE Trans. on Information Theory, 60(12):7900\u20137912, 2014.\n\n[10] Bradley Efron and Charles Stein. The jackknife estimate of variance. Ann. of Stat., pages 586\u2013596, 1981.\n[11] D. Evans. A law of large numbers for nearest neighbor statistics. In Proceedings of the Royal Society,\n\nvolume 464, pages 3175\u20133192, 2008.\n\n[12] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying \ufb01xed k-nearest neighbor information\n\nestimators. arXiv preprint arXiv:1604.03006, 2016.\n\n[13] M. N. Goria, N. N. Leonenko, V. V. Mergel, and P. L. Novi Inverardi. A new class of random vector entropy\nestimators and its applications in testing statistical hypotheses. J. Nonparametric Stat., 17:277\u2013297, 2005.\n[14] A. O. Hero, B. Ma, O. Michel, and J. Gorman. Alpha-divergence for classi\ufb01cation, indexing and retrieval,\n\n2002. Communications and Signal Processing Laboratory Technical Report CSPL-328.\n\n[15] A. O. Hero, B. Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal\n\nProcessing Magazine, 19(5):85\u201395, 2002.\n\n[16] K. Hlav\u00e1ckova-Schindler, M. Palu\u02c6sb, M. Vejmelkab, and J. Bhattacharya. Causality detection based on\n\ninformation-theoretic approaches in time series analysis. Physics Reports, 441:1\u201346, 2007.\n\n[17] M. M. Van Hulle. Constrained subspace ICA based on mutual information optimization directly. Neural\n\nComputation, 20:964\u2013973, 2008.\n\n[18] Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, et al. Nonparametric\nvon Mises estimators for entropies, divergences and mutual informations. In NIPS, pages 397\u2013405, 2015.\n\n8\n\n\f[19] Aryeh Kontorovich and Roi Weiss. A Bayes consistent 1-NN classi\ufb01er. In Proceedings of the Eighteenth\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 480\u2013488, 2015.\n\n[20] L. F. Kozachenko and N. N. Leonenko. A statistical estimate for the entropy of a random vector. Problems\n\nof Information Transmission, 23:9\u201316, 1987.\n\n[21] Samory Kpotufe and Ulrike V Luxburg. Pruning nearest neighbor cluster trees. In Proceedings of the 28th\n\nInternational Conference on Machine Learning (ICML-11), pages 225\u2013232, 2011.\n\n[22] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. Wasserman. Nonparametric estimation of renyi\n\ndivergence and friends. In International Conference on Machine Learning (ICML), 2014.\n\n[23] E. G. Learned-Miller and J. W. Fisher. ICA using spacings estimates of entropy. J. Machine Learning\n\nResearch, 4:1271\u20131295, 2003.\n\n[24] N. Leonenko and L. Pronzato. Correction of \u2018a class of R\u00e9nyi information estimators for mulitidimensional\n\ndensities\u2019 Ann. Statist., 36(2008) 2153-2182, 2010.\n\n[25] N. Leonenko, L. Pronzato, and V. Savani. A class of R\u00e9nyi information estimators for multidimensional\n\ndensities. Annals of Statistics, 36(5):2153\u20132182, 2008.\n\n[26] J. Lewi, R. Butera, and L. Paninski. Real-time adaptive information-theoretic optimization of neurophysi-\n\nology experiments. In Advances in Neural Information Processing Systems, volume 19, 2007.\n\n[27] H. Liu, J. Lafferty, and L. Wasserman. Exponential concentration inequality for mutual information\n\nestimation. In Neural Information Processing Systems (NIPS), 2012.\n\n[28] D. O. Loftsgaarden and C. P. Quesenberry. A nonparametric estimate of a multivariate density function.\n\nAnn. Math. Statist, 36:1049\u20131051, 1965.\n\n[29] YP Mack and M Rosenblatt. Multivariate k-nearest neighbor density estimates. J. Multivar. Analysis, 1979.\n[30] Kevin Moon and Alfred Hero. Multivariate f-divergence estimation with con\ufb01dence. In Advances in\n\nNeural Information Processing Systems, pages 2420\u20132428, 2014.\n\n[31] Kevin R Moon and Alfred O Hero. Ensemble estimation of multivariate f-divergence. In Information\n\nTheory (ISIT), 2014 IEEE International Symposium on, pages 356\u2013360. IEEE, 2014.\n\n[32] X. Nguyen, M.J. Wainwright, and M.I. Jordan. Estimating divergence functionals and the likelihood ratio\n\nby convex risk minimization. IEEE Transactions on Information Theory, To appear., 2010.\n\n[33] J. Oliva, B. Poczos, and J. Schneider. Distribution to distribution regression. In International Conference\n\non Machine Learning (ICML), 2013.\n\n[34] D. P\u00e1l, B. P\u00f3czos, and Cs. Szepesv\u00e1ri. Estimation of R\u00e9nyi entropy and mutual information based on\ngeneralized nearest-neighbor graphs. In Proceedings of the Neural Information Processing Systems, 2010.\n[35] H. Peng and C. Dind. Feature selection based on mutual information: Criteria of max-dependency,\nmax-relevance, and min-redundancy. IEEE Trans On Pattern Analysis and Machine Intelligence, 27, 2005.\n[36] F. P\u00e9rez-Cruz. Estimation of information theoretic measures for continuous random variables. In Advances\n\nin Neural Information Processing Systems 21, 2008.\n\n[37] B. P\u00f3czos and A. L\u02ddorincz. Independent subspace analysis using geodesic spanning trees. In ICML, 2005.\n[38] B. P\u00f3czos and A. L\u02ddorincz. Identi\ufb01cation of recurrent neural networks by Bayesian interrogation techniques.\n\nJ. Machine Learning Research, 10:515\u2013554, 2009.\n\n[39] B. Poczos and J. Schneider. On the estimation of alpha-divergences. In International Conference on AI and\nStatistics (AISTATS), volume 15 of JMLR Workshop and Conference Proceedings, pages 609\u2013617, 2011.\n[40] B. Poczos, L. Xiong, D. Sutherland, and J. Schneider. Nonparametric kernel estimators for image\n\nclassi\ufb01cation. In 25th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.\n\n[41] C. Shan, S. Gong, and P. W. Mcowan. Conditional mutual information based boosting for facial expression\n\nrecognition. In British Machine Vision Conference (BMVC), 2005.\n\n[42] S. Singh and B. Poczos. Exponential concentration of a density functional estimator. In Neural Information\n\nProcessing Systems (NIPS), 2014.\n\n[43] S. Singh and B. Poczos. Generalized exponential concentration inequality for R\u00e9nyi divergence estimation.\n\nIn International Conference on Machine Learning (ICML), 2014.\n\n[44] Kumar Sricharan, Raviv Raich, and Alfred O Hero. k-nearest neighbor estimation of entropies with\n\ncon\ufb01dence. In IEEE International Symposium on Information Theory, pages 1205\u20131209. IEEE, 2011.\n\n[45] Kumar Sricharan, Raviv Raich, and Alfred O Hero III. Estimation of nonlinear functionals of densities\n\nwith con\ufb01dence. Information Theory, IEEE Transactions on, 58(7):4135\u20134159, 2012.\n\n[46] Kumar Sricharan, Dennis Wei, and Alfred O Hero. Ensemble estimators for multivariate entropy estimation.\n\nIEEE Transactions on Information Theory, 59(7):4374\u20134388, 2013.\n\n[47] Z. Szab\u00f3, B. P\u00f3czos, and A. L\u02ddorincz. Undercomplete blind subspace deconvolution. J. Machine Learning\n\nResearch, 8:1063\u20131095, 2007.\n\n[48] Zolt\u00e1n Szab\u00f3. Information theoretical estimators toolbox. Journal of Machine Learning Research, 15:\n\n283\u2013287, 2014. (https://bitbucket.org/szzoli/ite/).\n\n[49] A. B. Tsybakov and E. C. van der Meulen. Root-n consistent estimators of entropy for densities with\n\nunbounded support. Scandinavian J. Statistics, 23:75\u201383, 1996.\n\n[50] Q. Wang, S.R. Kulkarni, and S. Verd\u00fa. Divergence estimation for multidimensional densities via k-nearest-\n\nneighbor distances. IEEE Transactions on Information Theory, 55(5), 2009.\n\n[51] E. Wolsztynski, E. Thierry, and L. Pronzato. Minimum-entropy estimation in semi-parametric models.\n\nSignal Process., 85(5):937\u2013949, 2005. ISSN 0165-1684.\n\n9\n\n\f", "award": [], "sourceid": 672, "authors": [{"given_name": "Shashank", "family_name": "Singh", "institution": "Carnegie Mellon University"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}