{"title": "Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations", "book": "Advances in Neural Information Processing Systems", "page_first": 397, "page_last": 405, "abstract": "We propose and analyse estimators for statistical functionals of one or moredistributions under nonparametric assumptions.Our estimators are derived from the von Mises expansion andare based on the theory of influence functions, which appearin the semiparametric statistics literature.We show that estimators based either on data-splitting or a leave-one-out techniqueenjoy fast rates of convergence and other favorable theoretical properties.We apply this framework to derive estimators for several popular informationtheoretic quantities, and via empirical evaluation, show the advantage of thisapproach over existing estimators.", "full_text": "Nonparametric von Mises Estimators for Entropies,\n\nDivergences and Mutual Informations\n\nKirthevasan Kandasamy\nCarnegie Mellon University\nkandasamy@cs.cmu.edu\n\nAkshay Krishnamurthy\nMicrosoft Research, NY\n\nakshaykr@cs.cmu.edu\n\nBarnab\u00b4as P\u00b4oczos, Larry Wasserman\n\nCarnegie Mellon University\n\nbapoczos@cs.cmu.edu, larry@stat.cmu.edu\n\nJames M. Robins\nHarvard University\n\nrobins@hsph.harvard.edu\n\nAbstract\n\nWe propose and analyse estimators for statistical functionals of one or more dis-\ntributions under nonparametric assumptions. Our estimators are derived from the\nvon Mises expansion and are based on the theory of in\ufb02uence functions, which ap-\npear in the semiparametric statistics literature. We show that estimators based ei-\nther on data-splitting or a leave-one-out technique enjoy fast rates of convergence\nand other favorable theoretical properties. We apply this framework to derive es-\ntimators for several popular information theoretic quantities, and via empirical\nevaluation, show the advantage of this approach over existing estimators.\n\n1\n\nIntroduction\n\nEntropies, divergences, and mutual informations are classical information-theoretic quantities that\nplay fundamental roles in statistics, machine learning, and across the mathematical sciences. In\naddition to their use as analytical tools, they arise in a variety of applications including hypothesis\ntesting, parameter estimation, feature selection, and optimal experimental design. In many of these\napplications, it is important to estimate these functionals from data so that they can be used in down-\nstream algorithmic or scienti\ufb01c tasks. In this paper, we develop a recipe for estimating statistical\nfunctionals of one or more nonparametric distributions based on the notion of in\ufb02uence functions.\nEntropy estimators are used in applications ranging from independent components analysis [15],\nintrinsic dimension estimation [4] and several signal processing applications [9]. Divergence es-\ntimators are useful in statistical tasks such as two-sample testing. Recently they have also gained\npopularity as they are used to measure (dis)-similarity between objects that are modeled as distribu-\ntions, in what is known as the \u201cmachine learning on distributions\u201d framework [5, 28]. Mutual infor-\nmation estimators have been used in in learning tree-structured Markov random \ufb01elds [19], feature\nselection [25], clustering [18] and neuron classi\ufb01cation [31]. In the parametric setting, conditional\ndivergence and conditional mutual information estimators are used for conditional two sample test-\ning or as building blocks for structure learning in graphical models. Nonparametric estimators for\nthese quantities could potentially allow us to generalise several of these algorithms to the nonpara-\nmetric domain. Our approach gives sample-ef\ufb01cient estimators for all these quantities (and many\nothers), which often outperfom the existing estimators both theoretically and empirically.\nOur approach to estimating these functionals is based on post-hoc correction of a preliminary esti-\nmator using the Von Mises Expansion [7, 36]. This idea has been used before in the semiparametric\nstatistics literature [3, 30]. However, most studies are restricted to functionals of one distribution\nand have focused on a \u201cdata-split\u201d approach which splits the samples for density estimation and\nfunctional estimation. While the data-split (DS) estimator is known to achieve the parametric con-\n\n1\n\n\fvergence rate for suf\ufb01ciently smooth densities [3, 14], in practical settings, as we show in our simu-\nlations, splitting the data results in poor empirical performance.\nIn this paper we introduce the method of in\ufb02uence function based nonparametric estimators to the\nmachine learning community and expand on this technique in several novel and important ways.\nThe main contributions of this paper are:\n1. We propose a \u201cleave-one-out\u201d (LOO) technique to estimate functionals of a single distribution.\nWe prove that it has the same convergence rates as the DS estimator. However, the LOO estimator\nhas better empirical performance in our simulations since it makes ef\ufb01cient use of the data.\n\n2. We extend both DS and LOO methods to functionals of multiple distributions and analyse their\nconvergence. Under suf\ufb01cient smoothness both estimators achieve the parametric rate and the\nDS estimator has a limiting normal distribution.\n\n3. We prove a lower bound for estimating functionals of multiple distributions. We use this to\n\nestablish minimax optimality of the DS and LOO estimators under suf\ufb01cient smoothness.\n\n4. We use the approach to construct and implement estimators for various entropy, diver-\ninformation quantities and their conditional versions. A subset of these\ngence, mutual\nfunctionals are listed in Table 1 in the Appendix. Our software is publicly available at\ngithub.com/kirthevasank/if-estimators.\n\n5. We compare our estimators against several other approaches in simulation. Despite the generality\nof our approach, our estimators are competitive with and in many cases superior to existing\nspecialised approaches for speci\ufb01c functionals. We also demonstrate how our estimators can be\nused in machine learning applications via an image clustering task.\n\nOur focus on information theoretic quantities is due to their relevance in machine learning applica-\ntions, rather than a limitation of our approach. Indeed our techniques apply to any smooth functional.\nHistory: We provide a brief history of the post-hoc correction technique and in\ufb02uence functions.\nWe defer a detailed discussion of other approaches to estimating functionals to Section 5. To our\nknowledge, the \ufb01rst paper using a post-hoc correction estimator was that of Bickel and Ritov [2].\nThe line of work following this paper analysed integral functionals of a single one dimensional\n\ndensity of the form(cid:82) \u03bd(p) [2, 3, 11, 14]. A recent paper by Krishnamurthy et al. [12] also extends\n(cid:82) p\u03b1q\u03b2 for densities p and q. All approaches above of use data splitting. Our work contributes to\n\nthis line to functionals of multiple densities, but only considers polynomial functionals of the form\n\nthis line of research in two ways: we extend the technique to a more general class of functionals and\nstudy the empirically superior LOO estimator.\nA fundamental quantity in the design of our estimators is the in\ufb02uence function, which appears both\nin robust and semiparametric statistics. Indeed, our work is inspired by that of Robins et al. [30]\nand Emery et al. [6] who propose a (data-split) in\ufb02uence-function based estimator for functionals of\na single distribution. Their analysis for nonparametric problems rely on ideas from semiparametric\nstatistics:\nthey de\ufb01ne in\ufb02uence functions for parametric models and then analyse estimators by\nlooking at all parametric submodels through the true parameter.\n\n2 Preliminaries\nLet X be a compact metric space equipped with a measure \u00b5, e.g.\nthe Lebesgue measure. Let\nF and G be measures over X that are absolutely continuous w.r.t \u00b5. Let f, g \u2208 L2(X ) be the\nRadon-Nikodym derivatives with respect to \u00b5. We focus on estimating functionals of the form:\n\n(cid:18)(cid:90)\n\n(cid:19)\n\n(cid:18)(cid:90)\n\n(cid:19)\n\nT (F ) = T (f ) = \u03c6\n\n\u03bd(f )d\u00b5\n\nor\n\nT (F, G) = T (f, g) = \u03c6\n\n\u03bd(f, g)d\u00b5\n\n,\n\n(1)\n\nwhere \u03c6, \u03bd are real valued Lipschitz functions that twice differentiable. Our framework permits\nmore general functionals (e.g. functionals based on the conditional densities), but we will focus on\nthis form for ease of exposition. To facilitate presentation of the main de\ufb01nitions, it is easiest to\nwork with functionals of one distribution T (F ). De\ufb01ne M to be the set of all measures that are\nabsolutely continuous w.r.t \u00b5, whose Radon-Nikodym derivatives belong to L2(X ).\n\n2\n\n\fCentral to our development is the Von Mises expansion (VME), which is the distributional analog\nof the Taylor expansion. For this we introduce the G\u02c6ateaux derivative which imposes a notion of\ndifferentiability in topological spaces. We then introduce the in\ufb02uence function.\n: M \u2192 R\nDe\ufb01nition 1. Let P, H \u2208 M and U : M \u2192 R be any functional. The map U(cid:48)\n|t=0 is called the G\u02c6ateaux derivative at P if the derivative exists and\nwhere U(cid:48)(H; P ) = \u2202U (P +tH)\nis linear and continuous in H. U is G\u02c6ateaux differentiable at P if the G\u02c6ateaux derivative exists at P .\nDe\ufb01nition 2. Let U be G\u02c6ateaux differentiable at P . A function \u03c8(\u00b7; P ) : X \u2192 R which satis\ufb01es\n\nU(cid:48)(Q \u2212 P ; P ) =(cid:82) \u03c8(x; P )dQ(x), is the in\ufb02uence function of U w.r.t the distribution P .\n\n\u2202t\n\nBy the Riesz representation theorem, the in\ufb02uence function exists uniquely since the domain of U is\na bijection of L2(X ) and consequently a Hilbert space. The classical work of Fernholz [7] de\ufb01nes\nthe in\ufb02uence function in terms of the G\u02c6ateaux derivative by,\n\n\u03c8(x; P ) = U(cid:48)(\u03b4x \u2212 P ; P ) =\n\n\u2202U ((1 \u2212 t)P + t\u03b4x)\n\n,\n\n(2)\n\n(cid:12)(cid:12)(cid:12)t=0\n\nwhere \u03b4x is the dirac delta function at x. While our functionals are de\ufb01ned only on non-atomic\ndistributions, we can still use (2) to compute the in\ufb02uence function. The function computed this\nway can be shown to satisfy De\ufb01nition 2.\nBased on the above, the \ufb01rst order VME is,\n\nU (Q) = U (P ) + U(cid:48)(Q \u2212 P ; P ) + R2(P, Q) = U (P ) +\n\n\u03c8(x; P )dQ(x) + R2(P, Q),\n\n(3)\n\nwhere R2 is the second order remainder. G\u02c6ateaux differentiability alone will not be suf\ufb01cient for\n\nour purposes. In what follows, we will assign Q \u2192 F and P \u2192 (cid:98)F , where F , (cid:98)F are the true and\n(cid:98)F . For functionals T of the form (1), we restrict the domain to be only measures with continuous\n\nestimated distributions. We would like to bound the remainder in terms of a distance between F and\n\ndensities, Then, we can control R2 using the L2 metric of the densities. This essentially means that\nour functionals satisfy a stronger form of differentiability called Fr\u00b4echet differentiability [7, 36] in\nthe L2 metric. Consequently, we can write all derivatives in terms of the densities, and the VME\nreduces to a functional Taylor expansion on the densities (Lemmas 9, 10 in Appendix A):\n\n\u2202t\n\n(cid:90)\n\nT (q) = T (p) + \u03c6(cid:48)(cid:18)(cid:90)\n(cid:90)\n\n= T (p) +\n\n(cid:19)(cid:90)\n\n\u03bd(p)\n\n(q \u2212 p)\u03bd(cid:48)(p) + R2(p, q)\n\n\u03c8(x; p)q(x)d\u00b5(x) + O((cid:107)p \u2212 q(cid:107)2\n2).\n\n(4)\n\nThis expansion will be the basis for our estimators.\nThese ideas generalise to functionals of multiple distributions and to settings where the functional\ninvolves quantities other than the density.\nA functional T (P, Q) of two distributions has two\ni (\u00b7; P, Q) for i = 1, 2 formed by perturbing the ith argument with the other\nG\u02c6ateaux derivatives, T (cid:48)\n\ufb01xed. The in\ufb02uence functions \u03c81, \u03c82 satisfy, \u2200P1, P2 \u2208 M,\n\u2202T (P1 + t(Q1 \u2212 P1), P2)\n\u2202T (P1, P2 + t(Q2 \u2212 P2))\n\n1(Q1 \u2212 P1; P1, P2) =\nT (cid:48)\n2(Q2 \u2212 P2; P1, P2) =\nT (cid:48)\n\n\u03c81(u; P1, P2)dQ1(u),\n\n\u03c82(u; P1, P2)dQ2(u).\n\n(cid:90)\n(cid:90)\n\n(5)\n\n(cid:12)(cid:12)(cid:12)t=0\n(cid:12)(cid:12)(cid:12)t=0\n\n\u2202t\n\n\u2202t\n\n=\n\n=\n\n(cid:90)\n\nThe VME can be written as,\n\nT (q1, q2) = T (p1, p2) +\n\n(cid:90)\n\n+ O((cid:107)p1 \u2212 q1(cid:107)2\n\n\u03c81(x; p1, p2)q1(x)dx +\n2) + O((cid:107)p2 \u2212 q2(cid:107)2\n2).\n\n\u03c82(x; p1, p2)q2(x)dx\n\n(6)\n\n3 Estimating Functionals\n\nFirst consider estimating a functional of a single distribution, T (f ) = \u03c6((cid:82) \u03bd(f )d\u00b5) from samples\n1 \u223c f. We wish to \ufb01nd an estimator (cid:98)T with low expected mean squared error (MSE) E[((cid:98)T \u2212 T )2].\n\nX n\n\n3\n\n\fUsing the VME (4), Emery et al. [6] and Robins et al. [30] suggest a natural estimator. If we use\nhalf of the data X n/2\n\n1\n\nto construct an estimate \u02c6f (1) of the density f, then by (4):\n\u03c8(x; \u02c6f (1))f (x)d\u00b5 + O((cid:107)f \u2212 \u02c6f (1)(cid:107)2\n2).\n\nT (f ) \u2212 T ( \u02c6f (1)) =\n\n(cid:90)\n\nn(cid:88)\n\n(cid:98)T (1)\n\nAs the in\ufb02uence function does not depend on (the unknown) F , the \ufb01rst term on the right hand side\nis simply an expectation of \u03c8(X; \u02c6f (1)) w.r.t F . We can use the second half of the data X n\nn/2+1 to\nestimate this expectation with its sample mean. This leads to the following preliminary estimator:\n\nDS = T ( \u02c6f (1)) +\n\n1\n\n\u03c8(Xi; \u02c6f (1)).\n\n(7)\n\nn/2\n\ni=n/2+1\n\nDS by using X n\n\nDS + (cid:98)T (2)\n\nn/2+1 for density estimation and X n/2\n\nWe can similarly construct an estimator (cid:98)T (2)\naveraging. Our \ufb01nal estimator is obtained via (cid:98)TDS = ((cid:98)T (1)\n\nfor\nDS )/2. In what follows, we shall\nrefer to this estimator as the Data-Split (DS) estimator. The DS estimator for functionals of one\ndistribution has appeared before in the statistics literature [2, 3, 30].\nThe rate of convergence of this estimator is determined by the O((cid:107)f \u2212 \u02c6f (1)(cid:107)2\n2) error in the VME\nand the n\u22121 rate for estimating an expectation. Lower bounds from several literature [3, 14] con\ufb01rm\nminimax optimality of the DS estimator when f is suf\ufb01ciently smooth. The data splitting trick is\ncommon approach [3, 12, 14] as the analysis is straightforward. While in theory DS estimators enjoy\ngood rates of convergence, data splitting is unsatisfying from a practical standpoint since using only\nhalf the data each for estimation and averaging invariably decreases the accuracy.\nTo make more effective use of the sample, we propose a Leave-One-Out (LOO) version of the above\nestimator,\n\n1\n\nT ( \u02c6f\u2212i) + \u03c8(Xi; \u02c6f\u2212i)\n\n.\n\n(8)\n\n(cid:98)TLOO =\n\nn(cid:88)\n\n(cid:16)\n\ni=1\n\n1\nn\n\n(cid:17)\n\nm(cid:88)\n\nwhere \u02c6f\u2212i is a density estimate using all the samples X n\n1 except for Xi. We prove that the LOO\nEstimator achieves the same rate of convergence as the DS estimator but empirically performs much\nbetter. Our analysis is specialised to the case where \u02c6f\u2212i is a kernel density estimate (Section 4).\nWe can extend this method to estimate functionals of two distributions. Say we have n i.i.d samples\nfrom g. Akin to the one distribution case, we propose the following\nX n\nDS and LOO versions.\n\n1 from f and m samples Y m\n1\n\nDS = T ( \u02c6f (1), \u02c6g(1)) +\n\n\u03c8f (Xi; \u02c6f (1), \u02c6g(1)) +\n\n1\n\nm/2\n\ni=n/2+1\n\nj=m/2+1\n\n\u03c8g(Yj; \u02c6f (1), \u02c6g(1)).\n\n(9)\n\nT ( \u02c6f\u2212i, \u02c6g\u2212i) + \u03c8f (Xi; \u02c6f\u2212i, \u02c6g\u2212i) + \u03c8g(Yi; \u02c6f\u2212i, \u02c6g\u2212i)\n\n.\n\n(10)\n\nn(cid:88)\n\n1\n\nn/2\n\nmax(n,m)(cid:88)\n\n(cid:16)\n\n1\n\nmax(n, m)\n\ni=1\n\n(cid:98)T (1)\n(cid:98)TLOO =\ncompute (cid:98)T (2)\n\nHere, \u02c6g(1), \u02c6g\u2212i are de\ufb01ned similar to \u02c6f (1), \u02c6f\u2212i.\n\nDS and average. For the LOO estimator, if n > m we cycle through the points Y m\n\n1 or vice versa. (cid:98)TLOO is asymmetric when n (cid:54)= m. A seemingly natural\n\nFor the DS estimator, we swap the samples to\n1 until\n\nwe have summed over all X n\nalternative would be to sum over all nm pairings of Xi\u2019s and Yj\u2019s. However, this is computationally\nmore expensive. Moreover, a straightforward modi\ufb01cation of our proof in Appendix D.2 shows that\nboth approaches converge at the same rate if n and m are of the same order.\nExamples: We demonstrate the generality of our framework by presenting estimators for several\nentropies, divergences mutual informations and their conditional versions in Table 1 (Appendix H).\nFor many functionals in the table, these are the \ufb01rst computationally ef\ufb01cient estimators proposed.\nWe hope this table will serve as a good reference for practitioners. For several functionals (e.g.\nconditional and unconditional R\u00b4enyi-\u03b1 divergence, conditional Tsallis-\u03b1 mutual information) the\nestimators are not listed only because the expressions are too long to \ufb01t into the table. Our software\nimplements a total of 17 functionals which include all the estimators in the table. In Appendix F we\nillustrate how to apply our framework to derive an estimator for any functional via an example.\n\n(cid:17)\n\n4\n\n\fAs will be discussed in Section 5, when compared to other alternatives, our technique has several\nfavourable properties: the computational complexity of our method is O(n2) when compared to\nO(n3) of other methods; for several functionals we do not require numeric integration; unlike most\nother methods [28, 32], we do not require any tuning of hyperparameters.\n\n4 Analysis\n\nSome smoothness assumptions on the densities are warranted to make estimation tractable. We use\nthe H\u00a8older class, which is now standard in nonparametrics literature.\n\nDe\ufb01nition 3. Let X \u2282 Rd be a compact space. For any r = (r1, . . . , rd), ri \u2208 N, de\ufb01ne |r| =(cid:80)\n\ni ri\n\n. The H\u00a8older class \u03a3(s, L) is the set of functions on L2(X ) satisfying,\n\nand Dr =\n\n\u2202|r|\n1 ...\u2202x\n\n\u2202xr1\n\nrd\nd\n\n|Drf (x) \u2212 Drf (y)| \u2264 L(cid:107)x \u2212 y(cid:107)s\u2212r,\n\nh\n\nfor all r s.t. |r| \u2264 (cid:98)s(cid:99) and for all x, y \u2208 X .\nMoreover, de\ufb01ne the Bounded H\u00a8older Class \u03a3(s, L, B(cid:48), B) to be {f \u2208 \u03a3(s, L) : B(cid:48) < f < B}.\n1 from a d-dimensional density\nNote that large s implies higher smoothness. Given n samples X n\n\nf, the kernel density estimator (KDE) with bandwidth h is \u02c6f (t) = 1/(nhd)(cid:80)n\n\ni=1 K(cid:0) t\u2212Xi\n\n(cid:1). Here\n\nK : Rd \u2192 R is a smoothing kernel [35]. When f \u2208 \u03a3(s, L), by selecting h \u2208 \u0398(n\n\u22121\n2s+d ) the KDE\nachieves the minimax rate of OP (n\n\u22122s\n2s+d ) in mean squared error. Further, if f is in the bounded\nH\u00a8older class \u03a3(s, L, B(cid:48), B) one can truncate the KDE from below at B(cid:48) and from above at B and\nachieve the same convergence rate [3]. In our analysis, the density estimators \u02c6f (1), \u02c6f\u2212i, \u02c6g(1), \u02c6g\u2212i are\nformed by either a KDE or a truncated KDE, and we will make use of these results.\nWe will also need the following regularity condition on the in\ufb02uence function. This is satis\ufb01ed for\nsmooth functionals including those in Table 1. We demonstrate this in our example in Appendix F.\nAssumption 4. For a functional T (f ) of one distribution, the in\ufb02uence function \u03c8 satis\ufb01es,\n\nE(cid:2)(\u03c8(X; f(cid:48)) \u2212 \u03c8(X; f ))2(cid:3) \u2208 O((cid:107)f \u2212 f(cid:48)(cid:107)2) as (cid:107)f \u2212 f(cid:48)(cid:107)2 \u2192 0.\n\n(cid:104)\n(\u03c8f (X; f(cid:48), g(cid:48)) \u2212 \u03c8f (X; f, g))2(cid:105) \u2208 O((cid:107)f \u2212 f(cid:48)(cid:107)2 + (cid:107)g \u2212 g(cid:48)(cid:107)2) as (cid:107)f \u2212 f(cid:48)(cid:107)2,(cid:107)g \u2212 g(cid:48)(cid:107)2 \u2192 0.\n(cid:104)\n(\u03c8g(Y ; f(cid:48), g(cid:48)) \u2212 \u03c8g(Y ; f, g))2(cid:105) \u2208 O((cid:107)f \u2212 f(cid:48)(cid:107)2 + (cid:107)g \u2212 g(cid:48)(cid:107)2) as (cid:107)f \u2212 f(cid:48)(cid:107)2,(cid:107)g \u2212 g(cid:48)(cid:107)2 \u2192 0.\nsingle distribution achieves MSE E[((cid:98)TDS\u2212T (f ))2] \u2208 O(n\n\nFor a functional T (f, g) of two distributions, the in\ufb02uence functions \u03c8f , \u03c8g satisfy,\nEf\nEg\nUnder the above assumptions, Emery et al. [6], Robins et al. [30] show that the DS estimator on a\n\u22124s\n2s+d +n\u22121) and further is asymptotically\nnormal when s > d/2. Their analysis in the semiparametric setting contains the nonparametric\nsetting as a special case.\nIn Appendix B we review these results with a simpler self contained\nanalysis that directly uses the VME and has more interpretable assumptions. An attractive property\nof our proof is that it is agnostic to the density estimator used provided it achieves the correct rates.\nFor the LOO estimator (Equation (8)), we establish the following result.\nTheorem 5 (Convergence of LOO Estimator for T (f )). Let f \u2208 \u03a3(s, L, B, B(cid:48)) and \u03c8 satisfy\n2s+d ) when s < d/2 and O(n\u22121) when s \u2265 d/2.\nThe key technical challenge in analysing the LOO estimator (when compared to the DS estimator)\nis in bounding the variance as there are several correlated terms in the summation. The bounded\ndifference inequality is a popular trick used in such settings, but this requires a supremum on the in-\n\ufb02uence functions which leads to signi\ufb01cantly worse rates. Instead we use the Efron-Stein inequality\nwhich provides an integrated version of bounded differences that can recover the correct rate when\ncoupled with Assumption 4. Our proof is contingent on the use of the KDE as the density estimator.\n\nWhile our empirical studies indicate that (cid:98)TLOO\u2019s limiting distribution is normal (Fig 2(c)), the proof\nseems challenging due to the correlation between terms in the summation. We conjecture that (cid:98)TLOO\n\nAssumption 4. Then, E[((cid:98)TLOO \u2212 T (f ))2] is O(n\n\n\u22124s\n\nis indeed asymptotically normal but for now leave it to future work.\n\n5\n\n\fWe reiterate that while the convergence rates are the same for both DS and LOO estimators, the data\n\nsplitting degrades empirical performance of (cid:98)TDS as we show in our simulations.\n\nNow we turn our attention to functionals of two distributions. When analysing asymptotics we will\nassume that as n, m \u2192 \u221e, n/(n + m) \u2192 \u03b6 \u2208 (0, 1). Denote N = n + m. For the DS estimator (9)\nwe generalise our analysis for one distribution to establish the theorem below.\nTheorem 6 (Convergence/Asymptotic Normality of DS Estimator for T (f, g)). Let f, g \u2208\n\u22124s\n2s+d )\nwhen s < d/2 and O(n\u22121 + m\u22121) when s \u2265 d/2. Further, when s > d/2 and when \u03c8f , \u03c8g (cid:54)= 0,\n\n\u03a3(s, L, B, B(cid:48)) and \u03c8f , \u03c8g satisfy Assumption 4. Then, E[((cid:98)TDS \u2212 T (f, g))2] is O(n\n(cid:98)TDS is asymptotically normal,\nN ((cid:98)TDS \u2212 T (f, g))\n\nVf [\u03c8f (X; f, g)] +\n\nVg [\u03c8g(Y ; f, g)]\n\n\u22124s\n2s+d + m\n\n.\n\n(11)\n\n(cid:19)\n\n(cid:18)\n\n\u221a\n\nD\u2212\u2192 N\n\n0,\n\n1\n\u03b6\n\n1\n1 \u2212 \u03b6\n\nThe convergence rate is analogous to the one distribution case with the estimator achieving the\nparametric rate under similar smoothness conditions. The asymptotic normality result allows us to\nconstruct asymptotic con\ufb01dence intervals for the functional. Even though the asymptotic variance\nof the in\ufb02uence function is not known, by Slutzky\u2019s theorem any consistent estimate of the variance\ngives a valid asymptotic con\ufb01dence interval. In fact, we can use an in\ufb02uence function based esti-\nmator for the asymptotic variance, since it is also a differentiable functional of the densities. We\ndemonstrate this in our example in Appendix F.\nThe condition \u03c8f , \u03c8g (cid:54)= 0 is somewhat technical. When both \u03c8f and \u03c8g are zero, the \ufb01rst order\nterms vanishes and the estimator converges very fast (at rate 1/n2). However, the asymptotic behav-\nior of the estimator is unclear. While this degeneracy occurs only on a meagre set, it does arise for\nimportant choices, such as the null hypothesis f = g in two-sample testing problems.\nFinally, for the LOO estimator (10) on two distributions we have the following result. Convergence\nis analogous to the one distribution setting and the parametric rate is achieved when s > d/2.\nTheorem 7 (Convergence of LOO Estimator for T (f, g)). Let f, g \u2208 \u03a3(s, L, B, B(cid:48)) and \u03c8f , \u03c8g\n\u22124s\n2s+d ) when s < d/2 and\nO(n\u22121 + m\u22121) when s \u2265 d/2.\nFor many functionals, a H\u00a8olderian assumption (\u03a3(s, L)) alone is suf\ufb01cient to guarantee the rates in\nTheorems 5,6 and 7. However, for some functionals (such as the \u03b1-divergences) we require \u02c6f , \u02c6g, f, g\nto be bounded above and below. Existing results [3, 12] demonstrate that estimating such quantities\nis dif\ufb01cult without this assumption.\nNow we turn our attention to the question of statistical dif\ufb01culty. Via lower bounds given by Birg\u00b4e\nand Massart [3] and Laurent [14] we know that the DS and LOO estimators are minimax optimal\nwhen s > d/2 for functionals of one distribution. In the following theorem, we present a lower\nbound for estimating functionals of two distributions.\n\nsatisfy Assumption 4. Then, E[((cid:98)TLOO \u2212 T (f, g))2] is O(n\n\nTheorem 8 (Lower Bound for T (f, g)). Let f, g \u2208 \u03a3(s, L) and (cid:98)T be any estimator for T (f, g).\n\nDe\ufb01ne \u03c4 = min{8s/(4s + d), 1}. Then there exists a strictly positive constant c such that,\n\n\u22124s\n2s+d + m\n\nE(cid:2)((cid:98)T \u2212 T (f, g))2(cid:3) \u2265 c(cid:0)n\u2212\u03c4 + m\u2212\u03c4(cid:1) .\n\nlim inf\n\nn\u2192\u221e inf(cid:98)T\n\nsup\n\nf,g\u2208\u03a3(s,L)\n\nOur proof, given in Appendix E, is based on LeCam\u2019s method [35] and generalises the analysis of\nBirg\u00b4e and Massart [3] for functionals of one distribution. This establishes minimax optimality of the\nDS/LOO estimators for functionals of two distributions when s \u2265 d/2. However, when s < d/2\nthere is a gap between our upper and lower bounds. It is natural to ask if it is possible to improve\non our rates in this regime. A series of work [3, 11, 14] shows that, for integral functionals of one\ndistribution, one can achieve the n\u22121 rate when s > d/4 by estimating the second order term in the\nfunctional Taylor expansion. This second order correction was also done for polynomial functionals\nof two distributions with similar statistical gains [12]. While we believe this is possible here, these\nestimators are conceptually complicated and computationally expensive \u2013 requiring O(n3 + m3)\nrunning time compared to the O(n2 + m2) running time for our estimator. The \ufb01rst order estimator\nhas a favorable balance between statistical and computational ef\ufb01ciency. Further, not much is known\nabout the limiting distribution of second order estimators.\n\n6\n\n\fFigure 1: Comparison of DS/LOO estimators against alternatives on different functionals. The y-axis is the\n\nerror |(cid:98)T \u2212 T (f, g)| and the x-axis is the number of samples. All curves were produced by averaging over 50\n\nexperiments. Discretisation in hyperparameter selection may explain some of the unsmooth curves.\n\n5 Comparison with Other Approaches\n\nEstimation of statistical functionals under nonparametric assumptions has received considerable at-\ntention over the last few decades. A large body of work has focused on estimating the Shannon\nentropy\u2013 Beirlant et al. [1] gives a nice review of results and techniques. More recent work in the\nsingle-distribution setting includes estimation of R\u00b4enyi and Tsallis entropies [17, 24]. There are also\nseveral papers extending some of these techniques to divergence estimation [10, 12, 26, 27, 37].\nMany of the existing methods can be categorised as plug-in methods: they are based on estimating\nthe densities either via a KDE or using k-Nearest Neighbors (k-NN) and evaluating the functional\non these estimates. Plug-in methods are conceptually simple but unfortunately suffer several draw-\nbacks. First, they typically have worse convergence rate than our approach, achieving the parametric\nrate only when s \u2265 d as opposed to s \u2265 d/2 [19, 32]. Secondly, using either the KDE or k-NN,\nobtaining the best rates for plug-in methods requires undersmoothing the density estimate and we\nare not aware for principled approaches for selecting this smoothing parameter. In contrast, the\nbandwidth used in our estimators is the optimal bandwidth for density estimation so we can select\nit using a number of approaches, e.g. cross validation. This is convenient from a practitioners per-\nspective as the bandwidth can be selected automatically, a convenience that other estimators do not\nenjoy. Secondly, plugin methods based on the KDE always require computationally burdensome\nnumeric integration. In our approach, numeric integration can be avoided for many functionals of\ninterest (See Table 1).\nAnother line of work focuses more speci\ufb01cally on estimating f-Divergences. Nguyen et al. [22]\nestimate f-divergences by solving a convex program and analyse the method when the likelihood\nratio of the densities belongs to an RKHS. Comparing the theoretical results is not straightforward\nas it is not clear how to port the RKHS assumption to our setting. Further, the size of the convex\nprogram increases with the sample size which is problematic for large samples. Moon and Hero [21]\nuse a weighted ensemble estimator for f-divergences. They establish asymptotic normality and the\nparametric convergence rate only when s \u2265 d, which is a stronger smoothness assumption than is\nrequired by our technique. Both these works only consider f-divergences, whereas our method has\nwider applicability and includes f-divergences as a special case.\n\n6 Experiments\n\nWe compare the estimators derived using our methods on a series of synthetic examples. We com-\npare against the methods in [8, 20, 23, 26\u201329, 33]. Software for the estimators was obtained either\n\n7\n\n10210310\u22121n|bT\u2212T|ShannonEntropy1D Plug-inDSLOOkNNKDPVasicek-KDE10210310\u22121n|bT\u2212T|ShannonEntropy2D Plug-inDSLOOkNNKDPVoronoi10210310\u2212410\u2212310\u2212210\u22121n|bT\u2212T|KLDivergence Plug-inDSLOOkNN10210310\u22121100n|bT\u2212T|Renyi-0.75Divergence Plug-inDSLOOkNN10210310\u2212410\u2212310\u2212210\u22121n|bT\u2212T|HellingerDivergence Plug-inDSLOOkNN10210310\u2212410\u2212310\u2212210\u22121n|bT\u2212T|Tsallis-0.75Divergence Plug-inDSLOOkNN\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Fig (a): Comparison of the LOO vs DS estimator on estimating the Conditional Tsallis divergence\nin 4 dimensions. Note that the plug-in estimator is intractable due to numerical integration. There are no other\nknown estimators for the conditional tsallis divergence. Figs (b), (c): QQ plots obtained using 4000 samples\nfor Hellinger divergence estimation in 4 dimensions using the DS and LOO estimators respectively.\n\ndirectly from the papers or from Szab\u00b4o [34]. For the DS/LOO estimators, we estimate the density\nvia a KDE with the smoothing kernels constructed using Legendre polynomials [35]. In both cases\nand for the plug in estimator we choose the bandwidth by performing 5-fold cross validation. The\nintegration for the plug in estimator is approximated numerically.\nWe test the estimators on a series of synthetic datasets in 1 \u2212 4 dimension. The speci\ufb01cs of the\ndensities used in the examples and methods compared to are given in Appendix G. The results are\nshown in Figures 1 and 2. We make the following observations. In most cases the LOO estimator\nperforms best. The DS estimator approaches the LOO estimator when there are many samples but\nis generally inferior to the LOO estimator with few samples. This, as we have explained before is\nbecause data splitting does not make ef\ufb01cient use of the data. The k-NN estimator for divergences\n[28] requires choosing a k. For this estimator, we used the default setting for k given in the software.\nAs performance is sensitive to the choice of k, it performs well in some cases but poorly in other\ncases. We reiterate that the hyper-parameter of our estimator (bandwidth of the kernel) can be\nselected automatically using cross validation.\nNext, we test the DS and LOO estimators for asymptotic normality on a 4-dimensional Hellinger\ndivergence estimation problem. We use 4000 samples for estimation. We repeat this experiment 200\ntimes and compare the empiriical asymptotic distribution (i.e. the\n\n4000((cid:98)T \u2212 T (f, g))/(cid:98)S values\nwhere (cid:98)S is the estimated asymptotic variance) to a N (0, 1) distribution on a QQ plot. The results in\n\n\u221a\n\nFigure 2 suggest that both estimators are asymptotically normal.\nImage clustering: We demonstrate the use of our nonparametric divergence estimators in an image\nclustering task on the ETH-80 datset [16]. Using our Hellinger divergence estimator we achieved an\naccuracy of 92.47% whereas a naive spectral clustering approach achieved only 70.18%. When we\nused a k-NN estimator for the Hellinger divergence [28] we achieved 90.04% which attests to the\nsuperiority of our method. Since this is not the main focus of this work we defer this to Appendix G.\n\n7 Conclusion\nWe generalise existing results in Von Mises estimation by proposing an empirically superior LOO\ntechnique for estimating functionals and extending the framework to functionals of two distributions.\nWe also prove a lower bound for the latter setting. We demonstrate the practical utility of our\ntechnique via comparisons against other alternatives and an image clustering application. An open\nproblem arising out of our work is to derive the limiting distribution of the LOO estimator.\n\nAcknowledgements\n\nThis work is supported in part by NSF Big Data grant IIS-1247658 and DOE grant DESC0011114.\n\nReferences\n[1] Jan Beirlant, Edward J. Dudewicz, L\u00b4aszl\u00b4o Gy\u00a8or\ufb01, and Edward C. Van der Meulen. Nonparametric entropy\n\nestimation: An overview. International Journal of Mathematical and Statistical Sciences, 1997.\n\n8\n\n10210310\u22121n|bT\u2212T|ConditionalTsallis-0.75Divergence DSLOO\u22123\u22122\u221210123\u22123\u22122\u221210123QuantilesofN(0,1)Quantilesofn\u22121/2(bTDS\u2212T)/\u02c6\u03c3\u22123\u22122\u221210123\u22123\u22122\u221210123QuantilesofN(0,1)Quantilesofn\u22121/2(bTLOO\u2212T)/\u02c6\u03c3\f[2] Peter J. Bickel and Ya\u2019acov Ritov. Estimating integrated squared density derivatives: sharp best order of\n\nconvergence estimates. Sankhy\u00afa: The Indian Journal of Statistics, 1988.\n\n[3] Lucien Birg\u00b4e and Pascal Massart. Estimation of integral functionals of a density. Ann. of Stat., 1995.\n[4] Kevin M. Carter, Raviv Raich, and Alfred O. Hero. On local intrinsic dimension estimation and its\n\napplications. IEEE Transactions on Signal Processing, 2010.\n\n[5] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. A Divisive Information Theoretic Feature\n\nClustering Algorithm for Text Classi\ufb01cation. J. Mach. Learn. Res., 2003.\n\n[6] M Emery, A Nemirovski, and D Voiculescu. Lectures on Prob. Theory and Stat. Springer, 1998.\n[7] Luisa Fernholz. Von Mises calculus for statistical functionals. Lecture notes in statistics. Springer, 1983.\n[8] Mohammed Nawaz Goria, Nikolai N Leonenko, Victor V Mergel, and Pier Luigi Novi Inverardi. A new\n\nclass of random vector entropy estimators and its applications. Nonparametric Statistics, 2005.\n\n[9] Hero, Bing Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs. IEEE Signal\n\nProcessing Magazine, 19, 2002.\n\n[10] David K\u00a8allberg and Oleg Seleznjev. Estimation of entropy-type integral functionals. arXiv, 2012.\n[11] G\u00b4erard Kerkyacharian and Dominique Picard. Estimating nonquadratic functionals of a density using\n\nhaar wavelets. Annals of Stat., 1996.\n\n[12] Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos, and Larry Wasserman. Nonparamet-\n\nric Estimation of R\u00b4enyi Divergence and Friends. In ICML, 2014.\n\n[13] Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos, and Larry Wasserman. On Estimating\n\nL2\n\n2 Divergence. In Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[14] B\u00b4eatrice Laurent. Ef\ufb01cient estimation of integral functionals of a density. Ann. of Stat., 1996.\n[15] Erik Learned-Miller and Fisher John. ICA using spacings estimates of entropy. Mach. Learn. Res., 2003.\n[16] Bastian Leibe and Bernt Schiele. Analyzing Appearance and Contour Based Methods for Object Catego-\n\nrization. In CVPR, 2003.\n\n[17] Nikolai Leonenko and Oleg Seleznjev. Statistical inference for the epsilon-entropy and the quadratic\n\nR\u00b4enyi entropy. Journal of Multivariate Analysis, 2010.\n\n[18] Jeremy Lewi, Robert Butera, and Liam Paninski. Real-time adaptive information-theoretic optimization\n\nof neurophysiology experiments. In NIPS, 2006.\n\n[19] Han Liu, Larry Wasserman, and John D Lafferty. Exponential concentration for mutual information\n\nestimation with application to forests. In NIPS, 2012.\n\n[20] Erik G Miller. A new class of Entropy Estimators for Multi-dimensional Densities. In ICASSP, 2003.\n[21] Kevin Moon and Alfred Hero. Multivariate f-divergence Estimation With Con\ufb01dence. In NIPS, 2014.\n[22] XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and\n\nthe likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010.\n\n[23] Havva Alizadeh Noughabi and Reza Alizadeh Noughabi. On the Entropy Estimators. Journal of Statisti-\n\ncal Computation and Simulation, 2013.\n\n[24] D\u00b4avid P\u00b4al, Barnab\u00b4as P\u00b4oczos, and Csaba Szepesv\u00b4ari. Estimation of R\u00b4enyi Entropy and Mutual Information\n\nBased on Generalized Nearest-Neighbor Graphs. In NIPS, 2010.\n\n[25] Hanchuan Peng, Fulmi Long, and Chris Ding. Feature selection based on mutual information criteria of\n\nmax-dependency, max-relevance, and min-redundancy. IEEE PAMI, 2005.\n\n[26] Fernando P\u00b4erez-Cruz. KL divergence estimation of continuous distributions. In IEEE ISIT, 2008.\n[27] Barnab\u00b4as P\u00b4oczos and Jeff Schneider. On the estimation of alpha-divergences. In AISTATS, 2011.\n[28] Barnab\u00b4as P\u00b4oczos, Liang Xiong, and Jeff G. Schneider. Nonparametric Divergence Estimation with Ap-\n\nplications to Machine Learning on Distributions. In UAI, 2011.\n\n[29] David Ram\u0131rez, Javier V\u0131a, Ignacio Santamar\u0131a, and Pedro Crespo. Entropy and Kullback-Leibler Diver-\n\ngence Estimation based on Szegos Theorem. In EUSIPCO, 2009.\n\n[30] James Robins, Lingling Li, Eric Tchetgen, and Aad W. van der Vaart. Quadratic semiparametric Von\n\nMises Calculus. Metrika, 2009.\n\n[31] Elad Schneidman, William Bialek, and Michael J. Berry II. An Information Theoretic Approach to the\n\nFunctional Classi\ufb01cation of Neurons. In NIPS, 2002.\n\n[32] Shashank Singh and Barnabas Poczos. Exponential Concentration of a Density Functional Estimator. In\n\nNIPS, 2014.\n\n[33] Dan Stowell and Mark D Plumbley. Fast Multidimensional Entropy Estimation by k-d Partitioning. IEEE\n\nSignal Process. Lett., 2009.\n\n[34] Zolt\u00b4an Szab\u00b4o. Information Theoretical Estimators Toolbox. J. Mach. Learn. Res., 2014.\n[35] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.\n[36] Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.\n[37] Qing Wang, Sanjeev R. Kulkarni, and Sergio Verd\u00b4u. Divergence estimation for multidimensional densities\n\nvia k-nearest-neighbor distances. IEEE Transactions on Information Theory, 2009.\n\n9\n\n\f", "award": [], "sourceid": 269, "authors": [{"given_name": "Kirthevasan", "family_name": "Kandasamy", "institution": "CMU"}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "CMU"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Larry", "family_name": "Wasserman", "institution": "Carnegie Mellon University"}, {"given_name": "james", "family_name": "robins", "institution": "Harvard University"}]}