{"title": "Exponential Concentration for Mutual Information Estimation with Application to Forests", "book": "Advances in Neural Information Processing Systems", "page_first": 2537, "page_last": 2545, "abstract": "We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph.", "full_text": "Exponential Concentration for Mutual Information\n\nEstimation with Application to Forests\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University, NJ 08544\n\nJohn Lafferty\n\nDepartment of Computer Science\n\nDepartment of Statistics\n\nUniversity of Chicago, IL 60637\n\nhanliu@princeton.edu\n\nlafferty@galton.uchicago.edu\n\nLarry Wasserman\n\nDepartment of Statistics\n\nMachine Learning Department\n\nCarnegie Mellon University, PA 15231\n\nlarry@stat.cmu.edu\n\nAbstract\n\nWe prove a new exponential concentration inequality for a plug-in estimator of the\nShannon mutual information. Previous results on mutual information estimation\nonly bounded expected error. The advantage of having the exponential inequality\nis that, combined with the union bound, we can guarantee accurate estimators of\nthe mutual information for many pairs of random variables simultaneously. As an\napplication, we show how to use such a result to optimally estimate the density\nfunction and graph of a distribution which is Markov to a forest graph.\n\n(cid:90)\n\n(cid:90)\n\n(cid:18) p(x1, x2)\n\n(cid:19)\n\nIntroduction\n\n1\nWe consider the problem of nonparametrically estimating the Shannon mutual information between\ntwo random variables. Let X1 \u2208 X1 and X2 \u2208 X2 be two random variables with domains X1 and\nX2 and joint density p(x1, x2). The mutual information between X1 and X2 is\n\n(cid:90) (cid:90)\n\ndx1 dx2 = H(X1) + H(X2) \u2212 H(X1, X2),\n\nI(X1; X2) :=\n\np(x1, x2) log\n\nX1\n\np(x1)p(x2)\n\nX2\nwhere H(X1, X2) = \u2212\np(x1, x2) log p(x1, x2)dx1 dx2 (and similarly for H(X1) and H(X2))\nare the corresponding Shannon entropies [4]. The mutual information is a measure of dependence\nbetween X1 and X2. To estimate I(X1; X2) well, it suf\ufb01ces to estimate H(X1, X2) := H(p).\nA simple way to estimate the Shannon entropy is to use a kernel density estimator (KDE) [22, 1, 9,\n5, 20, 7], i.e., the densities p(x, y), p(x), and p(y) are separately estimated from samples and the\nestimated densities are used to calculate the entropy. Alternative methods involve estimation of the\nentropies using spacings [25, 26, 23], k-nearest neighbors [11, 12], the Edgeworth expansion [24],\nand convex optimization [17]. More discussions can be found in the survey articles [2, 19]. There\nhave been many recent developments in the problem of estimating Shannon entropy and related\nquantities as well as application of these results to machine learning problems [18, 21, 8, 6]. Under\nn-rate of\nweak conditions, it has been shown that there are estimators that achieve the parametric\nconvergence in mean squared error (MSE), where n is the sample size.\nIn this paper, we construct an estimator with this rate, but we also prove an exponential concentration\n\ninequality for the estimator. More speci\ufb01cally, we show that our estimator (cid:98)H of H(p) satis\ufb01es\n\n\u221a\n\nP(cid:16)|(cid:98)H \u2212 H(p)| > \u0001\n\n(cid:17) \u2264 2 exp\n\n(cid:18)\n\n(cid:19)\n\n\u2212 n\u00012\n36\u03ba2\n\nsup\np\u2208\u03a3\n\n(1.1)\n\n1\n\n\fwhere \u03a3 is a nonparametric class of distributions de\ufb01ned in Section 2 and \u03ba is a constant. To\nthe best of our knowledge, this is the \ufb01rst such exponential inequality for nonparametric Shannon\nentropy and mutual information estimation. The advantage of this result, over the usual results which\n\nstate that E(cid:0)|(cid:98)H \u2212 H(p)|2(cid:1) = O(n\u22121), is that we can apply the union bound and thus guarantee\n\n2 mutual informations in order to apply the Chow-Liu algorithm. As long as log d\n\naccurate mutual information estimation for many pairs of random variables simultaneously. As an\napplication, we consider forest density estimation [15], which, in a d-dimenionsal problem, requires\nn \u2192\nestimating d(d+1)\n0 as n \u2192 \u221e, we can estimate the forest graph well, even if d = d(n) increases with n exponentially\nfast.\nThe rest of this paper is organized as follows. The assumptions and estimator are given in Section 2.\nThe main theoretical analysis is in Section 3. In Section 4 we show how to apply the result to forest\ndensity estimation. Some discussion and possible extensions are provided in the last section.\n2 Estimator and Main Result\nLet X = (X1, X2) \u2208 R2 be a random vector with density p(x) := p(x1, x2) and let x1, . . . , xn \u2208\nX \u2282 R2 be a random sample from p. In this paper, we only consider the case of bounded domain\nX = [0, 1]2. We want to estimate the Shannon entropy\n\nH(p) = \u2212\n\np(x) log p(x)dx.\n\n(2.1)\n\n(cid:90)\n\nX\n\nWe start with some assumptions on the density function p(x1, x2).\nAssumption 2.1 (Density assumption). We assume the density p(x1, x2) belongs to a 2nd-order\nH\u00a8older class \u03a3\u03ba(2, L) and is bounded away from zero and in\ufb01nity. In particular, there exist constants\n\u03ba1, \u03ba2\n\n0 < \u03ba1 \u2264 min\n\nx\u2208X p(x) \u2264 max\n\nx\u2208X p(x) \u2264 \u03ba2 < \u221e,\n\n(2.2)\n\nand for any (x1, x2)T \u2208 X , there exists a constant L such that, for any (u, v)T \u2208 X\n\n(cid:12)(cid:12)(cid:12)p(x1 + u, x2 + v) \u2212 p(x1, x2) \u2212 \u2202p(x1, x2)\n\n(2.3)\nAssumption 2.2 (Boundary assumption). If {xn} \u2208 X is any sequence converging to a boundary\npoint x\u2217, we require the density p(x) has vanishing \ufb01rst order partial derivatives:\n\n\u2202x1\n\n\u2202x2\n\nv\n\nu \u2212 \u2202p(x1, x2)\n\n(cid:12)(cid:12)(cid:12) \u2264 L(u2 + v2).\n\nlim\nn\u2192\u221e\n\n\u2202p(xn)\n\n\u2202x1\n\n= lim\nn\u2192\u221e\n\n\u2202p(xn)\n\n\u2202x2\n\n= 0.\n\n(2.4)\n\n(cid:101)ph(x1, x2) :=\n\nTo ef\ufb01ciently estimate the entropy in (2.1), we use a KDE based \u201cplug-in\u201d estimator. Bias at the\nboundaries turns out to be very important in this problem; see [10] for a discussion of boundary\nbias. To correct the boundary effects, we use the following \u201cmirror image\u201d kernel density estimator:\n\nK\n\nh\n\nh\n\n1\n\n1\n\n2\n\n2\n\nK\n\nK\n\ni=1\n\n+K\n\n+ K\n\n(cid:19)\n(cid:19)\n\n(cid:40)\n(cid:18) x1 + xi\n(cid:18) x2 \u2212 xi\n(cid:19)\n(cid:18) x1 \u2212 xi\nn(cid:88)\n(cid:18) x1 + xi\n(cid:18) x2 + xi\n(cid:19)\n(cid:18) x1 \u2212 xi\n(cid:19)\n(cid:18) x1 + xi\n(cid:19)\n(cid:18) x2 \u2212 2 + xi\n(cid:18) x1 \u2212 xi\n(cid:19)\n(cid:18) x1 \u2212 2 + xi\n(cid:18) x2 \u2212 xi\n(cid:19)\n(cid:19)\n(cid:18) x1 \u2212 2 + xi\n(cid:18) x1 \u2212 2 + xi\n(cid:19)\n\n(cid:18) x2 \u2212 xi\n(cid:19)\n(cid:19)\n(cid:18) x2 + xi\n(cid:19)\n(cid:19)\n(cid:18) x2 \u2212 2 + xi\n(cid:19)\n(cid:18) x2 + xi\n(cid:19)\n(cid:19)\n(cid:19)(cid:41)\n(cid:18) x2 \u2212 2 + xi\n\n+ K\n\n+ K\n\n+ K\n\nK\n\nK\n\nK\n\nK\n\n2\n\n2\n\n2\n\n2\n\n1\n\n1\n\nh\n\nh\n\nh\n\nh\n\nh\n\nh\n\n1\n\nK\n\n1\n\nK\n\n2\n\nh\n\n1\n\nh\n\n1\n\nh\n\nh\n\nh\n\nh\n\nh\n\nh\n\n2\n\n1\nnh2\n\n+K\n\n+K\n\n(2.5)\nHere h is the bandwidth and K(\u00b7) is a univariate kernel function. We denote by K2(u, v) :=\nK(u)K(v) the bivariate product kernel. This estimator has nine terms; one corresponds to the\noriginal data in the unit square [0, 1]2, and each of the remaining terms corresponds to re\ufb02ecting the\ndata across one of the four sides or four corners of the square.\n\n+K\n\nK\n\nh\n\nh\n\n.\n\n2\n\n1\n\n2\n\n\fAssumption 2.3 (Kernel assumption). The kernel K(\u00b7) is nonnegative and has a bounded support\n[\u22121, 1] with\n\nK(u)du = 1 and\n\nuK(u)du = 0.\n\n(cid:90) 1\n\n\u22121\n\n(cid:90) 1\n\n\u22121\n\n(cid:90)\n\nH ((cid:98)ph) := \u2212\n\nBy Assumption 2.1, the values of the true density lie in the interval [\u03ba1, \u03ba2]. We propose a clipped\nKDE estimator\n\n(2.6)\nwhere T\u03ba1,\u03ba2(a) = \u03ba1 \u00b7 I(a < \u03ba1) + a \u00b7 I(\u03ba1 \u2264 a \u2264 \u03ba2) + \u03ba2 \u00b7 I(a > \u03ba2), so that the estimated\ndensity also has this property. Letting g(u) = u log u, we propose the following plug-in entropy\nestimator:\n\n(cid:98)ph(x) = T\u03ba1,\u03ba2 ((cid:101)ph(x)) ,\ng ((cid:98)ph(x)) dx = \u2212\n\nX\n\nRemark 2.1. The clipped estimator(cid:98)ph requires the knowledge of \u03ba1 and \u03ba2. In applications, we do\nOur main technical result is the following exponential concentration inequality on H((cid:98)ph) around the\n\nnot need to know the exact values of \u03ba1 and \u03ba2; lower and upper bounds are suf\ufb01cient.\n\npopulation quantity H(p). Our proof is given in Section 3.\nTheorem 2.1. Under Assumptions 2.1, 2.2, and 2.3, if we choose the bandwidth according to h (cid:16)\nn\u22121/4, then there exists a constant N0 such that for all n > N0,\n\n(2.7)\n\n(cid:90)\nX (cid:98)ph(x) log(cid:98)ph(x)dx.\n\nP(cid:16)|H ((cid:98)ph) \u2212 H (p)| > \u0001\n\n(cid:17) \u2264 2 exp\n\n(cid:18)\n\n(cid:19)\n\n\u2212 n\u00012\n36\u03ba2\n\nsup\n\np\u2208\u03a3\u03ba(2,L)\n\n,\n\n(2.8)\n\n\u221a\n\nn-rate of convergence in mean squared error, E(cid:0)|(cid:98)H \u2212 H(p)|(cid:1) = O(n\u22121/2). The\n\nwhere \u03ba = max{| log \u03ba1|,| log \u03ba2|} + 1.\nTo the best of our knowledge, this is the \ufb01rst time an exponential inequality like (2.8) has been\nestablished for Shannon entropy estimation over the H\u00a8older class. It is easy to see that (2.8) implies\nthe parametric\nbandwidth h (cid:16) n\u22121/4 in the above theorem is different from the usual choice for optimal bivariate\ndensity estimation, which is hP (cid:16) n\u22121/6 for the 2nd-order H\u00a8older class. By using h (cid:16) n\u22121/4,\nwe undersmooth the density estimate. As we show in the next section, such a bandwidth choice is\nimportant for achieving the optimal rate for entropy estimation.\nLet I(p) := I(X1; X2) be the Shannon mutual information, and de\ufb01ne\n\n(cid:90)\n\n(cid:90)\n\nI((cid:98)ph) :=\n\nX1\n\nX2\n\n(cid:98)ph(x1, x2) log\nP(cid:16)|I ((cid:98)ph) \u2212 I (p)| > \u0001\n\n(cid:19)\n\n(cid:18) (cid:98)ph(x1, x2)\n(cid:98)ph(x1)(cid:98)ph(x2)\n(cid:18)\n(cid:17) \u2264 6 exp\n\n(cid:19)\n\n\u2212 n\u00012\n324\u03ba2\n\nThe next corollary provides an exponential inequality for Shannon mutual information estimation.\nCorollary 2.1. Under the same conditions as in Theorem 2.1, if we choose h (cid:16) n\u22121/4, then there\nexists a constant N1, such that for all n > N1,\n\ndx1 dx2.\n\n(2.9)\n\nsup\n\np\u2208\u03a3\u03ba(2,L)\n\nwhere \u03ba = max{| log \u03ba1|,| log \u03ba2|} + 1.\n\n,\n\n(2.10)\n\nProof. Using the same proof for Theorem 2.1, we can show that (2.8) also holds for estimating\nunivariate entropies H(X1) and H(X2). The desired result then follows from the union bound\nsince I(p) := I(X1; X2) = H(X1) + H(X2) \u2212 H(X1, X2).\nRemark 2.2. We use the same bandwidth h (cid:16) n\u22121/4 to estimate the bivariate density p(x1, x2)\nand univariate densities p(x1), p(x2). A related result is presented in [15]. They consider the same\nproblem setting as ours and also use a KDE based plug-in estimator to estimate the mutual infor-\nmation. However, unlike our proposal, they advocate the use of different bandwidths for bivariate\nand univariate entropy estimations. For bivariate case they use h2 (cid:16) n\u22121/6; for univariate case they\nuse h1 (cid:16) n\u22121/5. Such bandwidths h1 and h2 are useful for optimally estimating the density func-\ntions. However, such a choice achieves a suboptimal rate in terms of mutual information estimation:\nsupp\u2208\u03a3\u03ba(2,L)\nOur method achieves the faster parametric rate.\n\n(cid:17) \u2264 c1 exp(cid:0)\u2212c2n2/3\u00012(cid:1), where c1 and c2 are two constants.\n\nP(cid:16)|I ((cid:98)ph) \u2212 I (p)| > \u0001\n\n3\n\n\f3 Theoretical Analysis\n\nHere we present the detailed proof of Theorem 2.1. To analyze the error |H ((cid:98)ph) \u2212 H (p)|, we \ufb01rst\n\ndecompose it into a bias or approximation error term, and a \u201cvariance\u201d or estimation error term:\n\n(cid:124)\n\nVariance\n\n|H ((cid:98)ph) \u2212 H (p)| \u2264 |H ((cid:98)ph) \u2212 EH ((cid:98)ph)|\n+|EH ((cid:98)ph) \u2212 H (p)|\n(cid:124)\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n(cid:125)\n(cid:125)\n(cid:18)\n(cid:17) \u2264 2 exp\nP(cid:16)|H ((cid:98)ph) \u2212 EH ((cid:98)ph)|\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\n|EH ((cid:98)ph) \u2212 H (p)|\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\n\n\u2264 c1h2 +\n\n\u2212 n\u00012\n32\u03ba2\n\np\u2208\u03a3\u03ba(2,L)\n\np\u2208\u03a3\u03ba(2,L)\n\nc3\nnh2 ,\n\nVariance\n\nsup\n\nsup\n\n> \u0001\n\nBias\n\nBias\n\n.\n\n(cid:19)\n\n,\n\n(3.1)\n\n(3.2)\n\n(3.3)\n\nWe are going to show that\n\nwhere c1 and c3 are two constants. Since the bound on the variance in (3.2) does not depend on h,\nto optimize the rate, we only need to choose h to minimize the righthand side of (3.3). Therefore\nh (cid:16) n\u22121/4 achieves the optimal rate. In the rest of this section, we bound the bias and variance\nterms separately.\n3.1 Analyzing the Bias Term\nHere we prove (3.3). Let u be a vector. We denote the sup norm by (cid:107)u(cid:107)\u221e. The next lemma bounds\nthe integrated squared bias of the kernel density estimator over the support X := [0, 1]2.\nLemma 3.1. Under Assumptions 2.1, 2.2, and 2.3, there exists a constant c > 0 such that\n\nProof. We partition the support X := [0, 1]2 into three regions X = B \u222a C \u222a I, the boundary area\nB, the corner area C, and the interior area I:\n\nC = {x : (cid:107)x \u2212 u(cid:107)\u221e \u2264 h for u = (0, 0)T , or (0, 1)T , or (1, 0)T , or (1, 1)T},\nB = {x : x is within distance h to an edge of X , but does not belong to C},\nI = X \\ (C \u222a B).\n\n(3.5)\n(3.6)\n(3.7)\n\nWe have the following decomposition:\n\n(E(cid:101)ph(x) \u2212 p(x))2 dx =\n\n(cid:90)\n\n(cid:90)\n\n+\n\nI\n\n+\n\nC\n\nB\n\n(cid:90)\n\nX\n\n(E(cid:101)ph(x) \u2212 p(x))2 dx = TI + TC + TB.\n\nFrom standard results on kernel density estimation, we know that supp\u2208\u03a3(2,L) TI \u2264 ch4. In the next\n(E(cid:101)ph(x) \u2212 p(x))2 dx.\ntwo subsections, we bound TB :=\n\n(E(cid:101)ph(x) \u2212 p(x))2 dx and TC :=\n\n(cid:90)\n\nC\n\n3.1.1 Analyzing TB\nLet A := {x : 0 \u2264 x1 \u2264 h and h \u2264 x2 \u2264 1 \u2212 h}. We have\n\nFor x \u2208 A, we have\n\n(cid:101)ph(x) =\n\n1\nnh2\n\nTB =\n\n(cid:20)\n\nK\n\nn(cid:88)\n\ni=1\n\nTherefore, for x \u2208 A we have\n1\nh2\n\nE(cid:101)ph(x) =\n\n(cid:90)\n\nB\n\n(cid:90)\n(E(cid:101)ph(x) \u2212 p(x))2 dx \u2264 c\n(cid:19)\n(cid:18) x2 \u2212 xi\n(cid:19)\n(cid:18) x1 \u2212 xi\n(cid:18) x1 \u2212 t1\n(cid:19)\n(cid:90) 1\n(cid:90) 1\n\nK\n\nh\n\nh\n\n1\n\n2\n\nA\n\n(cid:19)\n\n(E(cid:101)ph(x) \u2212 p(x))2 dx.\n(cid:18) x2 \u2212 xi\n(cid:18) x1 + xi\n(cid:19)\n(cid:18) x2 \u2212 t2\n\nK\n\nh\n\nh\n\n1\n\n2\n\np(t1, t2)dt1dt2\n\n+ K\n\nK\n\nh\n\nK\n\n0\n\n0\n\nh\n\n4\n\n(3.8)\n\n(cid:19)(cid:21)\n\n.\n\n(3.9)\n\n(E(cid:101)ph(x) \u2212 p(x))2 dx \u2264 ch4.\n\nsup\n\np\u2208\u03a3\u03ba(2,L)\n\nX\n\n(3.4)\n\n(cid:90)\n\n(cid:90)\n(cid:90)\n\nB\n\n\f(cid:18) x1 + t1\n\n(cid:19)\n\nh\n\n(cid:18) x2 \u2212 t2\n\n(cid:19)\n\nh\n\nK\n\nK\n\np(t1, t2)dt1dt2\n\n+\n\n1\nh2\n\n(cid:90) 1\n\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) \u2212 x1\n(cid:90) 1\n\n\u2212 x1\n\n\u22121\n\n0\n\n0\n\nh\n\nh\n\n+\n\n\u22121\n\n\u22121\n\n=\n\nK(u1)K(u2)p(x1 + u1h, x2 + u2h)du1du2\n\nK(u1)K(u2)p(x1 \u2212 u1h, x2 \u2212 u2h)du1du2.\n\n(3.10)\n\n2]h2.\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202p(x)\n(cid:12)(cid:12)(cid:12)(cid:12) h +\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202p(x)\n(cid:12)(cid:12)(cid:12)(cid:12) h +\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202p(x)\n(cid:12)(cid:12)(cid:12)(cid:12) h + 10Lh2.\n(cid:12)(cid:12)(cid:12)(cid:12) h +\n\n\u2202x2\n\n\u2202x2\n\nSince p \u2208 \u03a3\u03ba(2, L) and 0 < x1 \u2264 h, we have\n\n|p(x1\u2212u1h, x2\u2212u2h) \u2212 p(x1, x2) +\n\n\u2202p(x)\n\u2202x1\n\n|p(x1 + u1h, x2 + u2h) \u2212 p(x1, x2) \u2212 (cid:104)(cid:53)p(x), u(cid:105)h| \u2264 L(cid:107)u(cid:107)2\n(u2h)| \u2264 L[(2 + u1)2 + u2\n\n(2x1 +u1h) +\n\n2h2,\n\n\u2202p(x)\n\u2202x2\n\n\u2202x1\n\n\u2202x1\n\n(cid:90) 1\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202p(x)\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nSince |u1|,|u2| \u2264 1, we have |p(x1 + u1h, x2 + u2h) \u2212 p(x1, x2)| \u2264\nL(cid:107)u(cid:107)2\nFor any x \u2208 A, we can bound the bias term\n\n2h2. Similarly, |p(x1 \u2212 u1h, x2 \u2212 u2h) \u2212 p(x1, x2)| \u2264 9\n|E(cid:101)ph(x) \u2212 p(x)|\n(cid:12)(cid:12)(cid:12)(cid:12)E(cid:101)ph(x) \u2212\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\nK(u1)K(u2)(cid:12)(cid:12)p(x1 + u1h, x2 + u2h) \u2212 p(x1, x2)(cid:12)(cid:12)du1du2\n(cid:90) 1\n(cid:90) \u2212 x1\nK(u1)K(u2)(cid:12)(cid:12)p(\u2212u1h \u2212 x1, x2 \u2212 u2h) \u2212 p(x1, x2)(cid:12)(cid:12)du1du2\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202p(x)\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202p(x)\n(cid:12)(cid:12)(cid:12)(cid:12) h + 2\n\nK(u1)K(u2)p(t1, t2)du1du2\n\n\u2212 x1\n\n\u2264\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u2202x2\n\n+\n\n=\n\nh\n\nh\n\n(3.11)\n\n(3.12)\n\n(3.13)\n\n(3.14)\n\n(cid:12)(cid:12)(cid:12)(cid:12) h + 12Lh2\n(cid:12)(cid:12)(cid:12) \u2202p(x)\n\n\u2202x1\n\n\u2264 10\n\u2264 12Lh2 + 12Lh2\n= 24Lh2,\n\n(cid:12)(cid:12)(cid:12) ,\n\n(cid:12)(cid:12)(cid:12) \u2202p(x)\n\n\u2202x2\n\n(cid:12)(cid:12)(cid:12) \u2264 Lh, by the H\u00a8older condition\n\nwhere the last inequality follows from the fact that\nand the assumption that the density p(x) has vanishing partial derivatives on the boundary points.\nTherefore, we have TB \u2264 ch5.\n3.1.2 Analyzing TC\nLet A1 := {x : 0 \u2264 x1, x2 \u2264 h}. We now analyze the term TC:\n\n\u2202x1\n\n(cid:90)\n\nC\n\nTC =\n\n(E(cid:101)ph(x) \u2212 p(x))2 dx \u2264 c\n(cid:18) x1 \u2212 a\n\nUx,h(a, b) = K\n\nh\n\n(cid:90)\n(cid:19)\n\nA1\n\nK\n\n(E(cid:101)ph(x) \u2212 p(x))2 dx.\n(cid:18) x2 \u2212 b\n\n(cid:19)\n\n.\n\nh\n\n(3.15)\n\n(3.16)\n\n2)(cid:3) . (3.17)\n\nFor notational simplicity, we write\n\nFor x \u2208 A1, we have\n\nn(cid:88)\n\n(cid:2)Ux,h(xi\n\n1, xi\nTherefore, for x \u2208 A1 we have\n\ni=1\n\n1\nnh2\n\n(cid:101)ph(x) =\nE(cid:101)ph(x)\n(cid:90) 1\n\n(cid:90) 1\n\n=\n\n1\nh2\n\n2) + Ux,h(\u2212xi\n\n1, xi\n\n2) + Ux,h(xi\n\n1,\u2212xi\n\n2) + Ux,h(\u2212xi\n\n1,\u2212xi\n\n[Ux,h(t1, t2) + Ux,h(\u2212t1, t2) + Ux,h(t1,\u2212t2) + Ux,h(\u2212t1,\u2212t2)] p(t1, t2)dt1dt2\n\n0\n\n0\n\n5\n\n\f=\n\nK(u1)K(u2)p(x1 + u1h, x2 + u2h)du1du2\n\nK(u1)K(u2)p(u1h \u2212 x1, x2 + u2h)du1du2\n\nK(u1)K(u2)p(u1h + x1,\u2212x2 + u2h)du1du2\n\nK(u1)K(u2)p(\u2212x1 + u1h,\u2212x2 + u2h)du1du2.\n\nSince K(\u00b7) is a symmetric kernel on [\u22121, 1], we have\n\nh\n\nh\n\nh\n\nh\n\n+\n\n+\n\nx1\nh\n\nx2\nh\n\n\u2212 x1\n\n\u2212 x2\n\n\u2212 x1\n\n\u2212 x2\n\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n\n\u2212 x2\n\n\u2212 x1\n\nx2\nh\n\nx1\nh\n\nx2\nh\n\nx1\nh\n\n+\n\nh\n\nh\n\n(3.18)\n\n(3.19)\n\n(3.20)\n\n(3.21)\n\n(3.22)\n\n(3.23)\n\n(3.24)\n(3.25)\n(3.26)\n(3.27)\n\n(3.28)\n\n(3.29)\n\n(3.30)\n\n(3.31)\n\n(3.32)\n\n(3.33)\n\n(3.34)\n\nh\n\nh\n\n\u22121\n\n\u2212 x2\n\n(cid:90) \u2212 x1\n(cid:90) 1\n(cid:90) 1\n(cid:90) \u2212 x2\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n\n\u2212 x1\n\n\u22121\n\nh\n\nh\n\nK(u1)K(u2)du1du2 =\n\nK(u1)K(u2)du1du2,\n\nK(u1)K(u2)du1du2 =\n\nK(u1)K(u2)du1du2.\n\nTherefore, for x = (x1, x2)T \u2208 A1,\n\n(cid:90) 1\n\n(cid:90) 1\n\n(cid:90) 1\n\np(x1, x2) =\nUsing the fact that p \u2208 \u03a3\u03ba(2, L), 0 \u2264 x1, x2 \u2264 h, and \u22121 \u2264 u1, u2 \u2264 1, we have\n\n\u2212 x2\n\n\u2212 x2\n\n\u2212 x1\n\n\u2212 x1\n\nx1\nh\n\nx1\nh\n\nx2\nh\n\nx2\nh\n\n+\n\n+\n\n+\n\nh\n\nh\n\nh\n\nh\n\np(x1, x2)K(u1)K(u2)du1du2.\n\n|p(x1 + u1h, x2 + u2h) \u2212 p(x1, x2)| \u2264 4Lh2,\n|p(u1h \u2212 x1, x2 + u2h) \u2212 p(x1, x2)| \u2264 20Lh2,\n|p(u1h + x1, u2h \u2212 x2) \u2212 p(x1, x2)| \u2264 20Lh2,\n|p(u1h \u2212 x1, u2h \u2212 x2) \u2212 p(x1, x2)| \u2264 36Lh2.\n\nFor x \u2208 A1, we can then bound the bias term as\n\n(cid:90) 1\n\n(cid:90) 1\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nK(u1)K(u2)p(t1, t2)du1du2\n\n\u22121\n\n\u22121\nK(u1)K(u2)|p(x1 + u1h, x2 + u2h) \u2212 p(x1, x2)|du1du2\n\nK(u1)K(u2)|p(u1h \u2212 x1, x2 + u2h) \u2212 p(x1, x2)|du1du2\n\nK(u1)K(u2)|p(u1h + x1, u2h \u2212 x2) \u2212 p(x1, x2)|du1du2\n\nK(u1)K(u2)|p(u1h \u2212 x1, u2h \u2212 x2) \u2212 p(x1, x2)|du1du2\n\n=\n\n\u2264\n\n|E(cid:101)ph(x) \u2212 p(x)|\n(cid:12)(cid:12)(cid:12)(cid:12)E(cid:101)ph(x) \u2212\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n(cid:90) 1\n\n\u2212 x2\n\n\u2212 x1\n\n\u2212 x2\n\n\u2212 x1\n\nx1\nh\n\nx2\nh\n\n+\n\n+\n\nh\n\nh\n\nh\n\nh\n\n+\n\nx2\nh\n\nx1\nh\n\n\u2264 80Lh2.\n\nTherefore, we have TC \u2264 ch6.\nCombining the analysis of TB, TC, and TI, we show that the mirror image kernel density estimator\nis free of boundary bias. Thus the desired result of Lemma 3.1 is proved.\n3.1.3 Analyzing the Bias of the Entropy Estimator\nLemma 3.2. Under Assumptions 2.1, 2.2, and 2.3, there exists a universal constant C\u2217 that does\nnot depend on the true density p, such that\n\n(cid:12)(cid:12)(cid:12)EH ((cid:98)ph) \u2212 H(p)\n\n(cid:12)(cid:12)(cid:12) \u2264 C\u2217\u221a\n\nn\n\nsup\n\np\u2208\u03a3\u03ba(2,L)\n\n.\n\n(3.35)\n\n6\n\n\f\u00b7(cid:104)(cid:98)ph(x) \u2212 p(x)\n(cid:105)2\n\n,\n\n(3.36)\n\nLet \u03ba be as de\ufb01ned in the statement of the theorem. Using Fubini\u2019s theorem, H\u00a8older\u2019s inequality and\nthe fact that the Lebesgue measure of X is 1, we have\n\n1\n\n+\n\n2\u03be(x)\n\nProof. Recalling that g(u) = u log u, by Taylor\u2019s theorem we have\n\ng ((cid:98)ph(x)) \u2212 g (p(x)) =(cid:0)log(p(x)) + 1(cid:1) \u00b7(cid:104)(cid:98)ph(x) \u2212 p(x)\n(cid:105)\nwhere \u03be(x) lies in between(cid:98)ph(x) and p(x). It is obvious that \u03ba1 \u2264 \u03be(x) \u2264 \u03ba2.\n(cid:12)(cid:12)EH ((cid:98)ph) \u2212 H(p)(cid:12)(cid:12)\n(cid:90)\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)E\n(cid:2)g ((cid:98)ph(x)) \u2212 g (p(x))(cid:3)dx\n(cid:12)(cid:12)(cid:12)(cid:90)\n(cid:12)(cid:12)(cid:12)\nE(cid:2)g ((cid:98)ph(x)) \u2212 g (p(x))(cid:3)dx\n\u2264 (cid:12)(cid:12)(cid:12)(cid:90)\n(cid:0)log(p(x)) + 1(cid:1) \u00b7 E(cid:104)(cid:98)ph(x) \u2212 p(x)\n(cid:105)\n(cid:115)(cid:90)\n(cid:90)\n(cid:104)E(cid:98)ph(x) \u2212 p(x)\n(cid:105)2\n(cid:115)(cid:90)\n(cid:90)\n(cid:104)E(cid:101)ph(x) \u2212 p(x)\n(cid:105)2\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n(cid:12)(cid:12)(cid:12) +\nE(cid:104)(cid:98)ph(x) \u2212 p(x)\nE(cid:104)(cid:101)ph(x) \u2212 p(x)\n\n\u2264 \u03ba\n\n\u2264 \u03ba\n\n1\n2\u03ba1\n\n2\u03be(x)\n\ndx +\n\ndx +\n\ndx.\n\ndx\n\n=\n\n=\n\nX\n\nX\n\nX\n\nX\n\n\u00b7\n\n\u00b7\n\nX\n\n1\n\n1\n2\u03ba1\n\nX\n\nX\n\nX\n\n\u00b7 E(cid:104)(cid:98)ph(x) \u2212 p(x)\n(cid:105)2\n(cid:105)2\n(cid:105)2\n\ndx\n\n(3.37)\n\n(3.38)\n\n(3.39)\n\n(cid:12)(cid:12)(cid:12)\n\n(3.40)\n\n(3.41)\n\ndx\n\n\u2264 c1h2 + c2h4 +\n\nc3\nnh2 .\n\n(3.42)\nThe last inequality follows from standard results of kernel density estimation and Lemma 3.1, where\nc1, c2, c3 are three constants. We get the desired result by setting h (cid:16) n\u22121/4.\n3.2 Analyzing the Variance Term\nLemma 3.3. Under Assumptions 2.1, 2.2, and 2.3, we have,\n\n(cid:19)\n\n(cid:18)\n\n\u2212 n\u00012\n32\u03ba2\n\n.\n\n(3.43)\n\nsup\n\np\u2208\u03a3\u03ba(2,L)\n\nProof. Let (cid:98)p(cid:48)\nmax(cid:2)|g(cid:48) ((cid:98)ph(x))| ,|g(cid:48) ((cid:98)p(cid:48)\n\nh(x) be the kernel density estimator de\ufb01ned as in (2.6) but with the jth data point\nxj replaced by an arbitrary value (xj)(cid:48). Since g(cid:48)(u) = log u + 1, by Assumption 2.1, we have\n\nFor notational simplicity, we write the product kernel as K2 = K\u00b7K. Using the mean-value theorem\nand the fact that T\u03ba1,\u03ba2 (\u00b7) is a contraction, we have\n\nsup\n\nx1,...,xn,(xj )(cid:48)\n\n=\n\nsup\n\nx1,...,xn,(xj )(cid:48)\n\n(cid:17) \u2264 2 exp\n\nP(cid:16)|H ((cid:98)ph) \u2212 EH ((cid:98)ph)| > \u0001\nh(x))|(cid:3) \u2264 \u03ba.\n|H ((cid:98)ph) \u2212 H ((cid:98)p(cid:48)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\nh)|\n(cid:90)\n(cid:90)\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\nh(x))(cid:3)dx\n(cid:2)g ((cid:98)ph(x)) \u2212 g ((cid:98)p(cid:48)\n|(cid:98)ph(x) \u2212(cid:98)p(cid:48)\n|T\u03ba1,\u03ba2 [(cid:101)ph(x)] \u2212 T\u03ba1,\u03ba2 [(cid:101)p(cid:48)\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n(cid:18) y \u2212 x\n\n(cid:18) xj \u2212 x\n(cid:19)\n\nh(x)| dx\n\nnh2 K2\n\n(cid:19)\n\n(cid:90)\n\n\u2212 1\n\nsup\n\nsup\n\nsup\n\nX\n\nX\n\nX\n\nh\n\ndx\n\n\u2264 \u03ba\n\n= \u03ba\n\n\u2264 4\u03ba\n\nx1,...,xn,(xj )(cid:48)\n\nx1,...,xn,(xj )(cid:48)\n\nx1,...,xn,(xj )(cid:48)\n\n\u2264 8\u03ba sup\n\n1\nnh2 K2\n\nX\n\n(cid:90)\n\ny\n\nK2(u)du =\n\n\u2264 8\u03ba\nn\n\nh\n\n8\u03ba\nn\n\n.\n\n(3.44)\n\n(3.45)\n\n(3.46)\n\n(3.47)\n\n(3.48)\n\n(3.49)\n\n(3.50)\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) dx\n\nh(x)]| dx\n\n(cid:18) (xj)(cid:48) \u2212 x\n\nnh2 K2\n\nh\n\nTherefore, using McDiarmaid\u2019s inequality [16], we get the desired inequality (3.43). The uniformity\nresult holds since the constant does not depend on the true density p.\n\n7\n\n\f4 Application to Forest Density estimation\nWe apply the concentration inequality (2.10) to analyze an algorithm for learning high dimen-\nsional forest graph models [15]. In a forest density estimation problem, we observe n data points\nx1, . . . , xn \u2208 Rd from a d-dimensional random vector X. We have two learning tasks: (i) we want\nto estimate an acyclic undirected graph F = (V, E), where V is the vertex set containing all the\nrandom variables and E is the edge set such that an edge (j, k) \u2208 E if and only if the corresponding\nrandom variables Xj and Xk are conditionally independent given the other variables X\\{j,k}; (ii)\n\nonce we have an estimated graph (cid:98)F , we want to estimate the density function p(x).\nthese two variables. Empirically, we replace I(Xj; Xk) by its estimate(cid:98)I(Xj; Xk) from (2.9). The\n\nUsing the negative log-likelihood loss, Liu et al. [15] show that the graph estimation problem can be\nrecast as the problem of \ufb01nding the maximum weight spanning forest for a weighted graph, where\nthe weight of the edge connecting nodes j and k is I(Xj; Xk), the mutual information between\n\nforest graph can be obtained by the Chow-Liu algorithm [3, 13], which is an iterative algorithm. At\neach iteration the algorithm adds an edge connecting that pair of variables with maximum mutual\ninformation among all pairs not yet visited by the algorithm, if doing so does not form a cycle. When\nstopped early, after s < d\u2212 1 edges have been added, it yields the best s-edge weighted forest. Once\n\na forest graph (cid:98)F = (V,(cid:98)E) is estimated, we propose to estimate the forest density as\n(cid:101)ph1 (x(cid:96)),\nwhere (cid:98)U is the set of isolated vertices in the estimated forest (cid:98)F . Our estimator is different from the\nestimator proposed by [15]\u2014once the graph (cid:98)F is given, we treat the isolated variables differently\n\n(cid:101)ph2(xu) \u00b7 (cid:89)\n(cid:96)\u2208V \\(cid:98)U\n\n(cid:101)ph2 (xj, xk)\n(cid:101)ph2 (xj)(cid:101)ph2(xk)\n\n(cid:89)\n(j,k)\u2208(cid:98)E\n\n(cid:98)p(cid:98)F (x) =\n\n\u00b7 (cid:89)\nu\u2208(cid:98)U\n\nthan the connected variables. As will be shown in Theorem 4.2, such a choice leads to minimax\noptimal forest density estimation, while the obtained rate from [15] is suboptimal.\nLet F s\nKullback-Leibler divergence. We de\ufb01ne the s-oracle forest F \u2217\noracle density estimator pF \u2217 to be\n\nd denote the set of forest graphs with d nodes and no more than s edges. Let D(\u00b7(cid:107)\u00b7) be the\ns := (V, E\u2217) and its corresponding\n\n(4.1)\n\nF \u2217\ns = arg min\nF\u2208F s\n\nD(p(cid:107)pF )\n\nand pF \u2217 :=\n\nd\n\n(j,k)\u2208E\u2217\n\np(x(cid:96)).\nLet \u03a3\u03ba(2, L) be de\ufb01ned as in Assumption (2.1). We de\ufb01ne a density class P\u03ba as\n\nP\u03ba :=(cid:8)p : p is a d-dimensional density with p(xj, xk) \u2208 \u03a3\u03ba(2, L) for any j (cid:54)= k(cid:9).\n\nTheorem 4.1 (Graph Recovery). Let (cid:98)F be the estimated s-edge forest graph using the Chow-Liu\n\n(4.3)\nThe next two theorems show that the above forest density estimation procedure is minimax optimal\nfor both graph recovery and density estimation. Their proofs are provided in a technical report [14].\nalgorithm. Under the same condition as Theorem 12 in [15], If we choose h (cid:16) n\u22121/4 for the mutual\ninformation estimator in (2.9), then\n\n(cid:96)\u2208V\n\n(4.2)\n\nsup\np\u2208P\u03ba\n\nTheorem 4.2 (Density Estimation). Once the s-edge forest graph (cid:98)F as in Theorem 4.1 has been\n\nobtained, we calculate the density estimator (B.1) by choosing h1 (cid:16) n\u22121/5 and h2 (cid:16) n\u22121/6. Then,\n\n(4.4)\n\n= O\n\ns\n\nwhenever log d\nn\n\n\u2192 0.\n\n(cid:17)\n\n(cid:19)\n\n(cid:18)(cid:114) s\n\nP(cid:16)(cid:98)F (cid:54)= F \u2217\n(cid:90)\n(cid:12)(cid:12)(cid:98)p(cid:98)F (x) \u2212 pF \u2217 (x)(cid:12)(cid:12) dx \u2264 C \u00b7\n\nn\n\nE\n\n(cid:114)\n\n(cid:89)\n\n(cid:89)\n\np(xj, xk)\np(xj, xk)\n\nsup\np\u2208P\u03ba\n\nX\n\ns\n\nn2/3\n\n+\n\nd \u2212 s\nn4/5\n\n.\n\n(4.5)\n\n5 Discussions and Conclusions\nTheorem 4.1 allows d to increase exponentially fast as n increases and still guarantees graph recov-\nery consistency. Theorem 4.2 provides the rate of convergence for the L1-risk. The obtained rate\nis minimax optimal over the class P\u03ba. The term sn\u22122/3 corresponds to the price paid to estimate\nbivariate densities; while the term (d \u2212 s)n\u22124/5 corresponds to the price paid to estimate univari-\nate densities. In this way, we see that the exponential concentration inequality for Shannon mutual\ninformation leads to signi\ufb01cantly improved theoretical analysis of the forest density estimation, in\nterms of both graph estimation and density estimation. This research was supported by NSF grant\nIIS-1116730 and AFOSR contract FA9550-09-1-0373.\n\n8\n\n\fReferences\n[1] Ibrahim A. Ahmad and Pi-Erh Lin. A nonparametric estimation of the entropy for absolutely continuous\n\ndistributions (corresp.). IEEE Transactions on Information Theory, 22(3):372\u2013375, 1976.\n\n[2] J Beirlant, E J Dudewicz, L Gy\u00a8or\ufb01, and E C Van Der Meulen. Nonparametric entropy estimation: An\n\noverview. International Journal of Mathematical and Statistical Sciences, 6(1):17\u201339, 1997.\n\n[3] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Information\n\nTheory, IEEE Transactions on, 14(3):462\u2013467, 1968.\n\n[4] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, 1991.\n[5] Paul P. B. Eggermont and Vincent N. LaRiccia. Best asymptotic normality of the kernel density entropy\n\nestimator for smooth densities. IEEE Transactions on Information Theory, 45(4):1321\u20131326, 1999.\n\n[6] A. Gretton, R. Herbrich, and A. J. Smola. The kernel mutual information. In Acoustics, Speech, and\nSignal Processing, 2003. Proceedings.(ICASSP\u201903). 2003 IEEE International Conference on, volume 4,\npages IV\u2013880. IEEE, 2003.\n\n[7] Peter Hall and Sally Morton. On the estimation of entropy. Annals of the Institute of Statistical Mathe-\n\nmatics, 45(1):69\u201388, 1993.\n\n[8] A. O. Hero III, B. Ma, O. J. J. Michel, and J. Gorman. Applications of entropic spanning graphs. Signal\n\nProcessing Magazine, IEEE, 19(5):85\u201395, 2002.\n\n[9] Harry Joe. Estimation of entropy and other functionals of a multivariate density. Annals of the Institute\n\nof Statistical Mathematics, 41(4):683\u2013697, December 1989.\n\n[10] M. C. Jones, M. C. Linton, and J. P. Nielsen. A simple bias reduction method for density estimation.\n\nBiometrika, 82(2):327\u2013338, 1995.\n\n[11] Shiraj Khan, Sharba Bandyopadhyay, Auroop R. Ganguly, Sunil Saigal, David J. Erickson, Vladimir\nProtopopescu, and George Ostrouchov. Relative performance of mutual information estimation methods\nfor quantifying the dependence among short and noisy data. Phys. Rev. E, 76:026209, Aug 2007.\n\n[12] Alexander Kraskov, Harald St\u00a8ogbauer, and Peter Grassberger. Estimating mutual information. Physical\n\nreview. E, Statistical, nonlinear, and soft matter physics, 69(6 Pt 2), June 2004.\n\n[13] Joseph B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem.\n\nProceedings of the American Mathematical Society, 7(1):48\u201350, 1956.\n\n[14] Han Liu, John Lafferty, and Larry Wasserman. Optimal forest density estimation. Technical Report, 2012.\n[15] Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John D. Lafferty, and Larry A. Wasserman. Forest density\n\nestimation. Journal of Machine Learning Research, 12:907\u2013951, 2011.\n\n[16] C. McDiarmid. On the method of bounded differences.\n\nIn Surveys in Combinatorics, number 141 in\nLondon Mathematical Society Lecture Note Series, pages 148\u2013188. Cambridge University Press, August\n1989.\n\n[17] XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence function-\nIEEE Transactions on Information Theory,\n\nals and the likelihood ratio by convex risk minimization.\n56(11):5847\u20135861, 2010.\n\n[18] D. P\u00b4al, B. P\u00b4oczos, and C. Szepesv\u00b4ari. Estimation of R\u00b4enyi entropy and mutual information based on\n\ngeneralized nearest-neighbor graphs. Arxiv preprint arXiv:1003.1954, 2010.\n\n[19] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191\u20131253,\n\n2003.\n\n[20] Liam Paninski and Masanao Yajima. Undersmoothed kernel entropy estimators. IEEE Transactions on\n\nInformation Theory, 54(9):4384\u20134388, 2008.\n\n[21] Barnab\u00b4as P\u00b4oczos and Jeff G. Schneider. Nonparametric estimation of conditional information and diver-\n\ngences. Journal of Machine Learning Research - Proceedings Track, 22:914\u2013923, 2012.\n\n[22] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall. New York,\n\nNY, 1986.\n\n[23] A.B. Tsybakov and van den Meulen. Root-n Consistent Estimators of Entropy for Densities with Un-\n\nbounded Support, volume 23. Universite catholique de Louvain,Institut de statistique, 1994.\n\n[24] Marc M. Van Hulle. Edgeworth approximation of multivariate differential entropy. Neural Comput.,\n\n17(9):1903\u20131910, September 2005.\n\n[25] O Vasicek. A test for normality based on sample entropy. Journal of the Royal Statistical Society Series\n\nB, 38(1):54\u201359, 1976.\n\n[26] Ven Es Bert. Estimating functionals related to a density by a class of statistics based on spacings. Scan-\n\ndinavian Journal of Statistics, 19(1):61\u201372, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1214, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}