{"title": "Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1089, "page_last": 1096, "abstract": null, "full_text": "Estimating divergence functionals and the likelihood\n\nratio by penalized convex risk minimization\n\nXuanLong Nguyen\n\nSAMSI & Duke University\n\nMartin J. Wainwright\n\nUC Berkeley\n\nMichael I. Jordan\n\nUC Berkeley\n\nAbstract\n\nWe develop and analyze an algorithm for nonparametric estimation of divergence\nfunctionals and the density ratio of two probability distributions. Our method is\nbased on a variational characterization of f-divergences, which turns the estima-\ntion into a penalized convex risk minimization problem. We present a derivation\nof our kernel-based estimation algorithm and an analysis of convergence rates for\nthe estimator. Our simulation results demonstrate the convergence behavior of the\nmethod, which compares favorably with existing methods in the literature.\n\n1 Introduction\n\nAn important class of \u201cdistances\u201d between multivariate probability distributions P and Q are the Ali-\nSilvey or f-divergences [1, 6]. These divergences, to be de\ufb01ned formally in the sequel, are all of the\n\nform D\u03c6(P, Q) = R \u03c6(dQ/dP)dP, where \u03c6 is a convex function of the likelihood ratio. This family,\n\nincluding the Kullback-Leibler (KL) divergence and the variational distance as special cases, plays\nan important role in various learning problems, including classi\ufb01cation, dimensionality reduction,\nfeature selection and independent component analysis. For all of these problems, if f-divergences\nare to be used as criteria of merit, one has to be able to estimate them ef\ufb01ciently from data.\n\nWith this motivation, the focus of paper is the problem of estimating an f-divergence based on i.i.d.\nsamples from each of the distributions P and Q. Our starting point is a variational characterization\nof f-divergences, which allows our problem to be tackled via an M-estimation procedure. Speci\ufb01-\ncally, the likelihood ratio function dP/dQ and the divergence functional D\u03c6(P, Q) can be estimated\nby solving a convex minimization problem over a function class. In this paper, we estimate the like-\nlihood ratio and the KL divergence by optimizing a penalized convex risk. In particular, we restrict\nthe estimate to a bounded subset of a reproducing kernel Hilbert Space (RKHS) [17]. The RKHS\nis suf\ufb01ciently rich for many applications, and also allows for computationally ef\ufb01cient optimization\nprocedures. The resulting estimator is nonparametric, in that it entails no strong assumptions on the\nform of P and Q, except that the likelihood ratio function is assumed to belong to the RKHS.\n\nThe bulk of this paper is devoted to the derivation of the algorithm, and a theoretical analysis of the\nperformance of our estimator. The key to our analysis is a basic inequality relating a performance\nmetric (the Hellinger distance) of our estimator to the suprema of two empirical processes (with\nrespect to P and Q) de\ufb01ned on a function class of density ratios. Convergence rates are then obtained\nusing techniques for analyzing nonparametric M-estimators from empirical process theory [20].\nRelated work. The variational representation of divergences has been derived independently and\nexploited by several authors [5, 11, 14]. Broniatowski and Keziou [5] studied testing and estimation\nproblems based on dual representations of f-divergences, but working in a parametric setting as op-\nposed to the nonparametric framework considered here. Nguyen et al. [14] established a one-to-one\ncorrespondence between the family of f-divergences and the family of surrogate loss functions [2],\nthrough which the (optimum) \u201csurrogate risk\u201d is equal to the negative of an associated f-divergence.\nAnother link is to the problem of estimating integral functionals of a single density, with the Shan-\nnon entropy being a well-known example, which has been studied extensively dating back to early\n\n1\n\n\fwork [9, 13] as well as the more recent work [3, 4, 12]. See also [7, 10, 8] for the problem of\n(Shannon) entropy functional estimation. In another branch of related work, Wang et al. [22] pro-\nposed an algorithm for estimating the KL divergence for continuous distributions, which exploits\nhistogram-based estimation of the likelihood ratio by building data-dependent partitions of equiv-\nalent (empirical) Q-measure. The estimator was empirically shown to outperform direct plug-in\nmethods, but no theoretical results on its convergence rate were provided.\n\nThis paper is organized as follows. Sec. 2 provides a background of f-divergences. In Sec. 3, we\ndescribe an estimation procedure based on penalized risk minimization and accompanying conver-\ngence rates analysis results. In Sec. 4, we derive and implement ef\ufb01cient algorithms for solving\nthese problems using RKHS. Sec. 5 outlines the proof of the analysis. In Sec. 6, we illustrate the\nbehavior of our estimator and compare it to other methods via simulations.\n\n2 Background\n\nWe begin by de\ufb01ning f-divergences, and then provide a variational representation of the f-\ndivergence, which we later exploit to develop an M-estimator.\nConsider two distributions P and Q, both assumed to be absolutely continuous with respect to\nLebesgue measure \u00b5, with positive densities p0 and q0, respectively, on some compact domain\nX \u2282 Rd. The class of Ali-Silvey or f-divergences [6, 1] are \u201cdistances\u201d of the form:\n\n(1)\nwhere \u03c6 : R \u2192 \u00afR is a convex function. Different choices of \u03c6 result in many divergences that play\nimportant roles in information theory and statistics, including the variational distance, Hellinger\ndistance, KL divergence and so on (see, e.g., [19]). As an important example, the Kullback-Leibler\n\nD\u03c6(P, Q) = Z p0\u03c6(q0/p0) d\u00b5,\n\n(KL) divergence between P and Q is given by DK(P, Q) = R p0 log(p0/q0) d\u00b5, corresponding to\nthe choice \u03c6(t) = \u2212 log(t) for t > 0 and +\u221e otherwise.\nVariational representation: Since \u03c6 is a convex function, by Legendre-Fenchel convex duality [16]\nwe can write \u03c6(u) = supv\u2208R(uv \u2212 \u03c6\u2217(v)), where \u03c6\u2217 is the convex conjugate of \u03c6. As a result,\n\nD\u03c6(P, Q) = Z p0 sup\n\nf\n\n(f q0/p0 \u2212 \u03c6\u2217(f )) d\u00b5 = sup\n\nf \u00b5Z f dQ \u2212Z \u03c6\u2217(f ) dP\u00b6 ,\n\nwhere the supremum is taken over all measurable functions f : X \u2192 R, and R f dP denotes the\nexpectation of f under distribution P. Denoting by \u2202\u03c6 the subdifferential [16] of the convex function\n\u03c6, it can be shown that the supremum will be achieved for functions f such that q0/p0 \u2208 \u2202\u03c6\u2217(f ),\nwhere q0, p0 and f are evaluated at any x \u2208 X . By convex duality [16], this is true if f \u2208 \u2202\u03c6(q0/p0)\nfor any x \u2208 X . Thus, we have proved [15, 11]:\nLemma 1. Letting F be any function class in X \u2192 R, there holds:\nf \u2208FZ f dQ \u2212 \u03c6\u2217(f ) dP,\n\nD\u03c6(P, Q) \u2265 sup\n\n(2)\n\nwith equality if F \u2229 \u2202\u03c6(q0/p0) 6= \u2205.\nTo illustrate this result in the special case of the KL divergence, here the function \u03c6 has the form\n\u03c6(u) = \u2212 log(u) for u > 0 and +\u221e for u \u2264 0. The convex dual of \u03c6 is \u03c6\u2217(v) = supu(uv\u2212\u03c6(u)) =\n\u22121 \u2212 log(\u2212v) if u < 0 and +\u221e otherwise. By Lemma 1,\n\nDK(P, Q) = sup\n\nf <0Z f dQ \u2212 Z (\u22121 \u2212 log(\u2212f )) dP = sup\n\ng>0Z log g dP \u2212 Z gdQ + 1.\n\n(3)\n\nIn addition, the supremum is attained at g = p0/q0.\n\n3 Penalized M-estimation of KL divergence and the density ratio\n\nLet X1, . . . , Xn be a collection of n i.i.d. samples from the distribution Q, and let Y1, . . . , Yn be\nn i.i.d. samples drawn from the distribution P. Our goal is to develop an estimator of the KL\ndivergence and the density ratio g0 = p0/q0 based on the samples {Xi}n\n\ni=1 and {Yi}n\n\ni=1.\n\n2\n\n\fThe variational representation in Lemma 1 motivates the following estimator of the KL divergence.\nFirst, let G be a function class of X \u2192 R+. We then compute\n\n\u02c6DK = sup\n\ng\u2208GZ log g dPn \u2212Z gdQn + 1,\n\n(4)\n\nwhereR dPn andR dQn denote the expectation under empirical measures Pn and Qn, respectively.\nIf the supremum is attained at \u02c6gn, then \u02c6gn serves as an estimator of the density ratio g0 = p0/q0.\nIn practice, the \u201ctrue\u201d size of G is not known. Accordingly, our approach in this paper is an alter-\nnative approach based on controlling the size of G by using penalties. More precisely, let I(g) be a\nnon-negative measure of complexity for g such that I(g0) < \u221e. We decompose the function class\nG as follows:\n(5)\nG = \u222a1\u2264M \u2264\u221eGM ,\nwhere GM := {g | I(g) \u2264 M} is a ball determined by I(\u00b7).\nThe estimation procedure involves solving the following program:\n\n\u02c6gn = argming\u2208G Z gdQn \u2212Z log g dPn +\n\n\u03bbn\n2\n\nI 2(g),\n\n(6)\n\nwhere \u03bbn > 0 is a regularization parameter. The minimizing argument \u02c6gn is plugged into (4) to\nobtain an estimate of the KL divergence DK.\nFor the KL divergence, the difference | \u02c6DK \u2212 DK(P, Q)| is a natural performance measure. For\nestimating the density ratio, various metrics are possible. Viewing g0 = p0/q0 as a density function\nwith respect to Q measure, one useful metric is the (generalized) Hellinger distance:\n\nh2\nQ(g0, g) :=\n\n1\n\n2 Z (g1/2\n\n0 \u2212 g1/2)2 dQ.\n\n(7)\n\nFor the analysis, several assumptions are in order. First, assume that g0 (not all of G) is bounded\nfrom above and below:\n\n0 < \u03b70 \u2264 g0 \u2264 \u03b71 for some constants \u03b70, \u03b71.\n\nNext, the uniform norm of GM is Lipchitz with respect to the penalty measure I(g), i.e.:\n\ng\u2208GM |g|\u221e \u2264 cM for any M \u2265 1.\nsup\n\nFinally, on the bracket entropy of G [21]: For some 0 < \u03b3 < 2,\n\nHB\n\u03b4 (GM , L2(Q)) = O(M/\u03b4)\u03b3 for any \u03b4 > 0.\n\nThe following is our main theoretical result, whose proof is given in Section 5:\nTheorem 2. (a) Under assumptions (8), (9) and (10), and letting \u03bbn \u2192 0 so that:\n\nthen under P:\n\n\u03bb\u22121\nn = OP(n2/(2+\u03b3))(1 + I(g0)),\n\nhQ(g0, \u02c6gn) = OP(\u03bb1/2\n\nn )(1 + I(g0)),\n\nI(\u02c6gn) = OP(1 + I(g0)).\n\n(b) If, in addition to (8), (9) and (10), there holds inf g\u2208G g(x) \u2265 \u03b70 for any x \u2208 X , then\n\n| \u02c6DK \u2212 DK(P, Q)| = OP(\u03bb1/2\n\nn )(1 + I(g0)).\n\n4 Algorithm: Optimization and dual formulation\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nG is an RKHS. Our algorithm involves solving program (6), for some choice of function class G.\nIn our implementation, relevant function classes are taken to be a reproducing kernel Hilbert space\ninduced by a Gaussian kernel. The RKHS\u2019s are chosen because they are suf\ufb01ciently rich [17], and\nas in many learning tasks they are quite amenable to ef\ufb01cient optimization procedures [18].\n\n3\n\n\fLet K : X \u00d7 X \u2192 R be a Mercer kernel function [17]. Thus, K is associated with a feature\nmap \u03a6 : X \u2192 H, where H is a Hilbert space with inner product h., .i and for all x, x0 \u2208 X ,\nK(x, x0) = h\u03a6(x), \u03a6(x0)i. As a reproducing kernel Hilbert space, any function g \u2208 H can be\nexpressed as an inner product g(x) = hw, \u03a6(x)i, where kgkH = kwkH. A kernel used in our\nsimulation is the Gaussian kernel:\n\nK(x, y) := e\u2212kx\u2212yk2/\u03c3,\n\nwhere k.k is the Euclidean metric in Rd, and \u03c3 > 0 is a parameter for the function class.\nLet G := H, and let the complexity measure be I(g) = kgkH. Thus, Eq. (6) becomes:\n\nmin\n\nw\n\nJ := min\n\nw\n\n1\nn\n\nn\n\nXi=1\n\nhw, \u03a6(xi)i \u2212\n\nXj=1\n\nloghw, \u03a6(yj)i +\n\nn\n\n1\nn\n\n\u03bbn\n2 kwk2\nH,\n\n(12)\n\nwhere {xi} and {yj} are realizations of empirical data drawn from Q and P, respectively. The log\nfunction is extended take value \u2212\u221e for negative arguments.\nLemma 3. minw J has the following dual form:\n\n\u2212min\n\n\u03b1>0\n\nn\n\nXj=1\n\n1\nn\u2212\n\n1\nn\n\n\u2212\n\nlog n\u03b1j+\n\n1\n\n2\u03bbn Xi,j\n\n\u03b1i\u03b1jK(yi, yj)+\n\n1\n\n2\u03bbnn2 Xi,j\n\nK(xi, xj)\u2212\n\n1\n\n\u03bbnn Xi,j\n\n\u03b1jK(xi, yj).\n\nProof. Let \u03c8i(w) := 1\n\nnhw, \u03a6(xi)i, \u03d5j(w) := \u2212 1\n\nn loghw, \u03a6(yj)i, and \u2126(w) = \u03bbn\n\n2 kwk2\n\nH. We have\n\nmin\n\nw\n\nJ = \u2212 max\n\nw\n\n= \u2212 min\n\nui,vj\n\nn\n\n(h0, wi \u2212 J(w)) = \u2212J \u2217(0)\nXi=1\n\ni (ui) +\n\nXj=1\n\n\u03c8\u2217\n\nn\n\nj (vj) + \u2126\u2217(\u2212\n\u03d5\u2217\n\nn\n\nXi=1\n\nui \u2212\n\nn\n\nXj=1\n\nvj),\n\nwhere the last line is due to the inf-convolution theorem [16]. Simple calculations yield:\n\n1\n1\n\u03d5\u2217\nj (v) = \u2212\nn \u2212\nn\n\u03c8\u2217\ni (u) = 0 if u =\n\n\u2126\u2217(v) =\n\nSo, minw J = \u2212 min\u03b1i Pn\n\nimplies the lemma immediately.\n\nlog n\u03b1j if v = \u2212\u03b1j\u03a6(yj) and + \u221e otherwise\n1\n\u03a6(xi) and + \u221e otherwise\nn\n1\n2\u03bbnkvk2\nH.\nj=1(\u2212 1\nn \u2212 1\n\nj=1 \u03b1j\u03a6(yj)\u2212 1\n\nn log n\u03b1j)+ 1\n\n2\u03bbnkPn\n\nn Pn\n\ni=1 \u03a6(xi)k2\n\nH, which\n\n(Pn\n\nn Pn\n\nIf \u02c6\u03b1 is solution of the dual formulation, it is not dif\ufb01cult to show that the optimal \u02c6w is attained at\n\u02c6w = 1\n\u03bbn\n\ni=1 \u03a6(xi)).\n\nj=1 \u02c6\u03b1j\u03a6(yj) \u2212 1\n\nFor an RKHS based on a Gaussian kernel, the entropy condition (10) holds for any \u03b3 > 0 [23].\nFurthermore, (9) trivially holds via the Cauchy-Schwarz inequality:\n|g(x)| = |hw, \u03a6(x)i| \u2264\nkwkHk\u03a6(x)kH \u2264 I(g)pK(x, x) \u2264 I(g). Thus, by Theorem 2(a), k \u02c6wkH = k\u02c6gnkH = OP(kg0kH),\nso the penalty term \u03bbnk \u02c6wk2 vanishes at the same rate as \u03bbn. We have arrived at the following esti-\nmator for the KL divergence:\n\n\u02c6DK = 1 +\n\nn\n\nXj=1\n\n(\u2212\n\n1\nn \u2212\n\n1\nn\n\nlog n\u02c6\u03b1j) =\n\nn\n\nXj=1\n\n1\nn\n\n\u2212\n\nlog n\u02c6\u03b1j.\n\nlog G is an RKHS. Alternatively, we could set\nletting g(x) =\nexphw, \u03a6(x)i, and letting I(g) = k log gkH = kwkH. Theorem 2 is not applicable in this case,\nbecause condition (9) no longer holds, but this choice nonetheless seems reasonable and worth in-\nvestigating, because in effect we have a far richer function class which might improve the bias of\nour estimator when the true density ratio is not very smooth.\n\nlog G to be the RKHS,\n\n4\n\n\fA derivation similar to the previous case yields the following convex program:\n\nmin\n\nw\n\nJ := min\n\nw\n\n1\nn\n\n= \u2212 min\n\n\u03b1>0\n\nn\n\nXi=1\nXi=1\n\nn\n\nehw, \u03a6(xi)i \u2212\n\n1\nn\n\nn\n\nXj=1\n\nhw, \u03a6(yj)i +\n\n\u03bbn\n2 kwk2\n\nH\n\n\u03b1i log(n\u03b1i) \u2212 \u03b1i +\n\n1\n2\u03bbnk\n\nn\n\nXi=1\n\n\u03b1i\u03a6(xi) \u2212\n\n1\nn\n\nn\n\nXj=1\n\n\u03a6(yj)k2\nH.\n\nLetting \u02c6\u03b1 be the solution of the above convex program, the KL divergence can be estimated by:\n\n\u02c6DK = 1 +\n\nn\n\nXi=1\n\n\u02c6\u03b1i log \u02c6\u03b1i + \u02c6\u03b1i log\n\nn\ne\n\n.\n\n5 Proof of Theorem 2\n\nWe now sketch out the proof of the main theorem. The key to our analysis is the following lemma:\nLemma 4. If \u02c6gn is an estimate of g using (6), then:\n1\n4\n\nI 2(\u02c6gn) \u2264 \u2212Z (\u02c6gn \u2212 g0)d(Qn \u2212 Q) +Z 2 log\n\nh2\nQ(g0, \u02c6gn) +\n\n\u02c6gn + g0\n\n\u03bbn\n2\n\n\u03bbn\n2\n\n2g0\n\nI 2(g0).\n\nd(Pn \u2212 P) +\n2 log x \u2264 \u221ax \u2212 1. Thus,\n\ndP. Note that for x > 0, 1\n\n\u2212 1) dP. As a result, for any g, dl is related to hQ as follows:\n\n0\n\ng0\n\ng0\n\ndP \u2264 2R (g1/2g\u22121/2\n\nProof. De\ufb01ne dl(g0, g) = R (g \u2212 g0)dQ \u2212 log g\nR log g\ndl(g0, g) \u2265 Z (g \u2212 g0) dQ \u2212 2Z (g1/2g\u22121/2\n= Z (g \u2212 g0) dQ \u2212 2Z (g1/2g1/2\n\n0\n\n\u2212 1) dP\n\n0 \u2212 g0) dQ = Z (g1/2 \u2212 g1/2\n\n0\n\n)2dQ = 2h2\n\nQ(g0, g).\n\nBy the de\ufb01nition (6) of our estimator, we have:\n\nZ \u02c6gndQn \u2212Z log \u02c6gndPn +\n\n\u03bbn\n2\n\nI 2(\u02c6gn) \u2264 Z g0dQn \u2212Z log g0dPn +\n\n\u03bbn\n2\n\nI 2(g0).\n\n2\n\n\u02c6gn + g0\n\ndQn \u2212Z log\n\nBoth sides (modulo the regularization term I 2) are convex functionals of g. By Jensen\u2019s inequality,\nif F is a convex function, then F ((u + v)/2) \u2212 F (v) \u2264 (F (u) \u2212 F (v))/2. We obtain:\nZ \u02c6gn + g0\nI 2(\u02c6gn) \u2264 Z g0dQn \u2212Z log g0dPn +\nRearranging, R \u02c6gn\u2212g0\nd(Pn \u2212 P) + \u03bbn\nZ log\n\nd(Qn \u2212 Q) \u2212R log \u02c6gn+g0\n\n4 I 2(\u02c6gn) \u2264\ng0 + \u02c6gn\n\nI 2(g0).\n\nI 2(g0)\n\n\u02c6gn + g0\n\ndPn +\n\n\u03bbn\n4\n\n\u03bbn\n4\n\n\u03bbn\n4\n\ndQ +\n\n) +\n\n2g0\n\n2g0\n\n2\n\n2\n\ndP \u2212Z \u02c6gn \u2212 g0\n2\n\u2264 \u22122h2\n\nQ(g0,\n\n\u03bbn\nI 2(g0) = \u2212dl(g0,\n4\ng0 + \u02c6gn\n\n2\nI 2(g0) \u2264 \u2212\n\n\u03bbn\n4\n\n) +\n\n2\n\n1\n8\n\nh2\nQ(g0, \u02c6gn) +\n\n\u03bbn\n4\n\nI 2(g0),\n\nwhere the last inequality is a standard result for the (generalized) Hellinger distance (cf. [20]).\n\nLet us now proceed to part (a) of the theorem. De\ufb01ne fg := log g+g0\n2g0\nSince fg is a Lipschitz function of g, conditions (8) and (10) imply that\n\n, and let FM := {fg|g \u2208 GM}.\n\n(13)\nApply Lemma 5.14 of [20] using distance metric d2(g0, g) = kg \u2212 g0kL2(Q), the following is true\nunder Q (and so true under P as well, since dP/dQ is bounded from above),\n\nHB\n\u03b4 (FM , L2(P)) = O(M/\u03b4)\u03b3.\n\nsup\ng\u2208G\n\nn\u22121/2d2(g0, g)1\u2212\u03b3/2(1 + I(g) + I(g0))\u03b3/2 \u2228 n\u2212 2\n\n|R (g \u2212 g0)d(Qn \u2212 Q)|\n\n2+\u03b3 (1 + I(g) + I(g0))\n\n= OP(1).\n\n(14)\n\n5\n\n\fIn the same vein, we obtain that under P measure:\n\nsup\ng\u2208G\n\nn\u22121/2d2(g0, g)1\u2212\u03b3/2(1 + I(g) + I(g0))\u03b3/2 \u2228 n\u2212 2\n\n2+\u03b3 (1 + I(g) + I(g0))\n\n|R fgd(Pn \u2212 P)|\n\n= OP(1).\n\n(15)\n\nBy condition (9), we have: d2(g0, g) = kg \u2212 g0kL2(Q) \u2264 2c1/2(1 + I(g) + I(g0))1/2hQ(g0, g).\nCombining Lemma 4 and Eqs. (15), (14), we obtain the following:\n\n1\n4\n\nh2\nQ(g0, \u02c6gn) +\n\n\u03bbn\n2\n\nI 2(\u02c6gn) \u2264 \u03bbnI(g0)2/2+\n\nOP\u00b5n\u22121/2hQ(g0, g)1\u2212\u03b3/2(1 + I(g) + I(g0))1/2+\u03b3/4 \u2228 n\u2212 2\n\n2+\u03b3 (1 + I(g) + I(g0))\u00b6.\n\n(16)\n\nFrom this point, the proof involves simple algebraic manipulation of (16). To simplify notation, let\n\u02c6h = hQ(g0, \u02c6gn), \u02c6I = I(\u02c6gn), and I0 = I(g0). There are four possibilities:\nCase a. \u02c6h \u2265 n\u22121/(2+\u03b3)(1 + \u02c6I + I0)1/2 and \u02c6I \u2265 1 + I0. From (16), either\n\n\u02c6h2/4 + \u03bbn \u02c6I 2/2 \u2264 OP(n\u22121/2)\u02c6h1\u2212\u03b3/2 \u02c6I 1/2+\u03b3/4 or \u02c6h2/4 + \u03bbn \u02c6I 2/2 \u2264 \u03bbnI 2\n\n0 /2,\n\nwhich implies, respectively, either\n\u02c6h \u2264 \u03bb\u22121/2\n\nn OP(n\u22122/(2+\u03b3)),\n\u02c6h \u2264 OP(\u03bb1/2\nn I0),\nBoth scenarios conclude the proof if we set \u03bb\u22121\nn = OP(n2/(\u03b3+2)(1 + I0)).\nCase b. \u02c6h \u2265 n\u22121/(2+\u03b3)(1 + \u02c6I + I0)1/2 and \u02c6I < 1 + I0. From (16), either\n\n\u02c6I \u2264 OP(I0).\n\n\u02c6I \u2264 \u03bb\u22121\n\nn OP(n\u22122/(2+\u03b3)) or\n\n\u02c6h2/4 + \u03bbn \u02c6I 2/2 \u2264 OP(n\u22121/2)\u02c6h1\u2212\u03b3/2(1 + I0)1/2+\u03b3/4 or \u02c6h2/4 + \u03bbn \u02c6I 2/2 \u2264 \u03bbnI 2\n\n0 /2,\n\nwhich implies, respectively, either\n\n\u02c6h \u2264 (1 + I0)1/2OP(n\u22121/(\u03b3+2)),\n\n\u02c6I \u2264 1 + I0 or\n\n\u02c6h \u2264 OP(\u03bb1/2\nBoth scenarios conclude the proof if we set \u03bb\u22121\nCase c. \u02c6h \u2264 n\u22121/(2+\u03b3)(1 + \u02c6I + I0)1/2 and \u02c6I \u2265 1 + I0. From (16)\n\u02c6h2/4 + \u03bbn \u02c6I 2/2 \u2264 OP(n\u22122/(2+\u03b3)) \u02c6I,\n\nn I0),\nn = OP(n2/(\u03b3+2)(1 + I0)).\n\n\u02c6I \u2264 OP(I0).\n\nn )(1 + I0),\n\nwhich implies that \u02c6h \u2264 OP(n\u22121/(2+\u03b3)) \u02c6I 1/2 and \u02c6I \u2264 \u03bb\u22121\nOP(\u03bb1/2\nCase d. \u02c6h \u2264 n\u22121/(2+\u03b3)(1 + \u02c6I + I0)1/2 and \u02c6I \u2264 1 + I0. Part (a) of the theorem is immediate.\nFinally, part (b) is a simple consequence of part (a) using the same argument as in Thm. 9 of [15].\n\nn OP(n\u22122/(2+\u03b3)). This means that \u02c6h \u2264\n\n\u02c6I \u2264 OP(1 + I0) if we set \u03bb\u22121\n\nn = OP(n2/(2+\u03b3))(1 + I0).\n\n6 Simulation results\n\nIn this section, we describe the results of various simulations that demonstrate the practical viability\nof our estimators, as well as their convergence behavior. We experimented with our estimators\nusing various choices of P and Q, including Gaussian, beta, mixture of Gaussians, and multivariate\nGaussian distributions. Here we report results in terms of KL estimation error. For each of the eight\nestimation problems described here, we experiment with increasing sample sizes (the sample size,\nn, ranges from 100 to 104 or more). Error bars are obtained by replicating each set-up 250 times.\nFor all simulations, we report our estimator\u2019s performance using the simple \ufb01xed rate \u03bbn \u223c 1/n,\nnoting that this may be a suboptimal rate. We set the kernel width to be relatively small (\u03c3 = .1) for\none-dimension data, and larger for higher dimensions. We use M1 to denote the method in which\nG is the RKHS, and M2 for the method in which log G is the RKHS. Our methods are compared to\n\n6\n\n\f0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n1.5\n\n1\n\n0.5\n\n0\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\nEstimate of KL(Beta(1,2),Unif[0,1])\n\n0.1931\nM1, \u03c3 = .1, \u03bb = 1/n\nM2, \u03c3 = .1, \u03bb = .1/n\nWKV, s = n1/2\nWKV, s = n1/3\n\n100\n\n200\n\n500\n\n1000 2000\n\n5000 10000 20000\n\n50000\n\n(0,1),N\nEstimate of KL(N\n(4,2))\nt\nt\n\n1.9492\nM1, \u03c3 = .1, \u03bb = 1/n\nM2, \u03c3 = .1, \u03bb = .1/n\nWKV, s = n1/3\nWKV, s = n1/2\nWKV, s = n2/3\n\n100\n\n200\n\n500\n\n1000\n\n2000\n\n5000\n\n10000\n\n),Unif[\u22123,3]2)\n(0,I\nEstimate of KL(N\n2\nt\n\n0.777712\nM1, \u03c3 = .5, \u03bb = .1/n\nM2, \u03c3 = .5, \u03bb = .1/n\nWKV, n1/3\nWKV, n1/2\n\n100\n\n200\n\n500\n\n1000\n\n2000\n\n5000\n\n10000\n\n),Unif[\u22123,3]3)\n(0,I\nEstimate of KL(N\n3\nt\n\n1.16657\nM1 \u03c3 = 1, \u03bb = .1/n1/2\nM2, \u03c3 = 1, \u03bb = .1/n\nM2, \u03c3 = 1, \u03bb = .1/n2/3\nWKV, n1/3\nWKV, n1/2\n2000\n\n5000\n\n10000\n\n100\n\n200\n\n500\n\n1000\n\n(0,1)+ 1/2 N\nEstimate of KL(1/2 N\n(1,1),Unif[\u22125,5])\nt\nt\n\n0.414624\nM1, \u03c3 = .1, \u03bb = 1/n\nM2, \u03c3 = 1, \u03bb = .1/n\nWKV, s = n1/3\nWKV, s = n1/2\nWKV, s = n2/3\n\n100\n\n200\n\n500\n\n1000\n\n2000\n\n5000\n\n10000\n\n(4,2),N\nEstimate of KL(N\n(0,1))\nt\nt\n\n4.72006\nM1, \u03c3 = 1, \u03bb = .1/n\nM2, \u03c3 = 1, \u03bb = .1/n\nWKV, s = n1/4\nWKV, s = n1/3\nWKV, s = n1/2\n5000\n\n10000\n\n100\n\n200\n\n500\n\n1000\n\n2000\n\n(1,I\n),N\n(0,I\nEstimate of KL(N\n))\n2\nt\n2\nt\n\n0.959316\nM1, \u03c3 = .5, \u03bb = .1/n\nM2, \u03c3 = .5, \u03bb = .1/n\nWKV, n1/3\nWKV, n1/2\n\n100\n\n200\n\n500\n\n1000\n\n2000\n\n5000\n\n10000\n\n(1,I\n),N\n(0,I\nEstimate of KL(N\n))\n3\nt\n3\nt\n\n1.43897\nM1, \u03c3 = 1, \u03bb = .1/n\nM2, \u03c3 = 1, \u03bb = .1/n\nWKV, n1/2\nWKV, n1/3\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n100\n\n200\n\n500\n\n1000\n\n2000\n\n5000\n\n10000\n\nFigure 1. Results of estimating KL divergences for various choices of probability distributions. In all\nplots, the X-axis is the number of data points plotted on a log scale, and the Y-axis is the estimated\nvalue. The error bar is obtained by replicating the experiment 250 times. Nt(a, Ik) denotes a truncated\nnormal distribution of k dimensions with mean (a, . . . , a) and identity covariance matrix.\n\n7\n\n\falgorithm A in Wang et al [22], which was shown empirically to be one of the best methods in the\nliterature. Their method, denoted by WKV, is based on data-dependent partitioning of the covariate\nspace. Naturally, the performance of WKV is critically dependent on the amount s of data allocated\nto each partition; here we report results with s \u223c n\u03b3, where \u03b3 = 1/3, 1/2, 2/3.\nThe \ufb01rst four plots present results with univariate distributions. In the \ufb01rst two, our estimators M 1\nand M 2 appear to have faster convergence rate than WKV. The WKV estimator performs very well\nin the third example, but rather badly in the fourth example. The next four plots present results with\ntwo and three dimensional data. Again, M1 has the best convergence rates in all examples. The\nM2 estimator does not converge in the last example, suggesting that the underlying function class\nexhibits very strong bias. The WKV methods have weak convergence rates despite different choices\nof the partition sizes. It is worth noting that as one increases the number of dimensions, histogram\nbased methods such as WKV become increasingly dif\ufb01cult to implement, whereas increasing di-\nmension has only a mild effect on our method.\n\nReferences\n\n[1] S. M. Ali and S. D. Silvey. A general class of coef\ufb01cients of divergence of one distribution from another.\n\nJ. Royal Stat. Soc. Series B, 28:131\u2013142, 1966.\n\n[2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101:138\u2013156, 2006.\n\n[3] P. Bickel and Y. Ritov. Estimating integrated squared density derivatives: Sharp best order of convergence\n\nestimates. Sankhy\u00afa Ser. A, 50:381\u2013393, 1988.\n\n[4] L. Birg\u00b4e and P. Massart. Estimation of integral functionals of a density. Ann. Statist., 23(1):11\u201329, 1995.\n[5] M. Broniatowski and A. Keziou. Parametric estimation and tests through divergences. Technical report,\n\nLSTA, Universit\u00b4e Pierre et Marie Curie, 2004.\n\n[6] I. Csisz\u00b4ar. Information-type measures of difference of probability distributions and indirect observation.\n\nStudia Sci. Math. Hungar, 2:299\u2013318, 1967.\n\n[7] L. Gyor\ufb01 and E.C. van der Meulen. Density-free convergence properties of various estimators of entropy.\n\nComputational Statistics and Data Analysis, 5:425\u2013436, 1987.\n\n[8] P. Hall and S. Morton. On estimation of entropy. Ann. Inst. Statist. Math., 45(1):69\u201388, 1993.\n[9] I. A. Ibragimov and R. Z. Khasminskii. On the nonparametric estimation of functionals. In Symposium\n\nin Asymptotic Statistics, pages 41\u201352, 1978.\n\n[10] H. Joe. Estimation of entropy and other functionals of a multivariate density. Ann. Inst. Statist. Math.,\n\n41:683\u2013697, 1989.\n\n[11] A. Keziou. Dual representation of \u03c6-divergences and applications. C. R. Acad. Sci. Paris, Ser. I 336,\n\npages 857\u2013862, 2003.\n\n[12] B. Laurent. Ef\ufb01cient estimation of integral functionals of a density. Ann. Statist., 24(2):659\u2013681, 1996.\n[13] B. Ya. Levit. Asymptotically ef\ufb01cient estimation of nonlinear functionals. Problems Inform. Transmis-\n\nsion, 14:204\u2013209, 1978.\n\n[14] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On divergences, surrogate losses and decentralized\n\ndetection. Technical Report 695, Dept of Statistics, UC Berkeley, October 2005.\n\n[15] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Nonparametric estimation of the likelihood ratio and\n\ndivergence functionals. In International Symposium on Information Theory (ISIT), 2007.\n\n[16] G. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970.\n[17] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman, Harlow, UK, 1988.\n[18] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[19] F. Topsoe. Some inequalities for information divergence and related measures of discrimination. IEEE\n\nTransactions on Information Theory, 46:1602\u20131609, 2000.\n\n[20] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.\n[21] A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New\n\nYork, NY, 1996.\n\n[22] Q. Wang, S. R. Kulkarni, and S. Verd\u00b4u. Divergence estimation of continuous distributions based on\n\ndata-dependent partitions. IEEE Transactions on Information Theory, 51(9):3064\u20133074, 2005.\n\n[23] D. X. Zhou. The covering number in learning theory. Journal of Complexity, 18:739\u2013767, 2002.\n\n8\n\n\f", "award": [], "sourceid": 782, "authors": [{"given_name": "XuanLong", "family_name": "Nguyen", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}