{"title": "A General Framework for Symmetric Property Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 12447, "page_last": 12457, "abstract": "In this paper we provide a general framework for estimating symmetric properties of distributions from i.i.d. samples. For a broad class of symmetric properties we identify the {\\em easy} region where empirical estimation works and the {\\em difficult} region where more complex estimators are required. We show that by approximately computing the profile maximum likelihood (PML) distribution \\cite{ADOS16} in this difficult region we obtain a symmetric property estimation framework that is sample complexity optimal for many properties in a broader parameter regime than previous universal estimation approaches based on PML. The resulting algorithms based on these \\emph{pseudo PML distributions} are also more practical.", "full_text": "A General Framework for Symmetric Property\n\nEstimation\n\nMoses Charikar\nStanford University\n\nmoses@cs.stanford.edu\n\nKirankumar Shiragur\n\nStanford University\n\nshiragur@stanford.edu\n\nAaron Sidford\n\nStanford University\n\nsidford@stanford.edu\n\nAbstract\n\nIn this paper we provide a general framework for estimating symmetric properties\nof distributions from i.i.d. samples. For a broad class of symmetric properties we\nidentify the easy region where empirical estimation works and the dif\ufb01cult region\nwhere more complex estimators are required. We show that by approximately\ncomputing the pro\ufb01le maximum likelihood (PML) distribution [ADOS16] in this\ndif\ufb01cult region we obtain a symmetric property estimation framework that is\nsample complexity optimal for many properties in a broader parameter regime than\nprevious universal estimation approaches based on PML. The resulting algorithms\nbased on these pseudo PML distributions are also more practical.\n\n1\n\nIntroduction\n\nSymmetric property estimation is a fundamental and well studied problem in machine learning\nand statistics. In this problem, we are given n i.i.d samples from an unknown distribution1 p and\nasked to estimate f(p), where f is a symmetric property (i.e. it does not depend on the labels of the\nsymbols). Over the past few years, the computational and sample complexities for estimating many\nsymmetric properties have been extensively studied. Estimators with optimal sample complexities\nhave been obtained for several properties including entropy [VV11b, WY16a, JVHW15], distance to\nuniformity [VV11a, JHW16], and support [VV11b, WY15].\nAll aforementioned estimators were property speci\ufb01c and therefore, a natural question is to design\na universal estimator. In [ADOS16], the authors showed that the distribution that maximizes the\npro\ufb01le likelihood, i.e. the likelihood of the multiset of frequencies of elements in the sample, referred\nto as pro\ufb01le maximum likelihood (PML) distribution, can be used as a universal plug-in estimator.\n[ADOS16] showed that computing the symmetric property on the PML distribution is sample\ncomplexity optimal in estimating support, support coverage, entropy and distance to uniformity\nwithin accuracy \u270f>\nn0.2499 . Further, this also holds for distributions that approximately optimize the\nPML objective with the approximation factor affecting the values of \u270f for which it holds.\nAcharya et al. [ADOS16] posed two important and natural open questions. The \ufb01rst was to give\nan ef\ufb01cient algorithm for \ufb01nding an approximate PML distribution, which was recently resolved\nin [CSS19]. The second open question is whether PML is sample competitive in all regimes of the\naccuracy parameter \u270f? In this work, we make progress towards resolving this open question.\nFirstly, we show that the PML distribution based plug-in estimator achieves optimal sample complex-\nity for all \u270f for the problem of estimating support size. Next, we introduce a variation of the PML\ndistribution that we call the pseudo PML distribution. Using this, we give a general framework for\nestimating a symmetric property. For entropy and distance to uniformity, this pseudo PML based\nframework achieves optimal sample complexity for a broader regime of the accuracy parameter than\nwas known for the vanilla PML distribution.\n\n1\n\n1Throughout the paper, distribution refers to discrete distribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe provide a general framework that could, in principle be applied to estimate any separable\n\nsymmetric property f, meaning f(p) can be written in the form ofPx2D f(px). This motivation\nbehind this framework is: for any symmetric property f that is separable, the estimate for f(p)\ncan be split into two parts: f(p) = Px2B f(px) +Px2G f(px), where B and G are a (property\ndependent) disjoint partition of the domain D. We refer to G as the good set and B as the bad set.\nIntuitively, G is the subset of domain elements whose contribution to f(p) is easy to estimate, i.e\na simple estimator such as empirical estimate (with correction bias) works. For many symmetric\nproperties, \ufb01nding an appropriate partition of the domain is often easy. Many estimators in the\nliterature [JVHW15, JHW16, WY16a] make such a distinction between domain elements. The more\n\nthe work in these estimators is dedicated towards estimating this contribution using sophisticated\ntechniques such as polynomial approximation. Our work gives a uni\ufb01ed approach to estimating the\n\ninteresting and dif\ufb01cult case is estimating the contribution of the bad set: Px2B f(px). Much of\ncontribution of the bad set. We propose a PML based estimator for estimatingPx2B f(px). We\nshow that computing the PML distribution only on the set B is sample competitive for entropy and\ndistance to uniformity for almost all interesting parameter regimes thus (partially) handling the open\nproblem proposed in [ADOS16]. Additionally, requiring that the PML distribution be computed on a\nsubset B \u2713D reduces the input size for the PML subroutine and results in practical algorithms (See\nSection 6).\nTo summarize, the main contributions of our work are:\n\nparameter \u270f that one can obtain for universal symmetric property estimation via PML.\n\n\u2022 We make progress on an open problem of [ADOS16] on broadening the range of error\n\u2022 We give a general framework for applying PML to new symmetric properties.\n\u2022 As a byproduct of our framework, we obtain more practical algorithms that invoke PML on\n\nsmaller inputs (See Section 6).\n\n1.1 Related Work\n\nFor many natural properties, there has been extensive work on designing ef\ufb01cient estimators both\nwith respect to computational time and sample complexity [HJWW17, HJM17, AOST14, RVZ17,\nZVV+16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a]. We\nde\ufb01ne and state the optimal sample complexity for estimating support, entropy and distance to\nuniformity. For entropy, we also discuss the regime in which the empirical distribution is sample\noptimal.\n\n1\n\nN\n\n\u270f ) [WY16a]. Further if \u270f< log N\n\nEntropy: For any distribution p 2 D, the entropy H(p) def= Px2D px log px. For \u270f log N\n(the interesting regime), where N def= |D|, the optimal sample complexity for estimating H(p)\nwithin additive accuracy \u270f is O( N\nN , then [WY16a] showed that\nlog N\nempirical distribution is optimal.\nDistance to uniformity: For any distribution p 2 D, the distance to uniformity kp uk1\nPx2D |px 1\nestimating kp uk1 within additive accuracy \u270f is O( N\nSupport: For any distribution p 2 D, the support of distribution S(p) def= |{x 2D | px > 0}|.\nEstimating support is dif\ufb01cult in general because we need suf\ufb01ciently large number of samples to\nobserve elements with small probability values. Suppose for all x 2D , if px 2{ 0}[ [ 1\nk , 1], then\n[WY15] showed that the optimal sample complexity for estimating support within additive accuracy\n\u270fk is O( k\n\ndef=\nN |, where u is the uniform distribution over D. The optimal sample complexity for\n\nlog N\n\n1\n\n\u270f2 ) [VV11a, JHW16].\n\nlog k log2 1\n\u270f ).\n\nPML was introduced by Orlitsky et al. [OSS+04] in 2004. The connection between PML and\nuniversal estimators was \ufb01rst studied in [ADOS16]. As discussed in the introduction, PML based\nplug-in estimator applies to a restricted regime of error parameter \u270f. There have been several other\napproaches for designing universal estimators for symmetric properties. Valiant and Valiant [VV11b]\nadopted and rigorously analyzed a linear programming based approach for universal estimators\nproposed by [ET76] and showed that it is sample complexity optimal in the constant error regime\nfor estimating certain symmetric properties (namely, entropy and support size). Recent work of Han\net al. [HJW18] applied a local moment matching based approach in designing ef\ufb01cient universal\n\n2\n\n\fsymmetric property estimators for a single distribution. [HJW18] achieves the optimal sample\ncomplexity in restricted error regimes for estimating the power sum function, support and entropy.\nRecently, [YOSW18] gave a different uni\ufb01ed approach to property estimation. They devised an\nestimator that uses n samples and achieves the performance attained by the empirical estimator with\nnplog n samples for a wide class of properties and for all underlying distributions. This result is\nfurther strengthened to n log n samples for Shannon entropy and a broad class of other properties\nincluding `1-distance in [HO19b].\nIndependently of our work, authors in [HO19a] propose truncated PML that is slightly different but\nsimilar in the spirit to our idea of pseudo PML; refer [HO19a] for further details.\n\n1.2 Organization of the Paper\nIn Section 2 we provide basic notation and de\ufb01nitions. We present our general framework in Section 3\nand state all our main results. In Section 4, we provide proofs of the main results of our general\nframework. In Section 5, we use these results to establish the sample complexity of our estimator\nin the case of entropy (See Section 5.1) and distance to uniformity (See Section 5.2). Due to\nspace constraints, many proofs are deferred to the appendix. In Section 6, we provide experimental\nresults for estimating entropy using pseudo PML and other state-of-the-art estimators. Here we also\ndemonstrate the practicality of our approach.\n\n2 Preliminaries\nLet [a] denote all integers in the interval [1, a]. Let D \u21e2 [0, 1]DR be the set of all distributions\nsupported on domain D and let N be the size of the domain. Throughout this paper we restrict our\nattention to discrete distributions and assume that we receive a sequence of n independent samples\nfrom an underlying distribution p 2 D. Let Dn be the set of all length n sequences and yn 2D n\nbe one such sequence with yn\ni denoting its ith element. The probability of observing sequence yn is:\n\nP(p, yn) def= Yx2D\n\npf(yn,x)\nx\n\nwhere f(yn, x) = |{i 2 [n] | yn\ni = x}| is the frequency/multiplicity of symbol x in sequence yn and\npx is the probability of domain element x 2D . We next formally de\ufb01ne pro\ufb01le, PML distribution\nand approximate PML distribution.\n+ is def=\nDe\ufb01nition 2.1 (Pro\ufb01le). For a sequence yn 2D n, its pro\ufb01le denoted =( yn) 2 Zn\n((j))j2[n] where (j) def= |{x 2D| f(yn, x) = j}| is the number of domain elements with frequency\nj in yn. We call n the length of pro\ufb01le and use n denote the set of all pro\ufb01les of length n. 2\nFor any distribution p 2 D, the probability of a pro\ufb01le 2 n is de\ufb01ned as:\n\nP(p, yn)\n\n(1)\n\nP(p, ) def=\n\nX{yn2Dn | (yn)=}\n\nThe distribution that maximizes the probability of a pro\ufb01le is the pro\ufb01le maximum likelihood\ndistribution and we formally de\ufb01ne it next.\nDe\ufb01nition 2.2 (Pro\ufb01le maximum likelihood distribution). For any pro\ufb01le 2 n, a Pro\ufb01le Maximum\nLikelihood (PML) distribution ppml, 2 D is: ppml, 2 arg maxp2D P(p, ) and P(ppml,, ) is\nthe maximum PML objective value. Further, a distribution p\npml, 2 D is a -approximate PML\ndistribution if P(p\nWe next provide formal de\ufb01nitions for separable symmetric property and an estimator.\nDe\ufb01nition 2.3 (Separable Symmetric Property). A symmetric property f : D ! R is separable\nif for any p 2 D, f (p) def= Px2D g(px), for some function g : R ! R. Further for any subset\nS \u21e2D , we de\ufb01ne fS(p) def= Px2S g(px).\n\n2The pro\ufb01le does not contain (0), the number of unseen domain elements.\n\npml,, ) \u00b7 P(ppml,, ).\n\n3\n\n\fDe\ufb01nition 2.4. A property estimator is a function \u02c6f : Dn ! R, that takes as input n samples and\nreturns the estimated property value. The sample complexity of \u02c6f for estimating a symmetric property\nf(p) is the number of samples needed to estimate f up to accuracy \u270f and with constant probability.\nThe optimal sample complexity of a property f is the minimum number of samples of any estimator.\n\n3 Main Results\n\nAs discussed in the introduction, one of our motivations was to provide a better analysis for the PML\ndistribution based plug-in estimator. In this direction, we \ufb01rst show that the PML distribution is\nsample complexity optimal in estimating support in all parameter regimes. Estimating support is\ndif\ufb01cult in general and all previous works make the assumption that the minimum non-zero probability\nvalue of the distribution is at least 1\nk . In our next result, we show that the PML distribution under this\nconstraint is sample complexity optimal for estimating support.\nTheorem 3.1. The PML distribution 3 based plug-in estimator is sample complexity optimal in\nestimating support for all regimes of error parameter \u270f.\n\nFor support, we show that an approximate PML distribution is sample complexity optimal as well.\nTheorem 3.2. For any constant \u21b5> 0, an exp(\u270f2n1\u21b5)-approximate PML distribution 3 based\nplug-in estimator is sample complexity optimal in estimating support for all regimes of error \u270f.\n\nWe defer the proof of both these theorems to Appendix A.\nFor entropy and distance to uniformity, we study a variation of the PML distribution we call the\npseudo PML distribution and present a general framework for symmetric property estimation based\non this. We show that this pseudo PML based general approach gives an estimator that is sample\ncomplexity optimal for estimating entropy and distance to uniformity in broader parameter regimes.\nTo motivate and understand this general framework we \ufb01rst de\ufb01ne new generalizations of the pro\ufb01le,\nPML and approximate PML distributions.\nDe\ufb01nition 3.3 (S-pseudo Pro\ufb01le). For any sequence yn 2D n and S \u2713D , its S-pseudo pro\ufb01le\ndenoted S = S(yn) is def= (S(j))j2[n] where S(j) def= |{x 2 S | f(yn, x) = j}| is the number\nof domain elements in S with frequency j in yn. We call n the length of S as it represents the length\nof the sequence yn from which this pseudo pro\ufb01le was constructed. Let n\nS denote the set of all\nS-pseudo pro\ufb01les of length n.\nFor any distribution p 2 D, the probability of a S-pseudo pro\ufb01le S 2 n\n\nS is de\ufb01ned as:\n\nP(p, S) def=\n\nP(p, yn)\n\n(2)\n\nX\n\n{yn2Dn | S (yn)=S}\n\nP\u21e3|fS(p) \u02c6f(S)| \u270f\u2318 \uf8ff ,\n\n4\n\nWe next de\ufb01ne the S-pseudo PML and (, S )-approximate pseudo PML distributions that are\nanalogous to the PML and approximate PML distributions.\nDe\ufb01nition 3.4 (S-pseudo PML distribution). For any S-pseudo pro\ufb01le S 2 n\npS 2 D is a S-pseudo PML distribution if pS 2 arg maxp2D P(p, S).\nDe\ufb01nition 3.5 ((, S )-approximate pseudo PML distribution). For any pro\ufb01le S 2 n\ntion p\n\nS 2 D is a (, S )-approximate pseudo PML distribution if P(p\n\n, S) \u00b7 P(p\n\nS, a distribution\n\nS, a distribu-\n, S).\n\nS\n\nS\n\nFor notational convenience, we also de\ufb01ne the following function.\nDe\ufb01nition 3.6. For any subset S \u2713D , the function Freq : n\nand returns the set with all distinct frequencies in S.\nUsing the de\ufb01nitions above, we next give an interesting generalization of Theorem 3 in [ADOS16].\nTheorem 3.7. For a symmetric property f and S \u2713D , suppose there is an estimator \u02c6f : n\nS ! R,\nsuch that for any p and S \u21e0 p the following holds,\n\nS ! 2Z+ takes input a S-psuedo pro\ufb01le\n\n3 Under the constraint that its minimum non-zero probability value is at least 1\n\nk . This assumption is also\n\nnecessary for the results in [ADOS16] to hold.\n\n\fthen for any F 2 2Z+, a (, S )-approximate pseudo PML distribution p\n\nS satis\ufb01es:\n\nP\u21e3|fS(p) fS(p\n\nS\n\n)| 2\u270f\u2318 \uf8ff\n\nn|F|\n P (Freq (S) \u2713 F) + P (Freq (S) 6\u2713 F) .\n\nNote that in the theorem above, the error probability with respect to a pseudo PML distribution based\nestimator has dependency on n|F|\nand P (Freq (S) 6\u2713 F). However Theorem 3 in [ADOS16] has\n\nerror probability epn\n . This is the bottleneck in showing that PML works for all parameter regimes\nand the place where pseudo PML wins over the vanilla PML based estimator, getting non-trivial\nresults for entropy and distance to uniformity. We next state our general framework for estimating\nsymmetric properties. We use the idea of sample splitting which is now standard in the literature\n[WY16a, JVHW15, JHW16, CL11, Nem03].\n\n1 , xn\n\nAlgorithm 1 General Framework for Symmetric Property Estimation\n1: procedure PROPERTY ESTIMATION(x2n, f, F)\nLet x2n = (xn\n2 ), where xn\n2:\nDe\ufb01ne S def= {y 2D | f (xn\n3:\nConstruct pro\ufb01le S, where S(j) def= |{y 2 S | f(xn\n4:\nFind a (, S )-approximate pseudo PML distribution p\n5:\nreturn fS(p\n6:\nS\n7: end procedure\n\n1 and xn\n1 , y) 2 F}.\n\n2 , y) = j}|.\nS and empirical distribution \u02c6p on xn\n2 .\n) + f \u00afS(\u02c6p) + correction bias with respect to f \u00afS(\u02c6p).\n\n2 represent \ufb01rst and last n samples of x2n respectively.\n\nThen for any sequence x2n = (xn\n\n) (recall p\n) is the property value of distribution p\n\nIn the above general framework, the choice of F depends on the symmetric property of interest.\nLater, in the case of entropy and distance to uniformity, we will choose F to be the region where\nthe empirical estimate fails; it is also the region that is dif\ufb01cult to estimate. One of the important\nproperties of the above general framework is that fS(p\nS is a (, S )-approximate pseudo\nS\nPML distribution and fS(p\nS on subset of domain elements\nS\nS \u2713D ) is close to fS(p) with high probability. Below we state this result formally.\nTheorem 3.8. For any symmetric property f, let G \u2713D and F, F0 2 2Z+. If for all S0 2 2G, there\nexists an estimator \u02c6f : n\nP\u21e3|fS0(p) \u02c6f(S0)| 2\u270f\u2318 \uf8ff and P (Freq (S0) 6\u2713 F0) \uf8ff .\nP\u21e3|fS(p) fS(p\n+ + PS /2 2G ,\n\nS0 ! R, such that for any p and S0 \u21e0 p satis\ufb01es,\n\nwhere S is a random set S def= {y 2D | f (xn\nUsing the theorem above, we already have a good estimate for fS(p) for appropriately chosen\nfrequency subsets F, F0 and G \u2713D . Further, we choose these subsets F, F0 and G carefully so that\nthe empirical estimate f \u00afS(\u02c6p) plus the correction bias with respect to f \u00afS is close to f \u00afS(p). Combining\nthese together, we get the following results for entropy and distance to uniformity.\nN 1\u21b5\u2318 for any constant \u21b5> 0, then for estimating entropy,\nTheorem 3.9. If error parameter \u270f> \u2326\u21e3 log N\n\nthe estimator 1 for = n log n is sample complexity optimal.\nFor entropy, we already know from [WY16a] that the empirical distribution is sample complexity\noptimal if \u270f< c log N\nN for some constant c > 0. Therefore the interesting regime for entropy estimation\n\nn|F 0|\n\n)| > 4\u270f\u2318 \uf8ff\n1 , y) 2 F} and S\n\ndef= S(xn\n2 ).\n\n2 ),\n1 , xn\n\n(3)\n\nS\n\nN \u2318 and our estimator works for almost all such \u270f.\n\nis when \u270f> \u2326\u21e3 log N\nTheorem 3.10. Let \u21b5> 0 and error parameter \u270f> \u2326\nuniformity, the estimator 1 for = np n log n\nNote that the estimator in [JHW17] also requires that the error parameter \u270f 1\nsome constant.\n\nis sample complexity optimal.\n\nN 18\u21b5, then for estimating distance from\n\nN C , where C > 0 is\n\n1\n\nN\n\n\n\n5\n\n\f4 Analysis of General Framework for Symmetric Property Estimation\n\nS\n\nHere we provide proofs of the main results for our general framework (Theorem\u20183.7 and 3.8). These\nresults weakly depend on the property and generalize results in [ADOS16]. The PML based estimator\nin [ADOS16] is sample competitive only for a restricted error parameter regime and this stems from\nthe large number of possible pro\ufb01les of length n. Our next lemma will be useful to address this issue\nand later we show how to use this result to prove Theorems 3.7 and 3.8.\nLemma 4.1. For any subset S \u2713D and F 2 2Z+, if set B is de\ufb01ned as B def= {S 2\nS | Freq (S) \u2713 F}, then the cardinality of set B is upper bounded by (n + 1)|F|.\nn\nProof of Theorem 3.7. Using the law of total probability we have,\nP\u21e3|fS(p) fS(p\n\n)| 2\u270f, Freq (S) \u2713 F\u2318\n)| 2\u270f, Freq (S) 6\u2713 F\u2318 ,\n)| 2\u270f, Freq (S) \u2713 F\u2318 + P (Freq (S) 6\u2713 F) .\n(S) > . For \uf8ff 1, we have\n) \u02c6f(S)|\uf8ff \u270f.\n)\u02c6f(S)|\uf8ff 2\u270f. Note\nS | Freq (S) \u2713 F and |fS(p) \n)| 2\u270f}. From the previous discussion, we get p (S) \uf8ff / for all S 2 BF,S,\u02c6f. Therefore,\n(n + 1)|F| .\n\nConsider any S \u21e0 p. If p (S) > /, then we know that p\np (S) > that implies |fS(p) \u02c6f(S)|\uf8ff \u270f. Further p\nUsing triangle inequality we get, |fS(p)fS(p\nwe wish to upper bound the probability of set: BF,S,\u02c6f\nfS(p\nS\n\n)| 2\u270f\u2318 = P\u21e3|fS(p) fS(p\n+ P\u21e3|fS(p) fS(p\n\uf8ff P\u21e3|fS(p) fS(p\n\n)|\uf8ff| fS(p)\u02c6f(S)|+|fS(p\n\n(S) > implies |fS(p\n\ndef= {S 2 n\n\nS\n\nS\n\nS\n\nS\n\nS\n\nS\n\nS\n\nS\n\np (S) \uf8ff\n\n\n|BF,S,\u02c6f|\uf8ff\n\n\n\n\nS | Freq (S) \u2713 F} and invoke Lemma 4.1.\n\nS\n\n)| 2\u270f, Freq (S) \u2713 F\u2318 = XS2BF,S,\u02c6f\n\nP\u21e3|fS(p) fS(p\nIn the \ufb01nal inequality, we use BF,S,\u02c6f \u2713{ S 2 n\nProof for Theorem 3.8. Using Bayes rule we have:\nP\u21e3|fS(p) fS(p\n\nS\n\nS\n\nS\n\nIn\n\nthe\n\nsecond\n\n)| > 2\u270f\u2318 = XS0\u2713D\n\uf8ff XS022G\n\n)| > 2\u270f | S = S0\u2318 P (S = S0)\n)| > 2\u270f | S = S0\u2318 P (S = S0) + PS /2 2G .\n\nP\u21e3|fS(p) fS(p\nP\u21e3|fS(p) fS(p\nuse PS0 /22G P\u21e3|fS(p) fS(p\nis upper bounded by, PS022G P\u21e3|fS0(p) fS0(p\n + P (Freq (S0) 6\u2713 F0)i P (S = S0) \uf8ff\n)| > 2\u270f\u2318.\n)| > 2\u270f | S = S0\u2318 = P\u21e3|fS0(p) fS0(p\n\nPS /2 2G.\nPS022Gh n|F0|\nP\u21e3|fS(p) fS(p\nwe usePS022G P (S = S0) \uf8ff 1 and P (Freq (S) 6\u2713 F0, S = S0) \uf8ff . The theorem follows by\n\nIn the \ufb01rst up-\nper bound, we removed randomness associated with the random set S and used\nIn the \ufb01rst inequal-\nity above, we invoke Theorem 3.7 using conditions from Equation (3). In the second inequality,\n\n)| > 2\u270f, S = S0\u2318\n)| > 2\u270f\u2318 P (S = S0) \uf8ff\n\n(4)\n\uf8ff\nS\nthe above expression and\n\ninequality, we\nthe \ufb01rst\n\ncombining all the analysis together.\n\nterm on the right side of\n\nS0\n+ .\n\nnote that\n\nConsider\n\nn|F0|\n\nS0\n\nS\n\nit\n\n\n\n5 Applications of the General Framework\n\nHere we provide applications of our general framework (de\ufb01ned in Section 3) using results from the\nprevious section. We apply our general framework to estimate entropy and distance to uniformity. In\nSection 5.1 and Section 5.2 we analyze the performance of our estimator for entropy and distance to\nuniformity estimation respectively.\n\n6\n\n\f5.1 Entropy estimation\n\nIn order to prove our main result for entropy (Theorem 3.9), we \ufb01rst need the existence of an estimator\nfor entropy with some desired properties. The existence of such an estimator will be crucial to bound\nthe failure probability of our estimator. A result analogous to this is already known in [ADOS16]\n(Lemma 2) and the proof of our result follows from a careful observation of [ADOS16, WY16a]. We\nstate this result here but defer the proof to appendix.\n\nLemma 5.1. Let \u21b5> 0, \u270f> \u2326\u21e3 log N\n)\npy\nthere exists an S-pseudo pro\ufb01le based estimator that use the optimal number of samples, has bias\nless than \u270f and if we change any sample, changes by at most c \u00b7 n\u21b5\nCombining the above lemma with Theorem 3.8, we next prove that our estimator de\ufb01ned in Algo-\nrithm 1 is sample complexity optimal for estimating entropy in a broader regime of error \u270f.\n\nN 1\u21b5\u2318 and S \u2713D , then for entropy on subset S (Py2S py log 1\n\nn , where c is a constant.\n\nProof for Theorem 3.9. Let f(p) represent the entropy of distribution p and \u02c6f be the estimator in\nLemma 5.1. De\ufb01ne F def= [0, c1 log n] for constant c1 40. Given the sequence x2n, the random set\n1 , y) \uf8ff c1 log n}. Let F0 def= [0, 8c1 log n], then by derivation in\nS is de\ufb01ned as S def= {y 2D | f (xn\nLemma 6 [ADOS16] (or by simple application of Chernoff 4) we have,\n\nP (Freq (S) 6\u2713 F0) = P (9y 2D such that f(xn\nFurther let G def= {x 2D | px \uf8ff 2c1 log N\nn4 . Further for all S0 2 2G we have,\n\nn\n\n1\n\n1 , y) \uf8ff c1 log n and f(xn\n\n2 , y) > 8c1 log n) \uf8ff\n\n1\nn5 .\n\n}, then by Equation 48 in [WY16a] we have, PS /2 2G \uf8ff\n\nP (Freq (S0) 6\u2713 F0) = P (9y 2 S0 such that f(xn\n\n2 , y) > 8c1 log n) \uf8ff for =\n\n1\nn5 .\n\nNote for all x 2 S0, px \uf8ff 2c1 log N\nand the above inequality also follows from Chernoff. All that\nremains now is to upper bound . Using the estimator constructed in Lemma 5.1 and further combined\nwith McDiarmid\u2019s inequality, we have,\n\nn\n\nSubstituting all these parameters together in Theorem 3.8 we have,\n\nP\u21e3|fS0(p) \u02c6f(S0)| 2\u270f\u2318 \uf8ff 2 exp\u2713 2\u270f2\nP\u21e3|fS(p) fS(p\n\n)| > 2\u270f\u2318 \uf8ff\n\nn )2\u25c6 \uf8ff for = exp2\u270f2n12\u21b5 .\n+ P (Freq (S) 6\u2713 F0) + PS /2 2G\n\nn|F 0|\n\nn(c n\u21b5\n\nS\n\n\n\n\uf8ff exp2\u270f2n12\u21b5 n9c1 log n +\n\n1\nn4 \uf8ff\n\n2\nn4 .\n\n(5)\n\nIn the \ufb01rst inequality, we use Theorem 3.8. In the second inequality, we substituted the values for\n\n1\n\n\u270f ) and \u270f> \u2326\u21e3 log3 N\nN 14\u21b5\u2318.\n\n, , and PS /2 2G. In the \ufb01nal inequality we used n =\u21e5( N\n\nlog N\n\nOur \ufb01nal goal is to estimate f(p), and to complete the proof we need to argue that f \u00afS(\u02c6p) + the\ncorrection bias with respect to f \u00afS is close to f \u00afS(p), where recall \u02c6p is the empirical distribution\non sequence xn\n2 . The proof for this follows immediately from [WY16a] (Case 2 in the proof of\nProposition 4). [WY16a] bound the bias and variance of the empirical estimator with a correction\nbias and applying Markov inequality on their result we get P\u21e3|f \u00afS(p) (f \u00afS(\u02c6p) + | \u00afS|n )| > 2\u270f\u2318 \uf8ff 1\n3,\nwhere | \u00afS|n is the correction bias in [WY16a]. Using triangle inequality, our estimator fails if either\n|f \u00afS(p) (f \u00afS(\u02c6p) + | \u00afS|n )| > 2\u270f or |fS(p) fS(p\n)| > 2\u270f. Further by union bound the failure\nS\nn4 , which is a constant.\nprobability is at most 1\n\n3 + 2\n\n4Note probability of many events in this proof can be easily bounded by application of Chernoff. These\nbounds on probabilities are also shown in [ADOS16, WY16a] and we use these inequalities by omitting details.\n\n7\n\n\f5.2 Distance to Uniformity estimation\nHere we prove our main result for distance to uniformity estimation (Theorem 3.10). First, we show\nexistence of an estimator for distance to uniformity with certain desired properties. Similar to entropy,\na result analogous to this is shown in [ADOS16] (Lemma 2) and the proof of our result follows\nfrom the careful observation of [ADOS16, JHW17]. We state this result here but defer the proof to\nAppendix C.\n\nLemma 5.2. Let \u21b5> 0 and S \u2713D , then for distance to uniformity on S (Py2S |py 1\nN |) there\nexists an S-pseudo pro\ufb01le based estimator that use the optimal number of samples, has bias at most \u270f\nand if we change any sample, changes by at most c \u00b7 n\u21b5\nCombining the above lemma with Theorem 3.8 we provide the proof for Theorem 3.10.\n\nn , where c is a constant.\n\nProof for Theorem 3.10. Let f(p) represent the distance to uniformity for distribution p and \u02c6f be\n] for some constant\nthe estimator in Lemma 5.2. De\ufb01ne F = [ n\nc1 40. Given the sequence x2n, the random set S is de\ufb01ned as S def= {y 2D | f (xn\n1 , y) 2 F}.\n], then by derivation in Lemma 7 of [ADOS16] (also\nLet F0 = [ n\n, n\nshown in [JHW17] 5) we have,\n\nN q c1n log n\n\nN +q c1n log n\n\nN +q 8c1n log n\n\nN q 8c1n log n\nP (Freq (S) 6\u2713 F0) = P (9y 2D such that f(xn\n\n1 , y) 2 F and f(xn\n\n2 , y) /2 F0) \uf8ff\n\n1\nn4 .\n\n, n\n\nN\n\nN\n\nN\n\nN\n\nN q 2c1 log n\n\nnN , 1\n\nN +q 2c1 log n\n1 , y) 2 F and px /2 G) \uf8ff\n\nlog n\nn1\u270f .\n\nnN ]}, then using Lemma 2 in [JHW17]\n\nFurther let G def= {x 2D | px 2 [ 1\nwe get,\n\nFurther for all S0 2 2G we have,\n\nPS /2 2G = P (9y 2D such that f(xn\nP (Freq (S0) 6\u2713 F0) = P (9y 2 S0 such that f(xn\n\n2 , y) > 8c1 log n) \uf8ff for =\n\n1\nn\n\n.\n\nNote for all x 2 S0, px 2 G and the above result follows from [JHW17] (Lemma 1). All that remains\nnow is to upper bound . Using the estimator constructed in Lemma 5.2 and further combined with\nMcDiarmid\u2019s inequality, we have,\n\nn )2\u25c6 \uf8ff for = exp2\u270f2n12\u21b5 .\n\nSubstituting all these parameters in Theorem 3.8 we get,\n\nP\u21e3|fS0(p) \u02c6f(S0)| 2\u270f\u2318 \uf8ff 2 exp\u2713 2\u270f2\nP\u21e3|fS(p) fS(p\n\n)| > 2\u270f\u2318 \uf8ff\n\nn|F 0|\n\nn(c n\u21b5\n\nS\n\n\n\n+ P (Freq (S) 6\u2713 F0) + PS /2 2G\n\n\uf8ff exp2\u270f2n12\u21b5 n2q 8c1n log n\n\nN\n\n+\n\nlog n\nn1\u270f +\n\n1\nn \uf8ff o(1) .\n\n(6)\n\nIn the \ufb01rst inequality, we use Theorem 3.8. In the second inequality, we substituted values for , , \n\nand PS /2 2G. In the \ufb01nal inequality we used n =\u21e5( N\n\nlog N\n\n1\n\n\u270f2 ) and \u270f> \u2326\n\n1\n\nN 18\u21b5.\n\nOur \ufb01nal goal is to estimate f(p), and to complete the proof we argue that f \u00afS(\u02c6p) + correction bias\nwith respect to f \u00afS is close to f \u00afS(p), where recall \u02c6p is the empirical distribution on sequence xn\n2 . The\nproof for this case follows immediately from [JHW17] (proof of Theorem 2). [JHW17] de\ufb01ne three\nkinds of events E1,E2 and E3, the proof for our empirical case follows from the analysis of bias and\nvariance of events E1 and E2. Further combining results in [JHW17] with Markov inequality we get\nP (|f \u00afS(p) f \u00afS(\u02c6p)| > 2\u270f) \uf8ff 1\n3, and the correction bias here is zero. Using triangle inequality, our\nestimator fails if either |f \u00afS(p) (f \u00afS(\u02c6p) + | \u00afS|n )| > 2\u270f or |fS(p) fS(p\n)| > 2\u270f. Further by union\nbound the failure probability is upper bounded by 1\n\n3 + o(1), which is a constant.\n\nS\n\n5Similar to entropy, for many events their probabilities can be bounded by simple application of Chernoff\n\nand have already been shown in [ADOS16, JHW17]. We omit details for these inequalities.\n\n8\n\n\f6 Experiments\n\nWe performed two different sets of experiments for entropy estimation \u2013 one to compare performance\nguarantees and the other to compare running times. In our pseudo PML approach, we divide the\nsamples into two parts. We run the empirical estimate on one (this is easy) and the PML estimate\non the other. For the PML estimate, any algorithm to compute an approximate PML distribution\ncan be used in a black box fashion. An advantage of the pseudo PML approach is that it can\nuse any algorithm to estimate the PML distribution as a black box, providing both competitive\nperformance and running time ef\ufb01ciency. In our experiments, we use the heuristic algorithm in\n[PJW17] to compute an approximate PML distribution. In the \ufb01rst set of experiments detailed below,\nwe compare the performance of the pseudo PML approach with raw [PJW17] and other state-of-the-\nart estimators for estimating entropy. Our code is available at https://github.com/shiragur/\nCodeForPseudoPML.git\n\n)\n\nE\nS\nM\nR\n\n(\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n \n\nn\na\ne\nm\n\n \nt\n\no\no\nR\n\n101\n\n100\n\n10-1\n\n10-2\n\n10-3\n\n10-4\n\n103\n\nEntropy - Mix 2 Uniforms\n\nOur Work\nPJW17\nMLE\nVV11b\nJVHW15\n\n104\n\n105\n\n106\n\n107\n\nSample size\n\n)\n\nE\nS\nM\nR\n\n(\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n \n\nn\na\ne\nm\n\n \nt\n\no\no\nR\n\n101\n\n100\n\n10-1\n\n10-2\n\n10-3\n\n10-4\n\n103\n\nEntropy - Zipf(0.5)\n\nOur Work\nPJW17\nMLE\nVV11b\nJVHW15\n\n104\n\n105\n\n106\n\n107\n\nSample size\n\n)\n\nE\nS\nM\nR\n\n(\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n \n\nn\na\ne\nm\n\n \nt\n\no\no\nR\n\n101\n\n100\n\n10-1\n\n10-2\n\n10-3\n\n103\n\nEntropy - Zipf(1)\n\nOur Work\nPJW17\nMLE\nVV11b\nJVHW15\n\n104\n\n105\n\n106\n\n107\n\nSample size\n\nEach plot depicts the performance of various algorithms for estimating entropy of different distribu-\ntions with domain size N = 105. Each data point represents 50 random trials. \u201cMix 2 Uniforms\u201d is a\nmixture of two uniform distributions, with half the probability mass on the \ufb01rst N/10 symbols, and\nZipf(\u21b5) \u21e0 1/i\u21b5 with i 2 [N ]. MLE is the naive approach of using the empirical distribution with\ncorrection bias; all the remaining algorithms are denoted using bibliographic citations. In our algo-\nrithm we pick threshold = 18 (same as [WY16a]) and our set F = [0, 18] (input of Algorithm 1), i.e.\nwe use the PML estimate on frequencies \uf8ff 18 and empirical estimate on the rest. Unlike Algorithm 1,\nwe do not perform sample splitting in the experiments \u2013 we believe this requirement is an artifact of\nour analysis. For estimating entropy, the error achieved by our estimator is competitive with [PJW17]\nand other state-of-the-art entropy estimators. Note that our results match [PJW17] for small sample\nsizes because not many domain elements cross the threshold and for a large fraction of the samples,\nwe simply run the [PJW17] algorithm.\nIn the second set of experiments we demonstrate the running time ef\ufb01ciency of our approach. In\nthese experiments, we compare the running time of our algorithm using [PJW17] as a subroutine to\nthe raw [PJW17] algorithm on the Zipf(1) distribution. The second row is the fraction of samples on\nwhich our algorithm uses the empirical estimate (plus correction bias). The third row is the ratio of\nthe running time of [PJW17] to our algorithm. For large sample sizes, the entries in the EmpFrac row\nhave high value, i.e. our algorithm applies the simple empirical estimate on large fraction of samples;\ntherefore, enabling 10x speedup in the running times.\n5 \u21e4 104\n0.505\n3.561\n\n5 \u21e4 103\n0.317\n1.205\n\n5 \u21e4 106\n0.886\n12.196\n\nSamples size\n\nEmpFrac\nSpeedup\n\n103\n0.184\n0.824\n\n104\n0.372\n1.669\n\n105\n0.562\n4.852\n\n5 \u21e4 105\n0.695\n9.552\n\n106\n0.752\n13.337\n\nAcknowledgments\nWe thank the reviewers for the helpful comments, great suggestions, and positive feedback. Moses\nCharikar was supported by a Simons Investigator Award, a Google Faculty Research Award and\nan Amazon Research Award. Aaron Sidford was partially supported by NSF CAREER Award\nCCF-1844855.\n\nReferences\n[ADOS16] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A\nuni\ufb01ed maximum likelihood approach for optimal distribution property estimation.\nCoRR, abs/1611.02960, 2016.\n\n9\n\n\f[AOST14] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The\ncomplexity of estimating r\u00e9nyi entropy. In Proceedings of the Twenty-Sixth Annual\nACM-SIAM Symposium on Discrete Algorithms, pages 1855\u20131869, 2014.\n\n[CL11] T. Tony Cai and Mark G. Low. Testing composite hypotheses, hermite polynomials and\noptimal estimation of a nonsmooth functional. Ann. Statist., 39(2):1012\u20131041, 04 2011.\n\n[CSS19] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Ef\ufb01cient Pro\ufb01le Maxi-\nmum Likelihood for Universal Symmetric Property Estimation. arXiv e-prints, page\narXiv:1905.08448, May 2019.\n\n[ET76] Bradley Efron and Ronald Thisted. Estimating the number of unsen species: How many\n\nwords did shakespeare know? Biometrika, 63(3):435\u2013447, 1976.\n\n[HJM17] Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On Estimation of $L_{r}$-Norms in\n\nGaussian White Noise Models. arXiv e-prints, page arXiv:1710.03863, Oct 2017.\n\n[HJW18] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A uni\ufb01ed\nmethodology for symmetric functional estimation and distribution estimation under\nwasserstein distance. arXiv preprint arXiv:1802.08405, 2018.\n\n[HJWW17] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates of entropy\n\nestimation over Lipschitz balls. arXiv e-prints, page arXiv:1711.02141, Nov 2017.\n\n[HO19a] Yi Hao and Alon Orlitsky. The broad optimality of pro\ufb01le maximum likelihood, 2019.\n\n[HO19b] Yi Hao and Alon Orlitsky. Data ampli\ufb01cation: Instance-optimal property estimation,\n\n2019.\n\n[JHW16] J. Jiao, Y. Han, and T. Weissman. Minimax estimation of the l1 distance. In 2016 IEEE\n\nInternational Symposium on Information Theory (ISIT), pages 750\u2013754, July 2016.\n\n[JHW17] Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Minimax Estimation of the L1\n\nDistance. arXiv e-prints, page arXiv:1705.00807, May 2017.\n\n[JVHW15] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of\ndiscrete distributions. IEEE Transactions on Information Theory, 61(5):2835\u20132885,\nMay 2015.\n\n[Nem03] Arkadi Nemirovski. On tractable approximations of randomly perturbed convex con-\nstraints. In 42nd IEEE International Conference on Decision and Control (IEEE Cat.\nNo. 03CH37475), volume 3, pages 2419\u20132422. IEEE, 2003.\n\n[OSS+04] A. Orlitsky, S. Sajama, N. P. Santhanam, K. Viswanathan, and Junan Zhang. Algo-\nrithms for modeling distributions over large alphabets. In International Symposium on\nInformation Theory, 2004. ISIT 2004. Proceedings., pages 304\u2013304, 2004.\n\n[OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the num-\nber of unseen species. Proceedings of the National Academy of Sciences, 113(47):13283\u2013\n13288, 2016.\n\n[PJW17] D. S. Pavlichin, J. Jiao, and T. Weissman. Approximate Pro\ufb01le Maximum Likelihood.\n\nArXiv e-prints, December 2017.\n\n[RRSS07] S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bounds for approxi-\nmating distribution support size and the distinct elements problem. In 48th Annual IEEE\nSymposium on Foundations of Computer Science (FOCS\u201907), pages 559\u2013569, Oct 2007.\n\n[RVZ17] Aditi Raghunathan, Gregory Valiant, and James Zou. Estimating the unseen from\n\nmultiple populations. CoRR, abs/1707.03854, 2017.\n\n[Tim14] Aleksandr Filippovich Timan. Theory of approximation of functions of a real variable,\n\nvolume 34. Elsevier, 2014.\n\n10\n\n\f[VV11a] G. Valiant and P. Valiant. The power of linear estimators. In 2011 IEEE 52nd Annual\n\nSymposium on Foundations of Computer Science, pages 403\u2013412, Oct 2011.\n\n[VV11b] Gregory Valiant and Paul Valiant. Estimating the unseen: An n/log(n)-sample estimator\nfor entropy and support size, shown optimal via new clts. In Proceedings of the Forty-\nthird Annual ACM Symposium on Theory of Computing, STOC \u201911, pages 685\u2013694,\nNew York, NY, USA, 2011. ACM.\n\n[WY15] Y. Wu and P. Yang. Chebyshev polynomials, moment matching, and optimal estimation\n\nof the unseen. ArXiv e-prints, April 2015.\n\n[WY16a] Y. Wu and P. Yang. Minimax rates of entropy estimation on large alphabets via best\npolynomial approximation. IEEE Transactions on Information Theory, 62(6):3702\u2013\n3720, June 2016.\n\n[WY16b] Yihong Wu and Pengkun Yang. Sample complexity of the distinct elements problem.\n\narXiv e-prints, page arXiv:1612.03375, Dec 2016.\n\n[YOSW18] Hao Yi, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Data ampli\ufb01cation:\nA uni\ufb01ed and competitive approach to property estimation. In Advances in Neural\nInformation Processing Systems, pages 8834\u20138843, 2018.\n\n[ZVV+16] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin\nSamocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G. MacArthur. Quanti-\nfying unobserved protein-coding variants in human populations provides a roadmap for\nlarge-scale sequencing projects. Nature Communications, 7:13293 EP \u2013, 10 2016.\n\n11\n\n\f", "award": [], "sourceid": 6733, "authors": [{"given_name": "Moses", "family_name": "Charikar", "institution": "Stanford University"}, {"given_name": "Kirankumar", "family_name": "Shiragur", "institution": "Stanford University"}, {"given_name": "Aaron", "family_name": "Sidford", "institution": "Stanford"}]}