{"title": "Learning Populations of Parameters", "book": "Advances in Neural Information Processing Systems", "page_first": 5778, "page_last": 5787, "abstract": "Consider the following estimation problem: there are $n$ entities, each with an unknown parameter $p_i \\in [0,1]$, and we observe $n$ independent random variables, $X_1,\\ldots,X_n$, with $X_i \\sim $ Binomial$(t, p_i)$. How accurately can one recover the ``histogram'' (i.e. cumulative density function) of the $p_i$'s? While the empirical estimates would recover the histogram to earth mover distance $\\Theta(\\frac{1}{\\sqrt{t}})$ (equivalently, $\\ell_1$ distance between the CDFs), we show that, provided $n$ is sufficiently large, we can achieve error $O(\\frac{1}{t})$ which is information theoretically optimal. We also extend our results to the multi-dimensional parameter case, capturing settings where each member of the population has multiple associated parameters. Beyond the theoretical results, we demonstrate that the recovery algorithm performs well in practice on a variety of datasets, providing illuminating insights into several domains, including politics, sports analytics, and variation in the gender ratio of offspring.", "full_text": "Learning Populations of Parameters\n\nKevin Tian, Weihao Kong, and Gregory Valiant\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA, 94305\n\n(kjtian, whkong, valiant)@stanford.edu\n\nAbstract\n\nConsider the following estimation problem: there are n entities, each with an\nunknown parameter pi \u2208 [0, 1], and we observe n independent random variables,\nX1, . . . , Xn, with Xi \u223c Binomial(t, pi). How accurately can one recover the\n\u201chistogram\u201d (i.e. cumulative density function) of the pi\u2019s? While the empirical\nestimates would recover the histogram to earth mover distance \u0398( 1\u221a\n) (equivalently,\n(cid:96)1 distance between the CDFs), we show that, provided n is suf\ufb01ciently large, we\ncan achieve error O( 1\nt ) which is information theoretically optimal. We also extend\nour results to the multi-dimensional parameter case, capturing settings where\neach member of the population has multiple associated parameters. Beyond the\ntheoretical results, we demonstrate that the recovery algorithm performs well\nin practice on a variety of datasets, providing illuminating insights into several\ndomains, including politics, sports analytics, and variation in the gender ratio of\noffspring.\n\nt\n\n1\n\nIntroduction\n\nIn many domains, from medical records, to the outcomes of political elections, performance in sports,\nand a number of biological studies, we have enormous datasets that re\ufb02ect properties of a large\nnumber of entities/individuals. Nevertheless, for many of these datasets, the amount of information\nthat we have about each entity is relatively modest\u2014often too little to accurately infer properties about\nthat entity. In this work, we consider the extent to which we can accurately recover an estimate of the\npopulation or set of property values of the entities, even in the regime in which there is insuf\ufb01cient\ndata to resolve properties of each speci\ufb01c entity.\nTo give a concrete example, suppose we have a large dataset representing 1M people, that records\nwhether each person had the \ufb02u in each of the past 5 years. Suppose each person has some underlying\nprobability of contracting the \ufb02u in a given year, with pi representing the probability that the ith\nperson contracts the \ufb02u each year (and assuming independence between years). With 5 years of data,\nthe empirical estimates \u02c6pi for each person are quite noisy (and the estimates will all be multiples\nof 1\n5). Despite this, to what extent can we hope to accurately recover the population or set of pi\u2019s?\nAn accurate recovery of this population of parameters might be very useful\u2014is it the case that most\npeople have similar underlying probabilities of contracting the \ufb02u, or is there signi\ufb01cant variation\nbetween people? Additionally, such an estimate of this population could be fruitfully leveraged as a\nprior in making concrete predictions about individuals\u2019 pi\u2019s, as a type of empirical Bayes method.\nThe following example motivates the hope for signi\ufb01cantly improving upon the empirical estimates:\nExample 1. Consider a set of n biased coins, with the ith coin having an unknown bias pi. Suppose\nwe \ufb02ip each coin twice (independently), and observe that the number of coins where both \ufb02ips landed\n4 , and similarly for the number coins that landed HT, T H, and T T . We can\nheads is roughly n\nsafely conclude that almost all of the pi\u2019s are almost exactly 1\n2 . The reasoning proceeds in two\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsteps: \ufb01rst, since the average outcome is balanced between heads and tails, the average pi must be\nvery close to 1\n2 . Given this, if there was any signi\ufb01cant amount of variation in the pi\u2019s, one would\nexpect to see signi\ufb01cantly more HHs and T T s than the HT and T H outcomes, simply because\nPr[Binomial(2, p) = 1] = 2p(1 \u2212 p) attains a maximum for p = 1/2.\nFurthermore, suppose we now consider the ith coin, and see that it landed heads twice. The empirical\n4 coins with each pair of outcomes, using the\nestimate of pi would be 1, but if we observe close to n\nabove reasoning that argues that almost all of the p\u2019s are likely close to 1\n2 , we could safely conclude\nthat pi is likely close to 1\n2 .\n\nThis ability to \u201cdenoise\u201d the empirical estimate of a parameter based on the observations of a number\nof independent random variables (in this case, the outcomes of the tosses of the other coins), was\n\ufb01rst pointed out by Charles Stein in the setting of estimating the means of a set of Gaussians and is\nknown as \u201cStein\u2019s phenomenon\u201d [14]. We discuss this further in Section 1.1. Example 1 was chosen\nto be an extreme illustration of the ability to leverage the large number of entities being studied, n, to\npartially compensate for the small amount of data re\ufb02ecting each entity (the 2 tosses of each coin, in\nthe above example).\nOur main result, stated below, demonstrates that even for worst-case sets of p\u2019s, signi\ufb01cant \u201cdenoising\u201d\nis possible. While we cannot hope to always accurately recover each pi, we show that we can\naccurately recover the set or histogram of the p\u2019s, as measured in the (cid:96)1 distance between the\ncumulative distribution functions, or equivalently, the \u201cearth mover\u2019s distance\u201d (also known as 1-\nWasserstein distance) between the set of p\u2019s regarded as a distribution P that places mass 1\nn at each pi,\nand the distribution Q returned by our estimator. Equivalently, our returned distribution Q can also\nbe represented as a set of n values q1, . . . , qn, in which case this earth mover\u2019s distance is precisely\n1/n times the (cid:96)1 distance between the vector of sorted pi\u2019s, and the vector of sorted qi\u2019s.\nTheorem 1. Consider a set of n probabilities, p1, . . . , pn with pi \u2208 [0, 1], and suppose we observe\nthe outcome of t independent \ufb02ips of each coin, namely X1, . . . , Xn, with Xi \u223c Binomial(t, pi).\nThere is an algorithm that produces a distribution Q supported on [0, 1], such that with probability at\nleast 1 \u2212 \u03b4 over the randomness of X1, . . . , Xn,\n\n(cid:107)P \u2212 Q(cid:107)W \u2264 \u03c0\nt\n\n+ 3t\n\nln(\n\n2t\n\u03b4\n\n)\n\n3\nn\n\n\u2264 \u03c0\nt\n\n+ O\u03b4(\n\n3tt ln t\u221a\nn\n\n),\n\n(cid:114)\n\nt(cid:88)\n\ni=1\n\nwhere P denotes the distribution that places mass 1\ndistance.\n\nn at value pi, and (cid:107) \u00b7 (cid:107)W denotes the Wasserstein\n\nThe above theorem applies to the setting where we hope to recover a set of arbitrary pi\u2019s. In some\npractical settings, we might think of each pi as being sampled independently from some underlying\ndistribution Ppop over probabilities, and the goal is to recover this population distribution Ppop.\n\u221a\nSince the empirical distribution of n draws from a distribution Ppop over [0, 1] converges to Ppop\nin Wasserstein distance at a rate of O(1/\nn), the above theorem immediately yeilds the analogous\nresult in this setting:\nCorollary 1. Consider a distribution Ppop over [0, 1], and suppose we observe X1, . . . , Xn where Xi\nis obtained by \ufb01rst drawing pi independently from Ppop, and then drawing Xi from Binomial(t, pi).\nThere is an algorithm that will output a distribution Q such that with probability at least 1 \u2212 \u03b4,\n(cid:107)Ppop \u2212 Q(cid:107)W \u2264 \u03c0\n\n(cid:16) 3tt ln t\u221a\n\n(cid:17)\n\nt + O\u03b4\n\n.\n\nn\n\nThe inverse linear dependence on t of Theorem 1 and Corollary 1 is information theoretically optimal,\nand is attained asymptotically for suf\ufb01ciently large n:\nProposition 1. Let Ppop denote a distribution over [0, 1], and for positive integers t and n, let\nX1, . . . , Xn denote random variables with Xi distributed as Binomial(t, pi) where pi is drawn\nindependently according to Ppop. An estimator f maps X1, . . . , Xn to a distribution f (X1, . . . , Xn).\nThen, for every \ufb01xed t, the following lower bound on the accuracy of any estimator holds for all n:\n\ninf\nf\n\nsup\nPpop\n\nE [(cid:107)f (X1, . . . , Xn) \u2212 Ppop(cid:107)W ] >\n\n1\n4t\n\n.\n\nOur estimation algorithm, whose performance is characterized by Theorem 1, proceeds via the\nmethod of moments. Given X1, . . . , Xn with Xi \u223c Binomial(t, pi), and suf\ufb01ciently large n, we can\n\n2\n\n\fobtain accurate estimates of the \ufb01rst t moments of the distribution/histogram P de\ufb01ned by the pi\u2019s.\nAccurate estimates of the \ufb01rst t moments can then be leveraged to recover an estimate of P that\nis accurate to error 1\nt plus a factor that depends (exponentially on t) on the error in the recovered\nmoments.\nThe intuition for the lower bound, Proposition 1, is that the realizations of Binomial(t, pi) give no\ninformation beyond the \ufb01rst t moments. Additionally, there exist distributions P and Q whose \ufb01rst t\nmoments agree exactly, but which differ in their t + 1st moment, and have (cid:107)P \u2212 Q(cid:107)W \u2265 1\n2t . Putting\nthese two pieces together establishes the lower bound.\n\nWe also extend our results to the practically relevant multi-parameter analog of the setting de-\nscribed above, where the ith datapoint corresponds to a pair, or d-tuple of hidden parameters,\np(i,1), . . . , p(i,d), and we observe independent random variables X(i,1), . . . , X(i,d) with X(i,j) \u223c\nBinomial(t(i,j), p(i,j)). In this setting, the goal is to recover the multivariate set of d-tuples\n{p(i,1), . . . , p(i,d)}, again in an earth mover\u2019s sense. This setting corresponds to recovering an\napproximation of an underlying joint distribution over these d-tuples of parameters.\nTo give one concrete motivation for this problem, consider a hypothetical setting where we have n\ngenotypes (sets of genetic features), with ti people of the ith genotype. Let X(i,1) denote the number\nof people with the ith genotype who exhibit disease 1, and X(i,2) denote the number of people with\ngenotype i who exhibit disease 2. The interpretation of the hidden parameters pi,1 and pi,2 are the\nrespective probabilities of people with the ith genotype of developing each of the two diseases. Our\nresults imply that provided n is large, one can accurately recover an approximation to the underlying\nset or two-dimensional joint distribution of {(pi,1, pi,2)} pairs, even in settings where there are too\nfew people of each genotype to accurately determine which of the genotypes are responsible for\nelevated disease risk. Recovering this set of pairs would allow one to infer whether there are common\ngenetic drivers of the two diseases\u2014even in the regime where there is insuf\ufb01cient data to resolve\nwhich genotypes are the common drivers.\nOur multivariate analog of Theorem 1 is also formulated in terms of multivariate analog of earth\nmover\u2019s distance (see De\ufb01nition 1 for a formal de\ufb01nition):\nTheorem 2. Let {pi,j} denote a set of n d-tuples of hidden parameters in [0, 1]d, with i \u2208 {1, . . . , n}\nand j \u2208 {1, . . . , d}, and suppose we observe random variables Xi,j, with Xi,j \u223c Binomial(t, pi,j).\nThere is an algorithm that produces a distribution Q supported on [0, 1]d, such that with probability\nat least 1 \u2212 \u03b4 over the randomness of the Xi,js,\n\nd(2t)d+12t\n\n3|\u03b1|\n\nln(\n\n1\n\u03b4\n\n)\n\n1\nn\n\n\u2264 C1\nt\n\n+ O\u03b4,t,d(\n\n1\u221a\nn\n\n),\n\n(cid:107)P \u2212 Q(cid:107)W \u2264 C1\nt\n\n+ C2\n\nt(cid:88)\n\n|\u03b1|=1\n\n(cid:114)\n\nfor absolute constants C1, C2, where \u03b1 is a d-dimensional multi-index consisting of all d-tuples of\nnonnegative integers summing to at most t, P denotes the distribution that places mass 1\nn at value\npi = (pi,1, . . . , pi,d) \u2208 [0, 1]d, and (cid:107) \u00b7 (cid:107)W denotes the d-dimensional Wasserstein distance between\nP and Q.\n\n1.1 Related Work\n\nThe seminal paper of Charles Stein [14] was one of the earliest papers to identify the surprising\npossibility of leveraging the availability of independent data re\ufb02ecting a large number of parameters\nof interest, to partially compensate for having little information about each parameter. The speci\ufb01c\nsetting examined considered the problem of estimating a list of unknown means, \u00b51, . . . , \u00b5n given\naccess to n independent Gaussian random variables, X1, . . . , Xn, with Xi \u223c N (\u00b5i, 1). Stein showed\nthat, perhaps surprisingly, that there is an estimator for the list of parameters \u00b51, . . . , \u00b5n that has\nsmaller expected squared error than the naive unbiased empirical estimates of \u02c6\u00b5i = Xi. This\nimproved estimator \u201cshrinks\u201d the empirical estimates towards the average of the Xi\u2019s. In our setting,\nthe process of recovering the set/histogram of unknown pi\u2019s and then leveraging this recovered set as\na prior to correct the empirical estimates of each pi can be viewed as an analog of Stein\u2019s \u201cshrinkage\u201d,\nand will have the property that the empirical estimates are shifted (in a non-linear fashion) towards\nthe average of the pi\u2019s.\nMore closely related to the problem considered in this paper is the work on recovering an approx-\nimation to the unlabeled set of probabilities of domain elements, given independent draws from a\n\n3\n\n\fdistribution of large discrete support (see e.g. [11, 2, 15, 16, 1]). Instead of learning the distribution,\nthese works considered the alternate goal of simply returning an approximation to the multiset of\nprobabilities with which the domain elements arise but without specifying which element occurs with\nwhich probability. Such a multiset can be used to estimate useful properties of the distribution that\ndo not depend on the labels of the domain of the distribution, such as the entropy or support size of\nthe distribution, or the number of elements likely to be observed in a new, larger sample [12, 17].\nThe bene\ufb01t of pursuing this weaker goal of returning the unlabeled multiset is that it can be learned\nto signi\ufb01cantly higher accuracy for a given sample size\u2014essentially as accurate as the empirical\ndistribution of a sample that is a logarithmic factor larger [15, 17].\nBuilding on the above work, the recent work [18] considered the problem of recovering the \u201cfrequency\nspectrum\u201d of rare genetic variants. This problem is similar to the problem we consider, but focuses\non a rather different regime. Speci\ufb01cally, the model considered posits that each location i = 1, . . . , n\nin the genome has some probability pi of being mutated in a given individual. Given the sequences of\nt individuals, the goal is to recover the set of pi\u2019s. The work [18] focused on the regime in which\nmany of the pi\u2019s are signi\ufb01cantly less than 1\nnt, and hence correspond to mutations that have never\nbeen observed; one conclusion of that work was that one can accurately estimate the number of such\nrare mutations that would be discovered in larger sequencing cohorts. Our work, in contrast, focuses\non the regime where the pi\u2019s are constant, and do not scale as a function of n, and the results are\nincomparable.\nAlso related to the current work are the works [9, 10] on testing whether certain properties of\ncollections of distributions hold. The results of these works show that speci\ufb01c properties, such as\nwhether most of the distributions are identical versus have signi\ufb01cant variation, can be decided based\non a sample size that is signi\ufb01cantly sublinear in the number of distributions.\nFinally, the papers [5, 6] consider the related by more dif\ufb01cult setting of learning \u201cPoisson Binomials,\u201d\nnamely a sum of independent non-identical Bernoulli random variables, given access to samples. In\ncontrast to our work, in the setting they consider, each \u201csample\u201d consists of only the sum of these n\nrandom variables, rather than observing the outcome of each random variable.\n\n1.2 Organization of paper\n\nIn Section 2 we describe the two components of our algorithm for recovering the population of\nBernoulli parameters: obtaining accurate estimates of the low-order moments (Section 2.1), and\nleveraging those moments to recover the set of parameters (Section 2.3). The complete algorithm is\npresented in Section 2.2, and a discussion of the multi-dimensional extension to which Theorem 2\napplies is described in Section 2.4. In Section 3 we validate the empirical performance of our\napproach on synthetic data, as well as illustrate its potential applications to several real-world settings.\n\n2 Learning a population of binomial parameters\n\nOur approach to recovering the underlying distribution or set of pi\u2019s proceeds via the method of\nmoments. In the following section we show that, given \u2265 t samples from each Bernoulli distribution,\nwe can accurately estimate each of the \ufb01rst t moments. In Section 2.3 we explain how these \ufb01rst t\nmoments can then be leveraged to recover the set of pi\u2019s, to earth mover\u2019s distance O(1/t).\n\n2.1 Moment estimation\n\n(cid:80)n\n\ni=1 pk\n\nOur method-of-moments approach proceeds by estimating the \ufb01rst t moments of P , namely\ni , for each integer k between 1 and t. The estimator we describe is unbiased, and also\n1\nn\napplies in the setting of Corollary 1 where each pi is drawn i.i.d. from a distribution Ppop. In this\ncase, we will obtain an unbiased estimator for Ep\u2190Ppop [pk]. We limit ourselves to estimating the\n\ufb01rst t moments because, as show in the proof of the lower bound, Proposition 1, the distribution of\nthe Xi\u2019s are determined by the \ufb01rst t moments, and hence no additional information can be gleaned\nregarding the higher moments.\nFor 1 \u2264 k \u2264 t, our estimate for the kth moment is \u03b2k = 1\nunbiased estimator is the following: Note that given any k i.i.d. samples of a variable distributed\n\n(cid:1)(cid:0)t\n(cid:0)Xi\n(cid:1) . The motivation for this\n\n(cid:80)n\n\nn\n\ni=1\n\nk\n\nk\n\n4\n\n\fis 1 if all the tosses come up heads, and otherwise is 0. Thus, if we average over all(cid:0)t\n\naccording to Bernoulli(pi), an unbiased estimator for pk\n\ni is their product, namely the estimator which\n\n(cid:1) subsets of\n\n(cid:80)n\n\n(cid:0)Xi\n(cid:1)(cid:0)t\n(cid:1) denote our estimate of the kth moment. Then E[\u03b2k] = \u03b1k, and Pr(|\u03b2k \u2212 \u03b1k| \u2265\n\nsize k, and then average over the population, we still derive an unbiased estimator.\nLemma 1. Given {p1, . . . , pn}, let Xi denote the random variable distributed according to\nBinomial(t, pi). For k \u2208 {1, . . . , t}, let \u03b1k = 1\ni denote the kth true moment, and\n\u03b2k = 1\nn\n\u0001) \u2264 2e\u2212 1\nGiven the above lemma, we obtain the fact that, with probability at least 1\u2212 \u03b4, the events |\u03b1k \u2212 \u03b2k| \u2264\n\n(cid:80)n\n\ni=1 pk\n\n3 n\u00012\n\ni=1\n\nn\n\nk\n\nk\n\nk\n\n.\n\n(cid:113)\n\nln( 2t\n\n\u03b4 ) 3\n\nn simultaneously occur for all k \u2208 {1, . . . , t}.\n\n2.2 Distribution recovery from moment estimates\n\nGiven the estimates of the moments of the distribution P , as described above, our algorithm will\nrecover a distribution, Q, whose moments are close to the estimated moments. We propose two\nalgorithms, whose distribution recoveries are via the standard linear programming or quadratic\nprogramming approaches which will recover a distribution Q supported on some (suf\ufb01ciently \ufb01ne) \u0001-\nnet of [0, 1]: the variables of the linear (or quadratic) program correspond to the amount of probability\nmass that Q assigns to each element of the \u0001-net, the constraints correspond to ensuring that the\namount of mass at each element is nonnegative and that the total amount of mass is 1, and the\nobjective function will correspond to the (possibly weighted) sum of the discrepancies between the\nestimated moments, and the moments of the distribution represented by Q.\nTo see why it suf\ufb01ces to solve this program over an \u0001-net of the unit interval, note that any distribution\nover [0, 1] can be rounded so as to be supported on an \u0001-net, while changing the distribution by\nat most \u0001\n2 in Wasserstein distance. Additionally, such a rounding alters each moment by at most\nO(\u0001), because the rounding alters the individual contributions of point masses to the kth moment\nby only O(\u0001k) < O(\u0001). As our goal is to recover a distribution with distance O(1/t), it suf\ufb01ces to\nchoose and \u0001-net with \u0001 (cid:28) 1/t so that the additional error due to this discretization is negligible.\nAs this distribution recovery program has O(1/\u0001) variables and O(t) constraints, both of which are\nindependent of n, this program can be solved extremely ef\ufb01ciently both in theory and in practice.\nWe formally describe this algorithm below, which takes as input X1, . . . , Xn, binomial parameter t,\nan integer m corresponding to the size of the \u0001-net, and a weight vector w.\n\nAlgorithms 1 and 2: Distribution Recovery with Linear / Quadratic Objectives\nInput: Integers X1, . . . , Xn, integers t and m, and weight vector w \u2208 Rt.\nOutput: Vector q = (q0, . . . , qm) of length m + 1, representing a distribution with\nm.\nprobability mass qi at value i\n\n\u2022 For each k \u2208 {1, . . . , t}, compute \u03b2k = 1\n\u2022 (Algorithm 1) Solve the linear program over variables q0, . . . , qm:\n\nn\n\nk\n\n(cid:80)(cid:0)Xi\n(cid:1)(cid:0)t\n(cid:1) .\n\nk\n\n\u2022 (Algorithm 2) Solve the quadratic program over variables q0, . . . , qm:\n\ni\n\nm(cid:88)\n\ni=0\n\nm(cid:88)\n\ni=0\n\n| \u02c6\u03b2k \u2212 \u03b2k|wk, where \u02c6\u03b2k =\n\nqi(\n\ni\nm\n\n)k,\n\nqi = 1, and for all i, qi \u2265 0.\n\n( \u02c6\u03b2k \u2212 \u03b2k)2w2\n\nk, where \u02c6\u03b2k =\nqi = 1, and for ll i, qi \u2265 0.\n\nqi(\n\ni\nm\n\n)k,\n\nk=1\n\nminimize:\n\nt(cid:88)\nsubject to: (cid:88)\nt(cid:88)\nsubject to: (cid:88)\n\nminimize:\n\nk=1\n\ni\n\n5\n\n\f2.2.1 Practical considerations\n\nOur theoretical results, Theorem 1 and Corollary 1, apply to the setting where the weight vector,\nw in the above linear program objective function has wk = 1 for all k. It makes intuitive sense to\npenalize the discrepancy in the kth moment inversely proportionally to the empirically estimated\nstandard deviation of the kth moment estimate, and our empirical results are based on such a weighted\nobjective.\nAdditionally, in some settings we observed an empirical improvement in the robustness and quality of\nthe recovered distribution if one averages the results of running Algorithm 1 or 2 on several random\nsubsamples of the data. In our empirical section, Section 3, we refer to this as a bootstrapped version\nof our algorithm.\n\n2.3 Close moments imply close distributions\n\nIn this section we complete the high-level proof that Algorithm 1 accurately recovers P , the dis-\ntribution corresponding to the set of pi\u2019s, establishing Theorem 1 and Corollary 1. The guarantees\nof Lemma 1 ensure that, with high probability, the estimated moments will be close to the true\nmoments. Together with the observation that discretizing P to be supported on an \u0001-net of [0, 1] alters\nthe moments by O(\u0001), it follows that there is a solution to the linear program in the second step of\nAlgorithm 1 corresponding to a distribution whose moments are close to the true moments of P , and\nhence with high probability Algorithm 1 will return such a distribution.\nTo conclude the proof, all that remains is to show that, provided the distribution Q returned by\nAlgorithm 1 has similar \ufb01rst t moments to the true distribution, P , then P and Q will be close in\nWasserstein (earth mover\u2019s) distance. We begin by formally de\ufb01ning the Wasserstein (earth mover\u2019s)\ndistance between two distributions P and Q:\nDe\ufb01nition 1. The Wasserstein, or earth mover\u2019s, distance between distributions P, Q, is ||P \u2212\nQ||W := inf\n[0,1]2d d(x, y)d\u03b3(x, y), where \u0393(P, Q) is the set of all couplings on P and Q,\nnamely a distribution whose marginals agree with the distributions. The equivalent dual de\ufb01nition is\n||P \u2212 Q||W := sup\ng(x)d(P \u2212 Q)(x) where the supremum is taken over Lipschitz functions, g.\n\n\u03b3\u2208\u0393(P,Q)\n\ng\u2208Lip(1)\n\n(cid:82)\n\n(cid:82)\n\nAs its name implies, this distance metric can be thought of as the cost of the optimal scheme of\n\u201cmoving\u201d the probability mass from P to create Q, where the cost per unit mass of moving from\nprobability x and y is |x \u2212 y|. Distributions over R, it is not hard to see that this distance is exactly\nthe (cid:96)1 distance between the associated cumulative distribution functions.\nThe following slightly stronger version of Proposition 1 in [7] bounds the Wasserstein distance\nbetween any pair of distributions in terms of the discrepancies in their low-order moments:\nTheorem 3. For two distributions P and Q supported on [0, 1] whose \ufb01rst t moments are \u03b1 and \u03b2\nrespectively, the Wasserstein distance ||P \u2212 Q||W is bounded by \u03c0\n\nt + 3t(cid:80)t\n\nk=1 |\u03b1k \u2212 \u03b2k|.\n\ng\u2208Lip\n\nThe formal proof of this theorem is provided in the Appendix A, and we conclude this section with\nan intuitive sketch of this proof. For simplicity, \ufb01rst consider the setting where the two distributions\nP, Q have the exact same \ufb01rst t moments. This immediately implies that for any polynomial f of\ndegree at most t, the expectation of f with respect to P is equal to the expectation of f with respect\n\nto Q. Namely,(cid:82) f (x)(P (x) \u2212 Q(x))dx = 0. Leveraging the de\ufb01nition of Wasserstein distance\n(cid:82) g(x)(P (x) \u2212 Q(x))dx, the theorem now follows from the standard fact\n(cid:107)P \u2212 Q(cid:107)W =(cid:80)\nhold, with an additional error term of(cid:80)t\n\nthat, for any Lipschitz function g, there exists a degree t polynomial fg that approximates it to within\n(cid:96)\u221e distance O(1/t) on the interval [0, 1].\nIf there is nonzero discrepancy between the \ufb01rst t moments of P and Q, the above proof continues to\nk=1 ck(\u03b1k \u2212 \u03b2k), where ck is the coef\ufb01cient of the degree\nk term in the polynomial approximation fg. Leveraging the fact that any Lipschitz function g can\nbe approximated to (cid:96)\u221e distance O(1/t) on the unit interval using a polynomial with coef\ufb01cients\nbounded by 3t, we obtain Theorem 3.\n\n6\n\n\f2.4 Extension: multivariate distribution estimation\n\nWe also consider the natural multivariate extension of the the problem of recovering a population\nof Bernoulli parameters. Suppose, for example, that every member i of a population of size n has\ntwo associated binomial parameters p(i,1), p(i,2), as in Theorem 2. One could estimate the marginal\ndistribution of the p(i,1) and p(i,2) separately using Algorithm 1, but it is natural to also want to\nestimate the joint distribution up to small Wasserstein distance in the 2-d sense. Similarly, one can\nconsider the analogous d-dimensional distribution recovery question.\nThe natural idea underlying our extension to this setting is to include estimates of the multivariate\nmoments represented by multi-indices \u03b1 with |\u03b1| \u2264 t. For example, in a 2-d setting, the moments\nfor members i of the population would look like Epi\u223cP [pa\n(i,2)]. Again, it remains to bound how\nclose an interpolating polynomial can get to any d-dimensional Lipschitz function, and bound the\nsize of the coef\ufb01cients of such a polynomial. To this end, we use the following theorem from [3]:\nLemma 2. Given any Lipschitz function f supported on [0, 1]d, there is a degree s polynomial p(x)\nsuch that\n\n(i,1)pb\n\n|p(x) \u2212 f (x)| \u2264 Cd\nt\n\n,\n\nsup\n\nx\u2208[0,1]d\nwhere Cd is a constant that depends on d.\nIn Appendix D, we prove the following bound on the magnitude of the coef\ufb01cients of the interpolating\npolynomial: |c\u03b1| \u2264 (2t)d2t\n, where c\u03b1 is the coef\ufb01cient of the \u03b1 multinomial term. Together with the\n3|\u03b1|\nconcentration bound of the \u03b1th moment of the distribution, we obtain Theorem 2, the multivariate\nanalog of Theorem 1.\n\n3 Empirical performance\n\n3.1 Recovering distributions with known ground truth\n\nWe begin by demonstrating the effectiveness of our algorithm on several synthetic datasets. We\nconsidered three different choices for an underlying distribution Ppop over [0, 1], then drew n\nindependent samples p1, . . . , pn \u2190 Ppop. For a parameter t, for each i \u2208 {1, . . . , n}, we then drew\nXi \u2190 Binomial(t, pi), and ran our population estimation algorithm on the set X1, . . . , Xn, and\nmeasured the extent to which we recovered the distribution Ppop. In all settings, n was suf\ufb01ciently\nlarge that there was little difference between the histogram corresponding to the set {p1, . . . , pn}\nand the distribution Ppop. Figure 1 depicts the error of the recovered distribution as t takes on all\neven values from 2 to 14, for three choices of Ppop: the \u201c3-spike\u201d distribution with equal mass at the\nvalues 1/4, 1/2, and 3/4, a Normal distribution truncated to be supported on [0, 1], and the uniform\ndistribution over [0, 1].\n\n(a) 3-spike distribution\n\n(b) truncated normal\n\n(c) Uniform on [0, 1]\n\nFigure 1: Earth mover\u2019s distance (EMD) between the true underlying distribution Ppop and the\ndistribution recovered by Algorithm 2 for three choices of Ppop: (a) the distribution consisting of\nequally weighted point masses at locations 1\n4; (b) the normal distribution with mean 0.5 and\nstandard deviation 0.15, truncated to be supported on [0, 1]; and (c) the uniform distribution over\n[0, 1]. For each underlying distributions, we plot the EMD (median over 20 trials) between Ppop and\nthe distribution recovered with Algorithm 2 as t, the number of samples from each of the n Bernoulli\nrandom variables, takes on all even values from 2 to 14. These results are given for n = 10, 000\n(green) and n = 100, 000 (blue). For comparison, the distance between Ppop and the histogram of\nthe empirical probabilities for n = 100, 000 is also shown (red).\n\n2 , 3\n\n4 , 1\n\n7\n\n\fFigure 2 shows representative plots of the CDFs of the recovered histograms and empirical histograms\nfor each of the three choices of Ppop considered above.\n\n(a) 3-spike distribution\n\n(b) truncated normal\n\n(c) Uniform on [0, 1]\n\nFigure 2: CDFs of the true distribution P (green), the histogram recovered by Algorithm 2 (blue) for\nP , and the empirical histogram (red) corresponding to t = 10 samples and n = 100, 000. Note that\nthe empirical distribution is only supported on multiples of 1\n10.\n\nWe also considered recovering the distribution of probabilities that different \ufb02ights are delayed (i.e.\neach \ufb02ight\u2014for example Delta Airlines 123\u2014corresponds to a parameter p \u2208 [0, 1] representing the\nprobability that \ufb02ight is delayed on a given day. Our algorithm was able to recover this non-parametric\ndistribution of \ufb02ight delay parameters extremely well based on few (\u2264 10) data points per \ufb02ight. In\nthis setting, we had access to a dataset with > 50 datapoints per \ufb02ight, and hence could compare the\nrecovered distribution to a close approximation of the ground truth distribution. These results are\nincluded in the appendix.\n\n3.2 Distribution of offspring sex ratios\n\nOne of the motivating questions for this work was the following naive sounding question: do all\nmembers of a given species have the same propensity of giving birth to a male vs female child, or is\nthere signi\ufb01cant variation in this probability across individuals? For a population of n individuals,\nletting pi represent the probability that a future child of the ith individual is male, this questions\nis precisely the question of characterizing the histogram or set of the pi\u2019s. This question of the\nuniformity of the pi\u2019s has been debated both by the popular science community (e.g. the recent BBC\narticle \u201cWhy Billionaires Have More Sons\u201d), and more seriously by the biology community.\nMeiosis ensures that each male produces the same number of spermatozoa carrying the X chromosome\nas carrying the Y chromosome. Nevertheless, some studies have suggested that the difference in\nthe amounts of genetic material in these chromosomes result in (slight) morphological differences\nbetween the corresponding spermatozoa, which in turn result in differences in their motility (speed of\nmovement), etc. (see e.g. [4, 13]). Such studies have led to a chorus of speculation that the relative\ntiming of ovulation and intercourse correlates with the sex of offspring.\nWhile it is problematic to tackle this problem in humans (for a number of reasons, including sex-\nselective abortions), we instead consider this question for dogs. Letting pi denote the probability\nthat each puppy in the ith litter is male, we could hope to recover the distribution of the pi\u2019s. If this\nsex-ratio varies signi\ufb01cantly according to the speci\ufb01c parents involved, or according to the relative\ntiming of ovulation and intercourse, then such variation would be evident in the pi\u2019s. Conveniently, a\ntypical dog litter consists of 4-8 puppies, allowing our approach to recover this distribution based on\naccurate estimates of these \ufb01rst moments.\nBased on a dataset of n \u2248 8, 000 litters, compiled by the Norwegian Kennel Club, we produced\nestimates of the \ufb01rst 10 moments of the distribution of pi\u2019s by considering only litters consisting of at\nleast 10 puppies. Our algorithm suggests that the distribution of the pi\u2019s is indistinguishable from a\nspike at 1\n2, given the size of the dataset. Indeed, this conclusion is evident based even on the estimates\ni \u2248 0.249, since among distribution over\nof the \ufb01rst two moments: 1\nn\n[0, 1] with expectation 1/2, the distribution consisting of a point mass at 1/2 has minimal variance,\nequal to 0.25, and these two moments robustly characterize this distribution. (For example, any\ndistribution supported on [0, 1] with mean 1/2 and for which > 10% of the mass lies outside the\nrange (0.45, 0.55), must have second moment at least 0.2505, though reliably resolving such small\nvariation would require a slightly large dataset.)\n\n(cid:80)\ni pi \u2248 0.497 and 1\n\n(cid:80)\n\ni p2\n\nn\n\n8\n\n\f3.3 Political tendencies on a county level\n\nWe performed a case study on the political leanings of counties. We assumed the following model:\nEach of the n = 3116 counties in the US have an intrinsic \u201cpolitical-leaning\u201d parameter pi denoting\ntheir likelihood of voting Republican in a given election. We observe t = 8 independent samples of\neach parameter, corresponding to whether each county went Democratic or Republican during the 8\npresidential elections from 1976 to 2004.\n\n(a) CDF recovered from 6 moments\n(blue), empirical CDF (red)\n\n(b) CDF recovered from 8 moments\n(blue), empirical CDF (red)\n\nFigure 3: Output of bootstrapping Algorithm 2 on political data for n =3,116 counties over t = 8\nelections.\n\n3.4 Game-to-game shooting of NBA players\n\nWe performed a case study on the scoring probabilities of two NBA players. One can think of this\nexperiment as asking whether NBA players, game-to-game, have differences in their intrinsic ability\nto score \ufb01eld goals (in the sports analytics world, this is the idea of \u201chot / cold\u201d shooting nights).\nThe model for each player is as follows: for the ith basketball game there is some parameter pi\nrepresenting the player\u2019s latent shooting percentage for that game, perhaps varying according to the\nopposing team\u2019s defensive strategy. The empirical shooting percentage of a player varies signi\ufb01cantly\nfrom game-to-game\u2014recovering the underlying distribution or histogram of the pi\u2019s allows one to\ndirectly estimate the consistency of a player. Additionally, such a distribution could be used as a prior\nfor making decisions during games. For example, conditioned on the performance during the \ufb01rst\nhalf of a game, one could update the expected fraction of subsequent shots that are successful.\nThe dataset used was the per-game 3 point shooting percentage of players, with suf\ufb01cient statistics\nof \u201c3 pointers made\u201d and \u201c3 pointers attempted\u201d for each game. To generate estimates of the kth\nmoment, we considered games where at least k 3 pointers were attempted. The players chosen were\nStephen Curry of the Golden State Warriors (who is considered a very consistent shooter) and Danny\nGreen of the San Antonio Spurs (whose nickname \u201cIcy Hot\u201d gives a good idea of his suspected\nconsistency).\n\n(a) Estimated CDF of Curry\u2019s game-\nto-game shooting percentage (blue),\nempirical CDF (red), n=457 games.\n\n(b) Estimated CDF of Green\u2019s game-\nto-game shooting percentage (blue),\nempirical CDF (red), n=524 games.\n\nFigure 4: Estimates produced by bootstrapped version of Algorithm 2 on NBA dataset, 8 moments\nincluded\n\n9\n\n\fAcknowledgments\n\nWe thank Kaja Borge and Ane N\u00f8dtvedt for sharing an anonymized dataset on sex composition of\ndog litters, based on data collected by the Norwegian Kennel Club. This research was supported by\nNSF CAREER Award CCF-1351108, ONR Award N00014-17-1-2562, NSF Graduate Fellowship\nDGE-1656518, and a Google Faculty Fellowship.\n\nReferences\n[1] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uni\ufb01ed maximum\nlikelihood approach for optimal distribution property estimation. arXiv preprint arXiv:1611.02960, 2016.\n\n[2] Jayadev Acharya, Alon Orlitsky, and Shengjun Pan. Recent results on pattern maximum likelihood. In\nNetworking and Information Theory, 2009. ITW 2009. IEEE Information Theory Workshop on, pages\n251\u2013255. IEEE, 2009.\n\n[3] Thomas Bagby, Len Bos, and Norman Levenberg. Multivariate simultaneous approximation. Constructive\n\nApproximation, 18(4), 2002.\n\n[4] P. Barlow and C.G. Vosa. The y chromosome in human spermatozoa. Nature, 226:961\u2013962, 1970.\n\n[5] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A Servedio. Learning poisson binomial distribu-\n\ntions. Algorithmica, 72(1):316\u2013357, 2015.\n\n[6] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Properly learning poisson binomial distributions\n\nin almost polynomial time. In Conference on Learning Theory, pages 850\u2013878, 2016.\n\n[7] Weihao Kong and Gregory Valiant. Spectrum estimation from samples. arXiv preprint arXiv:1602.00061,\n\n2016.\n\n[8] Nicolai Korneichuk and Nikolaj Pavlov\u02c7\u0131c Korn\u02d8eichuk. Exact constants in approximation theory, volume 38.\n\nCambridge University Press, 1991.\n\n[9] Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of\n\nComputing, 9(8):295\u2013347, 2013.\n\n[10] Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing similar means. Siam J. Discrete Math, 28(4):1699\u2013\n\n1724, 2014.\n\n[11] Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. On modeling\npro\ufb01les instead of values. In Proceedings of the 20th conference on Uncertainty in arti\ufb01cial intelligence,\npages 426\u2013435. AUAI Press, 2004.\n\n[12] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number of unseen\n\nspecies. Proceedings of the National Academy of Sciences, page 201607774, 2016.\n\n[13] L.M. Penfold, C. Holt, W.V. Holt, G.R. Welch, D.G. Cran, and L.A. Johnson. Comparative motility of x\nand y chromosome\u2013bearing bovine sperm separated on the basis of dna content by \ufb02ow sorting. Molecular\nReproduction and Development, 50(3):323\u2013327, 1998.\n\n[14] Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.\nIn Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:\nContributions to the Theory of Statistics, pages 197\u2013206, Berkeley, Calif., 1956. University of California\nPress.\n\n[15] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log(n)-sample estimator for entropy and\nsupport size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on\nTheory of computing, pages 685\u2013694. ACM, 2011.\n\n[16] Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other\n\nproperties. In Advances in Neural Information Processing Systems, pages 2157\u20132165, 2013.\n\n[17] Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In Proceedings of the\n\n48th Annual ACM SIGACT Symposium on Theory of Computing, pages 142\u2013155. ACM, 2016.\n\n[18] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha, Monkol Lek,\nShamil Sunyaev, Mark Daly, and Daniel G MacArthur. Quantifying unobserved protein-coding variants in\nhuman populations provides a roadmap for large-scale sequencing projects. Nature Communications, 7,\n2016.\n\n10\n\n\f", "award": [], "sourceid": 2950, "authors": [{"given_name": "Kevin", "family_name": "Tian", "institution": "Stanford University"}, {"given_name": "Weihao", "family_name": "Kong", "institution": "Stanford University"}, {"given_name": "Gregory", "family_name": "Valiant", "institution": "Stanford University"}]}