{"title": "Density Estimation via Discrepancy Based Adaptive Sequential Partition", "book": "Advances in Neural Information Processing Systems", "page_first": 1091, "page_last": 1099, "abstract": "Given $iid$ observations from an unknown continuous distribution defined on some domain $\\Omega$, we propose a nonparametric method to learn a piecewise constant function to approximate the underlying probability density function. Our density estimate is a piecewise constant function defined on a binary partition of $\\Omega$. The key ingredient of the algorithm is to use discrepancy, a concept originates from Quasi Monte Carlo analysis, to control the partition process. The resulting algorithm is simple, efficient, and has provable convergence rate. We demonstrate empirically its efficiency as a density estimation method. We also show how it can be utilized to find good initializations for k-means.", "full_text": "Density Estimation via Discrepancy Based\n\nAdaptive Sequential Partition\n\nDangna Li\n\nICME,\n\nStanford University\nStanford, CA 94305\n\ndangna@stanford.edu\n\nKun Yang\n\nGoogle\n\nMountain View, CA 94043\nkunyang@stanford.edu\n\nWing Hung Wong\n\nDepartment of Statistics\n\nStanford University\nStanford, CA 94305\n\nwhwong@stanford.edu\n\nAbstract\n\nGiven iid observations from an unknown absolute continuous distribution de\ufb01ned\non some domain \u2126, we propose a nonparametric method to learn a piecewise\nconstant function to approximate the underlying probability density function. Our\ndensity estimate is a piecewise constant function de\ufb01ned on a binary partition of\n\u2126. The key ingredient of the algorithm is to use discrepancy, a concept originates\nfrom Quasi Monte Carlo analysis, to control the partition process. The resulting\nalgorithm is simple, ef\ufb01cient, and has a provable convergence rate. We empirically\ndemonstrate its ef\ufb01ciency as a density estimation method. We also show how it can\nbe utilized to \ufb01nd good initializations for k-means.\n\n1\n\nIntroduction\n\nDensity estimation is one of the fundamental problems in statistics. Once an explicit estimate of the\ndensity function is constructed, various kinds of statistical inference tasks follow naturally. Given iid\nobservations, our goal in this paper is to construct an estimate of their common density function via a\nnonparametric domain partition approach.\nAs pointed out in [1], for density estimation, the bias due to the limited approximation power of a\nparametric family will become dominant in the over all error as the sample size grows. Hence it is\nnecessary to adopt a nonparametric approach to handle this bias. The kernel density estimation [2]\nis a popular nonparametric density estimation method. Although in theory it can achieve optimal\nconvergence rate when the kernel and the bandwidth are appropriately chosen, its result can be\nsensitive to the choice of bandwidth, especially in high dimension. In practice, kernel density\nestimation is typically not applicable to problems of dimension higher than 6.\nAnother widely used nonparametric density estimation method in low dimension is the histogram. But\nsimilarly with kernel density estimation, it can not be scaled easily to higher dimensions. Motivated\nby the usefulness of histogram and the need for a method to handle higher dimensional cases, we\npropose a novel nonparametric density estimation method which learns a piecewise constant density\nfunction de\ufb01ned on a binary partition of domain \u2126.\nA key ingredient for any partition based method is the decision for stopping. Based on the observation\nthat for any piecewise constant density, the distribution conditioned on each sub-region is uniform,\nwe propose to use star discrepancy, which originates from analysis of Quasi-Monte Carlo methods,\nto formally measure the degree of uniformity. We will see in section 4 that this allows our density\nestimator to have near optimal convergence rate.\nIn summary, we highlight our contribution as follows:\n\n\u2022 To the best of our knowledge, our method is the \ufb01rst density estimation method that utilizes\n\nQuasi-Monte Carlo technique in density estimation.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u2022 We provide an error analysis on binary partition based density estimation method. We\nestablish an O(n\u2212 1\n2 ) error bound for the density estimator. The result is optimal in the sense\nthat essentially all Monte Carlo methods have the same convergence rate. Our simulation\nresults support the tightness of this bound.\n\u2022 One of the advantage of our method over existing ones is its ef\ufb01ciency. We demonstrate in\nsection 5 that our method has comparable accuracy with other methods in terms of Hellinger\ndistance while achieving an approximately 102-fold speed up.\n\u2022 Our method is a general data exploration tool and is readily applicable to many important\nlearning tasks. Speci\ufb01cally, we demonstrate in section 5.3 how it can be used to \ufb01nd good\ninitializations for k-means.\n\n2 Related work\nExisting domain partition based density estimators can be divided into two categories: the \ufb01rst\ncategory belongs to the Bayesian nonparametric framework. Optional P\u00f3lya Tree (OPT) [3] is a\nclass of nonparametric conjugate priors on the set of piecewise constant density functions de\ufb01ned on\nsome partition of \u2126. Bayesian Sequential Partitioning (BSP) [1] is introduced as a computationally\nmore attractive alternative to OPT. Inferences for both methods are performed by sampling from the\nposterior distribution of density functions. Our improvement over these two methods is two-fold.\nFirst, we no longer restrict the binary partition to be always at the middle. By introducing a new\nstatistic called the \u201cgap\u201d, we allow the partitions to be adaptive to the data. Second, our method\ndoes not stem from a Bayesian origin and proceeds in a top down, greedy fashion. This makes our\nmethod computationally much more attractive than OPT and BSP, whose inference can be quite\ncomputationally intensive.\nThe second category is tree based density estimators [4] [5]. As an example, Density Estimation\nTrees [5] is generalization of classi\ufb01cation trees and regression trees for the task of density estimation.\nIts tree based origin has led to a loss minimization perspective: the learning of the tree is done by\nminimizing the integrated squared error. However, the true loss function can only be approximated by\na surrogate and the optimization problem is dif\ufb01cult to solve. The objective of our method is much\nsimpler and leads to an intuitive and ef\ufb01cient algorithm.\n\nin Rd. We use the short hand notation [a, b] =(cid:81)d\n\n3 Main algorithm\n3.1 Notations and de\ufb01nitions\nIn this paper we consider the problem of estimating a joint density function f from a given set of\nobservations. Without loss of generality, we assume the data domain \u2126 = [0, 1]d, a hyper-rectangle\nj=1[aj, bj] to denote a hyper-rectangle in Rd, where\na = (a1,\u00b7\u00b7\u00b7 , ad), b = (b1,\u00b7\u00b7\u00b7 , bd) \u2208 [0, 1]d. Each (aj, bj) pair speci\ufb01es the lower and upper bound\nof the hyper-rectangle along dimension j.\nWe restrict our attention to the class of piecewise constant functions after balancing the trade-off\nbetween simplicity and representational power: Ideally, we would like the function class to have\nconcise representation while at the same time allowing for ef\ufb01cient evaluation. On the other hand,\nwe would like to be able to approximate any continuous density function arbitrarily well (at least as\nthe sample size goes to in\ufb01nity). This trade-off has led us to choose the set of piecewise constant\nfunctions supported on binary partitions: First, we only need 2d + 1 \ufb02oating point numbers to\nuniquely de\ufb01ne a sub-rectangle (2d for its location and 1 for its density value). Second, it is well\nknown that the set of positive, integrable, piesewise constant functions is dense in Lp for p \u2208 [1,\u221e).\nThe binary partition we consider can be de\ufb01ned in the following recursive way: starting with\nP0 = \u2126. Suppose we have a binary partition Pt = {\u2126(1),\u00b7\u00b7\u00b7 , \u2126(t)} at level t, where \u222at\ni=1\u2126(i) = \u2126,\n\u2126(i) \u2229 \u2126(j) = \u2205, i (cid:54)= j, a level t + 1 partition Pt+1 is obtained by dividing one sub-rectangle \u2126(i) in\nPt along one of its coordinates, parallel to one of the dimension. See Figure 1 for an illustration.\n\n3.2 Adaptive partition and discrepancy control\nThe above recursive build up has two key steps. The \ufb01rst is to decide whether to further split a sub-\nrectangle. One helpful intuition is that for piecewise constant densities, the distribution conditioned\non each sub-rectangle is uniform. Therefore the partition should stop when the points inside a sub-\nrectangle are approximatly uniformly scattered. In other words, we stop the partition when further\n\n2\n\n\fFigure 1: Left: a sequence of binary partition and the corresponding tree representation; if we encode\npartitioning information (e.g., the location where the split occurs) in the nodes, there is a one to one\nmapping between the tree representations and the partitions. Right: the gaps with m = 3, we split\nthe rectangle at location D, which corresponds to the largest gap (Assuming it does not satisfy (2),\nsee the text for more details)\n\n.\n\npartitioning does not reveal much additional information about the underlying density landscape. We\npropose to use star discrepancy, which is a concept originates from the analysis of Quasi-Monte Carlo\nmethods, to formally measure the degree of uniformity of points in a sub-rectangle. Star discrepancy\nis de\ufb01ned as:\nDe\ufb01nition 1. Given n points Xn = {x1, ..., xn} in [0, 1]d. The star discrepancy D\u2217(Xn) is de\ufb01ned\nas:\n\n\u2217\n\nD\n\n(Xn) = sup\n\na\u2208[0,1]d\n\n(cid:12)(cid:12)(cid:12) 1\n\nn\n\nn(cid:88)\n\n1{xi \u2208 [0, a)} \u2212 d(cid:89)\n\ni=1\n\nj=1\n\n(cid:12)(cid:12)(cid:12)\n\naj\n\n(1)\n\nThe supremum is taken over all d-dimensional sub-rectangles [0, a). Given star discrepancy D\u2217(Xn),\nwe have the following error bound for Monte Carlo integration (See [6] for a proof):\nTheorem 2. (Koksma-Hlawka inequality) Let Xn = {x1, x2, ..., xn} be a set of points in [0, 1]d with\ndiscrepancy D\u2217(Xn); Let f be a function on [0, 1]d of bounded variation V(f ). Then,\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n\nf (x)dx \u2212 1\nn\n\n[0,1]d\n\nf (xi)\n\nn(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12) \u2264 V(f )D\n\n\u2217\n\n(Xn)\n\nto \ufb01nd a good location to split for [a, b] =(cid:81)d\n|(1/n)(cid:80)n\n\nwhere V(f ) is the total variation in the sense of Hardy and Krause (See [7] for its precise de\ufb01nition).\nThe above theorem implies if the star discrepancy D\u2217(Xn) is under control, the empirical distribution\nwill be a good approximation to the true distribution. Therefore, we may decide to keep partitioning\na sub-rectangle until its discrepancy is lower than some threshold. We shall see in section 4 that this\nprovably guarantees our density estimate is a good approximation to the true density function.\nAnother important ingredient of all partition based methods is the choice of splitting point. In order\nj=1[aj, bj], we divide jth dimension into m equal-sized\nbins: [aj, aj + (bj \u2212 aj)/m], ..., [aj + (bj \u2212 aj)(m \u2212 2)/m, aj + (bj \u2212 aj)(m \u2212 1)/m] and keep\ntrack of the gaps at aj + (bj \u2212 aj)/m, ..., aj + (bj \u2212 aj)(m \u2212 1)/m, where the gap gjk is de\ufb01ned as\ni=1 1(xij < aj + (bj \u2212 aj)k/m) \u2212 k/m| for k = 1, ..., (m \u2212 1), there are total (m \u2212 1)d\ngaps recorded (Figure 1). Here m is a hyper-parameter chosen by the user. [a, b] is split into two\nsub-rectangles along the dimension and location corresponding to maximum gap (Figure 1). The\npseudocode for the complete algorithm is given in Algorithm 1. We refer to this algorithm as DSP in\nthe sequel. One distinct feature of DSP is it only requires the user to specify two parameters: m, \u03b8,\nwhere m is the number of bins along each dimension; \u03b8 is the parameter for discrepancy control (See\ntheorem 2 for more details). In some applications, the user may prefer putting an upper bound on\nthe number of total partitions. In that case, there is typically no need to specify \u03b8. Choices for these\nparameters are discussed in Section 5.\nThe resulting density estimates \u02c6p is a piecewise constant function de\ufb01ned on a binary partition\ni=1 d(ri)1{x \u2208 ri} where 1 is the indicator function; L is the total number of\nsub-rectangles in the \ufb01nal partition; {ri, d(ri)}L\ni=1 are the sub-rectangle and density pairs. We\ndemonstrate in section 5 how \u02c6p(x) can be leveraged to \ufb01nd good initializations for k-means. In the\nfollowing section, we establish a convergence result of our density estimator.\n\nof \u2126: \u02c6p(x) = (cid:80)L\n\n3\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cfA:1/60B:1/60C:2/60D:7/60\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\fAlgorithm 1 Density Estimation via Discrepancy Based Sequential Partition (DSP)\nInput: XN , m, \u03b8\nOutput: A piecewise constant function Pr(\u00b7) de\ufb01ned on a binary partition R\nLet Pr(r) denote the probability mass of region r \u2282 \u2126; let XN (r) denote the points in XN that lie\nwithin r, where r \u2282 \u2126. ni denotes the size of set X (i).\n1: procedure DSP(XN , m, \u03b8)\n2:\n3:\n4:\n5:\n6:\n\nB = {[0, 1]d}, Pr([0, 1]d) = 1\nwhile true do\nR(cid:48) = \u2205\nfor each ri = [a(i), b(i)] in R do\n\nCalculate gaps {gjk}j=1,...,d,k=1,...,m\u22121\nScale X(ri) = {xil}ni\nif X(ri) (cid:54)= \u2205 and D\u2217( \u02dcX (i)) > \u03b8\n\nl=1 to \u02dcX (i) = {\u02dcxil = (\nN /ni then\n\n\u221a\n\nxil ,1\u2212a\n\n(i)\n1\n\nb\n\n(i)\n1\n\n, ...,\n\nxil ,d\u2212a\n\n(i)\nd\n\nb\n\n(i)\nd\n\n)}ni\n\nl=1\n\n(cid:46) Condition (2) in Theorem 4\n\n)|\n\nSplit ri into ri1 = [a(i1), b(i1)] and ri2 = [a(i2), b(i2)] along the max gap (Figure 1).\n|P (ri1\nPr(ri1 ) = Pr(ri)\nni\nR(cid:48) = R(cid:48) \u222a {ri1 , ri2}\nelse R(cid:48) = R(cid:48) \u222a {ri}\nif R(cid:48) (cid:54)= R then R = R(cid:48)\nelse return R, Pr(\u00b7)\n\n, Pr(ri2 ) = Pr(ri) \u2212 Pr(ri1 )\n\n7:\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\n4 Theoretical results\nBefore we establish our main theorem, we need the following lemma:1\nn = inf{x1,...,xn}\u2208[0,1]d D\u2217(x1, ..., xn), then we have\nLemma 3. Let D\u2217\n\n(cid:114)\nn \u2264 c\n\u2217\nD\nfor all n, d \u2208 R+, where c is some positive constant.\n\nd\nn\n\nWe now state our main theorem:\nTheorem 4. Let f be a function de\ufb01ned on \u2126 = [0, 1]d with bounded variation. Let XN =\n{x1, ..., xN \u2208 \u2126} and {[a(i), b(i)], i = 1,\u00b7\u00b7\u00b7 , L} be a level L binary partition of \u2126. Further denote\nby X (i) = {xj = (xj1, ..., xjd), xj \u2208 [a(i), b(i)] and } \u2229 XN , i.e. the part of XN in sub-rectangle i.\nni = |X (i)|. Suppose in each sub-rectangle [a(i), b(i)], X (i) satis\ufb01es\n\n(cid:113) N\n\nnid\n\n(2)\n\n\u03b8\n\nc for some positive\n\n(3)\n\n\u2217\n\nD\n\n( \u02dcX (i)) \u2264 \u03b1(i)D\n\n\u2217\nni\n\nwhere \u02dcX (i) = {\u02dcxj = ( xj1\u2212a(i)\nconstant \u03b8, D\u2217\n\n, ..., xjd\u2212a(i)\nis de\ufb01ned as in lemma 3. Then\n\nb(i)\nd\n\nb(i)\n1\n\nd\n\n1\n\nni\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n\nf (x)\u02c6p(x)dx \u2212 1\nN\n\n[0,1]d\n\n), xj \u2208 X (i)} , \u03b1(i) =\n\nN(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12) \u2264 \u03b8\u221a\n\nN\n\nV(f )\n\nf (xi)\n\nwhere \u02c6p(x) is a piecewise constant density estimator given by\n\nL(cid:88)\n\ni=1\n\n\u02c6p(x) =\n\ndi1{x \u2208 [a(i), b(i)]}\n\nwith di = ((cid:81)d\n\nj=1(b(i)\n\nj \u2212 a(i)\n\nj ))\u22121ni/N, i.e., the empirical density.\n\nIn the above theorem, \u03b1(i) controls the relative uniformity of the points and is adaptive to X (i). It\nimposes more restrictive constraints on regions containing larget proportion of the sample (ni/N).\nAlthough our density estimate is not the only estimator which satis\ufb01es (3), (for example, both the\nempirical distribution in the asymptotic limit and kernel density estimator with suf\ufb01ciently small\nbandwidth meet the criterion), one advantage of our density estimator is that it provides a very concise\n\n1The proof for Lemma 3 can be found in [8]. Theorem 4 and Corollary 5 are proved in the supplementary\n\nmaterial.\n\n4\n\n\fsummary of the data while at the same time capturing the landscape of the underlying distribution. In\naddition, the piecewise constant function does not suffer from having too many \u201clocal bumps\u201d, which\nis a common problem for kernel density estimator. Moreover, under certain regularity conditions\n(e.g. bounded second moments), the convergence rate of Monte Carlo methods for 1\ni=1 f (xi) to\nN\n[0,1]d f (x)p(x)dx is of order O(N\u2212 1\n2 ). Our density estimate is optimal in the sense that it achieves\nthe same rate of convergence. Given theorem 4, we have the following convergence result:\nCorollary 5. Let \u02c6p(x) be the estimated density function as in theorem 4. For any hyper-rectangle\n\n(cid:82)\nA = [a, b] \u2282 [0, 1]d, let \u02c6P (A) =(cid:82)\n\nA \u02c6p(x)dx and P (A) =(cid:82)\n\nA p(x)dx, then\n\n(cid:80)N\n\n| \u02c6P (A) \u2212 P (A)| \u2192 0\n\nsup\n\nA\u2282[0,1]d\n\nat the order O(n\u2212 1\n2 ).\nRemark 4.1. It is worth pointing out that the total variation distance between two probability\nmeasures \u02c6P and P is de\ufb01ned as \u03b4( \u02c6P , P ) = supA\u2208B | \u02c6P (A) \u2212 P (A)|, where B is the Borel \u03c3-algebra\nof [0, 1]d. In contrast, Corollary 5 restricts A to be hyper-rectangles.\n\n5 Experimental results\n5.1\nImplementation details\nIn some applications, we \ufb01nd it helpful to \ufb01rst estimate the marginal densities for each component\nvariables x.j (j = 1, ..., d), then make a copula transformation z.j= \u02c6Fj(x.j), where \u02c6Fj is the estimated\ncdf of x.j. After such a transformation, we can take the domain to be [0, 1]d. Also we \ufb01nd this can\nsave the number of partition needed by DSP. Unless otherwise stated, we use copula transform in our\nexperiments whenever the dimension exceeds 3.\nWe make the following observations to improve the ef\ufb01ciency of DSP: 1) First observe that\nmaxj=1,...,d D\u2217({xij}n\ni=1). Let x(i)j be the ith smallest element in {xij}n\ni=1,\nthen D\u2217({xij}n\n2n | [9], which has complexity O(n log n). Hence\ni=1) = 1\nmaxj=1,...,d D\u2217({xij}n\nL/n \ufb01rst before calculating\nD\u2217({xi}n\n\u221a\ni=1) is bounded above by 1; 3)\ni=1) is bounded below by cd log(d\u22121)/2 n\u22121 with some\n\u03b8\nN /n \u2264 \u0001,\nconstant cd depending on d [10]; thus we can keep splitting without checking (2) when \u03b8\nwhere \u0001 is a small positive constant (say 0.001) speci\ufb01ed by the user. This strategy has proved to be\neffective in decreasing the runtime signi\ufb01cantly at the cost of introducing a few more sub-rectangles.\nAnother approximation works well in practice is by replacing star discrepancy with computationally\n2 ; in fact,\nseveral statistics to test uniformity hypothesis based on D(2) are proposed in [11]; however, the\ntheoretical guarantee in Theorem 4 no longer holds. By Warnock\u2019s formula [9],\n\ni=1) \u2264 D\u2217({xi}n\n2n + maxi |x(i)j \u2212 2i\u22121\n\u221a\ni=1) can be used to compare against \u03b8\nN /n is large when n is small, but D\u2217({xi}n\n\nattractive L2 star discrepancy, i.e., D(2)(Xn) = ((cid:82)\n\nN /n is tiny when n is large and D\u2217({xi}n\n\ni=1 ai|2da) 1\n\n[0,1]d | 1\n\ni=1); 2) \u03b8\n\n\u221a\n\n\u221a\n\nn\n\n(cid:80)n\ni=1 1xi\u2208[0,a) \u2212(cid:81)d\nd(cid:89)\nn(cid:88)\n\nn(cid:88)\n\nd(cid:89)\n\n[D(2)(Xn)]2 =\n\n1\n\n3d \u2212 21\u2212d\n\nn\n\n(1 \u2212 x2\n\nij) +\n\n1\nn2\n\nmin{1 \u2212 xij, 1 \u2212 xlj}\n\ni=1\n\nj=1\n\nj=1\n\ni,l=1\n\nR in Algorithm 1, the total complexity is at most(cid:80)L\n\nD(2) can be computed in O(n logd\u22121 n) by K. Frank and S. Heinrich\u2019s algorithm [9]. At each scan of\ni=1 O(ni logd\u22121 N ) \u2264\n\ni=1 O(ni logd\u22121 ni) \u2264(cid:80)L\n\nO(N logd\u22121 N ).\nThere are no closed form formulas for calculating D\u2217(Xn) and D\u2217\nn except for low dimensions. If\nwe replace \u03b1(i) in (2) and apply Lemma 3, what we are actually trying to do is to control D\u2217( \u02dcX (i))\nN /ni. There are many existing work on ways to approximate D\u2217(Xn). In particular, a new\nby \u03b8\nrandomized algorithm based on threshold accepting is developed in [12]. Comprehensive numerical\ntests indicate that it improves upon other algorithms, especially in when 20 \u2264 d \u2264 50. We used\nthis algorithm in our experiments. The interested readers are referred to the original paper for more\ndetails.\n\n\u221a\n\n5\n\n\f5.2 DSP as a density estimate\n1) To demonstrate the method and visualize the results, we apply it on several 2-dimensional data sets\nsimulated from 3 distributions with different geometry:\n\n\u00b52 = (.50, .75)T , \u03a31 = \u03a32 = [0.04, 0.01; 0.01, 0.01];\n\n1. Gaussian: x \u223c N (\u00b5, \u03a3)1{x \u2208 [0, 1]2}, with \u00b5 = (.5, .5)T , \u03a3 = [0.08, 0.02; 0.02, 0.02]\n2. Mixture of Gaussians: x \u223c 1\n3. Mixture of Betas: x \u223c 1\n\n(cid:80)2\ni=1 N (\u00b5i, \u03a3i)1{x \u2208 [0, 1]2} with \u00b51 = (.50, .25)T , and\n\n3 (beta(2, 5)beta(5, 2) + beta(4, 2)beta(2, 4) + beta(1, 3)beta(3, 1));\nwhere N (\u00b5, \u03a3) denotes multivariate Gaussian distribution and beta(\u03b1, \u03b2) denotes beta distribution.\nWe simulated 105 points for each distribution. See the \ufb01rst row of Figure 2 for visualizations of the\nestimated densities. The \ufb01gure shows DSP accurately estimates the true density landscape in these\nthree toy examples.\n\n2\n\nFigure 2: First row: estimated densities for 3 simulated 2D datasets. The modes are marked with\nstars. The corresponding contours of true densities are embedded for comparison. Second row:\nsimulation of 2, 5 and 10 dimensional cases (from left to right) with reference functions f1, f2, f3.\nx-axis: sample size n. y-axis: error between the true integral and the estimated integral. The vertical\nbars are standard error bars obtained from 10 replications. See section 5.2 2) for more details.\n\n(cid:80)d\n\ni=1\n\n1\n2\n\nj=1 x\n\ni=1\n\n1\n2\n\nj=1 x\n\ni=1\n\n(cid:80)d\n\nj=1 beta(xj, 5, 15)\n\n1\n2\ndistribution.\n\nij, f2(x) = (cid:80)n\n\n[0,1]d fk(x)\u02c6p(x)dx| is bounded by |(cid:82)\n\ndimension d = 2, 5 and 10 respectively: f1(x) = (cid:80)n\nf3(x) = ((cid:80)n\n(cid:16)(cid:81)d\nj=1 beta(xj, 15, 5) +(cid:81)d\nThe error |(cid:82)\nj=1 fk(xj)| + |(cid:82)\n(cid:80)n\n\n(cid:80)d\n[0,1]d fk(x)p(x)dx \u2212 (cid:82)\n\n2) To evaluate the theoretical bound (3), we choose the following three 3 reference functions with\nj=1 xij,\nij)2. We generate n \u2208 {102, 103, 104, 105, 106} samples from p(x) =\n, where beta(\u00b7, \u03b1, \u03b2) is the density function of beta\n\n(cid:17)\n(cid:80)n\n[0,1]d fk(x)p(x)dx \u2212\nj=1 fk(xj)| where \u02c6p(x) is the estimated density;\nFor almost all Monte Carlo methods, the \ufb01rst term is of order O(n\u2212 1\n2 ). The second term is controlled\nby (3). Thus in total the error is of order O(n\u2212 1\n2 ). We have plot the error against the sample size\non log-log scale for each dimension in the second row of Figure 2. The linear trends in the plots\ncorroborate the bound in (3).\n3) To show the ef\ufb01ciency and scalability of DSP, we compare it with KDE, OPT and BSP in terms\ni=1 \u03c0iN (\u00b5i, \u03a3i))1{x \u2208\n[0, 1]d} with d = {2, 3,\u00b7\u00b7\u00b7 , 6} and N = {103, 104, 105} respectively. The estimation error measured\nin terms of Hellinger Distance is summarized in Table 1. We set m = 10, \u03b8 = 0.01 in our experiments.\nWe found the resulting Hellinger distance to be quite robust as m ranges from 3 to 20 (equally\n\nof estimation error and running time. We simulate samples from x \u223c ((cid:80)4\n\n[0,1]d fk(x)\u02c6p(x)dx \u2212 1\n\n1\nn\n\nn\n\n6\n\n10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 00.20.40.60.8100.20.40.60.81 00.20.40.60.8100.20.40.60.8100.20.40.60.8100.20.40.60.81f1f2f3f1f2f3f1f2f3*****10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 00.20.40.60.8100.20.40.60.81 00.20.40.60.8100.20.40.60.8100.20.40.60.8100.20.40.60.81f1f2f3f1f2f3f1f2f3*****10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 00.20.40.60.8100.20.40.60.81 00.20.40.60.8100.20.40.60.8100.20.40.60.8100.20.40.60.81f1f2f3f1f2f3f1f2f3*****10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 00.20.40.60.8100.20.40.60.81 00.20.40.60.8100.20.40.60.8100.20.40.60.8100.20.40.60.81f1f2f3f1f2f3f1f2f3*****10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 00.20.40.60.8100.20.40.60.81 00.20.40.60.8100.20.40.60.8100.20.40.60.8100.20.40.60.81f1f2f3f1f2f3f1f2f3*****10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 10210310410510610-410-310-210-1 00.20.40.60.8100.20.40.60.81 00.20.40.60.8100.20.40.60.8100.20.40.60.8100.20.40.60.81f1f2f3f1f2f3f1f2f3*****\fspaced). The supplementary material includes the exact details about the parameters of the simulating\ndistributions, estimation of Hellinger distance and other implementation details for the algorithms.\nThe table shows DSP achieves comparable accuracy with the best of the other three methods. As\nmentioned at the beginning of this paper, one major advantage of DSP\u2019s is its speed. Table 2 shows\nour method achieves a signi\ufb01cant speed up over all other three algorithms.\n\nTable 1: Error in Hellinger Distance between the true density and KDE, OPT, BSP, our method\nfor each (d, n) pair. The numbers in parentheses are standard errors from 20 replicas. The best of the\nfour method is highlighted in bold. Note that the simulations, being based on mixtures of Gaussians,\nis unfavorable for methods based on domain partitions.\n\nHellinger Distance (n = 103)\n\nHellinger Distance (n = 104)\n\nHellinger Distance (n = 105)\n\nd KDE\n2\n\n0.2331\n(0.0421)\n0.2893\n(0.0227)\n0.3913\n(0.0325)\n0.4522\n(0.0317)\n0.5511\n(0.0318)\n\n3\n\n4\n\n5\n\n6\n\nOPT\n0.2147\n(0.0172)\n0.3279\n(0.0128)\n0.3839\n(0.0136)\n0.4748\n(0.009)\n0.5508\n(0.0307)\n\nBSP\n0.2533\n(0.0163)\n0.2983\n(0.0133)\n0.3872\n(0.0117)\n0.4435\n(0.0167)\n0.5515\n(0.0354)\n\nDSP\n0.2634\n(0.0207)\n0.3072\n(0.0265)\n0.3895\n(0.0191)\n0.4307\n(0.0302)\n0.5527\n(0.0381)\n\nKDE\n0.1104\n(0.0102)\n0.2003\n(0.0199)\n0.2466\n(0.0113)\n0.3599\n(0.0199)\n0.4833\n(0.0255)\n\nOPT\n0.0957\n(0.0036)\n0.1722\n(0.0028)\n0.2726\n(0.0031)\n0.3562\n(0.0025)\n0.4015\n(0.0023)\n\nBSP\n0.1222\n(0.0043)\n0.1717\n(0.0083)\n0.2882\n(0.0047)\n0.3987\n(0.0022)\n0.4093\n(0.0046)\n\nDSP\n0.0803\n(0.0013)\n0.1721\n(0.0073)\n0.2955\n(0.0065)\n0.3563\n(0.0031)\n0.3911\n(0.0037)\n\nKDE\n0.0305\n(0.0021)\n0.1466\n(0.0047)\n0.1900\n(0.0057)\n0.2817\n(0.0088)\n0.3697\n(0.0122)\n\nOPT\n0.0376\n(0.0021)\n0.1117\n(0.0008)\n0.1880\n(0.0006)\n0.2822\n(0.0005)\n0.3409\n(0.0005)\n\nBSP\n0.0345\n(0.0025)\n0.1323\n(0.0009)\n0.2100\n(0.0006)\n0.2916\n(0.0003)\n0.3693\n(0.0004)\n\nDSP\n0.0312\n(0.0027)\n0.1020\n(0.004)\n0.1827\n(0.0059)\n0.2910\n(0.0002)\n0.3701\n(0.0002)\n\nTable 2: Average CPU time in seconds of KDE, OPT, BSP and our method for each (d, n) pair.\nThe numbers in parentheses are standard errors from 20 replicas. The speed-up is the fold speed-up\ncomputed as the ratio between the minimum run time of the other three methods and the run time of\nDSP. All methods are implemented in C++. See the supplementary material for more details.\n\nd\n2\n\n3\n\n4\n\n5\n\n6\n\nKDE\n2.445\n(0.191)\n2.655\n(0.085)\n3.540\n(0.116)\n4.107\n(0.110)\n4.986\n(0.214)\n\nRunning time (n = 103)\nDSP\nOPT\n0.020\n9.484\n(0.002)\n(0.029)\n0.019\n25.073\n(0.002)\n(0.056)\n32.112\n0.019\n(0.002)\n(0.072)\n0.020\n37.599\n(0.002)\n(0.088)\n0.020\n41.565\n(0.147)\n(0.001)\n\nBSP\n0.833\n(0.006)\n1.054\n(0.010)\n1.314\n(0.014)\n1.713\n(0.019)\n2.749\n(0.024)\n\nspeed-up\n\n41\n\n55\n\n69\n\n85\n\n137\n\nKDE\n21.903\n(1.905)\n26.964\n(1.089)\n37.141\n(2.244)\n45.580\n(2.124)\n53.291\n(2.767)\n\nRunning time (n = 104)\nDSP\nOPT\n0.033\n31.561\n(0.002)\n(0.079)\n0.044\n36.683\n(0.001)\n(0.076)\n39.219\n0.049\n(0.002)\n(0.221)\n0.078\n44.520\n(0.002)\n(0.587)\n0.127\n43.032\n(0.413)\n(0.004)\n\nBSP\n1.445\n(0.014)\n2.819\n0.036)\n5.861\n(0.076)\n12.220\n(0.154)\n21.696\n(0.213)\n\nspeed-up\n\n43\n\n64\n\n119\n\n157\n\n170\n\nKDE\n230.179\n(130.572)\n278.075\n(10.576)\n347.501\n(14.676)\n412.828\n(16.252)\n519.298\n(29.276)\n\nRunning time (n = 105)\nDSP\nOPT\n0.242\n44.561\n(0.015)\n(0.639)\n0.378\n56.329\n(0.011)\n(0.911)\n67.366\n0.485\n(0.018)\n(3.018)\n0.706\n77.776\n(0.051)\n(2.215)\n0.896\n81.023\n(3.703)\n(0.071)\n\nBSP\n7.750\n(0.178)\n21.104\n(0.576)\n53.620\n(2.917)\n115.869\n(6.872)\n218.999\n(6.046)\n\nspeed-up\n\n33\n\n55\n\n108\n\n110\n\n90\n\n5.3 DSP-kmeans\nIn addition to being a competitive density estimator, we demonstrate in this section how DSP can be\nused to get good initializations for k-means. The resulting algorithm is referred to as DSP-kmeans.\nRecall that given a \ufb01xed number of clusters K, the goal of k-means is to minimize the following\nobjective function:\n\nJK\n\n\u2206=\n\n(cid:107)xi \u2212 mk(cid:107)2\n\n2\n\n(4)\n\nK(cid:88)\n\n(cid:88)\n\nk=1\n\ni\u2208Ck\n\nwhere Ck denote the set of points in cluster k; {mk}K\nk=1 denote the cluster means. The original\nk-means algorithms proceeds by alternating between assigning points to centers and recomputing\nthe means. As a result, the \ufb01nal clustering is usually only a local optima and can be sensitive to the\ninitializations. Finding a good initialization has attracted a lot of attention over the past decade and\nnow there is a descent number existing methods, each with their own perspectives. Below we review\na few representative types.\nOne type of methods look for good initial centers sequentially. The idea is once the \ufb01rst center is\npicked, the second should be far away from the one that is already chosen. A similar argument applies\nto the rest of the centers. [13] [14] fall under this category. Several studies [15] [16] borrow ideas\nfrom hierarchical agglomerative clustering (HAC) to look for good initializations. In our experiments\nwe used the algorithm described in [15]. One essential ingredient of this type of algorithms is the inter\ncluster distance, which could be problem dependent. Last but not least, there is a class of methods\nthat attempt to utilize the relationship between PCA and k-means. [17] proposes a PCA-guided\nsearch for initial centers. [18] combines the relationship between PCA and k-means to look for\ngood initialization. The general idea is to recursively splitting a cluster according the \ufb01rst principal\ncomponent. We refer to this algorithm as PCA-REC.\n\n7\n\n\f(cid:80)\n\nDSP-kmeans is different from previous methods in that it tackles the initialization problem from\na density estimation point of view. The idea behind DSP-kmeans is that cluster centers should be\nclose to the modes of underlying probability density function. If a density estimator can accurately\nlocate the modes of the underlying true density function, it should also be able to \ufb01nd good cluster\ncenters. Due to its concise representation, DSP can be used for \ufb01nding initializations for k-means\nin the following way: Suppose we are trying to cluster a dataset Y with K clusters. We \ufb01rst apply\nDSP on Y to \ufb01nd a partition with K non-empty sub-rectangles, i.e. sub-rectangles that have at least\none point from Y . The output of DSP will be K sub-rectangles. Denote the set of indices for the\npoints in sub-rectangle j by Sj, j = 1, . . . , K, let Ij = 1|Sj|\nYi, i.e. Ij is the sample average\nof points fall into sub-rectangle j. We then use {I1,\u00b7\u00b7\u00b7 , IK} to initialize k-means. We also explored\nthe following two-phase procedure: \ufb01rst over partition the space to build a more accurate density\nestimate. Points in different sub-rectangles are considered to be in different clusters. Then we merge\nthe sub-rectangles hierarchically based on some measure of between cluster distance. We have found\nthis to be helpful when the number of clusters K is relatively small. For completeness, we have\nincluded the details of this two-phase DSP-kmeans in the supplementary material.\nWe test DSP-kmeans on 4 real world datasets of various number of data points and dimensions. Two\nof them are taken from the UCI machine learning repository [19]; the stem cell data set is taken from\nthe FlowCAP challenges [20]; the mouse bone marrow data set is a recently published single-cell\ndataset measured using mass cytometry [21]. We use random initialization as the base case and\ncompare it with DSP-kmeans, k-means++, PCA-REC and HAC. The numbers in Table 3 are the\nimprovements in k-means objective function of a method over random initialization. The result\nshows when the number of clusters is relatively large DSP-kmeans achieves lower objective value\nin these four datasets. Although in theory almost all density estimator could be used to \ufb01nd good\n\ni\u2208Sj\n\nTable 3: Comparison of different initialization methods. The number for method j is relative\nto random initialization: JK,j\u2212JK,0\n, where JK,j is the k-means objective value of method j at\nconvergence. Here we use 0 as index for random initialization. Negative number means the method\nperform worse than random initialization.\nImprovement over random init.\n\nImprovement over random init.\n\nJK,0\n\nk-means++\n\nPCA-REC HAC DSP-kmeans Mouse bone marrow\n\nk-means++\n\nPCA-REC HAC DSP-kmeans\n\nk-means++\n\nPCA-REC HAC DSP-kmeans US census\n\nk-means++\n\nPCA-REC HAC DSP-kmeans\n\nRoad network\nn\nd\n\n4.3e+04\n3\n\nStem cell\nn\nd\n\n9.9e+03\n6\n\nk\n4\n10\n20\n40\n60\nk\n4\n10\n20\n40\n60\n\n0.0\n0.0\n0.43\n11.7\n19.78\n\n3.45\n3.82\n9.96\n9.95\n6.12\n\n-0.02\n-0.12\n-0.46\n-2.52\n-3.45\n\n-2.1\n-4.2\n-3.59\n-6.39\n-7.29\n\n0.01\n0.25\n1.68\n2.27\n18.69\n\n3.67\n3.79\n9.91\n10.11\n8.19\n\n0.0\n0.08\n2.04\n13.62\n20.91\n\n3.96\n3.6\n9.39\n12.49\n13.7\n\nn\nd\n\nn\nd\n\n8.7e+04\n39\n\n2.4e+06\n68\n\nk\n4\n10\n20\n40\n60\nk\n4\n10\n20\n40\n60\n\n1.51\n0.45\n0.63\n1.99\n2.48\n\n47.44\n40.52\n32.63\n32.66\n21.7\n\n0.03\n0.24\n-1.2\n-3.56\n-5.25\n\n-2.33\n-1.9\n-1.97\n-5.15\n-1.19\n\n1.25\n0.77\n0.68\n2.06\n2.57\n\n46.72\n41.48\n29.49\n33.41\n16.28\n\n0.4\n0.83\n0.79\n2.55\n2.65\n\n40.44\n39.52\n32.55\n34.61\n21.68\n\ninitializations. Based on the comparison of Hellinger distance in Table 1, we would expect them to\nhave similar performances. However, for OPT and BSP, their runtime would be a major bottleneck for\ntheir applicability The situation for KDE is slightly more complicated: not only it is computationally\nquite intensive, its output can not be represented as concisely as partition based methods. Here we\nsee that the ef\ufb01ciency of DSP makes it possible to utilize it for other machine learning tasks.\n6 Conclusion\nIn this paper we propose a novel density estimation method based on ideas from Quasi-Monte Carlo\nanalysis. We prove it achieves a O(n\u2212 1\n2 ) error rate. By comparing it with other density estimation\nmethods, we show DSP has comparable performance in terms of Hellinger distance while achieving\na signi\ufb01cant speed-up. We also show how DSP can be used to \ufb01nd good initializations for k-means.\nDue to space limitation, we were unable to include other interesting applications including mode\nseeking, data visualization via level set tree and data compression [22].\n\nAcknowledgements. This work was supported by NIH-R01GM109836, NSF-DMS1330132 and\nNSF-DMS1407557. The second author\u2019s work was done when the author was a graduate student at\nStanford University.\n\n8\n\n\fReferences\n\n[1] Luo Lu, Hui Jiang, and Wing H Wong. Multivariate density estimation by bayesian sequential partitioning.\n\nJournal of the American Statistical Association, 108(504):1402\u20131410, 2013.\n\n[2] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical\n\nstatistics, 33(3):1065\u20131076, 1962.\n\n[3] Wing H Wong and Li Ma. Optional p\u00f3lya tree and bayesian inference. The Annals of Statistics, 38(3):1433\u2013\n\n1459, 2010.\n\n[4] Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, and Larry Wasserman. Forest density\n\nestimation. The Journal of Machine Learning Research, 12:907\u2013951, 2011.\n\n[5] Parikshit Ram and Alexander G Gray. Density estimation trees. In Proceedings of the 17th ACM SIGKDD\n\ninternational conference on Knowledge discovery and data mining, pages 627\u2013635. ACM, 2011.\n\n[6] Lauwerens Kuipers and Harald Niederreiter. Uniform distribution of sequences. Courier Dover Publications,\n\n2012.\n\n[7] Art B Owen. Multidimensional variation for quasi-monte carlo. In International Conference on Statistics\n\nin honour of Professor Kai-Tai Fang\u2019s 65th birthday, pages 49\u201374, 2005.\n\n[8] Stefan Heinrich, Erich Novak, Grzegorz W Wasilkowski, and Henryk Wozniakowski. The inverse of the\nstar-discrepancy depends linearly on the dimension. ACTA ARITHMETICA-WARSZAWA-, 96(3):279\u2013302,\n2000.\n\n[9] Carola Doerr, Michael Gnewuch, and Magnus Wahlstr\u00f3m. Calculation of discrepancy measures and\n\napplications. Preprint, 2013.\n\n[10] Michael Gnewuch. Entropy, randomization, derandomization, and discrepancy. In Monte Carlo and\n\nquasi-Monte Carlo methods 2010, pages 43\u201378. Springer, 2012.\n\n[11] Jia-Juan Liang, Kai-Tai Fang, Fred Hickernell, and Runze Li. Testing multivariate uniformity and its\n\napplications. Mathematics of Computation, 70(233):337\u2013355, 2001.\n\n[12] Michael Gnewuch, Magnus Wahlstr\u00f3m, and Carola Winzen. A new randomized algorithm to approximate\nthe star discrepancy based on threshold accepting. SIAM Journal on Numerical Analysis, 50(2):781\u2013807,\n2012.\n\n[13] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings\nof the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027\u20131035. Society for\nIndustrial and Applied Mathematics, 2007.\n\n[14] Ioannis Katsavounidis, C-C Jay Kuo, and Zhen Zhang. A new initialization technique for generalized lloyd\n\niteration. Signal Processing Letters, IEEE, 1(10):144\u2013146, 1994.\n\n[15] Chris Fraley. Algorithms for model-based gaussian hierarchical clustering. SIAM Journal on Scienti\ufb01c\n\nComputing, 20(1):270\u2013281, 1998.\n\n[16] Stephen J Redmond and Conor Heneghan. A method for initialising the k-means clustering algorithm\n\nusing kd-trees. Pattern recognition letters, 28(8):965\u2013973, 2007.\n\n[17] Qin Xu, Chris Ding, Jinpei Liu, and Bin Luo. Pca-guided search for k-means. Pattern Recognition Letters,\n\n54:50\u201355, 2015.\n\n[18] Ting Su and Jennifer G Dy. In search of deterministic methods for initializing k-means and gaussian\n\nmixture clustering. Intelligent Data Analysis, 11(4):319\u2013338, 2007.\n\n[19] Manohar Kaul, Bin Yang, and Christian S Jensen. Building accurate 3d spatial networks to enable next\ngeneration intelligent transportation systems. In Mobile Data Management (MDM), 2013 IEEE 14th\nInternational Conference on, volume 1, pages 137\u2013146. IEEE, 2013.\n\n[20] Nima Aghaeepour, Greg Finak, Holger Hoos, Tim R Mosmann, Ryan Brinkman, Raphael Gottardo,\nRichard H Scheuermann, FlowCAP Consortium, DREAM Consortium, et al. Critical assessment of\nautomated \ufb02ow cytometry data analysis techniques. Nature methods, 10(3):228\u2013238, 2013.\n\n[21] Matthew H Spitzer, Pier Federico Gherardini, Gabriela K Fragiadakis, Nupur Bhattacharya, Robert T Yuan,\nAndrew N Hotson, Rachel Finck, Yaron Carmi, Eli R Zunder, Wendy J Fantl, et al. An interactive reference\nframework for modeling a dynamic immune system. Science, 349(6244):1259425, 2015.\n\n[22] Robert M Gray and Richard A Olshen. Vector quantization and density estimation. In Compression and\n\nComplexity of Sequences 1997. Proceedings, pages 172\u2013193. IEEE, 1997.\n\n9\n\n\f", "award": [], "sourceid": 627, "authors": [{"given_name": "Dangna", "family_name": "Li", "institution": "Stanford university"}, {"given_name": "Kun", "family_name": "Yang", "institution": "Google Inc"}, {"given_name": "Wing Hung", "family_name": "Wong", "institution": "Stanford university"}]}