{"title": "Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method", "book": "Advances in Neural Information Processing Systems", "page_first": 929, "page_last": 936, "abstract": "We propose a new fast Gaussian summation algorithm for high-dimensional datasets with high accuracy. First, we extend the original fast multipole-type methods to use approximation schemes with both hard and probabilistic error. Second, we utilize a new data structure called subspace tree which maps each data point in the node to its lower dimensional mapping as determined by any linear dimension reduction method such as PCA. This new data structure is suitable for reducing the cost of each pairwise distance computation, the most dominant cost in many kernel methods. Our algorithm guarantees probabilistic relative error on each kernel sum, and can be applied to high-dimensional Gaussian summations which are ubiquitous inside many kernel methods as the key computational bottleneck. We provide empirical speedup results on low to high-dimensional datasets up to 89 dimensions.", "full_text": "Fast High-dimensional Kernel Summations Using the\n\nMonte Carlo Multipole Method\n\nDongryeol Lee\n\nComputational Science and Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\ndongryel@cc.gatech.edu\n\nAlexander Gray\n\nComputational Science and Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nagray@cc.gatech.edu\n\nAbstract\n\nWe propose a new fast Gaussian summation algorithm for high-dimensional\ndatasets with high accuracy. First, we extend the original fast multipole-type meth-\nods to use approximation schemes with both hard and probabilistic error. Second,\nwe utilize a new data structure called subspace tree which maps each data point in\nthe node to its lower dimensional mapping as determined by any linear dimension\nreduction method such as PCA. This new data structure is suitable for reducing\nthe cost of each pairwise distance computation, the most dominant cost in many\nkernel methods. Our algorithm guarantees probabilistic relative error on each ker-\nnel sum, and can be applied to high-dimensional Gaussian summations which are\nubiquitous inside many kernel methods as the key computational bottleneck. We\nprovide empirical speedup results on low to high-dimensional datasets up to 89\ndimensions.\n\n1 Fast Gaussian Kernel Summation\n\nIn this paper, we propose new computational techniques for ef(cid:2)ciently approximating the following\nsum for each query point qi 2 Q:\n\ne(cid:0)jjqi(cid:0)rjjj2=(2h2)\n\n(1)\n\n(cid:8)(qi;R) = Xrj2R\n\nwhere R is the reference set; each reference point is associated with a Gaussian function with a\nsmoothing parameter h (the \u2019bandwidth\u2019). This form of summation is ubiquitous in many statis-\ntical learning methods, including kernel density estimation, kernel regression, Gaussian process\nregression, radial basis function networks, spectral clustering, support vector machines, and kernel\nPCA [1, 4]. Cross-validation in all of these methods require evaluating Equation 1 for multiple val-\nues of h. Kernel density estimation, for example, requires jRj density estimate based on jRj (cid:0) 1\npoints, yielding a brute-force computational cost scaling quadratically (that is O(jRj2)).\nError bounds. Due to its expensive computational cost, many algorithms approximate the Gaus-\nsian kernel sums at the expense of reduced precision. Therefore, it is natural to discuss error bound\ncriteria which measure the quality of the approximations with respect to their corresponding true\nvalues. The following error bound criteria are common in literature:\nDe\ufb01nition 1.1. An algorithm guarantees (cid:15) absolute error bound, if for each exact value (cid:8)(qi;R)\n\nDe\ufb01nition 1.2. An algorithm guarantees (cid:15) relative error bound, if for each exact value (cid:8)(qi;R)\n\nfor qi 2 Q, it computese(cid:8)(qi;R) such that(cid:12)(cid:12)(cid:12)e(cid:8)(qi;R) (cid:0) (cid:8)(qi;R)(cid:12)(cid:12)(cid:12) (cid:20) (cid:15).\nfor qi 2 Q, it computese(cid:8)(qi;R) 2 R such that(cid:12)(cid:12)(cid:12)e(cid:8)(qi;R) (cid:0) (cid:8)(qi;R)(cid:12)(cid:12)(cid:12) (cid:20) (cid:15)j(cid:8)(qi;R)j.\n\n1\n\n\fBounding the relative error (e.g., the percentage deviation) is much harder because the error bound\ncriterion is in terms of the initially unknown exact quantity. As a result, many previous methods [7]\nhave focused on bounding the absolute error. The relative error bound criterion is preferred to the\nabsolute error bound criterion in statistical applications in which high accuracy is desired. Our new\nalgorithm will enforce the following (cid:147)relaxed(cid:148) form of the relative error bound criterion, whose\nmotivation will be discussed shortly.\nDe\ufb01nition 1.3. An algorithm guarantees (1 (cid:0) (cid:11)) probabilistic (cid:15) relative error bound, if for each\nexact value (cid:8)(qi;R) for qi 2 Q, it computes e(cid:8)(qi;R) 2 R, such that with at least probability\n0 < 1 (cid:0) (cid:11) < 1,(cid:12)(cid:12)(cid:12)e(cid:8)(qi;R) (cid:0) (cid:8)(qi;R)(cid:12)(cid:12)(cid:12) (cid:20) (cid:15)j(cid:8)(qi;R)j.\n\nPrevious work. The most successful class of acceleration methods employ (cid:147)higher-order divide\nand conquer(cid:148) or generalized N-body algorithms (GNA) [4]. This approach can use any spatial\npartioning tree such as kd-trees or ball-trees for both the query set Q and reference data R and\nperforms a simulataneous recursive descent on both trees.\nGNA with relative error bounds (De(cid:2)nition 1.2) [5, 6, 11, 10] utilized bounding boxes and addi-\ntional cached-suf\ufb01cient statistics such as higher-order moments needed for series-expansion. [5, 6]\nutilized bounding-box based error bounds which tend to be very loose, which resulted in slow empir-\nical performance around suboptimally small and large bandwidths.\n[11, 10] extended GNA-based\nGaussian summations with series-expansion which provided tighter bounds; it showed enormous\nperformance improvements, but only up to low dimensional settings (up to D = 5) since the num-\nber of required terms in series expansion increases exponentially with respect to D.\n[9] introduces an iterative sampling based GNA for accelerating the computation of nested sums\n(a related easier problem). Its speedup is achieved by replacing pessimistic error bounds provided\nby bounding boxes with normal-based con(cid:2)dence interval from Monte Carlo sampling. [9] demon-\nstrates the speedup many orders of magnitude faster than the previous state of the art in the context\nof computing aggregates over the queries (such as the LSCV score for selecting the optimal band-\nwidth). However, the authors did not discuss the sampling-based approach for computations that\nrequire per-query estimates, such as those required for kernel density estimation.\nNone of the previous approaches for kernel summations addresses the issue of reducing the compu-\ntational cost of each distance computation which incurs O(D) cost. However, the intrinsic dimen-\nsionality d of most high-dimensional datasets is much smaller than the explicit dimension D (that is,\nd << D). [12] proposed tree structures using a global dimension reduction method, such as random\nprojection, as a preprocessing step for ef(cid:2)cient (1 + (cid:15)) approximate nearest neighbor search. Simi-\nlarly, we develop a new data structure for kernel summations; our new data structure is constructed\nin a top-down fashion to perform the initial spatial partitioning in the original input space RD and\nperforms a local dimension reduction to a localized subset of the data in a bottom-up fashion.\nThis paper. We propose a new fast Gaussian summation algorithm that enables speedup in higher\ndimensions. Our approach utilizes: 1) probabilistic relative error bounds (De(cid:2)nition 1.3) on kernel\nsums provided by Monte Carlo estimates 2) a new tree structure called subspace tree for reducing\nthe computational cost of each distance computation. The former can be seen as relaxing the strict\nrequirement of guaranteeing hard relative bound on very small quantities, as done in [5, 6, 11, 10].\nThe latter was mentioned as a possible way of ameliorating the effects of the curse of dimensionality\nin [14], a pioneering paper in this area.\nNotations. Each query point and reference point (a D-dimensional vector) is indexed by natural\nnumbers i; j 2 N, and denoted qi and rj respectively. For any set S, jSj denotes the number of\nelements in S. The entities related to the left and the right child are denoted with superscripts L and\nR; an internal node N has the child nodes N L and N R.\n\n2 Gaussian Summation by Monte Carlo Sampling\n\nHere we describe the extension needed for probabilistic computation of kernel summation satisfying\nDe(cid:2)nition 1.3. The main routine for the probabilistic kernel summation is shown in Algorithm 1.\nThe function MCMM takes the query node Q and the reference node R (each initially called with\nthe roots of the query tree and the reference tree, Qroot and Rroot) and (cid:12) (initially called with (cid:11)\nvalue which controls the probability guarantee that each kernel sum is within (cid:15) relative error).\n\n2\n\n\fAlgorithm 1 The core dual-tree routine for probabilistic Gaussian kernel summation.\n\n5:\n\n10:\n\n15:\n\nMCMM(Q; R; (cid:12))\n\nif CANSUMMARIZEEXACT(Q; R; (cid:15)) then\n\nSUMMARIZEEXACT(Q; R)\n\nelse if CANSUMMARIZEMC(Q; R; (cid:15); (cid:12)) then\n\nSUMMARIZEMC(Q; R; (cid:15); (cid:12))\n\nelse\n\nif Q is a leaf node then\n\nif R is a leaf node then\nMCMMBASE(Q; R)\n\nelse\n\nelse\n\nMCMM(cid:16)Q; RL; (cid:12)\n\n2(cid:17), MCMM(cid:16)Q; RR; (cid:12)\n2(cid:17)\n\nelse\n\nif R is a leaf node then\n\nMCMM(QL; R; (cid:12)), MCMM(QR; R; (cid:12))\n\nMCMM(cid:16)QL; RL; (cid:12)\nMCMM(cid:16)QR; RL; (cid:12)\n\n2(cid:17), MCMM(cid:16)QL; RR; (cid:12)\n2(cid:17)\n2(cid:17), MCMM(cid:16)QR; RR; (cid:12)\n2(cid:17)\n\nThe idea of Monte Carlo sampling used in the new algorithm is similar to the one in [9], except\nthe sampling is done per query and we use approximations that provide hard error bounds as well\n(i.e. (cid:2)nite difference, exhaustive base case: MCMMBASE). This means that the approximation has\nless variance than a pure Monte Carlo approach used in [9]. Algorithm 1 (cid:2)rst attempts approxima-\ntions with hard error bounds, which are computationally cheaper than sampling-based approxima-\ntions. For example, (cid:2)nite-difference scheme [5, 6] can be used for the CANSUMMARIZEEXACT and\nSUMMARIZEEXACT functions in any general dimension.\nThe CANSUMMARIZEMC function takes two parameters that specify the accuracy: the relative error\nand its probability guarantee and decides whether to use Monte Carlo sampling for the given pair of\nnodes. If the reference node R contains too few points, it may be more ef(cid:2)cient to process it using\nexact methods that use error bounds based on bounding primitives on the node pair or exhaustive\npair-wise evaluations, which is determined by the condition: (cid:28) (cid:1) minitial (cid:20) jRj where (cid:28) > 1\ncontrols the minimum number of reference points needed for Monte Carlo sampling to proceed.\nIf the reference node does contain enough points, then for each query point q 2 Q, the SAMPLE\nroutine samples minitial terms over the terms in the summation (cid:8)(q; R) = Prjn2R\nKh(jjq (cid:0) rjnjj)\nwhere (cid:8)(q; R) denotes the exact contribution of R to q\u2019s kernel sum. Basically, we are interested\nin estimating (cid:8)(q; R) by e(cid:8)(q; R) = jRj(cid:22)S, where (cid:22)S is the sample mean of S. From the Central\nLimit Theorem, given enough m samples, (cid:22)S N ((cid:22); (cid:27)2\nS=m) where (cid:8)(q; R) = jRj(cid:22) (i.e. (cid:22)\nis the average of the kernel value between q and any reference point r 2 R); this implies that\nj(cid:22)S (cid:0) (cid:22)j (cid:20) z(cid:12)=2(cid:27)S=pm with probability 1(cid:0) (cid:12). The pruning rule we have to enforce for each query\npoint for the contribution of R is:\n\nwhere (cid:27)S the sample standard deviation of S. Since (cid:8)(q;R) is one of the unknown quanities we\nwant to compute, we instead enforce the following:\n\nz(cid:12)=2\n\n(cid:27)Spm (cid:20)\n\n(cid:15)(cid:8)(q;R)\n\njRj\n\nz(cid:12)=2\n\n(cid:27)Spm (cid:20)\n\n(cid:15)(cid:16)(cid:8)l(q;R) + jRj(cid:16)(cid:22)S (cid:0) z(cid:12)=2(cid:27)Spm (cid:17)(cid:17)\n\njRj\n\n(2)\n\nwhere (cid:8)l(q;R) is the currently running lower bound on the sum computed using exact methods\nand jRj(cid:16)(cid:22)S (cid:0) z(cid:12)=2(cid:27)Spm (cid:17) is the probabilistic component contributed by R. Denoting (cid:8)l;new(q;R) =\n(cid:8)l(q;R) + jRj(cid:18)(cid:22)S (cid:0) z(cid:12)=2(cid:27)SpjSj (cid:19), the minimum number of samples for q needed to achieve the\n\n3\n\n\ftarget error the right side of the inequality in Equation 2 with at least probability of 1 (cid:0) (cid:12) is:\n\nm (cid:21) z2\n\n(cid:12)=2(cid:27)2\nS\n\n(jRj + (cid:15)jRj)2\n\n(cid:15)2((cid:8)l(q;R) + jRj(cid:22)S)2\n\nthe given query node and reference node pair cannot be pruned using either non-\nIf\nprobabilistic/probabilistic approximations, then we recurse on a smaller subsets of two sets.\nIn\nparticular, when dividing over the reference node R, we recurse with half of the (cid:12) value1. We now\nstate the probablistic error guarantee of our algorithm as a theorem.\nTheorem 2.1. After calling MCMM with Q = Qroot , R = Rroot , and (cid:12) = (cid:11), Algorithm 1\n\napproximates each (cid:8)(q;R) withe(cid:8)(q;R) such that De\ufb01nition 1.3 holds.\nProof. For a query/reference (Q; R) pair and 0 < (cid:12) < 1, MCMMBASE and SUMMARIZEEXACT\ncompute estimates for q 2 Q such that (cid:12)(cid:12)(cid:12)e(cid:8)(q; R) (cid:0) (cid:8)(q; R)(cid:12)(cid:12)(cid:12) < (cid:15) (cid:8)(q;R)jRj\nwith probability at\nleast 1 > 1 (cid:0) (cid:12). By Equation 2, SUMMARIZEMC computes estimates for q 2 Q such that\n(cid:12)(cid:12)(cid:12)e(cid:8)(q; R) (cid:0) (cid:8)(q; R)(cid:12)(cid:12)(cid:12) < (cid:15) (cid:8)(q;R)jRj\nWe now induct on jQ [ Rj. Line 11 of Algorithm 1 divides over the reference whose subcalls com-\npute estimates that satisfy(cid:12)(cid:12)(cid:12)e(cid:8)(q; RL) (cid:0) (cid:8)(q; RL)(cid:12)(cid:12)(cid:12) (cid:20) (cid:15) (cid:8)(q;R)jRLj\nand(cid:12)(cid:12)(cid:12)e(cid:8)(q; RR) (cid:0) (cid:8)(q; RR)(cid:12)(cid:12)(cid:12) (cid:20)\n2 probability by induction hypothesis. For q 2 Q, e(cid:8)(q; R) =\nwith probability at least 1(cid:0)(cid:12).\ne(cid:8)(q; RL)+e(cid:8)(q; RR) which means je(cid:8)(q; R)(cid:0)(cid:8)(q; R)j (cid:20) (cid:15) (cid:8)(q;R)jRj\nLine 14 divides over the query and each subcall computes estimates that hold with at least probabil-\nity 1(cid:0) (cid:12) for q 2 QL and q 2 QR. Line 16 and 17 divides both over the query and the reference, and\nthe correctness can be proven similarly. Therefore, M CM M (Qroot ; Rroot ; (cid:11)) computes estimates\nsatisfying De(cid:2)nition 1.3.\n\neach with at least 1 (cid:0) (cid:12)\n\nwith probability 1 (cid:0) (cid:12).\n\n(cid:15) (cid:8)(q;R)jRRj\n\njRj\n\njRj\n\njRj\n\njRj\n\njRj\n\n\u201cReclaiming\u201d probability. We note that the assigned probability (cid:12) for the query/reference pair\ncomputed with exact bounds (SUMMARIZEEXACT and MCMMBASE) is not used. This portion\nof the probability can be (cid:147)reclaimed(cid:148) in a similar fashion as done in [10] and re-used to prune\nmore aggressively in the later stages of the algorithm. All experiments presented in this paper were\nbene(cid:2)ted by this simple modi(cid:2)cation.\n\n3 Subspace Tree\n\nA subspace tree is basically a space-partitioning tree with a set of orthogonal bases associated with\neach node N: N:(cid:10) = ((cid:22); U; (cid:3); d) where (cid:22) is the mean, U is a D(cid:2)d matrix whose columns consist of\nd eigenvectors, and (cid:3) the corresponding eigenvalues. The orthogonal basis set is constructed using\na linear dimension reduction method such as PCA. It is constructed in the top-down manner using\nthe PARTITIONSET function dividing the given set of points into two (where the PARTITIONSET\nfunction divides along the dimension with the highest variance in case of a kd-tree for example),\nwith the subspace in each node formed in the bottom-up manner. Algorithm 3 shows a PCA tree (a\nsubspace tree using PCA as a dimension reduction) for a 3-D dataset. The subspace of each leaf node\nis computed using PCABASE which can use the exact PCA [3] or a stochastic one [2]. For an internal\nnode, the subspaces of the child nodes, N L:(cid:10) = ((cid:22)L; U L; (cid:3)L; dL) and N R:(cid:10) = ((cid:22)R; U R; (cid:3)R; dR),\nare approximately merged using the MERGESUBSPACES function which involves solving an (dL +\ndR + 1) (cid:2) (dL + dR + 1) eigenvalue problem [8], which runs in O((dL + dR + 1)3) << O(D3)\ngiven that the dataset is sparse. In addition, each data point x in each node N is mapped to its new\nlower-dimensional coordinate using the orthogonal basis set of N: xproj = U T (x (cid:0) (cid:22)). The L2\nnorm reconstruction error is given by: jjxrecon (cid:0) xjj2\nMonte Carlo sampling using a subspace tree. Consider CANSUMMARIZEMC function in Algo-\nrithm 2. The (cid:147)outer-loop(cid:148) over this algorithm is over the query set Q, and it would make sense to\nproject each query point q 2 Q to the subspace owned by the reference node R. Let U and (cid:22) be the\northogonal basis system for R consisting of d basis. For each q 2 Q, consider the squared distance\n\n2.\n2 = jj(U xproj + (cid:22)) (cid:0) xjj2\n\n1We could also divide (cid:12) such that the node that may be harder to approximate gets a lower value.\n\n4\n\n\fAlgorithm 2 Monte Carlo sampling based approximation routines.\nSAMPLE(q; R; (cid:15); (cid:11); S; m)\nfor k = 1 to m do\nr random point in R\nS S [ fKh(jjq (cid:0) rjj)g\n(cid:22)S MEAN(S), (cid:27)2\n(cid:8)l;new(q;R) (cid:8)l(q;R) +jRj(cid:18)(cid:22)S (cid:0) z(cid:11)=2(cid:27)SpjSj (cid:19)\nmthresh z2\n(cid:11)=2(cid:27)2\nS\nm mthresh (cid:0) jSj\n\nCANSUMMARIZEMC(Q; R; (cid:15); (cid:11))\nreturn (cid:28) (cid:1) minitial (cid:20) jRj\nSUMMARIZEMC(Q; R; (cid:15); (cid:11))\nfor qi 2 Q do\nS ;, m minitial\nrepeat\nSAMPLE(qi; R; (cid:15); (cid:11); S; m)\nuntil m (cid:20) 0\n(cid:8)(qi;R) (cid:8)(qi;R) + jRj (cid:1) MEAN(S)\n\nS VARIANCE(S)\n\n(cid:15)2((cid:8)l(q;R)+jRj(cid:22)S )2\n\n(jRj+(cid:15)jRj)2\n\ne(cid:0)jjq(cid:0)rjj2=(2h2) (cid:25) e(cid:0)jjq(cid:0)qreconjj2=(2h2)e(cid:0)jjqproj (cid:0)rproj jj2=(2h2)\n\njj(q (cid:0) (cid:22))(cid:0) rprojjj2 (where (q (cid:0) (cid:22)) is q\u2019s coordinates expressed in terms of the coordinate system of\nR) as shown in Figure 1. For the Gaussian kernel, each pairwise kernel value is approximated as:\n(3)\nwhere qrecon = U qproj +(cid:22) and qproj = U T (q(cid:0)(cid:22)). For a (cid:2)xed query point q, e(cid:0)jjq(cid:0)qreconjj2=(2h2) can\nbe precomputed (which takes d dot products between two D-dimensional vectors) and re-used for\nevery distance computation between q and any reference point r 2 R whose cost is now O(d) <<\nO(D). Therefore, we can take more samples ef(cid:2)ciently. For a total of suf(cid:2)ciently large m samples,\nthe computational cost is O(d(D + m)) << O(D (cid:1) m) for each query point.\nIncreased variance comes at the cost of inexact distance computations, however. Each dis-\ntance computation incurs at most squared L2 norm of\nis,\n2. Neverhteless, the sample variance for each query\npoint plus the inexactness due to dimension reduction (cid:28)S can be shown to be bounded for the Gaus-\nsian kernel as: (where each s = e(cid:0)jjq(cid:0)rreconjj2=(2h2)):\n\njjrrecon (cid:0) rjj2\n\n2 (cid:0) jjq (cid:0) rjj2\n\n2 error.\n\nThat\n\n(cid:12)(cid:12)jjq (cid:0) rreconjj2\nm (cid:0) 1 Xs2S\nm (cid:0) 1 Xs2S\n\n(cid:20)\n\n1\n\n2(cid:12)(cid:12) (cid:20) jjrrecon (cid:0) rjj2\nS! + (cid:28)S\ns2! min(cid:26)1; max\n\nr2R\n\ns2 (cid:0) m (cid:1) (cid:22)2\n\n1\n\nejjrrecon(cid:0)rjj2\n\n2=h2(cid:27) (cid:0) m(cid:18)(cid:22)S min\n\nr2R\n\ne(cid:0)jjrrecon(cid:0)rjj2\n\n2=(2h2)(cid:19)2!\n\nExhaustive computations using a subspace tree. Now suppose we have built subspace trees for the\nquery and the reference sets. We can project either each query point onto the reference subspace, or\neach reference point onto the query subspace, depending on which subspace has a smaller dimension\nand the number of points in each node. The subspaces formed in the leaf nodes usually are highly\nnumerically accurate since it contains very few points compared to the extrinsic dimensionality D.\n\n4 Experimental Results\n\nWe empirically evaluated the runtime performance of our algorithm on seven real-world datasets,\nscaled to (cid:2)t in [0; 1]D hypercube, for approximating the Gaussian sum at every query point with a\nrange of bandwidths. This experiment is motivated by many kernel methods that require comput-\ning the Gaussian sum at different bandwidth values (according to the standard least-sqares cross-\nvalidation scores [15]). Nevertheless, we emphasize that the acceleration results are applicable to\nother kernel methods that require ef(cid:2)cient Gaussian summation.\nIn this paper, the reference set equals the query set. All datasets have 50K points so that the exact\nexhaustive method can be tractably computed. All times are in seconds and include the time needed\nto build the trees. Codes are in C/C++ and run on a dual Intel Xeon 3GHz with 8 Gb of main\nmemory. The measurements in second to eigth columns are obtained by running the algorithms at\nthe bandwidth kh(cid:3) where 10(cid:0)3 (cid:20) k (cid:20) 103 is the constant in the corresponding column header. The\nlast columns denote the total time needed to run on all seven bandwidth values.\nEach table has results for (cid:2)ve algorithms: the naive algorithm and four algorithms. The algorithms\nwith p = 1 denote the previous state-of-the-art ((cid:2)nite-difference with error redistribution) [10],\n\n5\n\n\fAlgorithm 3 PCA tree building routine.\nBUILDPCATREE(P)\nif CANPARTITION(P) then\nfP L;P Rg PARTITIONSET(P)\nN empty node\nN L BUILDPCATREE(P L)\nN R BUILDPCATREE(P R)\nN:S MERGESUBSPACES(N L:S; N R:S)\nelse\nN BUILDPCATREEBASE(P)\nN:S PCABASE(P)\nN:Pproj PROJECT(P; N:S)\nreturn N\n\nwhile those with p < 1 denote our probabilistic version. Each entry has the running time and the\npercentage of the query points that did not satisfy the relative error (cid:15).\nAnalysis. Readers should focus on the last columns containing the total time needed for evaluat-\ning Gaussian sum at all points for seven different bandwidth values. This is indicated by boldfaced\nnumbers for our probabilistic algorithm. As expected, On low-dimensional datasets (below 6 dimen-\nsions), the algorithm using series-expansion based bounds gives two to three times speedup com-\npared to our approach that uses Monte Carlo sampling. Multipole moments are an effective form\nof compression in low dimensions with analytical error bounds that can be evaluated; our Monte\nCarlo-based method has an asymptotic error bound which must be (cid:147)learned(cid:148) through sampling.\nAs we go from 7 dimensions and beyond, series-expansion cannot be done ef(cid:2)ciently because of its\nslow convergence. Our probabilistic algorithm (p = 0:9) using Monte Carlo consistently performs\nbetter than the algorithm using exact bounds (p = 1) by at least a factor of two. Compared to\nnaive, it achieves the maximum speedup of about nine times on an 16-dimensional dataset; on an\n89-dimensional dataset, it is at least three times as fast as the naive. Note that all the datasets contain\nonly 50K points, and the speedup will be more dramatic as we increase the number of points.\n\n5 Conclusion\n\nWe presented an extension to fast multipole methods to use approximation methods with both hard\nand probabilistic bounds. Our experimental results show speedup over the previous state-of-the-art\non high-dimensional datasets. Our future work will include possible improvements inspired by a\nrecent work done in the FMM community using a matrix-factorization formulation [13].\n\nFigure 1: Left: A PCA-tree for a 3-D dataset. Right: The squared Euclidean distance between\na given query point and a reference point projected onto a subspace can be decomposed into two\ncomponents: the orthogonal component and the component in the subspace.\n\n6\n\n\fAlgorithm n scale\n\n(cid:6)\nmockgalaxy-D-1M-rnd (cosmology: positions), D = 3; N = 50000; h(cid:3) = 0:000768201\n\n0.001\n\n0.01\n\n1000\n\n100\n\n0.1\n\n10\n\n1\n\nNaive\nMCMM\n((cid:15) = 0:1; p = 0:9)\nDFGT\n((cid:15) = 0:1; p = 1)\nMCMM\n((cid:15) = 0:01; p = 0:9)\nDFGT\n((cid:15) = 0:01; p = 1)\nAlgorithm n scale\n\nNaive\nMCMM\n((cid:15) = 0:1; p = 0:9)\nDFGT\n((cid:15) = 0:1; p = 1)\nMCMM\n((cid:15) = 0:01; p = 0:9)\nDFGT\n((cid:15) = 0:01; p = 1)\nAlgorithm n scale\n\nNaive\nMCMM\n((cid:15) = 0:1; p = 0:9)\nDFGT\n((cid:15) = 0:1; p = 1)\nMCMM\n((cid:15) = 0:01; p = 0:9)\nDFGT\n((cid:15) = 0:01; p = 1)\nAlgorithm n scale\n\n182\n5\n1 %\n2\n0 %\n4\n1 %\n2\n0 %\n0.1\n\n182\n26\n1 %\n6\n0 %\n27\n1 %\n7\n0 %\n10\n\n182\n48\n1 %\n19\n0 %\n58\n1 %\n30\n0 %\n100\n\n214\n4\n0 %\n4\n0 %\n4\n0 %\n4\n0 %\n0.01\n\n182\n3\n1 %\n2\n0 %\n3\n0 %\n2\n0 %\n0.01\n\n182\n3\n1 %\n2\n0 %\n3\n0 %\n2\n0 %\n0.001\n\n214\n4\n0 %\n4\n0 %\n4\n0 %\n4\n0 %\n0.001\n\n182\n2\n5 %\n3\n0 %\n21\n7 %\n5\n0 %\n1000\nbio5-rnd (biology: drug activity), D = 5; N = 50000; h(cid:3) = 0:000567161\n214\n1\n1 %\n2\n0 %\n1\n1 %\n4\n0 %\n1000\n\n214\n65\n0 %\n65\n0 %\n126\n0 %\n126\n0 %\n100\npall7 (cid:0) rnd ; D = 7; N = 50000; h(cid:3) = 0:00131865\n327\n327\n224\n< 1\n12 % 0 %\n223\n263\n0 %\n0 %\n265\n5\n8 %\n1 %\n374\n299\n0 %\n0 %\n100\n1000\ncovtype (cid:0) rnd ; D = 10; N = 50000; h(cid:3) = 0:0154758\n380\n380\n< 1\n< 1\n0 %\n0 %\n244\n< 1\n0 %\n0 %\n2\n< 1\n10 % 0 %\n416\n< 1\n0 %\n0 %\n1000\n100\n\n327\n3\n0 %\n10\n0 %\n3\n0 %\n10\n0 %\n0.001\n\n214\n6\n0 %\n5\n0 %\n6\n0 %\n5\n0 %\n0.1\n\n327\n3\n0 %\n11\n0 %\n3\n0 %\n11\n0 %\n0.1\n\n214\n149\n1 %\n96\n0 %\n165\n1 %\n139\n0 %\n10\n\n327\n63\n1 %\n84\n0 %\n70\n2 %\n85\n0 %\n10\n\n380\n11\n0 %\n26\n0 %\n11\n0 %\n26\n0 %\n0.001\n\n182\n10\n1 %\n2\n0 %\n11\n1 %\n2\n0 %\n1\n\n214\n144\n0 %\n24\n0 %\n148\n0 %\n25\n0 %\n1\n\n327\n3\n1 %\n14\n0 %\n3\n1 %\n14\n0 %\n1\n\n380\n39\n1 %\n177\n0 %\n77\n1 %\n180\n0 %\n1\n\n327\n3\n0 %\n10\n0 %\n3\n0 %\n10\n0 %\n0.01\n\n380\n11\n0 %\n27\n0 %\n11\n0 %\n27\n0 %\n0.01\n\n380\n318\n0 %\n390\n0 %\n362\n1 %\n427\n0 %\n10\n\nCoocTexture (cid:0) rnd ; D = 16; N = 50000; h(cid:3) = 0:0263958\n\n1274\n97\n\n36\n\n127\n\n50\n\n(cid:6)\n\n1498\n373\n\n200\n\n454\n\n307\n\n(cid:6)\n\n2289\n300\n\n615\n\n352\n\n803\n\n(cid:6)\n\n2660\n381\n\n903\n\n477\n\n1115\n\n(cid:6)\n\n3304\n343\n\n889\n\n534\n\n1159\n\nNaive\nMCMM\n((cid:15) = 0:1; p = 0:9)\nDFGT\n((cid:15) = 0:1; p = 1)\nMCMM\n((cid:15) = 0:01; p = 0:9)\nDFGT\n((cid:15) = 0:01; p = 1)\nAlgorithm n scale\n\nNaive\nMCMM\n((cid:15) = 0:1; p = 0:9)\nDFGT\n((cid:15) = 0:1; p = 1)\nMCMM\n((cid:15) = 0:01; p = 0:9)\nDFGT\n((cid:15) = 0:01; p = 1)\n\n472\n10\n0 %\n22\n0 %\n10\n0 %\n22\n0 %\n\n472\n11\n0 %\n26\n0 %\n11\n0 %\n26\n0 %\n\n380\n13\n0 %\n38\n0 %\n13\n0 %\n38\n0 %\n0.1\n\n472\n22\n0 %\n82\n0 %\n22\n1 %\n83\n0 %\n\n472\n189\n1 %\n240\n0 %\n204\n1 %\n254\n0 %\n\n7\n\n472\n472\n109\n< 1\n0 %\n8 %\n66\n452\n0 %\n0 %\n285\n< 1\n10 % 4 %\n543\n230\n0 %\n0 %\n\n472\n< 1\n0 %\n< 1\n0 %\n< 1\n0 %\n< 1\n0 %\n\n\fAlgorithm n scale\n\n0.001\n\n0.01\n\n0.1\n\n1\n\n10\n\n100\n\n1000\n\n(cid:6)\n\nLayoutHistogram (cid:0) rnd ; D = 32; N = 50000; h(cid:3) = 0:0609892\n\n757\n583\n1 %\n849\n0 %\n858\n1 %\n888\n0 %\n10\n\n757\n8\n0 %\n212\n0 %\n8\n0 %\n659\n0 %\n100\n\n1716\n1716\n1679\n17\n10 % 0 %\n836\n1772\n0 %\n0 %\n17\n1905\n2 %\n0 %\n1649\n1794\n0 %\n0 %\n\n757\n8\n0 %\n< 1\n0 %\n8\n0 %\n< 1\n0 %\n1000\n\n1716\n17\n0 %\n17\n0 %\n17\n0 %\n17\n0 %\n\n5299\n885\n\n2087\n\n1246\n\n2585\n\n(cid:6)\n\n12012\n3518\n\n6205\n\n3771\n\n7086\n\nNaive\nMCMM\n((cid:15) = 0:1; p = 0:9)\nDFGT\n((cid:15) = 0:1; p = 1)\nMCMM\n((cid:15) = 0:01; p = 0:9)\nDFGT\n((cid:15) = 0:01; p = 1)\nAlgorithm n scale\n\nNaive\nMCMM\n((cid:15) = 0:1; p = 0:9)\nDFGT\n((cid:15) = 0:1; p = 1)\nMCMM\n((cid:15) = 0:01; p = 0:9)\nDFGT\n((cid:15) = 0:01; p = 1)\n\n757\n32\n0 %\n153\n0 %\n32\n0 %\n153\n0 %\n0.001\n\n1716\n384\n0 %\n659\n0 %\n401\n0 %\n659\n0 %\n\n757\n32\n0 %\n159\n0 %\n45\n0 %\n159\n0 %\n0.01\n\n1716\n418\n0 %\n677\n0 %\n419\n0 %\n677\n0 %\n\n757\n54\n1 %\n221\n0 %\n60\n1 %\n222\n0 %\n0.1\n\n1716\n575\n0 %\n864\n0 %\n575\n0 %\n865\n0 %\n\n757\n168\n1 %\n492\n0 %\n183\n6 %\n503\n0 %\n1\n\n1716\n428\n1 %\n1397\n0 %\n437\n1 %\n1425\n0 %\n\nCorelCombined (cid:0) rnd ; D = 89; N = 50000; h(cid:3) = 0:0512583\n\nReferences\n[1] Nando de Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast krylov methods for n-body\nlearning. In Y. Weiss, B. Sch\u00a4olkopf, and J. Platt, editors, Advances in Neural Information Processing\nSystems 18, pages 251(cid:150)258. MIT Press, Cambridge, MA, 2006.\n\n[2] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a\n\ncompressed approximate matrix decomposition, 2004.\n\n[3] G. Golub. Matrix Computations, Third Edition. The Johns Hopkins University Press, 1996.\n[4] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning.\n\nIn Todd K. Leen, Thomas G.\nDietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13 (December\n2000). MIT Press, 2001.\n\n[5] Alexander G. Gray and Andrew W. Moore. Nonparametric Density Estimation: Toward Computational\n\nTractability. In SIAM International Conference on Data Mining 2003, 2003.\n\n[6] Alexander G. Gray and Andrew W. Moore. Very Fast Multivariate Kernel Density Estimation via Com-\n\nputational Geometry. In Joint Statistical Meeting 2003, 2003. to be submitted to JASA.\n\n[7] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM Journal of Scienti\ufb01c and Statistical Com-\n\nputing, 12(1):79(cid:150)94, 1991.\n\n[8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 22(9):1042(cid:150)1049, 2000.\n\n[9] Michael Holmes, Alexander Gray, and Charles Isbell. Ultrafast monte carlo for statistical summations.\nIn J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing\nSystems 20, pages 673(cid:150)680. MIT Press, Cambridge, MA, 2008.\n\n[10] Dongryeol Lee and Alexander Gray. Faster gaussian summation: Theory and experiment. In Proceedings\n\nof the Twenty-second Conference on Uncertainty in Arti\ufb01cial Intelligence. 2006.\n\n[11] Dongryeol Lee, Alexander Gray, and Andrew Moore. Dual-tree fast gauss transforms.\n\nIn Y. Weiss,\nB. Sch\u00a4olkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 747(cid:150)\n754. MIT Press, Cambridge, MA, 2006.\n\n[12] Ting Liu, Andrew W. Moore, and Alexander Gray. Ef(cid:2)cient exact k-nn and nonparametric classi(cid:2)cation\nin high dimensions. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch\u00a4olkopf, editors, Advances in\nNeural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.\n\n[13] P. G. Martinsson and Vladimir Rokhlin. An accelerated kernel-independent fast multipole method in one\n\ndimension. SIAM J. Scienti\ufb01c Computing, 29(3):1160(cid:150)1178, 2007.\n\n[14] A. W. Moore, J. Schneider, and K. Deng. Ef(cid:2)cient locally weighted polynomial regression predictions.\nIn D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning, pages\n196(cid:150)204, San Francisco, 1997. Morgan Kaufmann.\n\n[15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC, 1986.\n\n8\n\n\f", "award": [], "sourceid": 270, "authors": [{"given_name": "Dongryeol", "family_name": "Lee", "institution": null}, {"given_name": "Alexander", "family_name": "Gray", "institution": null}]}