{"title": "Space and Time Efficient Kernel Density Estimation in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 15799, "page_last": 15808, "abstract": "Recently, Charikar and Siminelakis (2017) presented a framework for kernel density estimation in provably sublinear query time, for kernels that possess a certain hashing-based property. However, their data structure requires a significantly increased super-linear storage space, as well as super-linear preprocessing time. These limitations inhibit the practical applicability of their approach on large datasets.\n\nIn this work, we present an improvement to their framework that retains the same query time, while requiring only linear space and linear preprocessing time. We instantiate our framework with the Laplacian and Exponential kernels, two popular kernels which possess the aforementioned property. Our experiments on various datasets verify that our approach attains accuracy and query time similar to Charikar and Siminelakis (2017), with significantly improved space and preprocessing time.", "full_text": "Space and Time Ef\ufb01cient Kernel Density Estimation\n\nin High Dimensions\n\nArturs Backurs\u2217\n\nTTIC\n\nbackurs@ttic.edu\n\nPiotr Indyk\n\nMIT\n\nindyk@mit.edu\n\nTal Wagner\n\nMIT\n\ntalw@mit.edu\n\nAbstract\n\nRecently, Charikar and Siminelakis (2017) presented a framework for kernel density\nestimation in provably sublinear query time, for kernels that possess a certain\nhashing-based property. However, their data structure requires a signi\ufb01cantly\nincreased super-linear storage space, as well as super-linear preprocessing time.\nThese limitations inhibit the practical applicability of their approach on large\ndatasets.\nIn this work, we present an improvement to their framework that retains the same\nquery time, while requiring only linear space and linear preprocessing time. We\ninstantiate our framework with the Laplacian and Exponential kernels, two popular\nkernels which possess the aforementioned property. Our experiments on various\ndatasets verify that our approach attains accuracy and query time similar to Charikar\nand Siminelakis (2017), with signi\ufb01cantly improved space and preprocessing time.\n\n1\n\nIntroduction\n\nKernel density estimation is a fundamental problem with many applications in statistics, machine\nlearning and scienti\ufb01c computing. For a kernel function k : Rd \u00d7 Rd \u2192 [0, 1], and a set of points\nX \u2282 Rd, the kernel density function of X at a point y \u2208 Rd is de\ufb01ned as:2\n\nKDEX (y) =\n\n1\n|X|\n\nk(x, y).\n\n(cid:88)\n\nx\u2208X\n\nTypically the density function is evaluated on a multiple queries y from an input set Y . Unfortunately,\na na\u00efve exact algorithm for this problem runs in a rectangular O(|X||Y |) time, which makes it\ninef\ufb01cient for large datasets X and Y . Because of this, most of the practical algorithms for this\nproblem report approximate answers. Tree-based techniques [GS91, GM01, GB17] lead to highly\nef\ufb01cient approximate algorithms in low-dimensional spaces, but their running times are exponential\nin d. In high-dimensional spaces, until recently, the best approximation/runtime tradeoff was provided\nby simple uniform random sampling. Speci\ufb01cally, for parameters \u03c4, \u0001 \u2208 (0, 1), it can be seen that\nconstant probability3 as long as KDEX (y) \u2265 \u03c4.\nThis approximation/runtime tradeoff was recently improved in [CS17], who proposed a framework\nbased on Hashing-Based Estimators (HBE). The framework utilizes locality-sensitive hash (LSH)\n\n(cid:1) points from X, then KDEX(cid:48)(y) = (1 \u00b1 \u0001) KDEX (y) with\n\nif X(cid:48) is a random sample of O(cid:0) 1\n\n1\n\u00012\n\n\u03c4\n\nmultiplied by a positive weight wx \u2265 0, see e.g., [CS17].\n\n\u2217Authors ordered alphabetically.\n2We note that all algorithms discussed in this paper easily extend to the case where each term k(x, y) is\n3The probability of correct estimation can be reduced to 1 \u2212 \u03b4 for any \u03b4 > 0 at the cost of increasing the\nsample size by a factor of log(1/\u03b4). Since the same observation applies to all algorithms considered in this\npaper, we will ignore the dependence on \u03b4 from now on.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Comparison of runtime and space bounds. Notation: \u03c4 \u2208 (0, 1) denotes a lower bound for\nKDE values; d denotes the dimension; \u0001 \u2208 (0, 1) denotes the approximation error.4\n\nQuery Time\n\nAlgorithm\nRandom Sampling O(d/\u03c4 \u00b7 1/\u00012)\nHBE\nThis paper\n\n\u221a\nO(d/\n\u221a\nO(d/\n\n# Stored hashes\nO(1/\u03c4 \u00b7 1/\u00012)\n\n\u03c4 \u00b7 1/\u00012) O(1/\u03c4 3/2 \u00b7 1/\u00014)\n\u03c4 \u00b7 1/\u00012) O(1/\u03c4 \u00b7 1/\u00012)\n\n1\n\u00012\n\n\u03c4\n\n(cid:17)\n\n(cid:16) 1\u221a\n\nfunctions [IM98], i.e., randomly selected functions h : Rd \u2192 U with the property that for any\nx, y \u2208 Rd, the collision probability Prh[h(x) = h(y)] is \u201croughly\u201d related to the kernel value\n. A recent empirical evaluation of\nk(x, y). HBE reduces the evaluation time to (about) O\nthis algorithm [SRB+19] showed that it is competitive with other state of the art methods, while\nproviding signi\ufb01cant (up to one order of magnitude) runtime reduction in many scenarios.\nOne drawback of HBE approach, however, is its space usage, which is super-linear in the dataset\nsize. Speci\ufb01cally, the algorithm constructs O\nhash tables, and stores the hash of each data\npoint in each table. Consequently, the additional storage required for the hashes is proportional\nto the number of tables times the number of data points. As mentioned above, we can uniformly\n\nsubsample the dataset down to O(cid:0) 1\nalso effects the preprocessing time of the HBE data structure, which requires O(cid:0) 1\n\n(cid:1) points, leading to an overall space usage of O(cid:0) 1\n\ntimes that of the simple random sampling approach. The increase in storage\n\n(cid:1),\n(cid:1) hashes\n\n(cid:16) 1\u221a\n\n(cid:16) 1\u221a\n\nwhich is O\n\ncomputations due to having to store every point in every table. As \u03c4 and \u0001 can be very close to zero\nin practice, these drawbacks may pose a substantial bottleneck in dealing with large datasets.\n\n(cid:17)\n\n(cid:17)\n\n1\n\u00012\n\n\u03c4\n\n1\n\u00012\n\n\u03c4\n\n1\n\u00012\n\n\u03c4\n\n\u03c4 3/2\n\n1\n\u00014\n\n\u03c4 3/2\n\n1\n\u00014\n\nOur results.\nIn this paper we show that the super-linear amount of storage is in fact not needed\nto achieve the runtime bound guaranteed by the HBE algorithm. Speci\ufb01cally, we modify the HBE\nalgorithm in a subtle but crucial way, and show that this modi\ufb01cation reduces the storage to (roughly)\n\n(cid:1), i.e., the same as simple random sampling. Table 1 summarizes the performance of the\n\nO(cid:0) 1\n\nrespective algorithms. Our main result is the following theorem.\nTheorem 1. Let k(x, y) be a kernel function, for which there exists a distribution H of hash functions\nand M \u2265 1 such that for every x, y \u2208 Rd,\n\n1\n\u00012\n\n\u03c4\n\nM\u22121 \u00b7 k(x, y)1/2 \u2264 Pr\nh\u223cH\n\n[h(x) = h(y)] \u2264 M \u00b7 k(x, y)1/2.\n\n(1)\n\nThere exists a data structure for Kernel Density Estimation with the following properties:\n\n(cid:16) 1\n\u03c4 \u00b7 TH M 3\n\n\u00012\n\n(cid:17)\n\n\u2022 Given a dataset X \u2282 Rd and parameters \u03c4, \u0001 \u2208 (0, 1), we preprocess it in O\n\nto store a point x \u2208 X, and SH is the space needed to store a hash value h(x).\n\n\u2022 The space usage of the data structure is O\n\n(cid:16) 1\ntime, where TH is the time to compute a hash value h(x).\n\u03c4 \u00b7 (SX +SH )M 3\n(cid:16) 1\u221a\n\n\u00012\n\n\u2022 Given a query point y such that KDEX (y) \u2265 \u03c4, we can return with constant probability a\n(1 \u00b1 \u0001)-approximation of KDEX (y) in O\ntime, where Tk is the time to\ncompute a kernel value k(x, y).\n\n\u03c4 \u00b7 (Tk+TH )M 3\n\n\u00012\n\n, where SX is the space needed\n\n(cid:17)\n\n(cid:17)\n\nWe empirically evaluate our approach on the Laplacian kernel k(x, y) = e\u2212(cid:107)x\u2212y(cid:107)1/\u03c3 and the expo-\nnential kernel k(x, y) = e\u2212(cid:107)x\u2212y(cid:107)2/\u03c3. Both are commonly used kernels, and \ufb01t into the framework as\n\n4For simplicity, the bounds in the table assume that the kernel takes O(d) time to compute, and that a hash\nvalue takes O(d) time to compute. The kernels we consider have these properties (for bandwidth \u03c3 = \u2126(1)).\nSee Theorem 1 for the full parameter dependence.\n\n2\n\n\fthey satisfy the requirements of Theorem 1 with M = O(1), Tk = O(d), TH = O(min{d, d/\u03c3})\nand SH = O(min{d log(1/\u03c3), d/\u03c3}), with high probability (over h \u223c H). Our experiments con\ufb01rm\nthe analytic bounds and show that our approach attains a similar query time to approximation tradeoff\nas HBE, while using signi\ufb01cantly less space and preprocessing time.\n\n(cid:16)(cid:112)1/\u03c4 \u00b7 1/\u00012(cid:17)\n\nOur techniques. Our algorithm builds on the HBE approach of [CS17]. Recall that the algorithm\nselects L = \u0398\nLSH functions h1 . . . hL, and creates L hash tables, such that for\neach i = 1 . . . L, each point x \u2208 X is placed in the jth table in bin hj(x). To estimate KDEX (y),\nthe algorithm selects one point from each bin h1(y) . . . hL(y), and uses those points for estimation.\nTo achieve the performance as in Table 1, the algorithm is applied to a random sample of size\ns = O(1/\u03c4 \u00b7 1/\u00012). The total space is therefore bounded by O(sL) = O(1/\u03c4 3/2 \u00b7 1/\u00014).\nA natural approach to improving the space bound would be to run HBE on a smaller sample.\nUnfortunately, it is easy to observe that any algorithm must use at least \u2126(1/\u03c4 \u00b7 1/\u00012) samples\nto guarantee (1 \u00b1 \u0001)-approximation. Therefore, instead of sub-sampling the whole input to the\nHBE algorithm, we sub-sample the content of each hash table independently for each hash function\nhj, i = 1 . . . L. Speci\ufb01cally, for each hash function hj, we include a point x \u2208 X in the jth hash table\n\u221a\n\u03c4 L). If we\nwith probability 1/(s\nstart from a sample of size s = \u0398(1/\u03c4 \u00b7 1/\u00012), then\n\u03c4 L = O(s), yielding the desired space bound;\nat the same time, each point is included in at least one hash table with constant probability, which\nmeans that at least \u2126(1/\u03c4 \u00b7 1/\u00012) points will be included in the union of the hash tables with high\nprobability. Perhaps surprisingly, we show that this increases the variance of the overall estimator by\nonly a constant factor.\nFor an intuition of why subsampling by a factor\n\u03c4 does not distort the kernel values by much,\nconsider a simple setting where \u0001 is a constant, n = 1/\u03c4, and there is only one data point x that is\nvery close to the query y (contributing \u2248 1) while all other points are far from y (contributing \u2248 0).\nIn this case, the original HBE algorithm would collect the point x from every bin h1(y) . . . hL(y),\n\u03c4, then x is expected to survive in one\ntable, and thus our algorithm is still likely to identify one such bin in expectation. Conditioned on\nthis event, the estimate of the algorithm is approximately correct. See more details in Section 3.\n\nwhere L =(cid:112)1/\u03c4. In contrast, if we subsample by a factor\n\n\u03c4 ). This reduces the expected number of stored hashes to O(\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n1.1 Related work\n\nThere is a vast amount of work on fast kernel density estimation in low dimensions, including the\nseminal Fast Gauss Transform [GS91] and other tree-based methods [GM01, GB17]. However, as\nmentioned above, they entail an exponential dependence on the input dimension. The tree-based\nASKIT algorithm [MXB15] avoids this dependence and is suitable for the high-dimensional regime.\nHowever, it lacks rigorous guarantees on the approximation quality. The empirical evaluation\nin [SRB+19] showed that HBE is consistently competitive with ASKIT, and in some settings\noutperforms it by an order of magnitude.\nAnother important line of research has focused on sparsifying (reducing the size) of the input pointset\nwhile preserving kernel density function values. This can be accomplished by constructing core-\nsets [Phi13, ZJPL13, PT18] or related approaches [CWS12, SRB+19]. Although effective in low\ndimensions, in high dimensions such approaches require \u2126(1/\u00012) samples (for an additive error\nof \u0001 > 0 [PT18]), which is the same as the simple random sampling approach.5 We note that the\nsparsi\ufb01cation approach can be combined with our improvement, as we can run our algorithm on a\ncore-set instead of the original data set, and retain the core-set size while speeding up the query time.\nIn addition to the aforementioned works of [CS17, SRB+19], LSH-based estimators have been\napplied in [CXS18, LS18b, WCN18, LS18a] to a variety of machine learning tasks.\n\n2 Preliminaries\nKernel Density Estimation. Consider a kernel map k : Rd \u00d7 Rd \u2192 [0, 1]. The kernel density\nestimation problem can be formally stated as follows.\n\n5However, core-sets preserve all KDE values with high probability, while simple random sampling only\n\npreserves the KDE of any individual query with high probability.\n\n3\n\n\fDe\ufb01nition 2. Let X = {x1, . . . , xn} \u2282 Rd be an input dataset, and \u0001, \u03c4 \u2208 (0, 1) input parameters.\nOur goal is to construct a data structure such that for every query point y \u2208 Rd that satis\ufb01es\nKDEX (y) \u2265 \u03c4, we can return an estimate (cid:94)KDE(y) such that with constant probability,\n\n(1 \u2212 \u0001)KDEX (y) \u2264 (cid:94)KDE(y) \u2264 (1 + \u0001)KDEX (y).\n\nAn exact computation of KDEX (y) performs n kernel evaluations. By standard concentration\ninequalities, the above approximation can be achieved by evaluating the kernel y with only O( 1\n1\n\u00012 )\n\u03c4\npoints chosen uniformly at random from X, and returning the average. As a result, we can assume\nwithout loss of generality (and up to scaling \u0001 by a constant) that n = O( 1\n\u03c4\n\n\u00012 ).\n\n1\n\nLSHable kernels. Locality-Sensitive Hashing (LSH) is a widely used framework for hashing\nmetric datasets in a way that relates the collision probability of each pair of points to their geometric\nsimilarity. Kernel maps for which such hashing families exist are called \u201cLSHable\u201d [CK15]. The\nprecise variant we will need is de\ufb01ned as follows.\nDe\ufb01nition 3. The kernel k is called ( 1\nh : Rd \u2192 {0, 1}\u2217, such that for every x, y \u2208 Rd, Equation (1) holds.6\nLaplacian and Exponential kernels. The Laplacian kernel is k(x, y) = e\u2212(cid:107)x\u2212y(cid:107)1/\u03c3, where \u03c3 > 0\nis the bandwidth parameter. The exponential kernel is de\ufb01ned similarly as k(x, y) = e\u2212(cid:107)x\u2212y(cid:107)2/\u03c3\n(the difference is in use of the (cid:96)2-norm instead of the (cid:96)1-norm). For our purposes the two are\nessentially equivalent, as they give the same analytic and empirical results. We will mostly focus on\nthe Laplacian kernel, since as we will see, it is ( 1\n2 , 1)-LSHable. As a corollary, a random rotation of\n2 , O(1))-LSHable.\nthe dataset [DIIM04, CS17] can be used to show that the Exponential kernel is ( 1\n\n2 , M )-LSHable if there exists a family H of hash functions\n\n3 The Data Structure\n\n1\n\n1\n\n1\n\n\u00012 ), these become O(TH \u00b7\n\n\u03c4 \u00012 ), and the space usage (in addition to the dataset) is O(SH \u00b7\nn\u221a\n\nWe begin by recalling the HBE-based KDE data structure of [CS17]. For simplicity consider the\n1\u221a\ncase M = 1. During preprocessing, they sample L = O(\n\u03c4 \u00012 ) hash functions h1, . . . , hL from the\nLSH family H, and store hj(xi) for every i = 1, . . . , n and j = 1, . . . , L. The preprocessing time\nis O(TH \u00b7\nn\u221a\n\u03c4 \u00012 ), where TH is the\ntime needed to evaluate the hash value of a point, and SH is the space needed to store it. Recalling\n\u03c4 1.5\u00014 ) and O(SH \u00b7\nwe have assumed that n = O( 1\n\u03c4 1.5\u00014 ) respectively.\n\u03c4\nGiven a query point y, let bj(y) := {xi : hj(xi) = hj(y)} be the set (\u201cbin\u201d) of data points whose\nhj-hash is the same as that of y. The estimator picks a uniformly random data point x from bj(y)\nand computes Zj = 1\n(cid:94)KDE(y) = 1\nevaluate k on a single pair.\nOur data structure is similar, except that for every hj, we store the hash of every data point only with\nprobability \u03b4 = 1/(n\n\u03c4 ). Therefore, on average we only compute and store a constant number of\nhashes of each data point, yielding expected preprocessing time of O(TH /(\u03c4 \u00b7 \u00012)) and space usage\nof O(SH /(\u03c4 \u00b7 \u00012)). The exact algorithm is given in Algorithm 1. Theorem 1, whose proof appears in\nthe appendix, shows this still returns a suf\ufb01ciently good estimate of KDEX (y).\n\nn|bj(y)| \u00b7(cid:112)k(x, y). If bj(y) is empty, then Zj = 0. The \ufb01nal KDE estimate is\n\n\u221a\nj=1 Zj. The query time is O((TH + Tk)/(\n\n\u03c4 \u00012)), where Tk is the time it takes to\n\n(cid:80)L\n\n\u221a\n\nL\n\nExample. Let us give an illustration of the different approaches on the setting mentioned in the\nintroduction. Suppose \u0001 = \u0398(1) and n \u2248 1/\u03c4. Consider a setting in which the query point is very\nclose to a unique data point and very far from the rest of the data points. Concretely, k(x1, y) \u2248 1,\nwhile k(xi, y) \u2248 0 for every i > 1. The KDE value is KDEX (y) \u2248 \u03c4. Na\u00efve random sampling\nwould have to sample \u2126(1/\u03c4 ) points in order to pick up x1 and return a correct estimate.\n\n6The HBE framework of [CS17] accommodates (\u03b2, M )-LSHable kernels, that satisfy M\u22121 \u00b7 k(x, y)\u03b2 \u2264\n2 , 1), and lower \u03b2 is better. Since the\n\nPrh\u223cH [h(x) = h(y)] \u2264 M \u00b7 k(x, y)\u03b2, where \u03b2 can take any value in [ 1\nkernels we consider attain the optimal setting \u03b2 = 1\n\n2 , we \ufb01x this value throughout.\n\n7This can be implemented in expected time O(L) by sampling \u02dcL \u223c Binomial(n, L\n\nn ), and then sampling a\n\nuniformly random subset of size \u02dcL.\n\n4\n\n\fAlgorithm 1 : Space-Ef\ufb01cient HBE\nProprocessing:\n\nInput: Dataset X \u2282 Rd of n points; kernel k(\u00b7,\u00b7); LSH family H; integer 1 \u2264 L \u2264 n.\nFor j = 1, . . . , L:\n\nSample a random hash function hj from H.\nLet Xj \u2282 X be a random subset that includes each point with independent probability L\nn .7\nFor every x \u2208 Xj, evaluate and store hj(x).\n\nQuery:\n\nInput: Query point y \u2208 Rd.\nFor j = 1, . . . , L:\n\nSample a uniformly random point x(j) from bj(y) = {x \u2208 X(cid:48)\nLet Zj \u2190 k(x(j),y)\u00b7|bj (y)|\n\nj : hj(x) = hj(y)}.\n\n(cid:80)L\n\nL\u00b7Prh\u223cH [h(x(j))=h(y)].\nj=1 Zj.\n\nReturn 1\nL\n\n(cid:80)L\n\nj=1\n\nn|bj(y)|(cid:112)k(x1, y) \u2248 \u03c4. However, note that\n\n1\n\nIn the HBE algorithm of [CS17], essentially in all hash tables x1 would be the unique data point in\nthe same bin as y, leading to a correct estimate 1\nL\nall terms in the sum are equal (to \u03c4), which seems to be somewhat wasteful. Indeed, it would suf\ufb01ce\nto pick up x1 in just one hash table instead of all of them.\nIn our method, x1 would be stored in \u03b4L \u2248 1 hash tables in expectation, say only in h1, and\nin that table it would be the unique data point in b1(y).\nIn the other tables (j > 1) bj(y)\nwould be empty, which means the estimator evaluates to zero. The resulting KDE estimate is\n1\nL\njust once instead of L times.\n\n(cid:17) \u2248 \u03c4, which is still correct, while we have stored a hash of x1\n\n(cid:16) 1\nn\u03b4|bj(y)|(cid:112)k(x1, y) +(cid:80)L\n\nj=2 0\n\n3.1 LSH for the Laplacian Kernel\nThe Laplacian kernel k(x, y) = e\u2212(cid:107)x\u2212y(cid:107)1/\u03c3 is a popular kernel, which \ufb01ts naturally into the above\nframework since it is ( 1\n2 , 1)-LSHable. For simplicity, let us assume w.l.o.g. that in the dataset we\nneed to hash, all point coordinates are in [0, 1]. This does not limit the generality since the Laplacian\nkernel is shift-invariant, and the coordinates can be scaled by inversely scaling \u03c3.\nThe LSHablility of the Laplacian kernel follows from the Random Binning Features construction of\nRahimi and Recht [RR07] (see details in the appendix). The expected hash size is O(d log(1/\u03c3)),\nand the hash evaluation time is O(d). We also give a variant (described below) with better hash size\nand evaluation time for \u03c3 \u2265 1. Together, the following lemma holds.\nLemma 4. There is an LSH family H\u03c3 such that for every x, y \u2208 Rd, Prh\u223cH\u03c3 [h(x) = h(y)] =\ne\u2212(cid:107)x\u2212y(cid:107)1/(2\u03c3). The expected hash size is SH\u03c3 = O(min{d log(1/\u03c3), d/\u03c3}) bits. The expected\nhashing time is TH\u03c3 = O(min{d, d/\u03c3}).\nThe hashing family for the case \u03c3 \u2265 1 is given as follows. Sample \u03c1 \u223c Poisson(d/(2\u03c3)). Then\nsample \u03be1, . . . , \u03be\u03c1 \u2208 {1, . . . , d} independently and uniformly at random, and \u03b61, . . . , \u03b6\u03c1 \u2208 [0, 1]\nindependently and uniformly at random. These random choices determine the hash function h. Next\nwe describe h. Given a point x to hash, for every i = 1, . . . , \u03c1 set bi = 1 if x\u03bei > \u03b6\u03bei and bi = 0\notherwise. The hash h(x) is the concatenation of b1, . . . , b\u03c1. It is not hard to verify (see appendix)\nthat Prh[h(x) = h(y)] = e\u2212(cid:107)x\u2212y(cid:107)1/(2\u03c3).\nUsing the LSH family from Lemma 4 in Theorem 1 yields the following concrete data structure.\nCorollary 5 (Data structure for Laplacian KDE). For the Laplacian kernel, there is a data structure\nfor the KDE problem with expected space overhead O(min{d log(1/\u03c3), d/\u03c3}/(\u03c4 \u00012)), expected\npreprocessing time O(min{d, d/\u03c3}/(\u03c4 \u00012)), and query time O(d/(\n\n\u03c4 \u00012)).\n\n\u221a\n\n5\n\n\fTable 2: Properties of the datasets used in our experiments.\n\nName\nCovertype\nCensus\nGloVe\nMNIST\n\nDescription\nforest cover type\nU.S. census\nword embeddings\nhand-written digits\n\nNumber of points Dimension\n581, 012\n2, 458, 285\n1, 183, 514\n60, 000\n\n55\n68\n100\n784\n\n4 Empirical Evaluation\n\nWe empirically evaluate our data structure for the Laplacian kernel.8 For brevity, we will refer to\nthe random sampling method as RS. The experimental results presented in this section are for the\nthe Laplacian kernel k(x, y) = e\u2212(cid:107)x\u2212y(cid:107)1/\u03c3. The results for the Exponential kernel are qualitatively\nsimilar are included in the appendix.\nChoice of datasets. While the worst-case analysis shows that the HBE approach has asymptotically\nbetter query time than RS, it is important to note that RS can still attain superior performance in\nsome practical settings. Indeed, the recent paper [SRB+19] found this to be the case on various\nstandard benchmark datasets, such as GloVe word embeddings [PSM14]. To re\ufb02ect this in our\nexperiments, we choose two datasets on which [SRB+19] found HBE to be superior to RS as well\nas to state-of-the-art methods, and two datasets on which RS was found to be superior. The former\ntwo are Covertype [BD99] and Census9, and the latter two are GloVe [PSM14] and MNIST [LC98].\nTheir properties are summarized in Table 2.\nExperimental setting. We implement and evaluate Algorithm 1. Note that it is parameterized by the\n\u221a\nnumber of hash tables L, while its analysis in Theorem 1 is parameterized in terms of \u03c4, \u0001, where we\nrecall that L = \u0398(1/(\n\u03c4 \u00012)). For practical implementation, parameterizing by L is more natural\nsince it acts as a smooth handle on the resources to accuracy tradeoff \u2013 larger L yields better KDE\n\u221a\nestimates at the expense of using more time and space. \u03c4, \u0001 need not be speci\ufb01ed explicitly; instead,\nfor any \u03c4, \u0001 that satisfy L = \u2126(1/(\n\u03c4 \u00012)), the guarantee of Theorem 1 holds (namely, for every\nquery whose true KDE is at least \u03c4, the KDE estimate has up to \u0001 relative error with high probability).\nWe compare our method to the HBE method of [CS17], as well as to RS as a baseline. The plots for\nHBE and our method are generated by varying the number of hash functions L. The plots for RS\nare generated by varying the sample size. Note that neither method has any additional parameters to\nset. For each method and each parameter setting, we report the median result of 3 trials. For each\ndataset we choose two bandwidth settings, one which yields median KDE values of order 10\u22122, and\nthe other of order 10\u22123.10 The bandwidth values and their precise method of choice are speci\ufb01ed in\nthe appendix. The appendix also includes accuracy results for varying bandwidth values (Fig. 9).\nEvaluation metrics. We evaluate the query time, space usage and preprocessing time. In all of the\nplots, the y-axis measures the average relative error (which directly corresponds to \u0001) of the KDE\nestimate, over 100 query points randomly chosen from the dataset. In the query time plots, the x-axis\ncounts the number of kernel evaluations per query, which dominates and serves as a proxy for the\nrunning time. In the space usage plots, the x-axis counts the number of stored hashes. We use this\nmeasure for the space usage rather than actual size in bits, since there are various ef\ufb01cient ways to\nstore each hash, and they apply equally to all algorithms. We also note that the plots do not account\nfor the space needed to store the sampled dataset itself, which is the same for all methods. RS is not\ndisplayed on these plots since it has no additional space usage. In all three methods the preprocessing\ntime is proportional to the additional space usage, and is qualitatively captured by the same plots.\nResults. The query time plots consistently show that the query time to approximation quality tradeoff\nof our method is essentially the same as [CS17], on all datasets. At the same time, the space usage\nplots show that we have achieve a signi\ufb01cantly smaller space overhead, with the gap from [CS17]\nsubstantially increasing as the target relative error becomes smaller. These \ufb01ndings af\ufb01rm the direct\nadvantage of our method as speci\ufb01ed in Table 1.\n\n8Our code is available at https://github.com/talwagner/efficient_kde.\n9Available at https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990).\n10In all the considered settings, the average KDE value is within a factor of at most 2 from the median KDE.\n\n6\n\n\fFigure 1: Covertype dataset, typical KDE values of order 10\u22122.\n\nFigure 2: Covertype dataset, typical KDE values of order 10\u22123.\n\nFigure 3: Census dataset, typical KDE values of order 10\u22122.\n\nFigure 4: Census dataset, typical KDE values of order 10\u22123.\n\n7\n\n\fFigure 5: MNIST dataset, typical KDE values of order 10\u22122.\n\nFigure 6: MNIST dataset, typical KDE values of order 10\u22123.\n\nFigure 7: GloVe dataset, typical KDE values of order 10\u22122.\n\nFigure 8: GloVe dataset, typical KDE values of order 10\u22123.\n\n8\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers for useful suggestions. Piotr Indyk was supported by NSF\nTRIPODS award #1740751 and Simons Investigator Award.\n\nReferences\n[AI06]\n\n[BD99]\n\n[CK15]\n\n[CS17]\n\nAlexandr Andoni and Piotr Indyk, Near-optimal hashing algorithms for approximate\nnearest neighbor in high dimensions, Foundations of Computer Science (FOCS), IEEE,\n2006, pp. 459\u2013468.\nJock A Blackard and Denis J Dean, Comparative accuracies of arti\ufb01cial neural networks\nand discriminant analysis in predicting forest cover types from cartographic variables,\nComputers and electronics in agriculture 24 (1999), no. 3, 131\u2013151.\nFlavio Chierichetti and Ravi Kumar, Lsh-preserving functions and their applications,\nJournal of the ACM (JACM) 62 (2015), no. 5, 33.\nMoses Charikar and Paris Siminelakis, Hashing-based-estimators for kernel density in\nhigh dimensions, Foundations of Computer Science (FOCS), 2017.\n\n[CWS12] Yutian Chen, Max Welling, and Alex Smola, Super-samples from kernel herding, 2012.\n[CXS18] Beidi Chen, Yingchen Xu, and Anshumali Shrivastava, Lsh-sampling breaks the compu-\n\ntational chicken-and-egg loop in adaptive stochastic gradient estimation.\n\n[DIIM04] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni, Locality-sensitive\nhashing scheme based on p-stable distributions, Proceedings of the twentieth annual\nsymposium on Computational geometry, ACM, 2004, pp. 253\u2013262.\nEdward Gan and Peter Bailis, Scalable kernel density classi\ufb01cation via threshold-based\npruning, Proceedings of the 2017 ACM International Conference on Management of\nData, ACM, 2017, pp. 945\u2013959.\n\n[GB17]\n\n[GM01] Alexander G Gray and Andrew W Moore, N-body\u2019problems in statistical learning,\n\n[GS91]\n\n[IM98]\n\n[JDH99]\n\n[LC98]\n[LS18a]\n\n[LS18b]\n\nAdvances in neural information processing systems, 2001, pp. 521\u2013527.\nLeslie Greengard and John Strain, The fast gauss transform, SIAM Journal on Scienti\ufb01c\nand Statistical Computing 12 (1991), no. 1, 79\u201394.\nPiotr Indyk and Rajeev Motwani, Approximate nearest neighbors: towards removing the\ncurse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory\nof computing, ACM, 1998, pp. 604\u2013613.\nTommi S Jaakkola, Mark Diekhans, and David Haussler, Using the \ufb01sher kernel method\nto detect remote protein homologies., ISMB, vol. 99, 1999, pp. 149\u2013158.\nYann LeCun and Corinna Cortes, The mnist database of handwritten digits, 1998.\nChen Luo and Anshumali Shrivastava, Arrays of (locality-sensitive) count estimators\n(ace): Anomaly detection on the edge, Proceedings of the 2018 World Wide Web Confer-\nence on World Wide Web, 2018, pp. 1439\u20131448.\n\n, Scaling-up split-merge mcmc with locality sensitive sampling (lss), Preprint.\n\nAvailable at (2018).\n\n[Phi13]\n\n[PSM14]\n\n[MXB15] William B March, Bo Xiao, and George Biros, Askit: Approximate skeletonization kernel-\nindependent treecode in high dimensions, SIAM Journal on Scienti\ufb01c Computing 37\n(2015), no. 2, A1089\u2013A1110.\nJeff M Phillips, \u03b5-samples for kernels, Proceedings of the twenty-fourth annual ACM-\nSIAM symposium on Discrete algorithms, SIAM, 2013, pp. 1622\u20131632.\nJeffrey Pennington, Richard Socher, and Christopher Manning, Glove: Global vectors\nfor word representation, Proceedings of the 2014 conference on empirical methods in\nnatural language processing (EMNLP), 2014, pp. 1532\u20131543.\nJeff M Phillips and Wai Ming Tai, Near-optimal coresets of kernel density estimates,\narXiv preprint arXiv:1802.01751 (2018).\nAli Rahimi and Benjamin Recht, Random features for large-scale kernel machines,\nAdvances in neural information processing systems, 2007, pp. 1177\u20131184.\n\n[PT18]\n\n[RR07]\n\n9\n\n\f[SRB+19] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Levis, Rehashing\nkernel evaluation in high dimensions, International Conference on Machine Learning,\n2019.\n\n[WCN18] Xian Wu, Moses Charikar, and Vishnu Natchu, Local density estimation in high dimen-\n\nsions, arXiv preprint arXiv:1809.07471 (2018).\n\n[ZJPL13] Yan Zheng, Jeffrey Jestes, Jeff M Phillips, and Feifei Li, Quality and ef\ufb01ciency for kernel\ndensity estimates in large data, Proceedings of the 2013 ACM SIGMOD International\nConference on Management of Data, ACM, 2013, pp. 433\u2013444.\n\n10\n\n\f", "award": [], "sourceid": 9251, "authors": [{"given_name": "Arturs", "family_name": "Backurs", "institution": "MIT"}, {"given_name": "Piotr", "family_name": "Indyk", "institution": "MIT"}, {"given_name": "Tal", "family_name": "Wagner", "institution": "MIT"}]}