{"title": "Nearest-Neighbor Sample Compression: Efficiency, Consistency, Infinite Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 1573, "page_last": 1583, "abstract": "We examine the Bayes-consistency of a recently proposed 1-nearest-neighbor-based multiclass learning algorithm. This algorithm is derived from sample compression bounds and enjoys the statistical advantages of tight, fully empirical generalization bounds, as well as the algorithmic advantages of a faster runtime and memory savings. We prove that this algorithm is strongly Bayes-consistent in metric spaces with finite doubling dimension --- the first consistency result for an efficient nearest-neighbor sample compression scheme. Rather surprisingly, we discover that this algorithm continues to be Bayes-consistent even in a certain infinite-dimensional setting, in which the basic measure-theoretic conditions on which classic consistency proofs hinge are violated. This is all the more surprising, since it is known that k-NN is not Bayes-consistent in this setting. We pose several challenging open problems for future research.", "full_text": "Nearest-Neighbor Sample Compression:\nEf\ufb01ciency, Consistency, In\ufb01nite Dimensions\n\nAryeh Kontorovich\n\nDepartment of Computer Science\nBen-Gurion University of the Negev\n\nkaryeh@cs.bgu.ac.il\n\nSivan Sabato\n\nDepartment of Computer Science\nBen-Gurion University of the Negev\n\nsabatos@bgu.ac.il\n\nDepartment of Computer Science and Applied Mathematics\n\nRoi Weiss\n\nWeizmann Institute of Science\nroiw@weizmann.ac.il\n\nAbstract\n\nWe examine the Bayes-consistency of a recently proposed 1-nearest-neighbor-based\nmulticlass learning algorithm. This algorithm is derived from sample compression\nbounds and enjoys the statistical advantages of tight, fully empirical generalization\nbounds, as well as the algorithmic advantages of a faster runtime and memory\nsavings. We prove that this algorithm is strongly Bayes-consistent in metric\nspaces with \ufb01nite doubling dimension \u2014 the \ufb01rst consistency result for an ef\ufb01cient\nnearest-neighbor sample compression scheme. Rather surprisingly, we discover\nthat this algorithm continues to be Bayes-consistent even in a certain in\ufb01nite-\ndimensional setting, in which the basic measure-theoretic conditions on which\nclassic consistency proofs hinge are violated. This is all the more surprising, since\nit is known that k-NN is not Bayes-consistent in this setting. We pose several\nchallenging open problems for future research.\n\n1\n\nIntroduction\n\nThis paper deals with Nearest-Neighbor (NN) learning algorithms in metric spaces. Initiated by\nFix and Hodges in 1951 [16], this seemingly naive learning paradigm remains competitive against\nmore sophisticated methods [8, 46] and, in its celebrated k-NN version, has been placed on a solid\ntheoretical foundation [11, 44, 13, 47].\nAlthough the classic 1-NN is well known to be inconsistent in general, in recent years a series of\npapers has presented variations on the theme of a regularized 1-NN classi\ufb01er, as an alternative to the\nBayes-consistent k-NN. Gottlieb et al. [18] showed that approximate nearest neighbor search can\nact as a regularizer, actually improving generalization performance rather than just injecting noise.\nIn a follow-up work, [27] showed that applying Structural Risk Minimization to (essentially) the\nmargin-regularized data-dependent bound in [18] yields a strongly Bayes-consistent 1-NN classi\ufb01er.\nA further development has seen margin-based regularization analyzed through the lens of sample\ncompression: a near-optimal nearest neighbor condensing algorithm was presented [20] and later\nextended to cover semimetric spaces [21]; an activized version also appeared [25]. As detailed in\n[27], margin-regularized 1-NN methods enjoy a number of statistical and computational advantages\nover the traditional k-NN classi\ufb01er. Salient among these are explicit data-dependent generalization\nbounds, and considerable runtime and memory savings. Sample compression affords additional\nadvantages, in the form of tighter generalization bounds and increased ef\ufb01ciency in time and space.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this work we study the Bayes-consistency of a compression-based 1-NN multiclass learning\nalgorithm, in both \ufb01nite-dimensional and in\ufb01nite-dimensional metric spaces. The algorithm is\nessentially the passive component of the active learner proposed by Kontorovich, Sabato, and Urner\nin [25], and we refer to it in the sequel as KSU; for completeness, we present it here in full (Alg. 1).\nWe show that in \ufb01nite-dimensional metric spaces, KSU is both computationally ef\ufb01cient and Bayes-\nconsistent. This is the \ufb01rst compression-based multiclass 1-NN algorithm proven to possess both of\nthese properties. We further exhibit a surprising phenomenon in in\ufb01nite-dimensional spaces, where\nwe construct a distribution for which KSU is Bayes-consistent while k-NN is not.\n\nMain results. Our main contributions consist of analyzing the performance of KSU in \ufb01nite and\nin\ufb01nite dimensional settings, and comparing it to the classical k-NN learner. Our key \ufb01ndings are\nsummarized below.\n\n\u2022 In Theorem 2, we show that KSU is computationally ef\ufb01cient and strongly Bayes-consistent\nin metric spaces with a \ufb01nite doubling dimension. This is the \ufb01rst (strong or otherwise)\nBayes-consistency result for an ef\ufb01cient sample compression scheme for a multiclass (or\neven binary)1 1-NN algorithm. This result should be contrasted with the one in [27], where\nmargin-based regularization was employed, but not compression; the proof techniques\nfrom [27] do not carry over to the compression-based scheme. Instead, novel arguments\nare required, as we discuss below. The new sample compression technique provides a\nBayes-consistency proof for multiple (even countably many) labels; this is contrasted with\nthe multiclass 1-NN algorithm in [28], which is not compression-based, and requires solving\na minimum vertex cover problem, thereby imposing a 2-approximation factor whenever\nthere are more than two labels.\n\u2022 In Theorem 4, we make the surprising discovery that KSU continues to be Bayes-consistent\nin a certain in\ufb01nite-dimensional setting, even though this setting violates the basic measure-\ntheoretic conditions on which classic consistency proofs hinge, including Theorem 2. This\nis all the more surprising, since it is known that k-NN is not Bayes-consistent for this\nconstruction [9]. We are currently unaware of any separable2 metric probability space on\nwhich KSU fails to be Bayes-consistent; this is posed as an intriguing open problem.\n\nOur results indicate that in \ufb01nite dimensions, an ef\ufb01cient, compression-based, Bayes-consistent\nmulticlass 1-NN algorithm exists, and hence can be offered as an alternative to k-NN, which is well\nknown to be Bayes-consistent in \ufb01nite dimensions [12, 41]. In contrast, in in\ufb01nite dimensions, our\nresults show that the condition characterizing the Bayes-consistency of k-NN does not extend to all\nNN algorithms. It is an open problem to characterize the necessary and suf\ufb01cient conditions for the\nexistence of a Bayes-consistent NN-based algorithm in in\ufb01nite dimensions.\n\nRelated work. Following the pioneering work of [11] on nearest-neighbor classi\ufb01cation, it was\nshown by [13, 47, 14] that the k-NN classi\ufb01er is strongly Bayes consistent in Rd. These results\nmade extensive use of the Euclidean structure of Rd, but in [41] a weak Bayes-consistency result was\nshown for metric spaces with a bounded diameter and a bounded doubling dimension, and additional\ndistributional smoothness assumptions. More recently, some of the classic results on k-NN risk\ndecay rates were re\ufb01ned by [10] in an analysis that captures the interplay between the metric and the\nsampling distribution. The worst-case rates have an exponential dependence on the dimension (i.e.,\nthe so-called curse of dimensionality), and Pestov [33, 34] examines this phenomenon closely under\nvarious distributional and structural assumptions.\nConsistency of NN-type algorithms in more general (and in particular in\ufb01nite-dimensional) metric\nspaces was discussed in [1, 5, 6, 9, 30]. In [1, 9], characterizations of Bayes-consistency were\ngiven in terms of Besicovitch-type conditions (see Eq. (3)). In [1], a generalized \u201cmoving window\u201d\nclassi\ufb01cation rule is used and additional regularity conditions on the regression function are imposed.\nThe \ufb01ltering technique (i.e., taking the \ufb01rst d coordinates in some basis representation) was shown to\nbe universally consistent in [5]. However, that algorithm suffers from the cost of cross-validating\nover both the dimension d and number of neighbors k. Also, the technique is only applicable in\n\n1 An ef\ufb01cient sample compression algorithm was given in [20] for the binary case, but no Bayes-consistency\n\nguarantee is known for it.\n\n2C\u00e9rou and Guyader [9] gave a simple example of a nonseparable metric on which all known nearest-neighbor\n\nmethods, including k-NN and KSU, obviously fail.\n\n2\n\n\fHilbert spaces (as opposed to more general metric spaces) and provides only asymptotic consistency,\nwithout \ufb01nite-sample bounds such as those provided by KSU. The insight of [5] is extended to the\nmore general Banach spaces in [6] under various regularity assumptions.\nNone of the aforementioned generalization results for NN-based techniques are in the form of\nfully empirical, explicitly computable sample-dependent error bounds. Rather, they are stated in\nterms of the unknown Bayes-optimal rate, and some involve additional parameters quantifying the\nwell-behavedness of the unknown distribution (see [27] for a detailed discussion). As such, these\nguarantees do not enable a practitioner to compute a numerical generalization error estimate for a\ngiven training sample, much less allow for a data-dependent selection of k, which must be tuned via\ncross-validation. The asymptotic expansions in [43, 37, 23, 40] likewise do not provide a computable\n\ufb01nite-sample bound. The quest for such bounds was a key motivation behind the series of works\n[18, 28, 20], of which KSU [25] is the latest development.\nThe work of Devroye et al. [14, Theorem 21.2] has implications for 1-NN classi\ufb01ers in Rd that\nare de\ufb01ned based on data-dependent majority-vote partitions of the space. It is shown that under\nsome conditions, a \ufb01xed mapping from each sample size to a data-dependent partition rule induces a\nstrongly Bayes-consistent algorithm. This result requires the partition rule to have a bounded VC\ndimension, and since this rule must be \ufb01xed in advance, the algorithm is not fully adaptive. Theorem\n19.3 ibid. proves weak consistency for an inef\ufb01cient compression-based algorithm, which selects\namong all the possible compression sets of a certain size, and maintains a certain rate of compression\nrelative to the sample size. The generalizing power of sample compression was independently\ndiscovered by [31], and later elaborated upon by [22]. In the context of NN classi\ufb01cation, [14] lists\nvarious condensing heuristics (which have no known performance guarantees) and leaves open the\nalgorithmic question of how to minimize the empirical risk over all subsets of a given size.\nThe \ufb01rst compression-based 1-NN algorithm with provable optimality guarantees was given in [20];\nit was based on constructing \u03b3-nets in spaces with a \ufb01nite doubling dimension. The compression\nsize of this construction was shown to be nearly unimprovable by an ef\ufb01cient algorithm unless P=NP.\nWith \u03b3-nets as its algorithmic engine, KSU inherits this near-optimality. The compression-based\n1-NN paradigm was later extended to semimetrics in [21], where it was shown to survive violations\nof the triangle inequality, while the hierarchy-based search methods that have become standard for\nmetric spaces (such as [4, 18] and related approaches) all break down.\nIt was shown in [27] that a margin-regularized 1-NN learner (essentially, the one proposed in [18],\nwhich, unlike [20], did not involve sample compression) becomes strongly Bayes-consistent when the\nmargin is chosen optimally in an explicitly prescribed sample-dependent fashion. The margin-based\ntechnique developed in [18] for the binary case was extended to multiclass in [28]. Since the algorithm\nrelied on computing a minimum vertex cover, it was not possible to make it both computationally\nef\ufb01cient and Bayes-consistent when the number of lables exceeds two. An additional improvement\nover [28] is that the generalization bounds presented there had an explicit (logarithmic) dependence\non the number of labels, while our compression scheme extends seamlessly to countable label spaces.\n\nPaper outline. After \ufb01xing the notation and setup in Sec. 2, in Sec. 3 we present KSU, the\ncompression-based 1-NN algorithm we analyze in this work. Sec. 4 discusses our main contributions\nregarding KSU, together with some open problems. High-level proof sketches are given in Sec. 5 for\nthe \ufb01nite-dimensional case, and Sec. 6 for the in\ufb01nite-dimensional case. Full detailed proofs can be\nfound in [26].\n\n2 Setting and Notation\nOur instance space is the metric space (X , \u03c1), where X is the instance domain and \u03c1 is the metric.\n(See Appendix A in [26] for relevant background on metric measure spaces.) We consider a countable\nlabel space Y. The unknown sampling distribution is a probability measure \u00af\u00b5 over X \u00d7 Y, with\n(cid:80)\nmarginal \u00b5 over X . Denote by (X, Y ) \u223c \u00af\u00b5 a pair drawn according to \u00af\u00b5. The generalization error of a\nclassi\ufb01er f : X \u2192 Y is given by err\u00af\u00b5(f ) := P\u00af\u00b5(Y (cid:54)= f (X)), and its empirical error with respect to\n(x,y)\u2208S(cid:48) 1[y (cid:54)= f (x)]. The optimal Bayes\n\u00af\u00b5 := inf err\u00af\u00b5(f ), where the in\ufb01mum is taken over all measurable classi\ufb01ers f : X \u2192 Y.\nrisk of \u00af\u00b5 is R\u2217\nWe say that \u00af\u00b5 is realizable when R\u2217\n\u00af\u00b5 = 0. We omit the overline in \u00af\u00b5 in the sequel when there is no\nambiguity.\n\na labeled set S(cid:48) \u2286 X \u00d7 Y is given by(cid:99)err(f, S(cid:48)) := 1|S(cid:48)|\n\n3\n\n\fFor a \ufb01nite labeled set S \u2286 X \u00d7 Y and any x \u2208 X , let Xnn(x, S) be the nearest neighbor of x with\nrespect to S and let Ynn(x, S) be the nearest neighbor label of x with respect to S:\n\n(Xnn(x, S), Ynn(x, S)) := argmin\n(x(cid:48),y(cid:48))\u2208S\n\n\u03c1(x, x(cid:48)),\n\nn\n\nn\n\nn\n\nn \u2286 X \u00d7 Y,\nn := Sn, in this work we\n\nn. While the classic 1-NN algorithm sets S(cid:48)\n\nThe set of points in S, denoted by X = {X1, . . . , X|S|} \u2286 X ,\n\nwhere ties are broken arbitrarily. The 1-NN classi\ufb01er induced by S is denoted by hS(x) :=\nYnn(x, S).\ninduces\n:= {V1(X), . . . , V|S|(X)}, where each Voronoi cell is\na Voronoi partition of X , V(X)\nVi(X) := {x \u2208 X : argminj\u2208{1,...,|S|} \u03c1(x, Xj) = i}. By de\ufb01nition, \u2200x \u2208 Vi(X), hS(x) = Yi.\nA 1-NN algorithm is a mapping from an i.i.d. labeled sample Sn \u223c \u00af\u00b5n to a labeled set S(cid:48)\nyielding the 1-NN classi\ufb01er hS(cid:48)\nstudy a compression-based algorithm which sets S(cid:48)\n) converges to R\u2217 almost surely,\nA 1-NN algorithm is strongly Bayes-consistent on \u00af\u00b5 if err(hS(cid:48)\n) = R\u2217] = 1. An algorithm is weakly Bayes-consistent on \u00af\u00b5 if err(hS(cid:48)\nthat is P[limn\u2192\u221e err(hS(cid:48)\n)\nconverges to R\u2217 in expectation, limn\u2192\u221e E[err(hS(cid:48)\n)] = R\u2217. Obviously, the former implies the\nlatter. We say that an algorithm is Bayes-consistent on a metric space if it is Bayes-consistent on all\ndistributions in the metric space.\nA convenient property that is used when studying the Bayes-consistency of algorithms in metric\nspaces is the doubling dimension. Denote the open ball of radius r around x by Br(x) := {x(cid:48) \u2208\nX : \u03c1(x, x(cid:48)) < r} and let \u00afBr(x) denote the corresponding closed ball. The doubling dimension of a\nmetric space (X , \u03c1) is de\ufb01ned as follows. Let n be the smallest number such that every ball in X can\nbe covered by n balls of half its radius, where all balls are centered at points of X . Formally,\ni=1Br/2(xi)}.\n\nn := min{n \u2208 N : \u2200x \u2208 X , r > 0, \u2203x1, . . . , xn \u2208 X s.t. Br(x) \u2286 \u222an\n\nn adaptively, as discussed further below.\n\nn\n\nThen the doubling dimension of (X , \u03c1) is de\ufb01ned by ddim(X , \u03c1) := log2 n.\nFor an integer n, let [n] := {1, . . . , n}. Denote the set of all index vectors of length d by In,d :=\n[n]d. Given a labeled set Sn = (Xi, Yi)i\u2208[n] and any i = {i1, . . . , id} \u2208 In,d, denote the sub-\nsample of Sn indexed by i by Sn(i) := {(Xi1, Yi1), . . . , (Xid , Yid )}. Similarly, for a vector Y (cid:48) =\n{Y (cid:48)\nd)}, namely the sub-sample\nof Sn as determined by i where the labels are replaced with Y (cid:48). Lastly, for i, j \u2208 In,d, we denote\nSn(i; j) := {(Xi1, Yj1), . . . , (Xid , Yjd )}.\n\nd} \u2208 Y d, denote by Sn(i, Y (cid:48)) := {(Xi1 , Y (cid:48)\n\n1 ), . . . , (Xid , Y (cid:48)\n\n1 , . . . , Y (cid:48)\n\n3 1-NN majority-based compression\n\nIn this work we consider the 1-NN majority-based compression algorithm proposed in [25], which\nwe refer to as KSU. This algorithm is based on constructing \u03b3-nets at different scales; for \u03b3 > 0\nand A \u2286 X , a set X \u2286 A is said to be a \u03b3-net of A if \u2200a \u2208 A,\u2203x \u2208 X : \u03c1(a, x) \u2264 \u03b3 and for all\nx (cid:54)= x(cid:48) \u2208 X, \u03c1(x, x(cid:48)) > \u03b3.3\nThe algorithm (see Alg. 1) operates as follows. Given an input sample Sn, whose set of points is\ndenoted Xn = {X1, . . . , Xn}, KSU considers all possible scales \u03b3 > 0. For each such scale it\nconstructs a \u03b3-net of Xn. Denote this \u03b3-net by X(\u03b3) := {Xi1 , . . . , Xim}, where m \u2261 m(\u03b3) denotes\nits size and i \u2261 i(\u03b3) := {i1, . . . , im} \u2208 In,m denotes the indices selected from Sn for this \u03b3-net.\nFor every such \u03b3-net, the algorithm attaches the labels Y (cid:48) \u2261 Y (cid:48)(\u03b3) \u2208 Y m, which are the empirical\nmajority-vote labels in the respective Voronoi cells in the partition V(X(\u03b3)) = {V1, . . . , Vm}.\nFormally, for i \u2208 [m],\n\ni \u2208 argmax\nY (cid:48)\ny\u2208Y\n\n|{j \u2208 [n] | Xj \u2208 Vi, Yj = y}|,\n(1)\nwhere ties are broken arbitrarily. This procedure creates a labeled set S(cid:48)\nn(\u03b3) := Sn(i(\u03b3), Y (cid:48)(\u03b3)) for\nevery relevant \u03b3 \u2208 {\u03c1(Xi, Xj) | i, j \u2208 [n]} \\ {0}. The algorithm then selects a single \u03b3, denoted\n\u03b3\u2217 \u2261 \u03b3\u2217\nn(\u03b3\u2217). The scale \u03b3\u2217 is selected so as to minimize a generalization error\nbound, which upper bounds err(S(cid:48)\nn(\u03b3)) with high probability. This error bound, denoted Q in the\nalgorithm, can be derived using a compression-based analysis, as described below.\n\nn, and outputs hS(cid:48)\n\n3 For technical reasons, having to do with the construction in Sec. 6, we depart slightly from the standard\nde\ufb01nition of a \u03b3-net X \u2286 A. The classic de\ufb01nition requires that (i) \u2200a \u2208 A,\u2203x \u2208 X : \u03c1(a, x) < \u03b3 and (ii)\n\u2200x (cid:54)= x(cid:48) \u2208 X : \u03c1(x, x(cid:48)) \u2265 \u03b3. In our de\ufb01nition, the relations < and \u2265 in (i) and (ii) are replaced by \u2264 and >.\n\n4\n\n\fAlgorithm 1 KSU: 1-NN compression-based algorithm\nRequire: Sample Sn = (Xi, Yi)i\u2208[n], con\ufb01dence \u03b4\nEnsure: A 1-NN classi\ufb01er\n1: Let \u0393 := {\u03c1(Xi, Xj) | i, j \u2208 [n]} \\ {0}\n2: for \u03b3 \u2208 \u0393 do\n3:\n4:\n5:\n6:\n7: end for\nn \u2208 argmin\u03b3\u2208\u0393 Q(n, \u03b1(\u03b3), 2m(\u03b3), \u03b4), where Q is, e.g., as in Eq. (2)\n9: Find \u03b3\u2217\n10: Set S(cid:48)\nn := S(cid:48)\n11: return hS(cid:48)\n\nLet X(\u03b3) be a \u03b3-net of {X1, . . . , Xn}\nLet m(\u03b3) := |X(\u03b3)|\nFor each i \u2208 [m(\u03b3)], let Y (cid:48)\nSet S(cid:48)\n\n8: Set \u03b1(\u03b3) := (cid:99)err(hS(cid:48)\n\nn(\u03b3) := (X(\u03b3), Y (cid:48)(\u03b3))\n\nn(\u03b3), Sn)\n\nn(\u03b3\u2217\nn)\n\nn\n\ni be the majority label in Vi(X(\u03b3)) as de\ufb01ned in Eq. (1)\n\ni \u2208 In,m and(cid:99)err(hS(cid:48)\n\nn is a compression scheme if there is a function C : \u222a\u221e\n\nWe say that a mapping Sn (cid:55)\u2192 S(cid:48)\nm=0(X \u00d7Y)m \u2192\n2X\u00d7Y, from sub-samples to subsets of X \u00d7Y, such that for every Sn there exists an m and a sequence\nn = C(Sn(i)). Given a compression scheme Sn (cid:55)\u2192 S(cid:48)\ni \u2208 In,m such that S(cid:48)\nn and a matching function\nn = C(Sn(i)) for some\nC, we say that a speci\ufb01c S(cid:48)\nn is an (\u03b1, m)-compression of a given Sn if S(cid:48)\n, Sn) \u2264 \u03b1. The generalization power of compression was recognized by [17]\nand [22]. Speci\ufb01cally, it was shown in [21, Theorem 8] that if the mapping Sn (cid:55)\u2192 S(cid:48)\nn is a compression\nn which is an (\u03b1, m)-compression of Sn \u223c \u00af\u00b5n,\nscheme, then with probability at least 1 \u2212 \u03b4, for any S(cid:48)\nwe have (omitting the constants, explicitly provided therein, which do not affect our analysis)\n\nn\n\nerr(hS(cid:48)\n\nn\n\n) \u2264 n\n\nn \u2212 m\n\n\u03b1 + O(\n\nm log(n) + log(1/\u03b4)\n\nn \u2212 m\n\n) + O(\n\nn\u2212m \u03b1 log(n) + log(1/\u03b4)\n\nn \u2212 m\n\n).\n\n(2)\n\n(cid:115) nm\n\nDe\ufb01ning Q(n, \u03b1, m, \u03b4) as the RHS of Eq. (2) provides KSU with a compression bound. The following\nproposition shows that KSU is a compression scheme, which enables us to use Eq. (2) with the\nappropriate substitution.4\nProposition 1. The mapping Sn (cid:55)\u2192 S(cid:48)\n\nn de\ufb01ned by Alg. 1 is a compression scheme whose output S(cid:48)\n\nn\n\nis a ((cid:99)err(hS(cid:48)\n\nn\n\n), 2|S(cid:48)\n\nn|)-compression of Sn.\n\nProof. De\ufb01ne the function C by C(( \u00afXi, \u00afYi)i\u2208[2m]) = ( \u00afXi, \u00afYi+m)i\u2208[m], and observe that for all\nn = C(Sn(i(\u03b3); j(\u03b3))), where i(\u03b3) is the \u03b3-net index set as de\ufb01ned above, and\nSn, we have S(cid:48)\ni = Yji for every i \u2208 [m(\u03b3)].\nj(\u03b3) = {j1, . . . , jm(\u03b3)} \u2208 In,m(\u03b3) is some index vector such that Y (cid:48)\nSince Y (cid:48)\nn of\n\ni is an empirical majority vote, clearly such a j exists. Under this scheme, the output S(cid:48)\n\nthis algorithm is a ((cid:99)err(hS(cid:48)\n\nn\n\n), 2|S(cid:48)\n\nn|)-compression.\n\nKSU is ef\ufb01cient, for any countable Y. Indeed, Alg. 1 has a naive runtime complexity of O(n4), since\nO(n2) values of \u03b3 are considered and a \u03b3-net is constructed for each one in time O(n2) (see [20,\nAlgorithm 1]). Improved runtimes can be obtained, e.g., using the methods in [29, 18]. In this work\nwe focus on the Bayes-consistency of KSU, rather than optimize its computational complexity. Our\nBayes-consistency results below hold for KSU, whenever the generalization bound Q(n, \u03b1, m, \u03b4n)\nsatis\ufb01es the following properties:\nProperty 1 For any integer n and \u03b4 \u2208 (0, 1), with probability 1 \u2212 \u03b4 over the i.i.d. random sample\nn is an (\u03b1, m)-compression of Sn, then\n\nSn \u223c \u00af\u00b5n, for all \u03b1 \u2208 [0, 1] and m \u2208 [n]: If S(cid:48)\nerr(hS(cid:48)\n\n) \u2264 Q(n, \u03b1, m, \u03b4).\n\nn\n\nProperty 2 Q is monotonically increasing in \u03b1 and in m.\nProperty 3 There is a sequence {\u03b4n}\u221e\nn\u2192\u221e sup\nlim\n\u03b1\u2208[0,1]\n\nn=1, \u03b4n \u2208 (0, 1) such that(cid:80)\u221e\n\n(Q(n, \u03b1, m, \u03b4n) \u2212 \u03b1) = 0.\n\nn=1 \u03b4n < \u221e and for all m,\n\n4 In [25] the analysis was based on compression with side information, and does not extend to in\ufb01nite Y.\n\n5\n\n\fby Eq. (2) using any convergent series(cid:80)\u221e\n\nThe compression bound in Eq. (2) clearly satis\ufb01es these properties. Note that Property 3 is satis\ufb01ed\nn=1 \u03b4n < \u221e such that \u03b4n = e\u2212o(n); in particular, the decay\n\nof \u03b4n cannot be too rapid.\n\n4 Main results\n\nn\n\nn(\u03b3\u2217\n\nIn this section we describe our main results. The proofs appear in subsequent sections. First, we show\nthat KSU is Bayes-consistent if the instance space has a \ufb01nite doubling dimension. This contrasts\nwith classical 1-NN, which is only Bayes-consistent if the distribution is realizable.\nTheorem 2. Let (X , \u03c1) be a metric space with a \ufb01nite doubling-dimension. Let Q be a generalization\nbound that satis\ufb01es Properties 1-3, and let \u03b4n be as stipulated by Property 3 for Q. If the input\nn) calculated by KSU is\ncon\ufb01dence \u03b4 for input size n is set to \u03b4n, then the 1-NN classi\ufb01er hS(cid:48)\nstrongly Bayes consistent on (X , \u03c1): P(limn\u2192\u221e err(hS(cid:48)\n) = R\u2217) = 1.\nThe proof, provided in Sec. 5, closely follows the line of reasoning in [27], where the strong Bayes-\nconsistency of an adaptive margin-regularized 1-NN algorithm was proved, but with several crucial\ndifferences. In particular, the generalization bounds used by KSU are purely compression-based, as\nopposed to the Rademacher-based generalization bounds used in [27]. The former can be much tighter\nin practice and guarantee Bayes-consistency of KSU even for countably many labels. This however\nrequires novel technical arguments, which are discussed in detail in Appendix B.1 in [26]. Moreover,\nsince the compression-based bounds do not explicitly depend on ddim, they can be used even when\nddim is in\ufb01nite, as we do in Theorem 4 below. To underscore the subtle nature of Bayes-consistency,\nwe note that the proof technique given here does not carry to an earlier algorithm, suggested in [20,\nTheorem 4], which also uses \u03b3-nets. It is an open question whether the latter is Bayes-consistent.\nNext, we study Bayes-consistency of KSU in in\ufb01nite dimensions (i.e., with ddim = \u221e) \u2014 in partic-\nular, in a setting where k-NN was shown by [9] not to be Bayes-consistent. Indeed, a straightforward\napplication of [9, Lemma A.1] yields the following result.\nTheorem 3 (C\u00e9rou and Guyader [9]). There exists an in\ufb01nite dimensional separable metric space\n(X , \u03c1) and a realizable distribution \u00af\u00b5 over X \u00d7 {0, 1} such that no kn-NN learner satisfying\nkn/n \u2192 0 when n \u2192 \u221e is Bayes-consistent under \u00af\u00b5. In particular, this holds for any space and\nrealizable distribution \u00af\u00b5 that satisfy the following condition: The set C of points labeled 1 by \u00af\u00b5\nsatis\ufb01es\n\n\u00b5(C) > 0\n\nand\n\n\u2200x \u2208 C,\n\nlim\nr\u21920\n\n\u00b5(C \u2229 \u00afBr(x))\n\n\u00b5( \u00afBr(x))\n\n= 0.\n\n(3)\n\nSince \u00b5(C) > 0, Eq. (3) constitutes a violation of the Besicovitch covering property. In doubling\nspaces, the Besicovitch covering theorem precludes such a violation [15]. In contrast, as [35, 36]\nshow, in in\ufb01nite-dimensional spaces this violation can in fact occur. Moreover, this is not an isolated\npathology, as this property is shared by Gaussian Hilbert spaces [45].\nAt \ufb01rst sight, Eq. (3) might appear to thwart any 1-NN algorithm applied to such a distribution.\nHowever, the following result shows that this is not the case: KSU is Bayes-consistent on a distribution\nwith this property.\nTheorem 4. There is a metric space equipped with a realizable distribution for which KSU is weakly\nBayes-consistent, while any k-NN classi\ufb01er necessarily is not.\n\nThe proof relies on a classic construction of Preiss [35] which satis\ufb01es Eq. (3). We show that the\nstructure of the construction, combined with the packing and covering properties of \u03b3-nets, imply that\nthe majority-vote classi\ufb01er induced by any \u03b3-net with a suf\ufb01cienlty small \u03b3 approaches the Bayes\nerror. To contrast with Theorem 4, we next show that on the same construction, not all majority-vote\nVoronoi partitions succeed. Indeed, if the packing property of \u03b3-nets is relaxed, partition sequences\nobstructing Bayes-consistency exist.\nTheorem 5. For the example constructed in Theorem 4, there exists a sequence of Voronoi partitions\nwith a vanishing diameter such that the induced true majority-vote classi\ufb01ers are not Bayes consistent.\n\nThe above result also stands in contrast to [14, Theorem 21.2], showing that, unlike in \ufb01nite dimen-\nsions, the partitions\u2019 vanishing diameter is insuf\ufb01cient to establish consistency when ddim = \u221e. We\nconclude the main results by posing intriguing open problems.\n\n6\n\n\fOpen problem 1. Does there exist a metric probability space on which some k-NN algorithm is\nconsistent while KSU is not? Does there exist any separable metric space on which KSU fails?\n\nOpen problem 2. C\u00e9rou and Guyader [9] distill a certain Besicovitch condition which is necessary\nand suf\ufb01cient for k-NN to be Bayes-consistent in a metric space. Our Theorem 4 shows that the\nBesicovitch condition is not necessary for KSU to be Bayes-consistent. Is it suf\ufb01cient? What is a\nnecessary condition?\n\n5 Bayes-consistency of KSU in \ufb01nite dimensions\n\nIn this section we give a high-level proof of Theorem 2, showing that KSU is strongly Bayes-\nconsistent in \ufb01nite-dimensional metric spaces. A fully detailed proof is given in Appendix B in\n[26].\nRecall the optimal empirical error \u03b1\u2217\ncomputed by KSU. As shown in Proposition 1, the sub-sample S(cid:48)\nn) is an (\u03b1\u2217\nof Sn. Abbreviate the compression-based generalization bound used in KSU by\n\nn) and the optimal compression size m\u2217\nn, 2m\u2217\n\nn \u2261 m(\u03b3\u2217\nn) as\nn)-compression\n\nn \u2261 \u03b1(\u03b3\u2217\n\nn(\u03b3\u2217\n\nQn(\u03b1, m) := Q(n, \u03b1, 2m, \u03b4n).\n\nn(\u03b3\u2217\n\nn)) \u2212 R\u2217 =(cid:0)err(hS(cid:48)\n\nTo show Bayes-consistency, we start by a standard decomposition of the excess error over the optimal\nBayes into two terms:\nerr(hS(cid:48)\nand show that each term decays to zero with probability one. For the \ufb01rst term, Property 1 for Q,\ntogether with the Borel-Cantelli lemma, readily imply lim supn\u2192\u221e TI (n) \u2264 0 with probability one.\nThe main challenge is showing that lim supn\u2192\u221e TII (n) \u2264 0 with probability one. We do so in\nseveral stages:\n\nn) \u2212 R\u2217(cid:1) =: TI (n) + TII (n),\n\nn)(cid:1) +(cid:0)Qn(\u03b1\u2217\n\nn)) \u2212 Qn(\u03b1\u2217\n\nn(\u03b3\u2217\n\nn, m\u2217\n\nn, m\u2217\n\n1. Loosely speaking, we \ufb01rst show (Lemma 10) that the Bayes error R\u2217 can be well approxi-\nmated using 1-NN classi\ufb01ers de\ufb01ned by the true (as opposed to empirical) majority-vote\nlabels over \ufb01ne partitions of X . In particular, this holds for any partition induced by a \u03b3-net\nof X with a suf\ufb01ciently small \u03b3 > 0. This approximation guarantee relies on the fact that in\n\ufb01nite-dimensional spaces, the class of continuous functions with compact support is dense\nin L1(\u00b5) (Lemma 9).\n2. Fix \u02dc\u03b3 > 0 suf\ufb01ciently small such that any true majority-vote classi\ufb01er induced by a \u02dc\u03b3-net\nhas a true error close to R\u2217, as guaranteed by stage 1. Since for bounded subsets of \ufb01nite-\ndimensional spaces the size of any \u03b3-net is \ufb01nite, the empirical error of any majority-vote\n\u03b3-net almost surely converges to its true majority-vote error as the sample size n \u2192 \u221e. Let\nn(\u02dc\u03b3) suf\ufb01ciently large such that Qn(\u02dc\u03b3)(\u03b1(\u02dc\u03b3), m(\u02dc\u03b3)) as computed by KSU for a sample of\nsize n(\u02dc\u03b3) is a reliable estimate for the true error of hS(cid:48)\n\nn(\u02dc\u03b3)(\u02dc\u03b3).\n\n3. Let \u02dc\u03b3 and n(\u02dc\u03b3) be as in stage 2. Given a sample of size n = n(\u02dc\u03b3), recall that KSU\nselects an optimal \u03b3\u2217 such that Qn(\u03b1(\u03b3), m(\u03b3)) is minimized over all \u03b3 > 0. For margins\n\u03b3 (cid:28) \u02dc\u03b3, which are prone to over-\ufb01tting, Qn(\u03b1(\u03b3), m(\u03b3)) is not a reliable estimate for\nn(\u03b3) since compression may not yet taken place for samples of size n. Nevertheless, these\nhS(cid:48)\nmargins are discarded by KSU due to the penalty term in Q. On the other hand, for \u03b3-nets\nwith margin \u03b3 (cid:29) \u02dc\u03b3, which are prone to under-\ufb01tting, the true error is well estimated by\nn) \u2248 R\u2217, implying\nQn(\u03b1(\u03b3), m(\u03b3)). It follows that KSU selects \u03b3\u2217\nlim supn\u2192\u221e TII (n) \u2264 0 with probability one.\n\nn \u2248 \u02dc\u03b3 and Qn(\u03b1\u2217\n\nn, m\u2217\n\nAs one can see, the assumption that X is \ufb01nite-dimensional plays a major role in the proof. A simple\nargument shows that the family of continuous functions with compact support is no longer dense\nin L1 in in\ufb01nite-dimensional spaces. In addition, \u03b3-nets of bounded subsets in in\ufb01nite dimensional\nspaces need no longer be \ufb01nite.\n\n6 On Bayes-consistency of NN algorithms in in\ufb01nite dimensions\n\nIn this section we study the Bayes-consistency properties of 1-NN algorithms on a classic in\ufb01nite-\ndimensional construction of Preiss [35], which we describe below in detail. This construction was\n\n7\n\n\fz1:k\u22122\n\n\u03b3k\u22121\n\nz1:k\u22121\n\n\u03b3k\n\n\u03b3k\n\n\u03b3k\n\n\u03b3k\n\nz1:k\n\n\u03b3k\n\n\u03b3k\n\nz\n\nC = Z\u221e\n\nFigure 1: Preiss\u2019s construction. Encircled is the closed ball \u00afB\u03b3k\u22121(z) for some z \u2208 C.\n\n(cid:80)\u221e\n\n\ufb01rst introduced as a concrete example showing that in in\ufb01nite-dimensional spaces the Besicovich\ncovering theorem [15] can be strongly violated, as manifested in Eq. (3).\nExample 1 (Preiss\u2019s construction). The construction (see Figure 1) de\ufb01nes an in\ufb01nite-dimensional\nmetric space (X , \u03c1) and a realizable measure \u00af\u00b5 over X \u00d7 Y with the binary label set Y = {0, 1}.\nIt relies on two sequences: a sequence of natural numbers {Nk}k\u2208N and a sequence of positive\nnumbers {ak}k\u2208N. The two sequences should satisfy the following:\n\nThese properties are satis\ufb01ed, for instance, by setting Nk := k! and ak := 2\u2212k/(cid:81)\n\n(4)\ni\u2208[k] Ni. Let Z0\nbe the set of all \ufb01nite sequences (z1, . . . , zk)k\u2208N of natural numbers such that zi \u2264 Ni, and let Z\u221e\nbe the set of all in\ufb01nite sequences (z1, z2, . . . ) of natural numbers such that zi \u2264 Ni.\nDe\ufb01ne the example space X := Z0 \u222a Z\u221e and denote \u03b3k := 2\u2212k, where \u03b3\u221e := 0. The metric \u03c1 over\nX is de\ufb01ned as follows: for x, y \u2208 X , denote by x \u2227 y their longest common pre\ufb01x. Then,\n\nk=1 akN1 . . . Nk = 1;\n\nlimk\u2192\u221e akN1 . . . Nk+1 = \u221e;\n\nand\n\nlimk\u2192\u221e Nk = \u221e.\n\n\u03c1(x, y) = (\u03b3|x\u2227y| \u2212 \u03b3|x|) + (\u03b3|x\u2227y| \u2212 \u03b3|y|).\n\nIt can be shown (see [35]) that \u03c1(x, y) is a metric; in fact, it embeds isometrically into the square\nnorm metric of a Hilbert space.\nTo de\ufb01ne \u00b5, the marginal measure over X , let \u03bd\u221e be the uniform product distribution measure\nover Z\u221e, that is: for all i \u2208 N, each zi in the sequence z = (z1, z2, . . . ) \u2208 Z\u221e is independently\ndrawn from a uniform distribution over [Ni]. Let \u03bd0 be an atomic measure on Z0 such that for all\nz \u2208 Z0, \u03bd0(z) = a|z|. Clearly, the \ufb01rst condition in Eq. (4) implies \u03bd0(Z0) = 1. De\ufb01ne the marginal\nprobability measure \u00b5 over X by\n\n\u2200A \u2286 Z0 \u222a Z\u221e, \u00b5(A) := \u03b1\u03bd\u221e(A) + (1 \u2212 \u03b1)\u03bd0(A).\n\nIn words, an in\ufb01nite sequence is drawn with probability \u03b1 (and all such sequences are equally likely),\nor else a \ufb01nite sequence is drawn (and all \ufb01nite sequences of the same length are equally likely).\nDe\ufb01ne the realizable distribution \u00af\u00b5 over X \u00d7 Y by setting the marginal over X to \u00b5, and by setting\nthe label of z \u2208 Z\u221e to be 1 with probability 1 and the label of z \u2208 Z0 to be 0 with probability 1.\nAs shown in [35], this construction satis\ufb01es Eq. (3) with C = Z\u221e and \u00b5(C) = \u03b1 > 0. It follows\nfrom Theorem 3 that no k-NN algorithm is Bayes-consistent on it. In contrast, the following theorem\nshows that KSU is weakly Bayes-consistent on this distribution. Theorem 4 immediately follows\nfrom the this result.\nTheorem 6. Assume (X , \u03c1),Y and \u00af\u00b5 as in Example 1. KSU is weakly Bayes-consistent on \u00af\u00b5.\nThe proof, provided in Appendix C in [26], \ufb01rst characterizes the Voronoi cells for which the true\nmajority-vote yields a signi\ufb01cant error for the cell (Lemma 15). In \ufb01nite-dimensional spaces, the total\nmeasure of all such \u201cbad\u201d cells can be made arbitrarily close to zero by taking \u03b3 to be suf\ufb01ciently\nsmall, as shown in Lemma 10 of Theorem 2. However, it is not immediately clear whether this can\nbe achieved for the in\ufb01nite dimensional construction above.\nIndeed, we expect such bad cells, due to the unintuitive property that for any x \u2208 C, we have\n\u00b5( \u00afB\u03b3(x) \u2229 C)/\u00b5( \u00afB\u03b3(x)) \u2192 0 when \u03b3 \u2192 0, and yet \u00b5(C) > 0. Thus, if for example a signi\ufb01cant\n\n8\n\n\fportion of the set C (whose label is 1) is covered by Voronoi cells of the form V = \u00afB\u03b3(x) with\nx \u2208 C, then for all suf\ufb01ciently small \u03b3, each one of these cells will have a true majority-vote 0. Thus\na signi\ufb01cant portion of C would be misclassi\ufb01ed. However, we show that by the structure of the\nconstruction, combined with the packing and covering properties of \u03b3-nets, we have that in any \u03b3-net,\nthe total measure of all these \u201cbad\u201d cells goes to 0 when \u03b3 \u2192 0, thus yielding a consistent classi\ufb01er.\nLastly, the following theorem shows that on the same construction above, when the Voronoi partitions\nare allowed to violate the packing property of \u03b3-nets, Bayes-consistency does not necessarily hold.\nTheorem 5 immediately follows from the following result.\nTheorem 7. Assume (X , \u03c1), Y and \u00af\u00b5 as in Example 1. There exists a sequence of Voronoi partitions\n(Pk)k\u2208N of X with maxV \u2208Pk diam(V ) \u2264 \u03b3k such that the sequence of true majority-vote classi\ufb01ers\n(hPk )k\u2208N induced by these partitions is not Bayes consistent: lim inf k\u2192\u221e err(hPk ) = \u03b1 > 0.\nThe proof, provided in Appendix D, constructs a sequence of Voronoi partitions, where each partition\nPk has all of its impure Voronoi cells (those with both 0 and 1 labels) being bad. In this case, C is\nincorrectly classi\ufb01ed by hPk, yielding a signi\ufb01cant error. Thus, in in\ufb01nite-dimensional metric spaces,\nthe shape of the Voronoi cells plays a fundamental role in the consistency of the partition.\n\nAcknowledgments. We thank Fr\u00e9d\u00e9ric C\u00e9rou for the numerous fruitful discussions and helpful\nfeedback on an earlier draft. Aryeh Kontorovich was supported in part by the Israel Science\nFoundation (grant No. 755/15), Paypal and IBM. Sivan Sabato was supported in part by the Israel\nScience Foundation (grant No. 555/15).\n\nReferences\n[1] Christophe Abraham, G\u00e9rard Biau, and Beno\u00eet Cadre. On the kernel rule for function classi\ufb01ca-\n\ntion. Ann. Inst. Statist. Math., 58(3):619\u2013633, 2006.\n\n[2] Daniel Berend and Aryeh Kontorovich. The missing mass problem. Statistics & Probability\n\nLetters, 82(6):1102\u20131110, 2012.\n\n[3] Daniel Berend and Aryeh Kontorovich. On the concentration of the missing mass. Electronic\n\nCommunications in Probability, 18(3):1\u20137, 2013.\n\n[4] Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In\nICML \u201906: Proceedings of the 23rd international conference on Machine learning, pages\n97\u2013104, New York, NY, USA, 2006. ACM.\n\n[5] G\u00e9rard Biau, Florentina Bunea, and Marten H. Wegkamp. Functional classi\ufb01cation in Hilbert\n\nspaces. IEEE Trans. Inform. Theory, 51(6):2163\u20132172, 2005.\n\n[6] G\u00e9rard Biau, Fr\u00e9d\u00e9ric C\u00e9rou, and Arnaud Guyader. Rates of convergence of the functional\n\nk-nearest neighbor estimate. IEEE Trans. Inform. Theory, 56(4):2034\u20132040, 2010.\n\n[7] V. I. Bogachev. Measure theory. Vol. I, II. Springer-Verlag, Berlin, 2007.\n[8] Oren Boiman, Eli Shechtman, and Michal Irani. In defense of nearest-neighbor based image\n\nclassi\ufb01cation. In CVPR, 2008.\n\n[9] Fr\u00e9d\u00e9ric C\u00e9rou and Arnaud Guyader. Nearest neighbor classi\ufb01cation in in\ufb01nite dimension.\n\nESAIM: Probability and Statistics, 10:340\u2013355, 2006.\n\n[10] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classi\ufb01-\n\ncation. In NIPS, 2014.\n\n[11] Thomas M. Cover and Peter E. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Transactions\n\non Information Theory, 13:21\u201327, 1967.\n\n[12] Luc Devroye. On the inequality of Cover and Hart in nearest neighbor discrimination. IEEE\n\nTrans. Pattern Anal. Mach. Intell., 3(1):75\u201378, 1981.\n\n[13] Luc Devroye and L\u00e1szl\u00f3 Gy\u00f6r\ufb01. Nonparametric density estimation: the L1 view. Wiley Series\nin Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley &\nSons, Inc., New York, 1985.\n\n9\n\n\f[14] Luc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A probabilistic theory of pattern recognition,\n\nvolume 31. Springer Science & Business Media, 2013.\n\n[15] Herbert Federer. Geometric measure theory. Die Grundlehren der mathematischen Wis-\n\nsenschaften, Band 153. Springer-Verlag New York Inc., New York, 1969.\n\n[16] Evelyn Fix and Jr. Hodges, J. L. Discriminatory analysis. nonparametric discrimination:\nConsistency properties. International Statistical Review / Revue Internationale de Statistique,\n57(3):pp. 238\u2013247, 1989.\n\n[17] Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the Vapnik-\n\nChervonenkis dimension. Machine learning, 21(3):269\u2013304, 1995.\n\n[18] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Ef\ufb01cient classi\ufb01cation for metric\ndata (extended abstract COLT 2010). IEEE Transactions on Information Theory, 60(9):5750\u2013\n5759, 2014.\n\n[19] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality\n\nreduction. Theoretical Computer Science, 620:105\u2013118, 2016.\n\n[20] Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Near-optimal sample compression\n\nfor nearest neighbors. In Neural Information Processing Systems (NIPS), 2014.\n\n[21] Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Nearly optimal classi\ufb01cation for\nsemimetrics (extended abstract AISTATS 2016). Journal of Machine Learning Research, 2017.\n\n[22] Thore Graepel, Ralf Herbrich, and John Shawe-Taylor. PAC-Bayesian compression bounds on\nthe prediction error of learning algorithms for classi\ufb01cation. Machine Learning, 59(1):55\u201376,\n2005.\n\n[23] Peter Hall and Kee-Hoon Kang. Bandwidth choice for nonparametric classi\ufb01cation. Ann.\n\nStatist., 33(1):284\u2013306, 02 2005.\n\n[24] Olav Kallenberg. Foundations of modern probability. Second edition. Probability and its\n\nApplications. Springer-Verlag, 2002.\n\n[25] Aryeh Kontorovich, Sivan Sabato, and Ruth Urner. Active nearest-neighbor learning in metric\n\nspaces. In Advances in Neural Information Processing Systems, pages 856\u2013864, 2016.\n\n[26] Aryeh Kontorovich, Sivan Sabato, and Roi Weiss. Nearest-neighbor sample compression:\n\nEf\ufb01ciency, consistency, in\ufb01nite dimensions. CoRR, abs/1705.08184, 2017.\n\n[27] Aryeh Kontorovich and Roi Weiss. A Bayes consistent 1-NN classi\ufb01er. In Arti\ufb01cial Intelligence\n\nand Statistics (AISTATS 2015), 2014.\n\n[28] Aryeh Kontorovich and Roi Weiss. Maximum margin multiclass nearest neighbors. In Interna-\n\ntional Conference on Machine Learning (ICML 2014), 2014.\n\n[29] Robert Krauthgamer and James R. Lee. Navigating nets: Simple algorithms for proximity\nsearch. In 15th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 791\u2013801, January\n2004.\n\n[30] Sanjeev R. Kulkarni and Steven E. Posner. Rates of convergence of nearest neighbor estimation\n\nunder arbitrary sampling. IEEE Trans. Inform. Theory, 41(4):1028\u20131039, 1995.\n\n[31] Nick Littlestone and Manfred K. Warmuth. Relating data compression and learnability. unpub-\n\nlished, 1986.\n\n[32] James R. Munkres. Topology: a \ufb01rst course. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1975.\n\n[33] Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration\n\nof measure. Inform. Process. Lett., 73(1-2):47\u201351, 2000.\n\n[34] Vladimir Pestov. Is the k-NN classi\ufb01er in high dimensions affected by the curse of dimensional-\n\nity? Comput. Math. Appl., 65(10):1427\u20131437, 2013.\n\n10\n\n\f[35] David Preiss. Invalid Vitali theorems. Abstracta. 7th Winter School on Abstract Analysis, pages\n\n58\u201360, 1979.\n\n[36] David Preiss. Gaussian measures and the density theorem. Comment. Math. Univ. Carolin.,\n\n22(1):181\u2013193, 1981.\n\n[37] Demetri Psaltis, Robert R. Snapp, and Santosh S. Venkatesh. On the \ufb01nite sample performance\nof the nearest neighbor classi\ufb01er. IEEE Transactions on Information Theory, 40(3):820\u2013837,\n1994.\n\n[38] Walter Rudin. Principles of mathematical analysis. McGraw-Hill Book Co., New York, third\n\nedition, 1976. International Series in Pure and Applied Mathematics.\n\n[39] Walter Rudin. Real and Complex Analysis. McGraw-Hill, 1987.\n\n[40] Richard J. Samworth. Optimal weighted nearest neighbour classi\ufb01ers. Ann. Statist., 40(5):2733\u2013\n\n2763, 10 2012.\n\n[41] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[42] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural\nrisk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory,\n44(5):1926\u20131940, 1998.\n\n[43] Robert R. Snapp and Santosh S. Venkatesh. Asymptotic expansions of the k nearest neighbor\n\nrisk. Ann. Statist., 26(3):850\u2013878, 1998.\n\n[44] Charles J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5(4):595\u2013620,\n\n1977.\n\n[45] Jaroslav Ti\u0161er. Vitali covering theorem in Hilbert space. Trans. Amer. Math. Soc., 355(8):3277\u2013\n\n3289, 2003.\n\n[46] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest\n\nneighbor classi\ufb01cation. Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[47] Lin Cheng Zhao. Exponential bounds of mean error for the nearest neighbor estimates of\n\nregression functions. J. Multivariate Anal., 21(1):168\u2013178, 1987.\n\n11\n\n\f", "award": [], "sourceid": 994, "authors": [{"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}, {"given_name": "Sivan", "family_name": "Sabato", "institution": "Ben Gurion University"}, {"given_name": "Roi", "family_name": "Weiss", "institution": "Weizmann institute of science"}]}