{"title": "Rates of Convergence for Nearest Neighbor Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 3437, "page_last": 3445, "abstract": "We analyze the behavior of nearest neighbor classification in metric spaces and provide finite-sample, distribution-dependent rates of convergence under minimal assumptions. These are more general than existing bounds, and enable us, as a by-product, to establish the universal consistency of nearest neighbor in a broader range of data spaces than was previously known. We illustrate our upper and lower bounds by introducing a new smoothness class customized for nearest neighbor classification. We find, for instance, that under the Tsybakov margin condition the convergence rate of nearest neighbor matches recently established lower bounds for nonparametric classification.", "full_text": "Rates of convergence for nearest neighbor\n\nclassi\ufb01cation\n\nKamalika Chaudhuri\n\nComputer Science and Engineering\nUniversity of California, San Diego\n\nkamalika@cs.ucsd.edu\n\nSanjoy Dasgupta\n\nComputer Science and Engineering\nUniversity of California, San Diego\n\ndasgupta@cs.ucsd.edu\n\nAbstract\n\nWe analyze the behavior of nearest neighbor classi\ufb01cation in metric spaces and\nprovide \ufb01nite-sample, distribution-dependent rates of convergence under minimal\nassumptions. These are more general than existing bounds, and enable us, as a\nby-product, to establish the universal consistency of nearest neighbor in a broader\nrange of data spaces than was previously known. We illustrate our upper and lower\nbounds by introducing a new smoothness class customized for nearest neighbor\nclassi\ufb01cation. We \ufb01nd, for instance, that under the Tsybakov margin condition the\nconvergence rate of nearest neighbor matches recently established lower bounds\nfor nonparametric classi\ufb01cation.\n\n1\n\nIntroduction\n\nIn this paper, we deal with binary prediction in metric spaces. A classi\ufb01cation problem is de\ufb01ned\nby a metric space (X , \u03c1) from which instances are drawn, a space of possible labels Y = {0, 1},\nand a distribution P over X \u00d7 Y. The goal is to \ufb01nd a function h : X \u2192 Y that minimizes the\nprobability of error on pairs (X, Y ) drawn from P; this error rate is the risk R(h) = P(h(X) (cid:54)= Y ).\nThe best such function is easy to specify: if we let \u00b5 denote the marginal distribution of X and \u03b7\nthe conditional probability \u03b7(x) = P(Y = 1|X = x), then the predictor 1(\u03b7(x) \u2265 1/2) achieves\nthe minimum possible risk, R\u2217 = EX [min(\u03b7(X), 1\u2212 \u03b7(X))]. The trouble is that P is unknown and\nthus a prediction rule must instead be based only on a \ufb01nite sample of points (X1, Y1), . . . , (Xn, Yn)\ndrawn independently at random from P.\nNearest neighbor (NN) classi\ufb01ers are among the simplest prediction rules. The 1-NN classi\ufb01er\nassigns each point x \u2208 X the label Yi of the closest point in X1, . . . , Xn (breaking ties arbitrarily,\nsay). For a positive integer k, the k-NN classi\ufb01er assigns x the majority label of the k closest points\nin X1, . . . , Xn. In the latter case, it is common to let k grow with n, in which case the sequence\n(kn : n \u2265 1) de\ufb01nes a kn-NN classi\ufb01er.\nThe asymptotic consistency of nearest neighbor classi\ufb01cation has been studied in detail, starting\nwith the work of Fix and Hodges [7]. The risk of the NN classi\ufb01er, henceforth denoted Rn, is a\nrandom variable that depends on the data set (X1, Y1), . . . , (Xn, Yn); the usual order of business is\nto \ufb01rst determine the limiting behavior of the expected value ERn and to then study stronger modes\nof convergence of Rn. Cover and Hart [2] studied the asymptotics of ERn in general metric spaces,\nunder the assumption that every x in the support of \u00b5 is either a continuity point of \u03b7 or has \u00b5({x}) >\n0. For the 1-NN classi\ufb01er, they found that ERn \u2192 EX [2\u03b7(X)(1 \u2212 \u03b7(X))] \u2264 2R\u2217(1 \u2212 R\u2217); for\nkn-NN with kn \u2191 \u221e and kn/n \u2193 0, they found ERn \u2192 R\u2217. For points in Euclidean space, a series\nof results starting with Stone [15] established consistency without any distributional assumptions.\nFor kn-NN in particular, Rn \u2192 R\u2217 almost surely [5].\nThese consistency results place nearest neighbor methods in a favored category of nonparametric\nestimators. But for a fuller understanding it is important to also have rates of convergence. For\n\n1\n\n\finstance, part of the beauty of nearest neighbor is that it appears to adapt automatically to different\ndistance scales in different regions of space. It would be helpful to have bounds that encapsulate this\nproperty.\nRates of convergence are also important in extending nearest neighbor classi\ufb01cation to settings such\nas active learning, semisupervised learning, and domain adaptation, in which the training data is not\na fully-labeled data set obtained by i.i.d. sampling from the future test distribution. For instance, in\nactive learning, the starting point is a set of unlabeled points X1, . . . , Xn, and the learner requests\nthe labels of just a few of these, chosen adaptively to be as informative as possible about \u03b7. There\nare many natural schemes for deciding which points to label: for instance, one could repeatedly\npick the point furthest away from the labeled points so far, or one could pick the point whose k\nnearest labeled neighbors have the largest disagreement among their labels. The asymptotics of such\nselective sampling schemes have been considered in earlier work [4], but ultimately the choice of\nscheme must depend upon \ufb01nite-sample behavior. The starting point for understanding this behavior\nis to \ufb01rst obtain a characterization in the non-active setting.\n\n1.1 Previous work on rates of convergence\n\nThere is a large body of work on convergence rates of nearest neighbor estimators. Here we outline\nsome of the types of results that have been obtained, and give representative sources for each.\nThe earliest rates of convergence for nearest neighbor were distribution-free. Cover [3] studied the 1-\nNN classi\ufb01er in the case X = R, under the assumption of class-conditional densities with uniformly-\nbounded third derivatives. He showed that ERn converges at a rate of O(1/n2). Wagner [18]\nand later Fritz [8] also looked at 1-NN, but in higher dimension X = Rd. The latter obtained an\nasymptotic rate of convergence for Rn under the milder assumption of non-atomic \u00b5 and lower\nsemi-continuous class-conditional densities.\nDistribution-free results are valuable, but do not characterize which properties of a distribution most\nin\ufb02uence the performance of nearest neighbor classi\ufb01cation. More recent work has investigated\ndifferent approaches to obtaining distribution-dependent bounds, in terms of the smoothness of the\ndistribution.\nA simple and popular smoothness parameter is the Holder constant. Kulkarni and Posner [12] ob-\ntained a fairly general result of this kind for 1-NN and kn-NN. They assumed that for some constants\nK and \u03b1, and for all x1, x2 \u2208 X ,\n\n|\u03b7(x1) \u2212 \u03b7(x2)| \u2264 K\u03c1(x1, x2)2\u03b1.\n\nThey then gave bounds in terms of the Holder parameter \u03b1 as well as covering numbers for the\nmarginal distribution \u00b5. Gyor\ufb01 [9] looked at the case X = Rd, under the weaker assumption that\nfor some function K : Rd \u2192 R and some \u03b1, and for all z \u2208 Rd and all r > 0,\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03b7(z) \u2212\n\n(cid:90)\n\n1\n\n\u00b5(B(z, r))\n\nB(z,r)\n\n\u03b7(x)\u00b5(dx)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 K(z)r\u03b1.\n\nThe integral denotes the average \u03b7 value in a ball of radius r centered at z; hence, this \u03b1 is similar\nin spirit to the earlier Holder parameter, but does not require \u03b7 to be continuous. Gyor\ufb01 obtained\nasymptotic rates in terms of \u03b1. Another generalization of standard smoothness conditions was pro-\nposed recently [17] in a \u201cprobabilistic Lipschitz\u201d assumption, and in this setting rates were obtained\nfor NN classi\ufb01cation in bounded spaces X \u2282 Rd.\nThe literature leaves open several basic questions that have motivated the present paper.\n(1) Is\nit possible to give tight \ufb01nite-sample bounds for NN classi\ufb01cation in metric spaces, without any\nsmoothness assumptions? What aspects of the distribution must be captured in such bounds? (2)\nAre there simple notions of smoothness that are especially well-suited to nearest neighbor? Roughly\nspeaking, we consider a notion suitable if it is possible to sharply characterize the convergence rate\nof nearest neighbor for all distributions satisfying this notion. As we discuss further below, the\nHolder constant is lacking in this regard.\n(3) A recent trend in nonparametric classi\ufb01cation has\nbeen to study rates of convergence under \u201cmargin conditions\u201d such as that of Tsybakov. The best\nachievable rates under these conditions are now known: does nearest neighbor achieve these rates?\n\n2\n\n\fFigure 1: One-dimensional distributions. In each case, the class-conditional densities are shown.\n\n1.2 Some illustrative examples\n\nWe now look at a couple of examples to get a sense of what properties of a distribution most critically\naffect the convergence rate of nearest neighbor. In each case, we study the k-NN classi\ufb01er.\nTo start with, consider a distribution over X = R in which the two classes (Y = 0, 1) have class-\nconditional densities \u00b50 and \u00b51. Assume that these two distributions have disjoint support, as on the\nleft side of Figure 1. The k-NN classi\ufb01er will make a mistake on a speci\ufb01c query x only if x is near\nthe boundary between the two classes. To be precise, consider an interval around x of probability\nmass k/n, that is, an interval B = [x\u2212r, x+r] with \u00b5(B) = k/n. Then the k nearest neighbors will\nlie roughly in this interval, and there will likely be an error only if the interval contains a substantial\nportion of the wrong class. Whether or not \u03b7 is smooth, or the \u00b5i are smooth, is irrelevant.\nIn a general metric space, the k nearest neighbors of any query point x are likely to lie in a ball\ncentered at x of probability mass roughly k/n. Thus the central objects in analyzing k-NN are balls\nof mass \u2248 k/n near the decision boundary, and it should be possible to give rates of convergence\nsolely in terms of these.\nNow let\u2019s turn to notions of smoothness. Figure 1, right, shows a variant of the previous example\nin which it is no longer the case that \u03b7 \u2208 {0, 1}. Although one of the class-conditional densities in\nthe \ufb01gure is highly non-smooth, this erratic behavior occurs far from the decision boundary and thus\ndoes not affect nearest neighbor performance. And in the vicinity of the boundary, what matters is\nnot how much \u03b7 varies within intervals of any given radius r, but rather within intervals of probability\nmass k/n. Smoothness notions such as Lipschitz and Holder constants, which measure changes in\n\u03b7 with respect to x, are therefore not entirely suitable: what we need to measure are changes in \u03b7\nwith respect to the underlying marginal \u00b5 on X .\n\n1.3 Results of this paper\nLet us return to our earlier setting of pairs (X, Y ), where X takes values in a metric space (X , \u03c1) and\nhas distribution \u00b5, while Y \u2208 {0, 1} has conditional probability function \u03b7(x) = Pr(Y = 1|X =\nx). We obtain rates of convergence for k-NN by attempting to make precise the intuitions discussed\nabove. This leads to a somewhat different style of analysis than has been used in earlier work.\nOur main result is an upper bound on the misclassi\ufb01cation rate of k-NN that holds for any sample\nsize n and for any metric space, with no distributional assumptions. The bound depends on a novel\nnotion of the effective boundary for k-NN: for the moment, denote this set by An,k \u2282 X .\n\n\u2022 We show that with high probability over the training data, the misclassi\ufb01cation rate of the\nk-NN classi\ufb01er (with respect to the Bayes-optimal classifer) is bounded above by \u00b5(An,k)\nplus a small additional term that can be made arbitrarily small (Theorem 5).\n\n\u2022 We lower-bound the misclassi\ufb01cation rate using a related notion of effective boundary\n\n(Theorem 6).\n\n\u2022 We identify a general condition under which, as n and k grow, An,k approaches the ac-\ntual decision boundary {x | \u03b7(x) = 1/2}. This yields universal consistency in a wider\nrange of metric spaces than just Rd (Theorem 1), thus broadening our understanding of the\nasymptotics of nearest neighbor.\n\n3\n\nClass 0Class 1Class 0Class 1\fWe then specialize our generalization bounds to smooth distributions.\n\nour upper and lower bounds under this kind of smoothness (Theorem 3).\n\n\u2022 We introduce a novel smoothness condition that is tailored to nearest neighbor. We compare\n\u2022 We obtain risk bounds under the margin condition of Tsybakov that match the best known\n\u2022 We look at additional speci\ufb01c cases of interest: when \u03b7 is bounded away from 1/2, and the\n\nresults for nonparametric classi\ufb01cation (Theorem 4).\neven more extreme scenario where \u03b7 \u2208 {0, 1} (zero Bayes risk).\n\n2 De\ufb01nitions and results\nLet (X , \u03c1) be any separable metric space. For any x \u2208 X , let\n\nBo(x, r) = {x(cid:48) \u2208 X | \u03c1(x, x(cid:48)) < r} and B(x, r) = {x(cid:48) \u2208 X | \u03c1(x, x(cid:48)) \u2264 r}\n\ndenote the open and closed balls, respectively, of radius r centered at x.\nLet \u00b5 be a Borel regular probability measure on this space (that is, open sets are measurable, and\nevery set is contained in a Borel set of the same measure) from which instances X are drawn. The\nlabel of an instance X = x is Y \u2208 {0, 1} and is distributed according to the measurable conditional\nprobability function \u03b7 : X \u2192 [0, 1] as follows: Pr(Y = 1|X = x) = \u03b7(x).\nGiven a data set S = ((X1, Y1), . . . , (Xn, Yn)) and a query point x \u2208 X , we use the notation\nX (i)(x) to denote the i-th nearest neighbor of x in the data set, and Y (i)(x) to denote its label.\nDistances are calculated with respect to the given metric \u03c1, and ties are broken by preferring points\nearlier in the sequence. The k-NN classi\ufb01er is de\ufb01ned by\n\n(cid:26) 1\n\n0\n\ngn,k(x) =\n\nif Y (1)(x) + \u00b7\u00b7\u00b7 + Y (k)(x) \u2265 k/2\notherwise\n\nWe analyze the performance of gn,k by comparing it with g(x) = 1(\u03b7(x) \u2265 1/2), the omniscient\nBayes-optimal classi\ufb01er. Speci\ufb01cally, we obtain bounds on PrX (gn,k(X) (cid:54)= g(X)) that hold with\nhigh probability over the choice of data S, for any n. It is worth noting that convergence results\nfor nearest neighbor have traditionally studied the excess risk Rn,k \u2212 R\u2217, where Rn,k = Pr(Y (cid:54)=\ngn,k(X)). If we de\ufb01ne the pointwise quantities\n\nRn,k(x) \u2212 R\u2217(x) = |1 \u2212 2\u03b7(x)|1(gn,k(x) (cid:54)= g(x)).\n\n(1)\nTaking expectation over X, we then have Rn,k \u2212 R\u2217 \u2264 PrX (gn,k(X) (cid:54)= g(X)), and so we also\nobtain upper bounds on the excess risk.\nThe technical core of this paper is the \ufb01nite-sample generalization bound of Theorem 5. We begin,\nhowever, by discussing some of its implications since these relate directly to common lines of inquiry\nin the statistical literature. All proofs appear in the appendix.\n\n2.1 Universal consistency\nA series of results, starting with [15], has shown that kn-NN is strongly consistent (Rn = Rn,kn \u2192\nR\u2217 almost surely) when X is a \ufb01nite-dimensional Euclidean space and \u00b5 is a Borel measure. A\nconsequence of the bounds we obtain in Theorem 5 is that this phenomenon holds quite a bit more\ngenerally. In fact, strong consistency holds in any metric measure space (X , \u03c1, \u00b5) for which the\nLebesgue differentiation theorem is true: that is, spaces in which, for any bounded measurable f,\n\n1\n\nlim\nr\u21930\n\n\u00b5(B(x, r))\n\nB(x,r)\n\nf d\u00b5 = f (x)\n\n(2)\n\nfor almost all (\u00b5-a.e.) x \u2208 X .\nFor more details on this differentiation property, see [6, 2.9.8] and [10, 1.13]. It holds, for instance:\n\n(cid:90)\n\n4\n\nRn,k(x) = Pr(Y (cid:54)= gn,k(x)|X = x)\nR\u2217(x) = min(\u03b7(x), 1 \u2212 \u03b7(x)),\n\nfor all x \u2208 X , we see that\n\n\f\u2022 When (X , \u03c1) is a \ufb01nite-dimensional normed space [10, 1.15(a)].\n\u2022 When (X , \u03c1, \u00b5) is doubling [10, 1.8], that is, when there exists a constant C(\u00b5) such that\n\u2022 When \u00b5 is an atomic measure on X .\n\n\u00b5(B(x, 2r)) \u2264 C(\u00b5)\u00b5(B(x, r)) for every ball B(x, r).\n\nFor the following theorem, recall that the risk of the kn-NN classi\ufb01er, Rn = Rn,kn, is a function of\nthe data set (X1, Y1), . . . , (Xn, Yn).\nTheorem 1. Suppose metric measure space (X , \u03c1, \u00b5) satis\ufb01es differentiation condition (2). Pick\na sequence of positive integers (kn), and for each n, let Rn = Rn,kn be the risk of the kn-NN\nclassi\ufb01er gn,kn.\n\n1. If kn \u2192 \u221e and kn/n \u2192 0, then for all \u0001 > 0,\n\nn\u2192\u221e Prn(Rn \u2212 R\u2217 > \u0001) = 0.\n\nlim\n\nHere Prn denotes probability over the data set (X1, Y1), . . . , (Xn, Yn).\n\n2. If in addition kn/(log n) \u2192 \u221e, then Rn \u2192 R\u2217 almost surely.\n\n2.2 Smooth measures\n\nBefore stating our \ufb01nite-sample bounds in full generality, we provide a glimpse of them under\nsmooth probability distributions. We begin with a few de\ufb01nitions.\n\nThe support of \u00b5. The support of distribution \u00b5 is de\ufb01ned as\n\nsupp(\u00b5) = {x \u2208 X | \u00b5(B(x, r)) > 0 for all r > 0}.\n\nIt was shown by [2] that in separable metric spaces, \u00b5(supp(\u00b5)) = 1. For the interested reader, we\nreproduce their brief proof in the appendix (Lemma 24).\n\nThe conditional probability function for a set. The conditional probability function \u03b7 is de\ufb01ned\nfor points x \u2208 X , and can be extended to measurable sets A \u2282 X with \u00b5(A) > 0 as follows:\n\n(cid:90)\n\n1\n\n\u00b5(A)\n\nA\n\n\u03b7(A) =\n\n\u03b7 d\u00b5.\n\n(3)\n\nThis is the probability that Y = 1 for a point X chosen at random from the distribution \u00b5 restricted\nto set A. We exclusively consider sets A of the form B(x, r), in which case \u03b7 is de\ufb01ned whenever\nx \u2208 supp(\u00b5).\n\n2.2.1 Smoothness with respect to the marginal distribution\n\nFor the purposes of nearest neighbor, it makes sense to de\ufb01ne a notion of smoothness with respect\nto the marginal distribution on instances. For \u03b1, L > 0, we say the conditional probability function\n\u03b7 is (\u03b1, L)-smooth in metric measure space (X , \u03c1, \u00b5) if for all x \u2208 supp(\u00b5) and all r > 0,\n\n|\u03b7(B(x, r)) \u2212 \u03b7(x)| \u2264 L \u00b5(Bo(x, r))\u03b1.\n\n(As might be expected, we only need to apply this condition locally, so it is enough to restrict\nattention to balls of probability mass upto some constant po.) One feature of this notion is that it is\nscale-invariant: multiplying all distances by a \ufb01xed amount leaves \u03b1 and L unchanged. Likewise, if\nthe distribution has several well-separated clusters, smoothness is unaffected by the distance-scales\nof the individual clusters.\nIt is common to analyze nonparametric classi\ufb01ers under the assumption that X = Rd and that \u03b7 is\n\u03b1H-Holder continuous for some \u03b1 > 0, that is,\n\n|\u03b7(x) \u2212 \u03b7(x(cid:48))| \u2264 L(cid:107)x \u2212 x(cid:48)(cid:107)\u03b1H\n\nfor some constant L. These bounds typically also require \u00b5 to have a density that is uniformly\nbounded (above and/or below). We now relate these standard assumptions to our notion of smooth-\nness.\n\n5\n\n\fLemma 2. Suppose that X \u2282 Rd, and \u03b7 is \u03b1H-Holder continuous, and \u00b5 has a density with respect\nto Lebesgue measure that is \u2265 \u00b5min on X . Then there is a constant L such that for any x \u2208 supp(\u00b5)\nand r > 0 with B(x, r) \u2282 X , we have |\u03b7(x) \u2212 \u03b7(B(x, r))| \u2264 L\u00b5(Bo(x, r))\u03b1H /d.\n(To remove the requirement that B(x, r) \u2282 X , we would need the boundary of X to be well-\nbehaved, for instance by requiring that X contains a constant fraction of every ball centered in it.\nThis is a familiar assumption in nonparametric classi\ufb01cation, including the seminal work of [1] that\nwe discuss shortly.)\nOur smoothness condition for nearest neighbor problems can thus be seen as a generalization of the\nusual Holder conditions. It applies in broader range of settings, for example for discrete \u00b5.\n\n2.2.2 Generalization bounds for smooth measures\n\nUnder smoothness, our general \ufb01nite-sample convergence rates (Theorems 5 and 6) take on an eas-\nily interpretable form. Recall that gn,k(x) is the k-NN classi\ufb01er, while g(x) is the Bayes-optimal\nprediction.\nTheorem 3. Suppose \u03b7 is (\u03b1, L)-smooth in (X , \u03c1, \u00b5). The following hold for any n and k.\n(Upper bound on misclassi\ufb01cation rate.) Pick any \u03b4 > 0 and suppose that k \u2265 16 ln(2/\u03b4). Then\n\n(cid:32)(cid:26)\n(cid:18)(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03b7(x) \u2212 1\n(cid:12)(cid:12)(cid:12)(cid:12) \u03b7(x) (cid:54)=\n\n1\n2\n\n2\n\n(cid:114) 1\n\n(cid:12)(cid:12) \u2264\n\nln\n\n2\n\u03b4\n\n+ L\n\nk\n\n(cid:19)\u03b1(cid:27)(cid:33)\n(cid:18) k\n(cid:19)\u03b1(cid:27)(cid:19)\n(cid:18) 2k\n\n2n\n\n.\n\n.\n\n(gn,k(X) (cid:54)= g(X)) \u2264 \u03b4 + \u00b5\n\nx \u2208 X\n\nPr\nX\n\nX\n\n(Lower bound on misclassi\ufb01cation rate.) Conversely, there is an absolute constant co such that\n\nEn Pr\n\n(gn,k(X) (cid:54)= g(X)) \u2265 co\u00b5\n\nx \u2208 X\n\n, |\u03b7(x) \u2212 1\n2\n\n| \u2264 1\u221a\nk\n\n\u2212 L\n\nn\n\nHere En is expectation over the data set.\nThe optimal choice of k is \u223c n2\u03b1/(2\u03b1+1), and with this setting the upper and lower bounds are\ndirectly comparable: they are both of the form \u00b5({x : |\u03b7(x) \u2212 1/2| \u2264 \u02dcO(k\u22121/2)}), the probability\nmass of a band of points around the decision boundary \u03b7 = 1/2.\nIt is noteworthy that these upper and lower bounds have a pleasing resemblance for every distribution\nin the smoothness class. This is in contrast to the usual minimax style of analysis, in which a bound\non an estimator\u2019s risk is described as \u201coptimal\u201d for a class of distributions if there exists even a single\ndistribution in that class for which it is tight.\n\n2.2.3 Margin bounds\n\nAn achievement of statistical theory in the past two decades has been margin bounds, which give\nfast rates of convergence for many classi\ufb01ers when the underlying data distribution P (given by \u00b5\nand \u03b7) satis\ufb01es a large margin condition stipulating, roughly, that \u03b7 moves gracefully away from\n1/2 near the decision boundary.\nFollowing [13, 16, 1], for any \u03b2 \u2265 0, we say P satis\ufb01es the \u03b2-margin condition if there exists a\nconstant C > 0 such that\n\n(cid:18)(cid:26)\n\n\u00b5\n\nx\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03b7(x) \u2212 1\n\n2\n\n(cid:12)(cid:12) \u2264 t\n\n(cid:27)(cid:19)\n\n\u2264 Ct\u03b2.\n\nLarger \u03b2 implies a larger margin. We now obtain bounds for the misclassi\ufb01cation rate and the excess\nrisk of k-NN under smoothness and margin conditions.\nTheorem 4. Suppose \u03b7 is (\u03b1, L)-smooth in (X , \u03c1, \u00b5) and satis\ufb01es the \u03b2-margin condition (with\nconstant C), for some \u03b1, \u03b2, L, C \u2265 0. In each of the two following statements, ko and Co are\nconstants depending on \u03b1, \u03b2, L, C.\n\n(a) For any 0 < \u03b4 < 1, set k = kon2\u03b1/(2\u03b1+1)(log(1/\u03b4))1/(2\u03b1+1). With probability at least\n\n1 \u2212 \u03b4 over the choice of training data,\n\n(cid:18) log(1/\u03b4)\n\n(cid:19)\u03b1\u03b2/(2\u03b1+1)\n\n.\n\nn\n\nPrX (gn,k(X) (cid:54)= g(X)) \u2264 \u03b4 + Co\n\n6\n\n\f(b) Set k = kon2\u03b1/(2\u03b1+1). Then EnRn,k \u2212 R\u2217 \u2264 Con\u2212\u03b1(\u03b2+1)/(2\u03b1+1).\n\nIt is instructive to compare these bounds with the best known rates for nonparametric classi\ufb01cation\nunder the margin assumption. The work of Audibert and Tsybakov [1] (Theorems 3.3 and 3.5) shows\nthat when (X , \u03c1) = (Rd,(cid:107)\u00b7(cid:107)), and \u03b7 is \u03b1H-Holder continuous, and \u00b5 lies in the range [\u00b5min, \u00b5max]\nfor some \u00b5max > \u00b5min > 0, and the \u03b2-margin condition holds (along with some other assumptions),\nan excess risk of n\u2212\u03b1H (\u03b2+1)/(2\u03b1H +d) is achievable and is also the best possible. This is exactly the\nrate we obtain for nearest neighbor classi\ufb01cation, once we translate between the different notions of\nsmoothness as per Lemma 2.\nWe discuss other interesting scenarios in Section C.4 in the appendix.\n\n2.3 A general upper bound on the misclassi\ufb01cation error\n\nWe now get to our most general \ufb01nite-sample bound. It requires no assumptions beyond the basic\nmeasurability conditions stated at the beginning of Section 2, and it is the basis of the all the results\ndescribed so far. We begin with some key de\ufb01nitions.\n\nThe radius and probability-radius of a ball. When dealing with balls, we will primarily be\ninterested in their probability mass. To this end, for any x \u2208 X and any 0 \u2264 p \u2264 1, de\ufb01ne\n\nThus \u00b5(B(x, rp(x))) \u2265 p (Lemma 23), and rp(x) is the smallest radius for which this holds.\n\nrp(x) = inf{r | \u00b5(B(x, r)) \u2265 p}.\n\nThe effective interiors of the two classes, and the effective boundary. When asked to make a\nprediction at point x, the k-NN classi\ufb01er \ufb01nds the k nearest neighbors, which can be expected to\nlie in B(x, rp(x)) for p \u2248 k/n. It then takes an average over these k labels, which has a standard\ndeviation of \u2206 \u2248 1/\nk. With this in mind, there is a natural de\ufb01nition for the effective interior of\nthe Y = 1 region: the points x with \u03b7(x) > 1/2 on which the k-NN classi\ufb01er is likely to be correct:\n\n\u221a\n\nX +\np,\u2206 = {x \u2208 supp(\u00b5) | \u03b7(x) >\n\n1\n2\n\n, \u03b7(B(x, r)) \u2265 1\n2\n\n+ \u2206 for all r \u2264 rp(x)}.\n\nThe corresponding de\ufb01nition for the Y = 0 region is\n\nX \u2212\np,\u2206 = {x \u2208 supp(\u00b5) | \u03b7(x) <\n\n1\n2\nThe remainder of X is the effective boundary,\n\n, \u03b7(B(x, r)) \u2264 1\n2\n\n\u2212 \u2206 for all r \u2264 rp(x)}.\n\n\u2202p,\u2206 = X \\ (X +\n\np,\u2206 \u222a X \u2212\n\np,\u2206).\n\nObserve that \u2202p(cid:48),\u2206(cid:48) \u2282 \u2202p,\u2206 whenever p(cid:48) \u2264 p and \u2206(cid:48) \u2264 \u2206. Under mild conditions, as p and \u2206 tend\nto zero, the effective boundary tends to the actual decision boundary {x | \u03b7(x) = 1/2} (Lemma 14),\nwhich we shall denote \u2202o.\nThe misclassi\ufb01cation rate of the k-NN classi\ufb01er can be bounded by the probability mass of the\neffective boundary:\nTheorem 5. Pick any 0 < \u03b4 < 1 and positive integers k < n. Let gn,k denote the k-NN classi\ufb01er\nbased on n training points, and g(x) the Bayes-optimal classi\ufb01er. With probability at least 1 \u2212 \u03b4\nover the choice of training data,\n\nPrX (gn,k(X) (cid:54)= g(X)) \u2264 \u03b4 + \u00b5(cid:0)\u2202p,\u2206\n(cid:32)\n1 \u2212(cid:112)(4/k) ln(2/\u03b4)\n\n, and \u2206 = min\n\n1\n\n(cid:1),\n\n(cid:114) 1\n\nk\n\nln\n\n2\n\u03b4\n\n(cid:33)\n\n.\n\n1\n2\n\n,\n\nwhere\n\np =\n\n\u00b7\n\nk\nn\n\n7\n\n\f2.4 A general lower bound on the misclassi\ufb01cation error\n\nn,k, where\n\nFinally, we give a counterpart to Theorem 5 that lower-bounds the expected probability of error\nof gn,k. For any positive integers k < n, we identify a region close to the decision boundary\nin which a k-NN classi\ufb01er has a constant probability of making a mistake. This high-error set is\nEn,k = E +\nE +\nn,k =\nE\u2212\nn,k =\n\n(cid:26)\nn,k \u222a E\u2212\n(cid:26)\nx \u2208 supp(\u00b5) | \u03b7(x) >\n\nfor all rk/n(x) \u2264 r \u2264 r(k+\n\u221a\nfor all rk/n(x) \u2264 r \u2264 r(k+\n\u221a\n\n, \u03b7(B(x, r)) \u2264 1\n2\n, \u03b7(B(x, r)) \u2265 1\n2\n\nx \u2208 supp(\u00b5) | \u03b7(x) <\n\n+\n\n1\u221a\nk\n\u2212 1\u221a\nk\n\n1\n2\n1\n2\n\n(cid:27)\n(cid:27)\n\nk+1)/n(x)\n\nk+1)/n(x)\n\n.\n\n(Recall the de\ufb01nition (3) of \u03b7(A) for sets A.) For smooth \u03b7 this region turns out to be comparable\nk. Meanwhile, here is a lower bound that applies to any\nto the effective decision boundary \u2202k/n,1/\n(X , \u03c1, \u00b5).\nTheorem 6. For any positive integers k < n, let gn,k denote the k-NN classi\ufb01er based on n training\npoints. There is an absolute constant co such that the expected misclassi\ufb01cation rate satis\ufb01es\n\n\u221a\n\nEnPrX (gn,k(X) (cid:54)= g(X)) \u2265 co \u00b5(En,k),\n\nwhere En is expectation over the choice of training set.\n\nAcknowledgements\n\nThe authors are grateful to the National Science Foundation for support under grant IIS-1162581.\n\n8\n\n\fReferences\n[1] J.-Y. Audibert and A.B. Tsybakov. Fast learning rates for plug-in classi\ufb01ers. Annals of Statis-\n\ntics, 35(2):608\u2013633, 2007.\n\n[2] T. Cover and P.E. Hart. Nearest neighbor pattern classi\ufb01cation. IEEE Transactions on Infor-\n\nmation Theory, 13:21\u201327, 1967.\n\n[3] T.M. Cover. Rates of convergence for nearest neighbor procedures. In Proceedings of The\n\nHawaii International Conference on System Sciences, 1968.\n\n[4] S. Dasgupta. Consistency of nearest neighbor classi\ufb01cation under selective sampling.\n\nTwenty-Fifth Conference on Learning Theory, 2012.\n\nIn\n\n[5] L. Devroye, L. Gyor\ufb01, A. Krzyzak, and G. Lugosi. On the strong universal consistency of\n\nnearest neighbor regression function estimates. Annals of Statistics, 22:1371\u20131385, 1994.\n\n[6] H. Federer. Geometric Measure Theory. Springer, 1969.\n[7] E. Fix and J. Hodges. Discriminatory analysis, nonparametric discrimination. USAF School of\nAviation Medicine, Randolph Field, Texas, Project 21-49-004, Report 4, Contract AD41(128)-\n31, 1951.\n\n[8] J. Fritz. Distribution-free exponential error bound for nearest neighbor pattern classi\ufb01cation.\n\nIEEE Transactions on Information Theory, 21(5):552\u2013557, 1975.\n\n[9] L. Gyor\ufb01. The rate of convergence of kn-nn regression estimates and classi\ufb01cation rules. IEEE\n\nTransactions on Information Theory, 27(3):362\u2013364, 1981.\n\n[10] J. Heinonen. Lectures on Analysis on Metric Spaces. Springer, 2001.\n[11] R. Kaas and J.M. Buhrman. Mean, median and mode in binomial distributions. Statistica\n\nNeerlandica, 34(1):13\u201318, 1980.\n\n[12] S. Kulkarni and S. Posner. Rates of convergence of nearest neighbor estimation under arbitrary\n\nsampling. IEEE Transactions on Information Theory, 41(4):1028\u20131039, 1995.\n\n[13] E. Mammen and A.B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,\n\n27(6):1808\u20131829, 1999.\n\n[14] E. Slud. Distribution inequalities for the binomial law. Annals of Probability, 5:404\u2013412,\n\n1977.\n\n[15] C. Stone. Consistent nonparametric regression. Annals of Statistics, 5:595\u2013645, 1977.\n[16] A.B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statis-\n\ntics, 32(1):135\u2013166, 2004.\n\n[17] R. Urner, S. Ben-David, and S. Shalev-Shwartz. Access to unlabeled data can speed up pre-\n\ndiction time. In International Conference on Machine Learning, 2011.\n\n[18] T.J. Wagner. Convergence of the nearest neighbor rule. IEEE Transactions on Information\n\nTheory, 17(5):566\u2013571, 1971.\n\n9\n\n\f", "award": [], "sourceid": 1788, "authors": [{"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UC San Diego"}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": "UC San Diego"}]}