{"title": "k-NN Regression Adapts to Local Intrinsic Dimension", "book": "Advances in Neural Information Processing Systems", "page_first": 729, "page_last": 737, "abstract": "Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e.g. a manifold). We show that $k$-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query $x$ and depend only on the way masses of balls centered at $x$ vary with radius. Furthermore, we show a simple way to choose $k = k(x)$ locally at any $x$ so as to nearly achieve the minimax rate at $x$ in terms of the unknown intrinsic dimension in the vicinity of $x$. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure.", "full_text": "k-NN Regression Adapts to Local Intrinsic Dimension\n\nSamory Kpotufe\n\nMax Planck Institute for Intelligent Systems\n\nsamory@tuebingen.mpg.de\n\nAbstract\n\nMany nonparametric regressors were recently shown to converge at rates that de-\npend only on the intrinsic dimension of data. These regressors thus escape the\ncurse of dimension when high-dimensional data has low intrinsic dimension (e.g.\na manifold). We show that k-NN regression is also adaptive to intrinsic dimen-\nsion. In particular our rates are local to a query x and depend only on the way\nmasses of balls centered at x vary with radius.\nFurthermore, we show a simple way to choose k = k(x) locally at any x so as to\nnearly achieve the minimax rate at x in terms of the unknown intrinsic dimension\nin the vicinity of x. We also establish that the minimax rate does not depend on a\nparticular choice of metric space or distribution, but rather that this minimax rate\nholds for any metric space and doubling measure.\n\n1 Introduction\n\n(cid:0)1=O(D) (see e.g. [1, 2]).\n\nWe derive new rates of convergence in terms of dimension for the popular approach of Nearest\nNeighbor (k-NN) regression. Our motivation is that, for good performance, k-NN regression can\nrequire a number of samples exponential in the dimension of the input space X . This is the so-called\n\u201ccurse of dimension\u201d. Formally stated, the curse of dimension is the fact that, for any nonparametric\nregressor there exists a distribution in RD such that, given a training size n, the regressor converges\nat a rate no better than n\nFortunately it often occurs that high-dimensional data has low intrinsic dimension: typical examples\nare data lying near low-dimensional manifolds [3, 4, 5]. We would hope that in these cases non-\nparametric regressors can escape the curse of dimension, i.e. their performance should only depend\non the intrinsic dimension of the data, appropriately formalized. In other words, if the data in RD\n(cid:0)1=O(d)\nhas intrinsic dimension d << D, we would hope for a better convergence rate of the form n\n(cid:0)1=O(D). This has recently been shown to indeed be the case for methods such as kernel\ninstead of n\nregression [6], tree-based regression [7] and variants of these methods [8]. In the case of k-NN\nregression however, it is only known that 1-NN regression (where k = 1) converges at a rate that de-\npends on intrinsic dimension [9]. Unfortunately 1-NN regression is not consistent. For consistency,\nit is well known that we need k to grow as a function of the sample size n [10] .\nOur contributions are the following. We assume throughout that the target function f is Lipschitz.\nFirst we show that, for a wide range of values of k ensuring consistency, k-NN regression converges\nat a rate that only depends on the intrinsic dimension in a neighborhood of a query x. Our local\nnotion of dimension in a neighborhood of a point x relies on the well-studied notion of doubling\nmeasure (see Section 2.3). In particular our dimension quanti\ufb01es how the mass of balls vary with\nradius, and this captures standard examples of data with low intrinsic dimension. Our second, and\nperhaps most important contribution, is a simple procedure for choosing k = k(x) so as to nearly\nachieve the minimax rate of O\nin terms of the unknown dimension d in a neighborhood\nof x. Our \ufb01nal contribution is in showing that this minimax rate holds for any metric space and\ndoubling measure. In other words the hardness of the regression problem is not tied to a particular\n\n(cid:1)\n\n(cid:0)\n\n(cid:0)2=(2+d)\n\nn\n\n1\n\n\f(cid:1)\nchoice of metric space X or doubling measure (cid:22), but depends only on how the doubling measure (cid:22)\nexpands on a metric space X . Thus, for any marginal (cid:22) on X with expansion constant (cid:2)\n, the\nminimax rate for the measure space (X ; (cid:22)) is (cid:10)\n\n(cid:0)2=(2+d)\n\n(cid:1)\n\n(cid:0)\n\n(cid:0)\n\n2d\n\nn\n\n.\n\n1.1 Discussion\n\nIt is desirable to express regression rates in terms of a local notion of dimension rather than a global\none because the complexity of data can vary considerably over regions of space. Consider for ex-\nample a dataset made up of a collection of manifolds of various dimensions. The global complexity\nis necessarily of a worst case nature, i.e. is affected by the most complex regions of the space while\nwe might happen to query x from a less complex region. Worse, it can be the case that the data\nis not complex locally anywhere, but globally the data is more complex. A simple example of this\nis a so-called space \ufb01lling curve where a low-dimensional manifold curves enough that globally it\nseems to \ufb01ll up space. We will see that the global complexity does not affect the behavior of k-NN\nregression, provided k=n is suf\ufb01ciently small. The behavior of k-NN regression is rather controlled\nby the often smaller local dimension in a neighborhood B(x; r) of x, where the neighborhood size\nr shrinks with k=n.\nGiven such a neighborhood B(x; r) of x, how does one choose k = k(x) optimally relative to the\nunknown local dimension in B(x; r)? This is nontrivial as standard methods of (global) parameter\nselection do not easily apply. For instance, it is unclear how to choose k by cross-validation over\npossible settings: we do not know reliable surrogates for the true errors at x of the various estimators\nffn;k(x)g\nk2[n]. Another possibility is to estimate the dimension of the data in the vicinity of x, and\nuse this estimate to set k. However, for optimal rates, we have to estimate the dimension exactly and\nwe know of no \ufb01nite sample result that guarantees the exact estimate of intrinsic dimension. Our\nmethod consists of \ufb01nding a value of k that balances quantities which control estimator variance and\nbias at x, namely 1=k and distances to x\u2019s k nearest neighbors. The method guarantees, uniformly\nwhere d = d(x) is exactly the unknown local\ndimension on a neighborhood B(x; r) of x, where r ! 0 as n ! 1.\n\nover all x 2 X , a near optimal rate of eO\n\n(cid:0)2=(2+d)\n\n(cid:0)\n\n(cid:1)\n\nn\n\n2 Setup\nWe are given n i.i.d samples (X; Y) = f(Xi; Yi)gn\ni=1 from some unknown distribution where the\ninput variable X belongs to a metric space (X ; (cid:26)), and the output Y is a real number. We assume\nthat the class B of balls on (X ; (cid:26)) has \ufb01nite VC dimension VB. This is true for instance for any\nsubset X of a Euclidean space, e.g. the low-dimensional spaces discussed in Section 2.3. The VC\nassumption is however irrelevant to the minimax result of Theorem 3.\nWe denote the marginal distribution on X by (cid:22) and the empirical distribution on X by (cid:22)n.\n\n2.1 Regression function and noise\nThe regression function f (x) = E [Y jX = x] is assumed to be (cid:21)-Lipschitz, i.e. there exists (cid:21) > 0\nsuch that 8x; x\nWe assume a simple but general noise model: the distributions of the noise at points x 2 X have\nuniformly bounded tails and variance. In particular, Y is allowed to be unbounded. Formally:\n\n0 2 X , jf (x) (cid:0) f (x\n0\n\n)j (cid:20) (cid:21)(cid:26) (x; x\n0\n\n).\n\n8(cid:14) > 0 there exists t > 0 such that sup\nx2X\n\nPY jX=x (jY (cid:0) f (x)j > t) (cid:20) (cid:14):\n\nY uniformly over all x 2 X .\n\nWe denote by tY ((cid:14)) the in\ufb01mum over all such t. Also, we assume that the variance of (Y jX = x)\nis upper-bounded by a constant (cid:27)2\nTo illustrate our noise assumptions, consider for instance the standard assumption of bounded noise,\ni.e. jY (cid:0) f (x)j is uniformly bounded by some M > 0; then 8(cid:14) > 0, tY ((cid:14)) (cid:20) M, and can thus be\np\nreplaced by M in all our results. Another standard assumption is that where the noise distribution\nhas exponentially decreasing tail; in this case 8(cid:14) > 0, tY ((cid:14)) (cid:20) O(ln 1=(cid:14)). As a last example, in the\ncase of Gaussian (or sub-Gaussian) noise, it\u2019s not hard to see that 8(cid:14) > 0, tY ((cid:14)) (cid:20) O(\n\nln 1=(cid:14)).\n\n2\n\n\f2.2 Weighted k-NN regression estimate\nWe assume a kernel function K : R+ 7! R+, non-increasing, such that K(1) > 0, and K((cid:26)) = 0\nfor (cid:26) > 1. For x 2 X , let rk;n(x) denote the distance to its k\u2019th nearest neighbor in the sample X.\nThe regression estimate at x given the n-sample (X; Y) is then de\ufb01ned as\n\nX\n\nX\n\nP\n\nfn;k(x) =\n\ni\n\n2.3 Notion of dimension\n\nK ((cid:26)(x; xi)=rk;n(x))\nj K ((cid:26)(x; xj)=rk;n(x))\n\nYi =\n\nwi;k(x)Yi:\n\ni\n\nWe start with the following de\ufb01nition of doubling measure which will lead to the notion of local\ndimension used in this work. We stay informal in developing the motivation and refer the reader to\n[?, 11, 12] for thorough overviews of the topic of metric space dimension and doubling measures.\nDe\ufb01nition 1. The marginal (cid:22) is a doubling measure if there exist Cdb > 0 such that for any x 2 X\nand r (cid:21) 0, we have (cid:22)(B(x; r)) (cid:20) Cdb(cid:22)(B(x; r=2)). The quantity Cdb is called an expansion\nconstant of (cid:22).\nAn equivalent de\ufb01nition is that, (cid:22) is doubling if there exist C and d such that for any x 2 X , for any\nr (cid:21) 0 and any 0 < (cid:15) < 1, we have (cid:22)(B(x; r)) (cid:20) C(cid:15)\n(cid:0)d(cid:22)(B(x; (cid:15)r)). Here d acts as a dimension. It\nis not hard to show that d can be chosen as log2 Cdb and C as Cdb (see e.g. [?]).\nA simple example of a doubling measure is the Lebesgue volume in the Euclidean space Rd. For any\nx 2 Rd and r > 0, vol (B(x; r)) = vol (B(x; 1)) rd. Thus vol (B(x; r)) = vol (B(x; (cid:15)r)) = (cid:15)\n(cid:0)d\nfor any x 2 Rd, r > 0 and 0 < (cid:15) < 1. Building upon the doubling behavior of volumes in Rd,\nwe can construct various examples of doubling probability measures. The following ingredients are\nsuf\ufb01cient. Let X (cid:26) RD be a subset of a d-dimensional hyperplane, and let X satisfy for all balls\nB(x; r) with x 2 X , vol (B(x; r) \\ X ) = (cid:2)(rd), the volume being with respect to the containing\nhyperplane. Now let (cid:22) be approximately uniform, that is (cid:22) satis\ufb01es for all such balls B(x; r),\n(cid:22)(B(x; r) \\ X ) = (cid:2)(vol (B(x; r) \\ X )). We then have (cid:22)(B(x; r))=(cid:22)(B(x; (cid:15)r)) = (cid:2)((cid:15)\nUnfortunately a global notion of dimension such as the above de\ufb01nition of d is rather restrictive as\nit requires the same complexity globally and locally. However a data space can be complex globally\nand have small complexity locally. Consider for instance a d-dimensional submanifold X of RD,\nand let (cid:22) have an upper and lower bounded density on X . The manifold might be globally complex\nbut the restriction of (cid:22) to a ball B(x; r); x 2 X , is doubling with local dimension d, provided r is\nsuf\ufb01ciently small and certain conditions on curvature hold. This is because, under such conditions\n(see e.g. the Bishop-Gromov theorem [13]), the volume (in X ) of B(x; r) \\ X is (cid:2)(rd).\nThe above example motivates the following de\ufb01nition of local dimension d.\nDe\ufb01nition 2. Fix x 2 X , and r > 0. Let C (cid:21) 1 and d (cid:21) 1. The marginal (cid:22) is (C; d)-homogeneous\non B(x; r) if we have (cid:22)(B(x; r\n\n0 (cid:20) r and 0 < (cid:15) < 1.\n\n)) for all r\n\n(cid:0)d(cid:22)(B(x; (cid:15)r\n\n0\n\n(cid:0)d).\n\nThe above de\ufb01nition covers cases other than manifolds. In particular, another space with small local\ndimension is a sparse data space X (cid:26) RD where each vector x has at most d non-zero coordinates,\ni.e. X is a collection of \ufb01nitely many hyperplanes of dimension at most d. More generally suppose\ni (cid:25)i(cid:22)i of \ufb01nitely many distributions (cid:22)i with potentially differ-\nthe data distribution (cid:22) is a mixture\nent low-dimensional supports. Then if all (cid:22)i supported on a ball B are (Ci; d)-homogeneous on B,\ni.e. all have local dimension d on B, then (cid:22) is also (C; d)-homogeneous on B for some C.\nWe want rates of convergence which hold uniformly over all regions where (cid:22) is doubling. We\ntherefore also require (De\ufb01nition 3) that C and d from De\ufb01nition 2 are uniformly upper bounded.\nThis will be the case in many situations including the above examples.\nDe\ufb01nition 3. The marginal (cid:22) is (C0; d0)-maximally-homogeneous for some C0 (cid:21) 1 and d0 (cid:21) 1,\nif the following holds for all x 2 X and r > 0: suppose there exists C (cid:21) 1 and d (cid:21) 1 such that (cid:22) is\n(C; d)-homogeneous on B(x; r), then (cid:22) is (C0; d0)-homogeneous on B(x; r).\n\nWe note that, rather than assuming as in De\ufb01nition 3 that all local dimensions are at most d0, we\ncan express our results in terms of the subset of X where local dimensions are at most d0. In this\ncase d0 would be allowed to grow with n. The less general assumption of De\ufb01nition 3 allows for a\nclearer presentation which still captures the local behavior of k-NN regression.\n\n3\n\n0\n\n)) (cid:20) C(cid:15)\nP\n\n\f3 Overview of results\n\n3.1 Local rates for \ufb01xed k\nThe \ufb01rst result below establishes the rates of convergence for any k & ln n in terms of the (unknown)\ncomplexity on B(x; r) where r is any r satisfying (cid:22)(B(x; r)) > (cid:10)(k=n) (we need at least (cid:10)(k)\nsamples in the relevant neighborhoods of x).\nTheorem 1. Suppose (cid:22) is (C0; d0)-maximally-homogeneous, and B has \ufb01nite VC dimension VB.\nLet 0 < (cid:14) < 1. With probability at least 1 (cid:0) 2(cid:14) over the choice of (X; Y), the following holds\nsimultaneously for all x 2 X and k satisfying n > k (cid:21) VB ln 2n + ln(8=(cid:14)).\nPick any x 2 X . Let r > 0 satisfy (cid:22)(B(x; r)) > 3C0k=n. Suppose (cid:22) is (C; d)-homogeneous on\nB(x; r), with 1 (cid:20) C (cid:20) C0 and 1 (cid:20) d (cid:20) d0. We have\n(cid:18)\n\njfn;k(x) (cid:0) f (x)j2 (cid:20) 2K(0)\n\nK(1)\n\n(cid:1) VB (cid:1) t2\n\nY ((cid:14)=2n) (cid:1) ln(2n=(cid:14)) + (cid:27)2\n\nY\n\nk\n\n+ 2(cid:21)2r2\n\n3Ck\n\nn(cid:22)(B(x; r))\n\n(cid:19)2=d\n\n:\n\nNote that the above rates hold uniformly over x, k & ln n, and any r where (cid:22)(B(x; r)) (cid:21) (cid:10)(k=n).\nThe rate also depends on (cid:22)(B(x; r)) and suggests that the best scenario is that where x has a small\nneighborhood of large mass and small dimension d.\n\n3.2 Minimax rates for a doubling measure\n\nOur next result shows that the hardness of the regression problem is not tied to a particular choice\nof the metric X or the doubling measure (cid:22). The result relies mainly on the fact that (cid:22) is doubling on\nX . We however assume that (cid:22) has the same expansion constant everywhere and that this constant\n(cid:0)\nis tight. This does not however make the lower-bound less expressive, as it still tells us which rates\nto expect locally. Thus if (cid:22) is (C; d)-homogeneous near x, we cannot expect a better rate than\nO\nTheorem 2. Let (cid:22) be a doubling measure on a metric space (X ; (cid:26)) of diameter 1, and suppose (cid:22)\nsatis\ufb01es, for all x 2 X , for all r > 0 and 0 < (cid:15) < 1,\n\n(assuming a Lipschitz regression function f).\n\n(cid:0)2=(2+d)\n\n(cid:1)\n\nn\n\n(cid:0)d(cid:22)(B(x; (cid:15)r)) (cid:20) (cid:22)(B(x; r)) (cid:20) C2(cid:15)\n\n(cid:0)d(cid:22)(B(x; (cid:15)r));\n\nC1(cid:15)\n\nwhere C1, C2 and d are positive constants independent of x, r, and (cid:15). Let Y be a subset of R and\nlet (cid:21) > 0. De\ufb01ne D(cid:22);(cid:21) as the class of distributions on X (cid:2) Y such that X (cid:24) (cid:22) and the output\nY = f (X) +N (0; 1) where f is any (cid:21)-Lipschitz function from X to Y. Fix a sample size n > 0 and\nlet fn denote any regressor on samples (X; Y) of size n, i.e. fn maps any such sample to a function\nfnj(X;Y)((cid:1)) : X 7! Y in L2((cid:22)). There exists a constant C independent of n and (cid:21) such that\n\n(cid:12)(cid:12)fnj(X;Y)(x) (cid:0) f (x)\n(cid:12)(cid:12)2\n\n(cid:21)2d=(2+d)n(cid:0)2=(2+d)\n\nE X;Y;x\n\n(cid:21) C:\n\ninfffng supD(cid:22);(cid:21)\n\n3.3 Choosing k for near-optimal rates at x\n\nOur last result shows a practical and simple way to choose k locally so as to nearly achieve the\nminimax rate at x, i.e. a rate that depends on the unknown local dimension in a neighborhood\nB(x; r) of x, where again, r satis\ufb01es (cid:22)(B(x; r)) > (cid:10)(k=n) for good choices of k. It turns out that\nwe just need (cid:22)(B(x; r)) > (cid:10)(n\nAs we will see, the choice of k simply consists of monitoring the distances from x to its nearest\nneighbors. The intuition, similar to that of a method for tree-pruning in [7], is to look for a k that\nbalances the variance (roughly 1=k) and the square bias (roughly r2\nk;n(x)) of the estimate. The\nprocedure is as follows:\n\n(cid:0)1=3).\n\nChoosing k at x: Pick (cid:1) (cid:21) maxi (cid:26) (x; Xi), and pick (cid:18)n;(cid:14) (cid:21) ln n=(cid:14).\nLet k1 be the highest integer in [n] such that (cid:1)2 (cid:1) (cid:18)n;(cid:14)=k1 (cid:21) r2\nk1;n(x).\nDe\ufb01ne k2 = k1 + 1 and choose k as arg minki;i2[2]\n\n(cid:18)n;(cid:14)=ki + r2\n\n(cid:16)\n\n(cid:17)\n\nki;n(x)\n\n.\n\n4\n\n\fThe parameter (cid:18)n;(cid:14) guesses how the noise in Y affects the risk. This will soon be clearer. Perfor-\nmance guarantees for the above procedure are given in the following theorem.\nTheorem 3. Suppose (cid:22) is (C0; d0)-maximally-homogeneous, and B has \ufb01nite VC dimension VB.\nAssume k is chosen for each x 2 X using the above procedure, and let fn;k(x) be the corresponding\nestimate. Let 0 < (cid:14) < 1 and suppose n4=(6+3d0) > (VB ln 2n + ln(8=(cid:14))) =(cid:18)n;(cid:14). With probability at\nleast 1 (cid:0) 2(cid:14) over the choice of (X; Y), the following holds simultaneously for all x 2 X .\nPick any x 2 X . Let 0 < r < (cid:1) satisfy (cid:22)(B(x; r)) > 6C0n\non B(x; r), with 1 (cid:20) C (cid:20) C0 and 1 (cid:20) d (cid:20) d0. We have\n\n(cid:0)1=3. Suppose (cid:22) is (C; d)-homogeneous\n\n(cid:18)\n\n(cid:0)\njfn;k(x) (cid:0) f (x)j2 (cid:20)\nVB (cid:1) t2\n\n+ 2(cid:21)2\nY ((cid:14)=2n) (cid:1) ln(2n=(cid:14)) + (cid:27)2\n\n2Cn;(cid:14)\n(cid:18)n;(cid:14)\n\nwhere Cn;(cid:14) =\n\nK(0)=K(1).\n\nY\n\n1 + 4(cid:1)2\n\n3C(cid:18)n;(cid:14)\n\nn(cid:22)(B(x; r))\n\n(cid:1)(cid:18)\n\n(cid:19)(cid:0)\n(cid:1)\n\n(cid:19)2=(2+d)\n\n;\n\nY ((cid:14)=2n) (cid:1) ln n=(cid:14)).\n\nSuppose we set (cid:18)n;(cid:14) = ln2 n=(cid:14). Then, as per the discussion in Section 2.1, if the noise in Y is\nGaussian, we have t2\nY ((cid:14)=2n) = O(ln n=(cid:14)), and therefore the factor Cn;(cid:14)=(cid:18)n;(cid:14) = O(1). Thus\nideally we want to set (cid:18)n;(cid:14) to the order of (t2\nJust as in Theorem 1, the rates of Theorem 3 hold uniformly for all x 2 X , and all 0 < r < (cid:1)\n(cid:0)1=3). For any such r, let us call B(x; r) an admissible neighborhood. It is\nwhere (cid:22)(B(x; r)) > (cid:10)(n\n0\nclear that, as n grows to in\ufb01nity, w.h.p. any neighborhood B(x; r) of x, 0 < r < supx02X (cid:26) (x; x\n),\nbecomes admissible. Once a neighborhood B(x; r) is admissible for some n, our procedure nearly\nattains the minimax rates in terms of the local dimension on B(x; r), provided (cid:22) is doubling on\nB(x; r). Again, the mass of an admissible neighborhood affects the rate, and the bound in Theorem\n3 is best for an admissible neighborhood with large mass (cid:22)(B(x; r)) and small dimension d.\n\n4 Analysis\n\nDe\ufb01ne efn;k(x) = EYjX fn;k(x) =\n\npoint x in a standard way as\n\njfn;k(x) (cid:0) f (x)j2 (cid:20) 2\n\nP\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n(cid:12)(cid:12)(cid:12)2\n\ni wi;k(x)f (Xi). We will bound the error of the estimate at a\n\n(cid:12)(cid:12)(cid:12)efn;k(x) (cid:0) f (x)\n(cid:12)(cid:12)(cid:12)2\n\n+ 2\n\n:\n\n(1)\n\nTheorem 1 is therefore obtained by combining bounds on the above two r.h.s terms (variance and\nbias). These terms are bounded separately in Lemma 2 and Lemma 3 below.\n\n4.1 Local rates for \ufb01xed k: bias and variance at x\n\nIn this section we bound the bias and variance terms of equation (1) with high probability, uniformly\nover x 2 X . We will need the following lemma which follows easily from standard VC theory [14]\nresults. The proof is given in the long version [15].\nLemma 1. Let B denote the class of balls on X , with VC-dimension VB. Let 0 < (cid:14) < 1, and de\ufb01ne\n(cid:11)n = (VB ln 2n + ln(8=(cid:14))) =n. The following holds with probability at least 1 (cid:0) (cid:14) for all balls in\nB. Pick any a (cid:21) (cid:11)n. Then (cid:22)(B) (cid:21) 3a =) (cid:22)n(B) (cid:21) a and (cid:22)n(B) (cid:21) 3a =) (cid:22)(B) (cid:21) a.\nWe start with the bias which is simpler to handle: it is easy to show that the bias of the estimate\nat x depends on the radius rk;n(x). This radius can then be bounded, \ufb01rst in expectation using the\ndoubling assumption on (cid:22), then by calling on the above lemma to relate this expected bound to\nrk;n(x) with high probability.\nLemma 2 (Bias). Suppose (cid:22) is (C0; d0)-maximally-homogeneous. Let 0 < (cid:14) < 1. With probability\nat least 1(cid:0) (cid:14) over the choice of X, the following holds simultaneously for all x 2 X and k satisfying\nn > k (cid:21) VB ln 2n + ln(8=(cid:14)).\nPick any x 2 X . Let r > 0 satisfy (cid:22)(B(x; r)) > 3C0k=n. Suppose (cid:22) is (C; d)-homogeneous on\n(cid:18)\nB(x; r), with 1 (cid:20) C (cid:20) C0 and 1 (cid:20) d (cid:20) d0. We have:\n\n(cid:12)(cid:12)(cid:12)efn;k(x) (cid:0) f (x)\n\n(cid:12)(cid:12)(cid:12)2 (cid:20) (cid:21)2r2\n\n(cid:19)2=d\n\n:\n\n3Ck\n\nn(cid:22)(B(x; r))\n\n5\n\n\fProof. First \ufb01x X, x 2 X and k 2 [n]. We have:\n\n(cid:12)(cid:12)(cid:12)efn;k(x) (cid:0) f (x)\n(cid:12)(cid:12)(cid:12) =\n\nX\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:20)\n\nwi;k(x) (f (Xi) (cid:0) f (x))\nwi;k(x)(cid:21)(cid:26) (Xi; x) (cid:20) (cid:21)rk;n(x):\n\ni\n\nwi;k(x)jf (Xi) (cid:0) f (x)j\n\n(2)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X\nX\n\ni\n\ni\n\n(cid:20)\n\nWe therefore just need to bound rk;n(x). We proceed as follows.\nFix x 2 X and k and pick any r > 0 such that (cid:22)(B(x; r)) > 3C0k=n. Suppose (cid:22) is (C; d)-\nhomogeneous on B(x; r), with 1 (cid:20) C (cid:20) C0 and 1 (cid:20) d (cid:20) d0. De\ufb01ne\n\n(cid:18)\n\n:\n=\n\n(cid:15)\n\n3Ck\n\nn(cid:22)(B(x; r))\n\n(cid:19)1=d\n\n;\n\nso that (cid:15) < 1 by the bound on (cid:22)(B(x; r)); then by the local doubling assumption on B(x; r),\nwe have (cid:22)(B(x; (cid:15)r)) (cid:21) C\n(cid:0)1(cid:15)d(cid:22)(B(x; r)) (cid:21) 3k=n. Let (cid:11)n as de\ufb01ned in Lemma 1, and assume\nk=n (cid:21) (cid:11)n (this is exactly the assumption on k in the lemma statement). By Lemma 1, it follows that\nwith probability at least 1 (cid:0) (cid:14) uniform over x, r and k thus chosen, we have (cid:22)n((B(x; (cid:15)r)) (cid:21) k=n\nimplying that rk;n(x) (cid:20) (cid:15)r. We then conclude with the lemma statement by using equation (2).\nLemma 3 (Variance). Let 0 < (cid:14) < 1. With probability at least 1 (cid:0) 2(cid:14) over the choice of (X; Y),\nthe following then holds simultaneously for all x 2 X and k 2 [n]:\n\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n\n(cid:12)(cid:12)(cid:12)2 (cid:20) K(0)\n\nK(1)\n\n(cid:1) VB (cid:1) t2\n\nY ((cid:14)=2n) (cid:1) ln(2n=(cid:14)) + (cid:27)2\n\nY\n\nk\n\n:\n\nVB as is well-known in VC theory.\n\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n\nProof. First, condition on X \ufb01xed. For any x 2 X , k 2 [k], let Yx;k denote the subset of Y\ncorresponding to points from X falling in B(x; rk;n(x)). For X \ufb01xed, the number of such subsets\nYx;k is therefore at most the number of ways we can intersect balls in B with the sample X; this is\nin turn upper-bounded by n\n\n(cid:12)(cid:12)(cid:12). We\u2019ll proceed by showing that with high probability, for all\n\nLet (Yx;k)\nx 2 X , (Yx;k) is close to its expectation, then we bound this expectation.\nLet (cid:14)0 (cid:20) 1=2n. We further condition on the event Y(cid:14)0 that for all n samples Yi, jYi (cid:0) f (Xi)j (cid:20)\ntY ((cid:14)0). By de\ufb01nition of tY ((cid:14)0), the event Y(cid:14)0 happens with probability at least 1 (cid:0) n(cid:14)0 (cid:21) 1=2 . It\nfollows that for any x 2 X\n\n:\n=\n\n (Yx;k) (cid:21) 1\n\nwhere EY(cid:14)0\nP (9x; k; (Yx;k) > 2E (Yx;k) + (cid:15)) (cid:20) P\n\n(cid:19)\n[(cid:1)] denote conditional expectation under the event. Let (cid:15) > 0, we in turn have\n(cid:19)\n\nE (Yx;k) (cid:21) P (Y(cid:14)0) (cid:1) E\nY(cid:14)0\n(cid:18)\n(cid:18)\n9x; k; (Yx;k) > E\nY(cid:14)0\n9x; k; (Yx;k) > E\nY(cid:14)0\n\n(cid:20) PY(cid:14)0\n\n (Yx;k) + (cid:15)\n\n (Yx;k);\n\nE\nY(cid:14)0\n\n2\n\n (Yx;k) + (cid:15)\n\n+ n(cid:14)0:\n\nThis last probability can be bounded by applying McDiarmid\u2019s inequality: changing any Yi value\nchanges (Yx;k) by at most wi;k (cid:1) tY ((cid:14)0) when we condition on the event Y(cid:14)0. This, followed by a\nunion-bound yields\n\n(cid:19)\n\n(\n(cid:0)2(cid:15)2=t2\n\n)\n\nX\n\ni\n\n(cid:20) n\n\nVB exp\n\nY ((cid:14)0)\n\nw2\ni;k\n\n:\n\n(cid:18)\n\nPY(cid:14)0\n\n9x; k; (Yx;k) > E\nY(cid:14)0\n\n (Yx;k) + (cid:15)\n\n6\n\n\f(\n(cid:0)2(cid:15)2=t2\n\n)\n\nX\n\nCombining with the above we get\n\nP (9x 2 X ; (Yx;k) > 2E (Yx;k) + (cid:15)) (cid:20) n\n\nVB exp\n\nY ((cid:14)0)\n\nw2\ni;k\n\n+ n(cid:14)0:\n\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n\n \nX\nIn other words, let (cid:14)0 = (cid:14)=2n, with probability at least 1 (cid:0) (cid:14), for all x 2 X and k 2 [n]\nVB ln(2n=(cid:14))\nX\n\n(cid:12)(cid:12)(cid:12)2 (cid:20) 8\n\nY ((cid:14)=2n)\n\nE\nYjX\n\n(cid:18)\n\n+ t2\n\ni\n\ni\n\n!\n\n(cid:20) 8 E\nYjX\n\n+ t2\n\nY ((cid:14)=2n)\n\nwhere the second inequality is an application of Jensen\u2019s.\nWe bound the above expectation on the r.h.s. next. In what follows (second equality below) we use\n\nthe fact that for i.i.d random variables zi with zero mean, E jP\n\n(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)(cid:19)2\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n(cid:12)(cid:12)(cid:12)2\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X\nX\n(cid:12)(cid:12)(cid:12)2 (cid:20)(cid:0)VB (cid:1) t2\nP\n\ni;k(x) E\nw2\nYjX\n\n= E\nYjX\n\n=\n\ni\n\ni\n\ni zij2 =\nwi;k(x) (Yi (cid:0) f (Xi))\njYi (cid:0) f (Xi)j2 (cid:20)\n\n(cid:1) (cid:1)\nY ((cid:14)=2n) (cid:1) ln(2n=(cid:14)) + (cid:27)2\nP\n\n(cid:20)\n\nK ((cid:26)(x; xi)=rk;n(x))\nj K ((cid:26)(x; xj)=rk;n(x))\n\nY\n\n(cid:20) K(0)\nK(1)k\n\n:\n\nw2\ni;k\n\n;\n\ni\n\n!\n\nw2\ni;k\n\n \nVB ln(2n=(cid:14))\nP\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\nE jzij2. We have\nX\nX\n\ni;k(x)(cid:27)2\nY :\n\nw2\n\ni\n\ni\n\n(3)\n\nw2\n\ni;k(x):\n\ni\n\nK(0)\n\nj K ((cid:26)(x; xj)=rk;n(x))\n\nCombining with the previous bound we get that, with probability at least 1 (cid:0) (cid:14), for all x and k,\n\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n\nE\nYjX\n\n(cid:12)(cid:12)(cid:12)fn;k(x) (cid:0) efn;k(x)\n\nP\ni;k(x) (cid:20) max\nw2\nP\ni2[n]\n\n(cid:20)\n\nX\n\ni\n\nWe can now bound\n\ni w2\n\ni;k(x) as follows:\n\nwi;k(x) = max\ni2[n]\n\nK(0)\n\nxj2B(x;rk;n(x)) K ((cid:26)(x; xj)=rk;n(x))\n\nPlug this back into equation 3 and conclude.\n\n4.2 Minimax rates for a doubling measure\n\nThe minimax rates of theorem 2 (proved in the long version [15]) are obtained as is commonly\ndone by constructing a regression problem that reduces to the problem of binary classi\ufb01cation (see\ne.g. [1, 2, 10]). Intuitively the problem of classi\ufb01cation is hard in those instances where labels (say\n(cid:0)1; +1) vary wildly over the space X , i.e. close points can have different labels. We make the\nregression problem similarly hard. We will consider a class of candidate regression functions such\nthat each function f alternates between positive and negative in neighboring regions (f is depicted\nas the dashed line below).\n\nThe reduction relies on the simple observation that for a regressor fn to approximate the right f\nfrom data it needs to at least identify the sign of f in the various regions of space. The more we can\nmake each such f change between positive and negative, the harder the problem. We are however\nconstrained in how much f changes since we also have to ensure that each f is Lipchitz continuous.\n\n7\n\n+\u2212\f4.3 Choosing k for near-optimal rates at x\n\nProof of Theorem 3. Fix x and let r; d; C as de\ufb01ned in the theorem statement. De\ufb01ne\n\n(cid:18)\n\n(cid:19)2=(2+d)\n\n(cid:18)\n\n:\n= (cid:18)d=(2+d)\n\nn;(cid:14)\n\n(cid:20)\n\n(cid:1)\n\nn(cid:22)(B(x; r))\n\n3C\n\nand (cid:15)\n\n:\n=\n\n3C(cid:20)\n\nn(cid:22)(B(x; r))\n\n(cid:19)1=d\n\n:\n\nNote that, by our assumptions,\n\n(cid:22)(B(x; r)) > 6C(cid:18)n;(cid:14)n\n\n(cid:20)\nn\n(cid:0)1(cid:15)d(cid:22)(B(x; r)) (cid:21) 3(cid:20)=n. Now by the \ufb01rst inequality of (4) we also have\n\nThe above equation (4) implies (cid:15) < 1. Thus, by the homogeneity assumption on B(x; r),\n(cid:22)(B(x; (cid:15)r)) (cid:21) C\n\n(cid:0)d=(2+d) = 6C(cid:18)n;(cid:14)\n\n(cid:0)1=3 (cid:21) 6C(cid:18)n;(cid:14)n\n\n(cid:21) 6C\n\nn2=(2+d)\n\n(4)\n\nn\n\n:\n\n(cid:20)\nn\n\n(cid:21) (cid:18)n;(cid:14)\nn\n\nn4=(6+3d) (cid:21) (cid:18)n;(cid:14)\nn\n\nn4=(6+3d0) (cid:21) (cid:11)n;\n\nwhere (cid:11)n = (VB ln 2n + ln(8=(cid:14))) =n is as de\ufb01ned in Lemma 1. We can thus apply Lemma 1 to\nhave that, with probability at least 1 (cid:0) (cid:14), (cid:22)n(B(x; (cid:15)r)) (cid:21) (cid:20)=n. In other words, for any k (cid:20) (cid:20),\nrk;n(x) (cid:20) (cid:15)r. It follows that if k (cid:20) (cid:20),\n(cid:21) (cid:1)2 (cid:1) (cid:18)n;(cid:14)\n\n(cid:19)2=d (cid:21) ((cid:15)r)2 (cid:21) r2\n\n(cid:1)2 (cid:1) (cid:18)n;(cid:14)\n\n(cid:18)\n\n= (cid:1)2\n\n3C(cid:20)\n\nk;n(x):\n\nk\n\n(cid:20)\n\nn(cid:22)(B(x; r))\n\nRemember that the above inequality is exactly the condition on the choice of k1 in the theorem\nstatement. Therefore, suppose k1 (cid:20) (cid:20), it must be that k2 > (cid:20) otherwise k2 is the highest integer\nsatisfying the condition, contradicting our choice of k1. Thus we have (i) (cid:18)n;(cid:14)=k2 < (cid:18)n;(cid:14)=(cid:20) = (cid:15)2:\nWe also have (ii) rk2;n(x) (cid:20) 21=d(cid:15)r. To see this, notice that since k1 (cid:20) (cid:20) < k2 = k1 + 1 we have\nk2 (cid:20) 2(cid:20); now by repeating the sort of argument above, we have (cid:22)(B(x; 21=d(cid:15)r)) (cid:21) 6(cid:20)=n which by\nLemma 1 implies that (cid:22)n(B(x; 21=d(cid:15)r)) (cid:21) 2(cid:20)=n (cid:21) k2=n.\nNow suppose instead that k1 > (cid:20), then by de\ufb01nition of k1, we have (iii)\n\nThe following holds by (i), (ii), and (iii). Let k be chosen as in the theorem statement. Then, whether\nk1 > (cid:20) or not, it is true that\n\nk1\n\nrk1;n(x)2 (cid:20) (cid:1)2 (cid:1) (cid:18)n;(cid:14)\n(cid:19)\n(cid:1)\n(cid:20)(cid:0)\n\n1 + 4(cid:1)2\n\n(cid:20)\n\n(cid:20) (cid:1)2 (cid:1) (cid:18)n;(cid:14)\n(cid:1)(cid:18)\n(cid:0)\n\n(cid:15)2 =\n\n1 + 4(cid:1)2\n\n= ((cid:1)(cid:15))2:\n\n(cid:19)2=(2+d)\n\n:\n\n3C(cid:18)n;(cid:14)\n\nn(cid:22)(B(x; r))\n\n(cid:18)\n\n(cid:18)n;(cid:14)\nk\n\n+ r2\n\nk;n(x)\n\nNow combine Lemma 3 with equation (2) and we have that with probability at least 1(cid:0) 2(cid:14) (account-\ning for all events discussed)\n\n(cid:18)\njfn;k(x) (cid:0) f (x)j2 (cid:20) 2Cn;(cid:14)\n(cid:18)n;(cid:14)\n\n(cid:18)n;(cid:14)\nk\n\n+ 2(cid:21)2r2\n\nk;n(x) (cid:20)\n\n(cid:19)(cid:0)\n\n+ 2(cid:21)2\n\n1 + 4(cid:1)2\n\n2Cn;(cid:14)\n(cid:18)n;(cid:14)\n\n+ 2(cid:21)2\n\n3C(cid:18)n;(cid:14)\n\nn(cid:22)(B(x; r))\n\n(cid:20)\n\n2Cn;(cid:14)\n(cid:18)n;(cid:14)\n\n(cid:18)\n(cid:1)(cid:18)\n\n(cid:19)(cid:18)\n(cid:19)2=(2+d)\n\n(cid:18)n;(cid:14)\nk\n\n(cid:19)\n\n+ r2\n\nk;n(x)\n\n:\n\n5 Final remark\n\nThe problem of choosing k = k(x) optimally at x is similar to the problem of local bandwidth\nselection for kernel-based methods (see e.g. [16, 17]), and our method for choosing k might yield\ninsights into bandwidth selection, since k-NN and kernel regression methods only differ in their\nnotion of neighborhood of a query x.\n\nAcknowledgments\n\nI am grateful to David Balduzzi for many useful discussions.\n\n8\n\n\fReferences\n[1] C. J. Stone. Optimal rates of convergence for non-parametric estimators. Ann. Statist., 8:1348\u2013\n\n1360, 1980.\n\n[2] C. J. Stone. Optimal global rates of convergence for non-parametric estimators. Ann. Statist.,\n\n10:1340\u20131353, 1982.\n\n[3] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, 290, 2000.\n\n[4] J. Tenebaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear di-\n\nmensionality reduction. Science, 290, 2000.\n\n[5] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15(6):1373\u20131396, 2003.\n\n[6] P. Bickel and B. Li. Local polynomial regression on unknown manifolds. Tech. Re. Dep. of\n\nStats. UC Berkley, 2006.\n\n[7] S. Kpotufe. Escaping the curse of dimensionality with a tree-based regressor. Conference On\n\nLearning Theory, 2009.\n\n[8] S. Kpotufe. Fast, smooth, and adaptive regression in metric spaces. Neural Information Pro-\n\ncessing Systems, 2009.\n\n[9] S. Kulkarni and S. Posner. Rates of convergence of nearest neighbor estimation under arbitrary\n\nsampling. IEEE Transactions on Information Theory, 41, 1995.\n\n[10] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric\n\nRegression. Springer, New York, NY, 2002.\n\n[11] C. Cutler. A review of the theory and estimation of fractal dimension. Nonlinear Time Series\n\nand Chaos, Vol. I: Dimension Estimation and Models, 1993.\n\n[12] K. Clarkson. Nearest-neighbor searching and metric space dimensions. Nearest-Neighbor\n\nMethods for Learning and Vision: Theory and Practice, 2005.\n\n[13] M. do Carmo. Riemannian Geometry. Birkhauser, 1992.\n[14] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events\n\nto their expectation. Theory of probability and its applications, 16:264\u2013280, 1971.\n\n[15] S. Kpotufe. k-NN regression adapts to local intrinsic dimension. arXiv:1110.4300, 2011.\n[16] J. G. Staniswalis. Local bandwidth selection for kernel estimates. Journal of the American\n\nStatistical Association, 84:284\u2013288, 1989.\n\n[17] R. Cao-Abad. Rate of convergence for the wild bootstrap in nonpara- metric regression. Annals\n\nof Statistics, 19:2226\u20132231, 1991.\n\n9\n\n\f", "award": [], "sourceid": 498, "authors": [{"given_name": "Samory", "family_name": "Kpotufe", "institution": null}]}