{"title": "Adaptivity to Local Smoothness and Dimension in Kernel Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3075, "page_last": 3083, "abstract": "We present the first result for kernel regression where the procedure adapts locally at a point $x$ to both the unknown local dimension of the metric and the unknown H\\{o}lder-continuity of the regression function at $x$. The result holds with high probability simultaneously at all points $x$ in a metric space of unknown structure.\"", "full_text": "Adaptivity to Local Smoothness and Dimension in\n\nKernel Regression\n\nSamory Kpotufe\n\nToyota Technological Institute-Chicago(cid:3)\n\nsamory@ttic.edu\n\nVikas K Garg\n\nToyota Technological Institute-Chicago\n\nvkg@ttic.edu\n\nAbstract\n\nWe present the \ufb01rst result for kernel regression where the procedure adapts locally\nat a point x to both the unknown local dimension of the metric space X and the\nunknown H\u00a8older-continuity of the regression function at x. The result holds with\nhigh probability simultaneously at all points x in a general metric space X of\nunknown structure.\n\n1 Introduction\n\nContemporary statistical procedures are making inroads into a diverse range of applications in the\nnatural sciences and engineering. However it is dif\ufb01cult to use those procedures \u201doff-the-shelf\u201d\nbecause they have to be properly tuned to the particular application. Without proper tuning their\nprediction performance can suffer greatly. This is true in nonparametric regression (e.g. tree-based,\nk-NN and kernel regression) where regression performance is particularly sensitive to how well the\nmethod is tuned to the unknown problem parameters.\nIn this work, we present an adaptive kernel regression procedure, i.e. a procedure which self-tunes,\noptimally, to the unknown parameters of the problem at hand.\nWe consider regression on a general metric space X of unknown metric dimension, where the output\nY is given as f (x) + noise. We are interested in adaptivity at any input point x 2 X : the algorithm\nmust self-tune to the unknown local parameters of the problem at x. The most important such\nparameters (see e.g. [1, 2]), are (1) the unknown smoothness of f, and (2) the unknown intrinsic\ndimension, both de\ufb01ned over a neighborhood of x. Existing results on adaptivity have typically\ntreated these two problem parameters separately, resulting in methods that solve only part of the\nself-tuning problem.\nIn kernel regression, the main algorithmic parameter to tune is the bandwidth h of the kernel. The\nproblem of (local) bandwidth selection at a point x 2 X has received considerable attention in both\nthe theoretical and applied literature (see e.g. [3, 4, 5]). In this paper we present the \ufb01rst method\nwhich provably adapts to both the unknown local intrinsic dimension and the unknown H\u00a8older-\ncontinuity of the regression function f at any point x in a metric space of unknown structure. The\nintrinsic dimension and H\u00a8older-continuity are allowed to vary with x in the space, and the algorithm\nmust thus choose the bandwidth h as a function of the query x, for all possible x 2 X.\nIt is unclear how to extend global bandwidth selection methods such as cross-validation to the local\nbandwidth selection problem at x. The main dif\ufb01culty is that of evaluating the regression error at x\nsince the ouput Y at x is unobserved. We do have the labeled training sample to guide us in selecting\nh(x), and we will show an approach that guarantees a regression rate optimal in terms of the local\nproblem complexity at x.\n\n(cid:3)\n\nOther af\ufb01liation: Max Planck Institute for Intelligent Systems, Germany\n\n1\n\n\fThe result combines various insights from previous work on regression. In particular, to adapt to\nH\u00a8older-continuity, we build on acclaimed results of Lepski et al. [6, 7, 8]. In particular some such\nLepski\u2019s adaptive methods consist of monitoring the change in regression estimates fn;h(x) as the\nbandwidth h is varied. The selected estimate has to meet some stability criteria. The stability criteria\nis designed to ensure that the selected fn;h(x) is suf\ufb01ciently close to a target estimate fn;~h(x) for\na bandwidth ~h known to yield an optimal regression rate. These methods however are generally\ninstantiated for regression in R, but extend to high-dimensional regression if the dimension of the\ninput space X is known. In this work however the dimension of X is unknown, and in fact X is\nallowed to be a general metric space with signi\ufb01cantly less regularity than usual Euclidean spaces.\nTo adapt to local dimension we build on recent insights of [9] where a k-NN procedure is shown to\nadapt locally to intrinsic dimension. The general idea for selecting k = k(x) is to balance surrogates\nof the unknown bias and variance of the estimate. As a surrogate for the bias, nearest neighbor\ndistances are used, assuming f is globally Lipschitz. Since Lipschitz-continuity is a special case of\nH\u00a8older-continuity, the work of [9] corresponds in the present context to knowing the smoothness of\nf everywhere. In this work we do not assume knowledge of the smoothness of f, but simply that f\nis locally H\u00a8older-continuous with unknown H\u00a8older parameters.\nSuppose we knew the smoothness of f at x, then we can derive an approach for selecting h(x),\nsimilar to that of [9], by balancing the proper surrogates for the bias and variance of a kernel estimate.\nLet (cid:22)h be the hypothetical bandwidth so-obtained. Since we don\u2019t actually know the local smoothness\nof f, our approach, similar to Lepski\u2019s, is to monitor the change in estimates fn;h(x) as h varies, and\npick the estimate fn;^h(x) which is deemed close to the hypothetical estimate fn;(cid:22)h(x) under some\nstability condition.\nWe prove nearly optimal local rates ~O\nin terms of the local dimension d\nat any point x and H\u00a8older parameters (cid:21); (cid:11) depending also on x. Furthermore, the result holds with\nhigh probability, simultaneously at all x 2 X , for n suf\ufb01ciently large. Note that we cannot union-\nbound over all x 2 X , so the uniform result relies on proper conditioning on particular events in our\nvariance bounds on estimates fn;h((cid:1)).\nWe start with de\ufb01nitions and theoretical setup in Section 2. The procedure is given in Section 3,\nfollowed by a technical overview of the result in Section 4. The analysis follows in Section 5.\n\n(cid:0)2(cid:11)=(2(cid:11)+d)\n\n(cid:21)2d=(2(cid:11)+d)n\n\n(cid:0)\n\n(cid:1)\n\n2 Setup and Notation\n\n2.1 Distribution and sample\nWe assume the input X belongs to a metric space (X ; (cid:26)) of bounded diameter (cid:1)X (cid:21) 1. The output\nY belongs to a space Y of bounded diameter (cid:1)Y. We let (cid:22) denote the marginal measure on X and\n(cid:22)n denote the corresponding empirical distribution on an i.i.d. sample of size n. We assume for\nsimplicity that (cid:1)X and (cid:1)Y are known.\nThe algorithm runs on an i.i.d training sample f(Xi; Yi)gn\nfXign\n\ni=1 of size n. We use the notation X\n\n1 and Y = fYign\n1 .\n\n:\n=\n\nRegression function\n\n= E [Y jx] satis\ufb01es local H\u00a8older assumptions: for every\n:\nWe assume the regression function f (x)\nx 2 X and r > 0, there exists (cid:21); (cid:11) > 0 depending on x and r, such that f is ((cid:21); (cid:11))-H\u00a8older at x on\nB(x; r):\n\n8x\n\n0 2 B(x; r)\n\njf (x) (cid:0) f (x\n\n0\n\n)j (cid:20) (cid:21)(cid:26)(x; x\n0\n\n)(cid:11):\n\nWe note that the (cid:11) parameter is usually assumed to be in the interval (0; 1] for global de\ufb01nitions\nof H\u00a8older continuity, since a global (cid:11) > 1 implies that f is constant (for differentiable f). Here\nhowever, the de\ufb01nition being given relative to x, we can simply assume (cid:11) > 0. For instance the\nfunction f (x) = x(cid:11) is clearly locally (cid:11)-H\u00a8older at x = 0 with constant (cid:21) = 1 for any (cid:11) > 0. With\nhigher (cid:11) = (cid:11)(x), f gets \ufb02atter locally at x, and regression gets easier.\n\n2\n\n\fNotion of dimension\n\nWe use the following notion of metric-dimension, also employed in [9]. This notion extends some\nglobal notions of metric dimension to local regions of space . Thus it allows for the intrinsic dimen-\nsion of the data to vary over space. As argued in [9] (see also [10] for a more general theory) it often\ncoincides with other natural measures of dimension such as manifold dimension.\nDe\ufb01nition 1. Fix x 2 X , and r > 0. Let C (cid:21) 1 and d (cid:21) 1. The marginal (cid:22) is (C; d)-homogeneous\non B(x; r) if we have (cid:22)(B(x; r\n\n0 (cid:20) r and 0 < (cid:15) < 1.\n\n(cid:0)d(cid:22)(B(x; (cid:15)r\n\n)) (cid:20) C(cid:15)\n\n)) for all r\n\n0\n\n0\n\nIn the above de\ufb01nition, d will be viewed as the local dimension at x. We will require a general\nupper-bound d0 on the local dimension d(x) over any x in the space. This is de\ufb01ned below and can\nbe viewed as the worst-case intrinsic dimension over regions of space.\nAssumption 1. The marginal (cid:22) is (C0; d0)-maximally-homogeneous for some C0 (cid:21) 1 and d0 (cid:21) 1,\ni.e. the following holds for all x 2 X and r > 0: suppose there exists C (cid:21) 1 and d (cid:21) 1 such that (cid:22)\nis (C; d)-homogeneous on B(x; r), then (cid:22) is (C0; d0)-homogeneous on B(x; r).\n\nNotice that if (cid:22) is (C; d)-homogeneous on some B(x; r), then it is (C0; d0)-homogeneous on\nB(x; r) for any C0 > C and d0 > d. Thus, C0; d0 can be viewed as global upper-bounds on\nthe local homogeneity constants. By the de\ufb01nition, it can be the case that (cid:22) is (C0; d0)-maximally-\nhomogeneous without being (C0; d0)-homogeneous on the entire space X .\nThe algorithm is assumed to know the upper-bound d0. This is a minor assumption: in many situa-\ntions where X is a subset of a Euclidean space RD, D can be used in place of d0; more generally, the\nglobal metric entropy (log of covering numbers) of X can be used in the place of d0 (using known\nrelations between the present notion of dimension and metric entropies [9, 10]). The metric entropy\nis relatively easy to estimate since it is a global quantity independent of any particular query x.\nFinally we require that the local dimension is tight in small regions. This is captured by the following\nassumption.\nAssumption 2. There exists r(cid:22) > 0; C\nwhere r < r(cid:22), then for any r\n\n> 0 such that if (cid:22) is (C; d)-homogeneous on some B(x; r)\n\n0 (cid:20) r, (cid:22)(B(x; r\n\n)) (cid:20) C\n\n0d.\n\nr\n\n0\n\n0\n\n0\n\nThis last assumption extends (to local regions of space) the common assumption that (cid:22) has an upper-\nbounded density (relative to Lebesgue). This is however more general in that (cid:22) is not required to\nhave a density.\n\n2.2 Kernel Regression\n\nWe consider a positive kernel K on [0; 1] highest at 0, decreasing on [0; 1], and 0 outside [0; 1]. The\nkernel estimate is de\ufb01ned as follows: if B(x; h) \\ X 6= ;,\nwi(x)Yi; where wi(x) =\n\nfn;h(x) =\n\nP\n\n:\n\nX\n\nK((cid:26)(x; Xi)=h)\nj K((cid:26)(x; Xj)=h)\n\nWe set wi(x) = 1=n; 8i 2 [n] if B(x; h) \\ X = ;.\n\ni\n\n3 Procedure for Bandwidth Selection at x\nDe\ufb01nition 2. (Global cover size) Let (cid:15) > 0. Let N(cid:26)((cid:15)) denote an upper-bound on the size of the\nsmallest (cid:15)-cover of (X ; (cid:26)).\nWe assume the global quantity N(cid:26)((cid:15)) is known or pre-estimated. Recall that, as discussed in Section\n2, d0 can be picked to satisfy ln(N(cid:26)((cid:15))) = O(d0 log((cid:1)X =(cid:15))), in other words the procedure requires\nonly knowledge of upper-bounds N(cid:26)((cid:15)) on global cover sizes.\nThe procedure is given as follows:\n\n(cid:26)\nn . For any x 2 X , the set of admissible bandwidths is given as\nh (cid:21) 16(cid:15) : (cid:22)n(B(x; h=32)) (cid:21) 32 ln(N(cid:26)((cid:15)=2)=(cid:14))\n\n(cid:27)\\(cid:26)\n\nFix (cid:15) = (cid:1)X\n^Hx =\n\n(cid:27)dlog((cid:1)X =(cid:15))e\n\n:\n\n(cid:1)X\n2i\n\ni=0\n\nn\n\n3\n\n\f(cid:0)\n4 ln (N(cid:26)((cid:15)=2)=(cid:14)) + 9C04d0\n\n(cid:1)\n\nLet Cn;(cid:14) (cid:21) 2K(0)\n\nK(1)\n\n(cid:1)2Y Cn;(cid:14)\n\nn (cid:1) (cid:22)n(B(x; h=2))\n\nand Dh =\n^(cid:27)h = 2\nAt every x 2 X select the bandwidth:\n\n8<:h 2 ^Hx :\n\n^h = max\n\np\np\n: For any h 2 ^Hx, de\ufb01ne\n\ni\n\n2^(cid:27)h; fn;h(x) +\n\n2^(cid:27)h\n\n:\n\nh\n\nfn;h(x) (cid:0)\n\\\n\nh02 ^Hx:h0<h\n\n9=; :\n\nDh0 6= ;\n\nThe main difference with Lepski\u2019s-type methods is in the parameter ^(cid:27)h. In Lepski\u2019s method, since\nd is assumed known, a better surrogate depending on d will be used.\n\n4 Discussion of Results\n\nWe have the following main theorem.\nTheorem 1. Let 0 < (cid:14) < 1=e. Fix (cid:15) = (cid:1)X =n. Let Cn;(cid:14) (cid:21) 2K(0)\nDe\ufb01ne C2 =\nleast 1 (cid:0) 2(cid:14) over the choice of (X; Y), simultaneously for all x 2 X and all r satisfying\n\n(cid:0)\n9C04d0 + 4 ln (N(cid:26)((cid:15)=2)=(cid:14))\n. There exists N such that, for n > N, the following holds with probability at\n(cid:19)1=(2(cid:11)+d0)\n\n!1=(2(cid:11)+d0)(cid:18)\n\n(cid:0)d0\n4\n6C0\n\n \n\n(cid:1)\n\nK(1)\n\n:\n\nr(cid:22) > r > rn , 2\n\n2d0C 2\n\n0 (cid:1)d0X\n\n(cid:1)2Y Cn;(cid:14)\n\nC2(cid:21)2\n\nn\n\nLet x 2 X , and suppose f is ((cid:21); (cid:11))-H\u00a8older at x on B(x; r). Suppose (cid:22) is (C; d)-homogeneous on\nB(x; r). Let Cr\n\nrd0(cid:0)d. We have\n\n:\n=\n\n1\n\nCC0(cid:1)d0X\n\n(cid:12)(cid:12)f^h(x) (cid:0) f (x)\n\n(cid:12)(cid:12)2 (cid:20) 96C02d0 (cid:1) (cid:21)2d=(2(cid:11)+d)\n\n \n\n2d(cid:1)2Y Cn;(cid:14)\nC2Cr(cid:21)2n\n\n:\n\n:\n\n!2(cid:11)=(2(cid:11)+d)\n\nn!1(cid:0)(cid:0)(cid:0)(cid:0)! 0.\nThe result holds with high probability for all x 2 X , and for all r(cid:22) > r > rn, where rn\nThus, as n grows, the procedure is eventually adaptive to the H\u00a8older parameters in any neighborhood\nof x. Note that the dimension d is the same for all r < r(cid:22) by de\ufb01nition of r(cid:22). As previously\ndiscussed, the de\ufb01nition of r(cid:22) corresponds to a requirement that the intrinsic dimension is tight in\nsmall enough regions. We believe this is a technical requirement due to our proof technique. We\nhope this requirement might be removed in a longer version of the paper.\nNotice that r is a factor of n in the upper-bound. Since the result holds simultaneously for all\nr(cid:22) > r > rn, the best tradeoff in terms of smoothness and size of r is achieved. A similar tradeoff\nis observed in the result of [9].\nAs previously mentioned, the main idea behind the proof is to introduce hypothetical bandwidths (cid:22)h\nand and ~h which balance respectively, ^(cid:27)h and (cid:21)2h2(cid:11), and O((cid:1)2Y =(nhd)) and (cid:21)2h2(cid:11) (see Figure 1).\nIn the \ufb01gure, d and (cid:11) are the unknown parameters in some neighborhood of point x.\nThe \ufb01rst part of the proof consists in showing that the variance of the estimate using a bandwidth\nh is at most ^(cid:27)h. With high probability ^(cid:27)h is bounded above by O((cid:1)2Y =(nhd). Thus by balancing\n(cid:0)2(cid:11)=(2(cid:11)+d). We then have to show\nO((cid:1)2Y =(nhd) and (cid:21)2h2(cid:11), using ~h we would achieve a rate of n\nthat the error of fn;(cid:22)h cannot be too far from that fn;~h.\nFinally the error of fn;^h, ^h being selected by the procedure, will be related to that of fn;(cid:22)h.\nThe argument is a bit more nuanced that just described above and in Figure 1: the respective curves\nO((cid:1)2Y =(nhd) and (cid:21)2h2(cid:11) are changing with h since dimension and smoothness at x depend on the\nsize of the region considered. Special care has to be taken in the analysis to handle this technicality.\n\n4\n\n\f(Left) The proof argues over (cid:22)h, ~h which balance respectively, ^(cid:27)h and (cid:21)2h2(cid:11), and\nFigure 1:\nO((cid:1)2Y =(nhd)) and (cid:21)2h2(cid:11). The estimates under ^h selected by the procedure is shown to be close to\nthat of (cid:22)h, which in turn is shown to be close to that of ~h which is of the right adaptive form.\n\n(Right) Simulation results comparing the error of the proposed method to that of a global h\nselected by cross-validation. The test size is 1000 for all experiments. X (cid:26) R70 has diameter\n1, and is a collection of 3 disjoint \ufb02ats (clusters) of dimension d1 = 2; d2 = 5; d3 = 10, and\nequal mass 1=3. For each x from cluster i we have the output Y = (sinkxk)ki + N (0; 1)\nwhere k1 = 0:8; k2 = 0:6; k3 = 0:4. For the implementation of the proposed method, we set\n^(cid:27)h(x) = ^varY =n(cid:22)n(B(x; h)), where ^varY is the variance of Y on the training sample. For both our\nmethod and cross-validation, we use a box-kernel, and we vary h on an equidistant 100-knots grid\non the interval from the smallest to largest interpoint distance on the training sample.\n\n5 Analysis\n\nX\nWe will make use of the the following bias-variance decomposition throughout the analysis. For any\nx 2 X and bandwidth h, de\ufb01ne the expected regression estimate\n(cid:12)(cid:12)(cid:12)fn;h(x) (cid:0) efn;h(x)\n(cid:12)(cid:12)(cid:12)2\n\n(cid:12)(cid:12)(cid:12)efn;h(x) (cid:0) f (x)\n\njfn;h(x) (cid:0) f (x)j2 (cid:20) 2\n\nefn;h(x)\n\n:\n= EYjXfn;h(x) =\n\nwif (Xi):\n\nWe have\n\n(cid:12)(cid:12)(cid:12)2\n\n:\n\ni\n\n+ 2\n\n(1)\n\nThe bias term above is easily bounded in a standard way. This is stated in the Lemma below.\nLemma 1 (Bias). Let x 2 X , and suppose f is ((cid:21); (cid:11))-H\u00a8older at x on B(x; h). For any h > 0, we\nhave\n\n(cid:12)(cid:12)(cid:12)efn;h(x) (cid:0) f (x)\n\n(cid:12)(cid:12)(cid:12)2 (cid:20) (cid:21)2h2(cid:11):\n(cid:12)(cid:12)(cid:12)efn;h(x) (cid:0) f (x)\n(cid:12)(cid:12)(cid:12) (cid:20)P\n\nProof. We have\n\ni wi(x)jf (Xi) (cid:0) f (x)j (cid:20) (cid:21)h(cid:11).\n\nThe rest of this section is dedicated to the analysis of the variance term of (1). We will need various\nsupporting Lemmas relating the empirical mass of balls to their true mass. This is done in the next\nsubsection. The variance results follow in the subsequent subsection.\n\n5.1 Supporting Lemmas\nWe often argue over the following distributional counterpart to ^Hx((cid:15)).\nDe\ufb01nition 3. Let x 2 X and (cid:15) > 0. De\ufb01ne\n\n(cid:26)\nh (cid:21) 8(cid:15) : (cid:22)(B(x; h=8)) (cid:21) 12 ln(N(cid:26)((cid:15)=2)=(cid:14))\n\n(cid:27)\\(cid:26)\n\nHx((cid:15)) =\n\nn\n\n(cid:27)dlog((cid:1)X =(cid:15))e\n\n:\n\ni=0\n\n(cid:1)X\n2i\n\n5\n\n3000400050006000700080009000100001234567Training SizeNMSE Cross ValidationAdaptive\fLemma 2. Fix (cid:15) > 0 and let Z denote an (cid:15)=2-cover of X , and let S(cid:15) =\n\n4 ln(N(cid:26)((cid:15)=2)=(cid:14))\n\n:\n=\n\n(cid:13)n\n\nn\n\n. With probability at least 1 (cid:0) (cid:14), for all z 2 Z and h 2 S(cid:15) we have\n\np\np\n(cid:13)n (cid:1) (cid:22)(B(z; h)) + (cid:13)n=3;\n(cid:22)n(B(z; h)) (cid:20) (cid:22)(B(z; h)) +\n(cid:13)n (cid:1) (cid:22)n(B(z; h)) + (cid:13)n=3:\n(cid:22)(B(z; h)) (cid:20) (cid:22)n(B(z; h)) +\n\ni=0\n\n(cid:27)dlog((cid:1)X =(cid:15))e\n\n(cid:26)\n\n(cid:1)X\n2i\n\n. De\ufb01ne\n\n(2)\n(3)\n\nIdea. Apply Bernstein\u2019s inequality followed by a union bound on Z and S(cid:15).\n\nThe following two lemmas result from the above Lemma 2.\nLemma 3. Fix (cid:15) > 0 and 0 < (cid:14) < 1. With probability at least 1 (cid:0) (cid:14), for all x 2 X and h 2 Hx((cid:15)),\nwe have for C1 = 3C04d0 and C2 =\n\n,\n\n(cid:0)d0\n4\n6C0\n\nC2(cid:22)(B(x; h=2)) (cid:20) (cid:22)n(B(x; h=2)) (cid:20) C1(cid:22)(B(x; h=2)):\n\nLemma 4. Let 0 < (cid:14) < 1, and (cid:15) > 0. With probability at least 1(cid:0)(cid:14), for all x 2 X , ^Hx((cid:15)) (cid:26) Hx((cid:15)).\n\nProof. Again, let Z be an (cid:15)=2 cover and de\ufb01ne S(cid:15) and (cid:13)n as in Lemma 2. Assume (2) in the\nstatement of Lemma 2. Let h > 16(cid:15), we have for any z 2 Z and x within (cid:15)=2 of z,\n\n(cid:22)n(B(x; h=32)) (cid:20) (cid:22)n(B(z; h=16)) (cid:20) 2(cid:22)(B(z; h=16)) + 2(cid:13)n (cid:20) 2(cid:22)(B(x; h=8)) + 2(cid:13)n;\n2 (cid:22)n(B(x; h=32)) (cid:0) (cid:13)n. Pick h 2 ^Hx and conclude.\n\nand we therefore have (cid:22)(B(x; h=8)) (cid:21) 1\n\n5.2 Bound on the variance\n\nThe following two results of Lemma 5 to 6 serve to bound the variance of the kernel estimate.\nThese results are standard and included here for completion. The main result of this section is the\nvariance bound of Lemma 7. This last lemma bounds the variance term of (1) with high probability\nsimultaneously for all x 2 X and for values of h relevant to the algorithm.\nLemma 5. For any x 2 X and h > 0:\n\n(cid:12)(cid:12)(cid:12)fn;h(x) (cid:0) efn;h(x)\n\n(cid:12)(cid:12)(cid:12)2 (cid:20)\n\nX\n\nEYjX\n\nw2\n\ni (x)(cid:1)2Y :\n\nP\nLemma 6. Suppose that for some x 2 X and h > 0, (cid:22)n(B(x; h)) 6= 0. We then have:\n\ni\n\nK(0)\n\ni w2\n\nK(1) (cid:1) n(cid:22)n(B(x; h))\n\ni (x) (cid:20) maxi wi(x) (cid:20)\n(cid:0)\n\n(cid:1)\n(cid:12)(cid:12)(cid:12)fn;h(x) (cid:0) efn;h(x)\n(cid:12)(cid:12)(cid:12)2 (cid:20)\n9C04d0 + 4 ln (N(cid:26)((cid:15)=2)=(cid:14))\nfor all x 2 X and all h 2 ^Hx((cid:15)),\n\n:\n\nLemma 7 (Variance bound). Let 0 < (cid:14) < 1=2 and (cid:15) > 0.\n2K(0)\nK(1)\n\n:\n=\n, With probability at least 1 (cid:0) 3(cid:14) over the choice of (X; Y),\n\nDe\ufb01ne Cn;(cid:14)\n\n(cid:1)2Y Cn;(cid:14)\n\nn(cid:22)n(B(x; h=2))\n\n.\n\nProof. We prove the lemma statement for h 2 Hx((cid:15)). The result then follows for h 2 ^Hx((cid:15)) with\nthe same probability since, by Lemma 4, ^Hx((cid:15)) (cid:26) Hx((cid:15)) under the same event of Lemma 2.\nConsider any (cid:15)=2-cover Z of X . De\ufb01ne (cid:13)n as in Lemma 2 and assume statement (3). Let x 2 X\nand z 2 Z within distance (cid:15)=2 of x. Let h 2 Hx((cid:15)). We have\n\n(cid:22)(B(x; h=8)) (cid:20) (cid:22)(B(z; h=4)) (cid:20) 2(cid:22)n(B(z; h=4)) + 2(cid:13)n (cid:20) 2(cid:22)n(B(x; h=2)) + 2(cid:13)n;\n\nand we therefore have (cid:22)n(B(x; h=2)) (cid:21) 1\n2 (cid:13)n. Thus de\ufb01ne Hz denote the\nunion of Hx((cid:15)) for x 2 B(z; (cid:15)=2). With probability at least 1(cid:0) (cid:14), for all z 2 Z, and x 2 B(z; (cid:15)=2),\n\n2 (cid:22)(B(x; h=8)) (cid:0) (cid:13)n (cid:21) 1\n\n6\n\n\fand h 2 Hz the sets B(z; h)\\X, B(x; h)\\X are all non empty since they all contain B(x\n; h=2)\\X\n0\n0 such that h 2 Hx0 ((cid:15)) . The corresponding kernel estimates are therefore well de\ufb01ned.\nfor some x\nAssume w.l.o.g. that Z is a minimal cover, i.e. all B(z; (cid:15)=2) contain some x 2 X .\nWe \ufb01rst condition on X \ufb01xed and argue over the randomness in Y. For any x 2 X and h > 0,\nlet Yx;h denote the subset of Y corresponding to points from X falling in B(x; h). We de\ufb01ne\n(cid:30)(Yx;h)\n\n(cid:12)(cid:12)(cid:12)fn;h(x) (cid:0) efn;h(x)\n(cid:12)(cid:12)(cid:12).\n\n:\n=\n\nWe note that changing any Yi value changes (cid:30)(Yz;h) by at most (cid:1)Y wi(z). Applying McDiarmid\u2019s\ninequality and taking a union bound over z 2 Z and h 2 Hz, we get\n\nP(9z 2 Z;9h 2 S(cid:15); (cid:30)(Yz;h) > E(cid:30)(Yz;h) + t) (cid:20) N 2\n\n(cid:26) ((cid:15)=2) exp\n\n(cid:12)(cid:12)(cid:12)fn;h(z) (cid:0) efn;h(z)\n\nWe then have with probability at least 1 (cid:0) 2(cid:14), for all z 2 Z and h 2 Hz,\n\n(cid:12)(cid:12)(cid:12)2 (cid:20) 2 EY jX\n(cid:18)\n\n(cid:16)(cid:12)(cid:12)(cid:12)fn;h(z) (cid:0) efn;h(z)\n(cid:18)N(cid:26)((cid:15)=2)\n\n(cid:19)(cid:19)\n\n(cid:12)(cid:12)(cid:12)(cid:17)2\n\n(cid:20)\n\n4 ln\n\n(cid:14)\n\n+ 2 ln\n\n(cid:1)\n\nK(0)(cid:1)2Y\n\nK(1) (cid:1) n(cid:22)n(B(z; h))\n\n(cid:14)\n\n;\n\n0BB@(cid:0)\n(cid:18)N(cid:26)((cid:15)=2)\n\n(cid:1)2Y\n\n1CCA :\nX\n\nw2\n\ni (z)\n\n2t2\n\nX\n(cid:19)\n\ni\n\n(cid:1)2Y (cid:1)\n\nw2\n\ni (z)\n\ni\n\n(4)\n\nwhere we apply Lemma 5 and 6 for the last inequality.\nNow \ufb01x any z 2 Z, h 2 Hz and x 2 B(z; (cid:15)=2). We have j(cid:30)(Yx;h) (cid:0) (cid:30)(Yz;h)j (cid:20)\nmaxf(cid:30)(Yx;h); (cid:30)(Yz;h)g since both quantities are positive. Thus j(cid:30)(Yx;h) (cid:0) (cid:30)(Yz;h)j changes by\nat most maxi;j fwi(z); wj(x)g (cid:1) (cid:1)Y if we change any Yi value out of the contributing Y values. By\nLemma 6, maxi;j fwi(z); wj(x)g (cid:20) (cid:12)n;h(x; z)\n: Thus\n\nK(0)\n\n:\n=\n\nnK(1) min((cid:22)n(B(x; h)); (cid:22)n(B(z; h)))\n\n1\n\n:\n=\n\nde\ufb01ne h(x; z)\n h(x; z). By what we\njust argued, changing any Yi makes h(z) vary by at most (cid:1)Y. We can therefore apply McDiarmid\u2019s\ninequality to have that, with probability at least 1 (cid:0) 3(cid:14), for all z 2 Z and h 2 Hz,\n\nx:(cid:26)(x;z)(cid:20)(cid:15)=2\n\n(cid:12)n;h(x; z)\n\nsup\n\nj(cid:30)(Yx;h) (cid:0) (cid:30)(Yz;h))j and h(z)\n\n:\n=\n\n h(z) (cid:20) EYjX h(z) + (cid:1)Y\n\n(5)\nTo bound the above expectation for any z and h 2 Hz, consider a sequence fxig1\n1 ; xi 2 B(z; (cid:15)=2)\nsuch that h(xi; z) i!1(cid:0)(cid:0)(cid:0)! h(z). Fix any such xi. Using Holder\u2019s inequality and invoking Lemma\n5 and Lemma 6, we have\n\n2n\n\n:\n\n2 ln(N(cid:26)((cid:15)=2)=(cid:14))\n\nr\n\np\nEYjX((cid:30)(Yxi;h) (cid:0) (cid:30)(Yz;h)2)\nq\n\n(cid:12)n;h(xi; z)\n\n(cid:20)\n\n4(cid:1)2Y (cid:12)n;h(xi; z)\n\n(cid:12)n;h(xi; z)\n\nEYjX h(xi; z) =\n\n(cid:20)\n\n=\n\np\np\n\n1\n\nEYjX j(cid:30)(Yxi;h) (cid:0) (cid:30)(Yz;h)j (cid:20)\n\n(cid:12)n;h(xi; z)\n\n2EYjX(cid:30)(Yxi;h)2 + 2EYjX(cid:30)(Yz;h)2\n\ns\n\n(cid:12)n;h(xi; z)\n(cid:20) 2(cid:1)Y\n\n2(cid:1)Y\n\n(cid:12)n;h(xi; z)\n\nnK(1)(cid:22)n(B(z; h))\n\n:\n\nK(0)\n\ns\n\nSince h(xi; z) is bounded for all xi 2 B(z; (cid:15)), the Dominated Convergence Theorem yields\n\nEYjX h(z) = lim\ni!1\n\nE YjX h(xi; z) (cid:20) 2(cid:1)Y\n\nnK(1)(cid:22)n(B(z; h))\n\nK(0)\n\n:\n\nTherefore, using (5), we have for any z 2 Z, any h 2 Hz, and any x 2 B(z; (cid:15)=2) that, with\nprobability at least 1 (cid:0) 3(cid:14)\n\n \n\ns\n\n!\n\nr\n\nj(cid:30)(Yx;h) (cid:0) (cid:30)(Yz;h))j (cid:20) (cid:1)Y (cid:12)n;h(x; z)\n\n2 ln(N(cid:26)((cid:15)=2)=(cid:14))\n\n2n\n\n:\n\n(6)\n\n2\n\nnK(1)(cid:22)n(B(z; h))\n\nK(0)\n\n+\n\n7\n\n\fFigure 2: Illustration of the selection procedure. The intervals Dh are shown containing f (x). We\nwill argue that fn;^h(x) cannot be too far from fn;(cid:22)h(x).\n\nNow notice that (cid:12)n;h(x; z) (cid:20)\n\nK(0)\n\nnK(1)(cid:22)n(B(x; h=2))\n\n, so by Lemma 3,\n\n(cid:22)n(B(z; h)) (cid:20) (cid:22)n(B(x; 2h)) (cid:20) C1(cid:22)(B(x; 2h)) (cid:20) C1C04d0 (cid:22)(B(x; h=2))\n\n(cid:20) C2C1C04d0 (cid:22)n(B(x; h=2)) (cid:20) C04d0 (cid:22)n(B(x; h=2)):\n\nHence, (6) becomes j(cid:30)(Yx;h) (cid:0) (cid:30)(Yz;h))j (cid:20) 3(cid:1)Y\nCombine with (4), using again the fact that (cid:22)n(B(z; h)) (cid:21) (cid:22)n(B(x; h=2)) to obtain\n(cid:1)\n+ 2j(cid:30)(Yx;h) (cid:0) (cid:30)(Yz;h))j2\n9C04d0 + 4 ln (N(cid:26)((cid:15)=2)=(cid:14))\n\n(cid:12)(cid:12)(cid:12)fn;h(x) (cid:0) efn;h(x)\n\n(cid:12)(cid:12)(cid:12)2 (cid:20) 2\n\n(cid:20)\n\nC04d0 K(0)\n\nnK(1)(cid:22)n(B(x;h=2)) :\n\n:\n\nq\n(cid:12)(cid:12)(cid:12)fn;h(z) (cid:0) efn;h(z)\n(cid:12)(cid:12)(cid:12)2\n(cid:1)(cid:0)\n\n2(cid:1)2Y\n\nn(cid:22)n(B(x; h=2))\n\n5.3 Adaptivity\n\nThe proof of Theorem 1 is given in the appendix. As previously discussed, the main part of the\nargument consists of relating the error of fn;(cid:22)h(x) to that of fn;~h(x) which is of the right form for\nB(x; r) appropriately de\ufb01ned as in the theorem statement.\nTo relate the error of fn;^h(x) to that fn;(cid:22)h(x), we employ a simple argument inspired by Lepski\u2019s\nadaptivity work. Notice that, by de\ufb01nition of ^h (see Figure 1 (Left)), for any h (cid:20) (cid:22)h ^(cid:27)h (cid:21) (cid:21)2h2(cid:11).\nTherefore by Lemma 1 and 7 that, for any h < (cid:22)h, kfn;h (cid:0) fk2 (cid:20) 2^(cid:27)h so the intervals Dh must\nall contain f (x) and therefore must intersect. By the same argument ^h (cid:21) (cid:22)h and D^h and D(cid:22)h must\nintersect. Now since ^(cid:27)h is decreasing, we can infer that fn;^h(x) cannot be too far from fn;(cid:22)h(x), so\ntheir errors must be similar. This is illustrated in Figure 2.\n\nReferences\n[1] C. J. Stone. Optimal rates of convergence for non-parametric estimators. Ann. Statist., 8:1348\u2013\n\n1360, 1980.\n\n[2] C. J. Stone. Optimal global rates of convergence for non-parametric estimators. Ann. Statist.,\n\n10:1340\u20131353, 1982.\n\n[3] W. S. Cleveland and C. Loader. Smoothing by local regression: Principles and methods. Sta-\n\ntistical theory and computational aspects of smoothing, 1049, 1996.\n\n[4] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric\n\nRegression. Springer, New York, NY, 2002.\n\n8\n\n\f[5] J. Lafferty and L. Wasserman. Rodeo: Sparse nonparametric regression in high dimensions.\n\nArxiv preprint math/0506342, 2005.\n\n[6] O. V. Lepski, E. Mammen, and V. G. Spokoiny. Optimal spatial adaptation to inhomogeneous\nsmoothness: an approach based on kernel estimates with variable bandwidth selectors. The\nAnnals of Statistics, pages 929\u2013947, 1997.\n\n[7] O. V. Lepski and V. G. Spokoiny. Optimal pointwise adaptive methods in nonparametric esti-\n\nmation. The Annals of Statistics, 25(6):2512\u20132546, 1997.\n\n[8] O. V. Lepski and B. Y. Levit. Adaptive minimax estimation of in\ufb01nitely differentiable func-\n\ntions. Mathematical Methods of Statistics, 7(2):123\u2013156, 1998.\n\n[9] S. Kpotufe. k-NN Regression Adapts to Local Intrinsic Dimension. NIPS, 2011.\n[10] K. Clarkson. Nearest-neighbor searching and metric space dimensions. Nearest-Neighbor\n\nMethods for Learning and Vision: Theory and Practice, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1402, "authors": [{"given_name": "Samory", "family_name": "Kpotufe", "institution": "TTI Chicago"}, {"given_name": "Vikas", "family_name": "Garg", "institution": "TTI Chicago"}]}