{"title": "Derivative Estimation in Random Design", "book": "Advances in Neural Information Processing Systems", "page_first": 3445, "page_last": 3454, "abstract": "We propose a nonparametric derivative estimation method for random design without\nhaving to estimate the regression function. The method is based on a variance-reducing linear combination of symmetric difference quotients. First, we discuss\nthe special case of uniform random design and establish the estimator\u2019s asymptotic\nproperties. Secondly, we generalize these results for any distribution of the dependent variable and compare the proposed estimator with popular estimators for\nderivative estimation such as local polynomial regression and smoothing splines.", "full_text": "Derivative Estimation in Random Design\n\nYu Liu1, Kris De Brabanter1,2\u2217\n\n1Department of Computer Science, 2Department of Statistics\n\nAbstract\n\nWe propose a nonparametric derivative estimation method for random design\nwithout having to estimate the regression function. The method is based on a\nvariance-reducing linear combination of symmetric difference quotients. First, we\ndiscuss the special case of uniform random design and establish the estimator\u2019s\nasymptotic properties. Secondly, we generalize these results for any distribution of\nthe dependent variable and compare the proposed estimator with popular estima-\ntors for derivative estimation such as local polynomial regression and smoothing\nsplines.\n\n1\n\nIntroduction\n\nIn the area of statistics, nonparametric regression is often of great interest due to its \ufb02exibility\nand different regression methods have been fully explored [1, 2, 3]. Derivative estimation has\nreceived less attention than regression and it is often treated as the \u201cby-product\u201d of nonparametric\nregression problems e.g. local polynomial regression [1] and smoothing splines [4]. Derivatives are\nwidely used in different areas, for example, analyzing human growth data [5, 6]. Other applications\ninclude exploring the structure of curves [7, 8], analyzing signi\ufb01cant trends [9], comparing regression\ncurves [10], characterization of nanoparticles [11], neural network pruning [12], estimating the leading\nbias term in the construction of con\ufb01dence intervals [13, 14] and bandwidth selection methods [15].\nIn general, derivative estimators can be divided into three categories: local polynomial regression,\nregression/smoothing splines and difference quotients [16]. Local polynomial regression relies on the\nTaylor expansion and the coef\ufb01cients, obtained by solving a weighted least squares problem, provide\nestimates of the derivatives. Asymptotic properties for the regression as well as the derivatives\nare given in [1]. Derivative estimation via smoothing splines is obtained by differentiating the\nspline basis [17]. These estimators are shown to achieve the optimal L2 convergence rate [18] and\nasymptotic properties are discussed [19]. For the latter, the smoothing parameter selection is quite\ndif\ufb01cult. Especially for smoothing splines whose parameter depends on the order of the derivative [4].\nDifference quotient based derivative estimators [16, 20] provide a noisy version of the derivative\nwhich could be smoothed by any nonparametric regression method. The difference estimator proposed\nby [16] is quasi unbiased but the variance is O(n2) where n is the sample size. In order to reduce\nthe variance, [21, 22] proposed a variance-reducing linear combination of k symmetric difference\nquotients, where k is considered to be a tuning parameter.\nMore recently, [23] proposed a sequence of approximate linear regression representations in which\nthe derivative is just the intercept term. Although their results are very appealing, they rely on rather\nstringent assumptions on the regression function. These assumptions are relaxed in [24] where a linear\ncombination of the dependent variables are used to obtain the derivatives. The variance reducing\nweights are obtained by solving a constraint optimization problem for which a closed form solution\nis derived. Also, they showed that the symmetric form used in [21] and [22] reduces the order of\n\u2217Liu and De Brabanter are with Iowa State University, Ames, IA 50011, USA. Corresponding authors:\n\nyuliu@iastate.edu, kbrabant@iastate.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\festimation variance without signi\ufb01cantly increasing the estimation bias in the interior. They proposed\nan asymmetric estimator for the derivatives at the boundaries. All results from [23] and [24] assume\nthe equispaced design and both authors do not mention the extension to the random design setting.\nIn this paper we extend the difference quotient based estimator to the random design and provide a\nframework that can be used to extend other difference based estimators to the random design. Further,\nwe show that the extension from equispaced to random design for higher order derivatives is not\ntrivial. Simply using the estimator from [21] and [22] in random design will lead to an inconsistent\nestimator. In the simulation study, we show that the new estimator has similar performance compared\nto local polynomial regression and penalized smoothing splines. All the proofs of the lemmas and\ntheorems can be found in Supplementary Material accompanying the paper.\n\n1.1 Equispaced design vs. random design\n\nConsider the data (X1, Y1),. . .,(Xn, Yn) which form an independent and identically distributed (i.i.d.)\nsample from a population (X, Y ), where Xi \u2208 X = [a, b] \u2286 R and Yi \u2208 R for all i = 1, . . . , n. In\nthe equispaced design case, the response variables are assumed to satisfy\n\nYi = m(xi) + ei,\n\n(1)\nwhere x1, . . . , xn are nonrandom and xi+1 \u2212 xi = (b\u2212 a)/(n\u2212 1) is constant for all i. In this setting,\ne < \u221e.\nthe regression function is given by m(x) = E[Y ] and we assume that E[e] = 0 and Var[e] = \u03c32\nIn contrast to the equispaced design, the design points X in random design are random variables and\nare generated from an unknown distribution F . Consider the following model\n\ni = 1, . . . , n,\n\nYi = m(Xi) + ei,\n\n(2)\nwhere the regression function is given by m(x) = E[Y |X = x] and assume that E[e] = 0, Var[e] =\ne < \u221e, X and e are independent. The derivative estimators discussed in [21, 22, 23, 24] use the\n\u03c32\nsymmetric property xi+j \u2212 xi = xi \u2212 xi\u2212j since they assume equispaced design. However, in the\nrandom design this property no longer holds and it also presents extra theoretical dif\ufb01culties as we\nwill show in the next sections.\n\ni = 1, . . . , n,\n\n2 Difference based derivative estimators based on order statistics\n\nThe \ufb01rst difference quotients were proposed by [16] for \ufb01xed design. Extending their estimator to\nrandom design yields\n\n\u02c6q(1)\ni =\n\nYi \u2212 Yi\u22121\nXi \u2212 Xi\u22121\n\n(3)\n\nwhich is a noise corrupted version of the \ufb01rst order derivative in Xi. Although this estimator is quasi\nunbiased, two problems immediately emerge: (i) no simple expression for the difference Xi \u2212 Xi\u22121\nis available to study its asymptotic properties; (ii) the variance is proportional to n2 (see next section).\nIn order to discuss the asymptotic properties of this different quotient, we need to obtain an asymptotic\nexpression for the difference Xi \u2212 Xi\u22121 which is not trivial in the random design setting. However,\nin a special case i.e., X = U \u223c U(0, 1) and arranging the random variables in order of magnitude\naccording to U (order statistics), the asymptotic properties of the difference can be obtained using\norder statistics (see Lemma 1). In what follows, U(0, 1) denotes the uniform distribution between 0\nand 1.\n\n2.1 Approach based on order statistics\n\nConsider n bivariate data forming an i.i.d sample from a population (U, Y ) and further assume\nU \u223c U(0, 1). Arrange the bivariate data (U, Y ) in order of magnitude according to U i.e., U(1) <\nU(2) < . . . < U(n) where U(i), i = 1, . . . , n is the ith order statistic. Consider the following model:\n(4)\ne < \u221e, U\n\nwhere r(u) = E[Y |U = u] is the regression function and assume E[e] = 0, Var[e] = \u03c32\nand e are independent. Our goal is to obtain a smoothed version of the \ufb01rst order derivative of r.\nSince the estimator (3) suffers from high variance,\n[21] and [22] proposed a variance reducing\nlinear combination of symmetric difference quotients. Our proposed extension to the random design\n\nYi = r(U(i)) + ei,\n\n2\n\n\finvolving uniform order statistics is\n\n\u02c6Y (1)\ni =\n\nk(cid:88)\n\nj=1\n\nwi,j \u00b7\n\n(cid:18) Yi+j \u2212 Yi\u2212j\n\nU(i+j) \u2212 U(i\u2212j)\n\n(cid:19)\n\n,\n\n(5)\n\nwhere the weights wi,1, . . . , wi,k sum up to one. To avoid division by zero we require no ties, i.e.\nU(l) (cid:54)= U(m) for l (cid:54)= m. Note that (5) is valid for k + 1 \u2264 i \u2264 n \u2212 k and hence k \u2264 (n \u2212 1)/2. For\nthe boundary regions i.e., for 2 \u2264 i \u2264 k and n \u2212 k + 1 \u2264 i \u2264 n \u2212 1, the estimator (5) needs to be\nmodi\ufb01ed and is discussed in Section 2.5. A minor point is that the estimator (5) does not provide\nresults for \u02c6Y (1)\nn . One can ignore these two points from consideration. Proposition 1 shows\nthe optimal weights wi,j which minimize the variance of (3).\nProposition 1 For k + 1 \u2264 i \u2264 n \u2212 k and under model (4), the weights wi,j that minimize the\n\nand \u02c6Y (1)\n\n1\n\nvariance of (5), satisfying(cid:80)k\n\nj=1 wi,j = 1, are given by\n(cid:80)k\n(U(i+j) \u2212 U(i\u2212j))2\nl=1(U(i+l) \u2212 U(i\u2212l))2\n\nwi,j =\n\n,\n\nj = 1, . . . , k.\n\n(6)\n\nAt \ufb01rst sight, these weights seem to be different than the weights obtained by [21] and [22] for\nthe equispaced design case. However, for the equispaced design case, plugging in the difference\nui+j \u2212 ui\u2212j = 2j(b \u2212 a)/(n \u2212 1) in the weights obtained in Proposition 1 gives\n\n(cid:80)k\n(ui+j \u2212 ui\u2212j)2\nl=1(ui+l \u2212 ui\u2212l)2\n\n=\n\nwi,j =\n\n4j2\n\n(n\u22121)2\n\n(cid:80)k\n\n4\n\n(n\u22121)2\n\nl=1 l2\n\n=\n\n6j2\n\nk(k + 1)(2k + 1)\n\nwhich are exactly the weights used in the equispaced design. This shows that the weights for\nequispaced design are a special case of the weights in Proposition 1. To reduce the variance, for a\n\ufb01xed i, the jth weight (6) is proportional to the inverse variance of\nNext, we need to \ufb01nd an asymptotic expression for the differences U(i+l) \u2212 U(i\u2212l). From [25, p. 14],\nthe distribution of the difference of uniform order statistics is:\n\nYi+j\u2212Yi\u2212j\nU(i+j)\u2212U(i\u2212j)\n\nin (5).\n\nU(s) \u2212 U(r) \u223c Beta(s \u2212 r, n \u2212 s + r + 1)\n\nfor s > r.\n\nThis result immediately leads to the lemma below.\n\ni.i.d.\u223c U(0, 1) and arrange the random variables in order of magnitude U(1) < U(2) <\n\nLemma 1 Let U\n\u00b7\u00b7\u00b7 < U(n). Then, for i > j\n2j\n\nU(i+j) \u2212 U(i\u2212j) =\n\n(cid:18)(cid:114) j\n\n(cid:19)\n\n,\n\n+ Op\n\nn + 1\n\nn2\n\nand\n\nU(i) \u2212 U(i\u2212j) =\n\nj\n\nn + 1\n\n+ Op\n\nU(i+j) \u2212 U(i) =\n\nj\n\nn + 1\n\n+ Op\n\n(cid:18)(cid:114) j\n\nn2\n\n(cid:19)\n\n.\n\n(cid:18)(cid:114) j\n\n(cid:19)\n\nn2\n\nWe brie\ufb02y show why (3) suffers from high variance. Assume r(\u00b7) is twice continuously differentiable\non [0, 1], a Taylor expansion of r(U(i\u00b1j)) in a neighborhood of U(i) gives\nr(U(i\u00b1j)) = r(U(i)) + r(1)(U(i))(U(i\u00b1j) \u2212 U(i)) + Op\n\n(cid:18) j2\n\n(cid:19)\n\n(7)\n\n.\n\nApplying Lemma 1 and (7) to (3), then for n \u2192 \u221e we have\n\nand\n\nE[\u02c6q(1)\n\ni\n\n|U(i\u22121), U(i)] = E\n\nVar(cid:2)\u02c6q(1)\n\ni\n\n|U(i\u22121), U(i)\n\n(cid:20) Yi \u2212 Yi\u22121\n(cid:3) = Var\n\nU(i) \u2212 U(i\u22121)\n\n(cid:20) Yi \u2212 Yi\u22121\n\nU(i) \u2212 U(i\u22121)\n\nn2\n\n(cid:21)\n\n(cid:21)\n\n|U(i\u22121), U(i)\n\n= Op(n2).\n\n|U(i\u22121), U(i)\n\n= r(1)(U(i)) + op(1)\n\nIt is immediately clear that the \ufb01rst order difference quotient proposed by [16] is an asymptotic unbi-\nased estimator of r(U(i)). The variance of this estimator can be arbitrary large, severely complicating\nderivative estimation and the main goal to be addressed in this paper.\n\n3\n\n\f2.2 Asymptotic properties of the \ufb01rst order derivative estimator\n\nThe following theorems establish the asymptotic bias and variance of our proposed estimator (5) for\ninterior points i.e., k + 1 \u2264 i \u2264 n \u2212 k. In what follows we denote U = (U(i\u2212j), . . . , U(i+j)) for\ni > j and j = 1, . . . , k.\nTheorem 1 Under model (4) and assume r(\u00b7) is twice continuously differentiable on [0, 1] and\nk \u2192 \u221e as n \u2192 \u221e. Then, for uniform random design on [0, 1] and for the weights in Proposition 1,\nthe conditional (absolute) bias and conditional variance of (5) are given by\n\n(cid:12)(cid:12)(cid:12)bias(cid:2) \u02c6Y (1)\n\ni\n\n|U(cid:3)(cid:12)(cid:12)(cid:12) \u2264 sup\nVar(cid:2) \u02c6Y (1)\n|U(cid:3) =\n\nu\u2208[0,1]\n\ni\n\n|r(2)(u)|\n\n3k(k + 1)\n\n4(n + 1)(2k + 1)\n\n+ op(n\u22121k)\n\n3\u03c32\n\ne (n + 1)2\n\nk(k + 1)(2k + 1)\n\n+ op(n2k\u22123)\n\nand\n\nuniformly for k + 1 \u2264 i \u2264 n \u2212 k.\nFrom Theorem 1, the pointwise consistency immediately follows.\nCorollary 1 Under the assumptions of Theorem 1, k \u2192 \u221e as n \u2192 \u221e such that n\u22121k \u2192 0 and\nn2k\u22123 \u2192 0. Then, for \u03c32\n\ne < \u221e and the weights given in Proposition 1, we have for any \u03b5 > 0\n\nP(| \u02c6Y (1)\n\ni \u2212 r(1)(U(i))| \u2265 \u03b5) \u2192 0\n\nfor k + 1 \u2264 i \u2264 n \u2212 k.\nAccording to Theorem 1 and Corollary 1, the conditional bias and variance of (5) goes to zero as\nn \u2192 \u221e and k \u2192 \u221e faster than O(n2/3), but slower than O(n). In the next section, we propose\na selection method for k in practice such that k = O(n4/5). The fastest possible rate at which\ni \u2212 r(1)(U(i)))2|U] \u2192 0 (L2 rate of convergence) is Op(n\u22122/5). Using Jensen\u2019s inequality,\nE[( \u02c6Y (1)\nsimilar results can be shown for the L1 rate of convergence i.e.,\n\ni \u2212 r(1)(U(i))(cid:12)(cid:12) | U(cid:3) \u2264(cid:12)(cid:12)(cid:12)bias(cid:2) \u02c6Y (1)\nE(cid:2)(cid:12)(cid:12) \u02c6Y (1)\n\ni\n\n|U(cid:3)(cid:12)(cid:12)(cid:12) +\n\n(cid:113)\nVar(cid:2) \u02c6Y (1)\n\ni\n\n|U(cid:3) = Op(n\u22121/5).\n\n2.3 Selection method for k\n\nCrucial to the estimator (5) is the parameter k which controls the bias-variance trade-off. Based\non Theorem 1, we choose the k that minimizes the asymptotic upper bound of the mean integrated\nsquared error (MISE). The result is given in Corollary 2. However, a closed form for kopt cannot be\nobtained.\nCorollary 2 Under the assumptions of Theorem 1 and denote B = supu\u2208[0,1] |r(2)(u)|, then the k\nthat minimizes asymptotic upper bound of MISE is\n\nkopt = arg min\nk\u2208N+\\{0}\n\nB2\n\n9k2(k + 1)2\n\n16(n + 1)2(2k + 1)2 +\n\n3\u03c32\n\ne (n + 1)2\n\nk(k + 1)(2k + 1)\n\n= O(n4/5).\n\n(cid:26)\n\n(cid:27)\n\n\u221a\nThe only unknown two quantities here are \u03c32\n\nn-consistent estimator [26]\n\ne and B. The error variance can be estimated by Hall\u2019s\n\nn\u22122(cid:88)\n\ni=1\n\n\u02c6\u03c32\ne =\n\n1\n\nn \u2212 2\n\n(0.809Yi \u2212 0.5Yi+1 \u2212 0.309Yi+2)2.\n\nThe second unknown quantity B can be (roughly) estimated with a local polynomial regression\nestimator of order p = 3. The performance of our proposed model is not so sensitive to the accuracy\nof B, thus a rough estimate of the second order derivative is suf\ufb01cient. By plugging in these two\ne and B in Corollary 2, the optimal value kopt can be obtained using a grid search (or\nestimators for \u03c32\n\nany other optimization method) over the integer set [1,(cid:6) n\u22121\n\n(cid:7)].\n\n2\n\n4\n\n\fThis can be split into two terms: odd and even l \u2265 2\n\nbias(cid:2) \u02c6Y (1)\n\n|U(cid:3) =\n\ni\n\nbiasodd[ \u02c6Y (1)\n\nresulting in\n\nj=1(U(i+j) \u2212 U(i\u2212j))(cid:2)(cid:80)q\n(cid:80)k\n(cid:19)\n| U(cid:3) = Op\nbias(cid:2) \u02c6Y (1)\n\nn2\n\ni\n\n[i]\n\nl=2\n\nr(l)(U(i)){(U(i+j)\u2212U(i))l\u2212(U(i\u2212j)\u2212U(i))l}\n\n(cid:80)k\np=1(U(i+p) \u2212 U(i\u2212p))2\n\nl!\n\n+ Op\n\n(cid:18) k 1\n\n2\n\n(cid:19)\n\nn\n\nand\n\n(cid:18) k2\n| U(cid:3) = biasodd[ \u02c6Y (1)\n| U(cid:3) + biaseven[ \u02c6Y (1)\n(cid:18) k2\n\n| U(cid:3) = Op\n| U(cid:3)\n\nbiaseven[ \u02c6Y (1)\n\n(cid:19)(cid:27)\n\n(cid:26)\n\ni\n\ni\n\ni\n\n= Op\n\nmax\n\n2\n\nk 1\nn\n\nn2 ,\n\n.\n\n2.4 Asymptotic order of the bias and continuous differentiability of r\n\nIn Theorem 1, we bounded the conditional bias above. From a theoretical point of view, it is helpful\nto derive an exact expression for the conditional bias. Assume that the \ufb01rst q + 1 derivatives of r(\u00b7)\nexist on [0, 1]. A Taylor series of r(U(i\u00b1j)) in a neighborhood of U(i) yields\n(U(i\u00b1j) \u2212 U(i))lr(l)(U(i)) + Op\n\n(cid:8)(j/n)q+1(cid:9).\n\nr(U(i\u00b1j)) = r(U(i)) +\n\nq(cid:88)\n\nUsing Lemma 1, assume k \u2192 \u221e as n \u2192 \u221e, and for the weights in Proposition 1 we obtain the\nasymptotic order of the exact conditional bias for different values of q\n\nbias(cid:2) \u02c6Y (1)\n\n|U(cid:3) =\n\ni\n\n(cid:0) k\n(cid:1) ,\n(cid:8)max(cid:0) k\n\nn\n\nOp\nOp\n\n(cid:1)(cid:9),\n\n1\n2\n\nn , k2\n\nn2\n\nq = 1;\nq \u2265 2.\n\nFor q = 1 (i.e., r(\u00b7) is twice continuously differentiable), the leading order of exact conditional\nbias is the same as that of the bias upperbound given in Theorem 1. For q = 2, r(\u00b7) is three\ntimes continuously differentiable on [0, 1] and the exact bias achieves smaller order than Op(k/n).\nAdding additional assumptions on the differentiability of r(\u00b7), i.e. q > 2, will no longer improve the\nasymptotic rate of the bias. This can be seen as follows: for q \u2265 2, the bias is\n\nl=1\n\n1\nl!\n\n(cid:40)\n\n(cid:8)(j/n)q+1(cid:9)(cid:3)\n\n.\n\nIn \ufb01xed design, biaseven = 0 due to symmetry: u(i+j) \u2212 u(i) = u(i) \u2212 u(i\u2212j). Unfortunately, in\nthe random design, we cannot remove biaseven. It is this fact that will lead to the inconsistency of\nthird and higher order derivatives if these estimators are de\ufb01ned in a fully recursive way as in [21].\nDue to page limitations we will not elaborate further on higher order derivative estimation, but more\ninformation and theoretical results can be obtained by contacting the \ufb01rst author.\n\n2.5 Boundary Correction\n\nWe already discussed the proposed estimator at the interior points and in this section we provide a\nsimple but effective correction for the boundary region. Points with index i < k + 1 and i > n \u2212 k\nare points located at the left and right boundary respectively. Since there are not enough k pairs of\nneighbors at the boundary, we use a weighted linear combination of k(i) pairs at points Ui instead,\nwhere k(i) = i \u2212 1 for the left boundary and k(i) = n \u2212 i for the right boundary. The \ufb01rst order\nderivative estimator at the boundary is obtained by replacing k with k(i) in the estimator (5) and\nweights in Proposition 1. From Section 2.4, assume r(\u00b7) is three times continuously differentiable\nn2 , k(i)1/2\non [0, 1], at the boundary, the asymptotic order of the bias is Op\nsmaller than the interior points. However, the asymptotic order of the variance is Op{(3\u03c32\n1)2)/(k(i)(k(i) + 1)(2k(i) + 1)} and attains Op(n2), as i is close to either 2 or n \u2212 1.\nIn order to reduce the variance at the boundary we propose the following modi\ufb01cation to (5). For\npoints at the left boundary, i < k + 1, the estimator becomes\n\n(cid:1)(cid:9), which is\n\n(cid:8)max(cid:0) k(i)2\n\ne (n +\n\nn\n\nk(i)(cid:88)\n\nj=1\n\n(cid:18) Yi+j \u2212 Yi\u2212j\n\nU(i+j) \u2212 U(i\u2212j))\n\n(cid:19)\n\n+\n\nk(cid:88)\n\nj=k(i)+1\n\n(cid:18) Yi+j \u2212 Yi\n\nU(i+j) \u2212 U(i)\n\n(cid:19)\n\nwi,j \u00b7\n\n\u02c6Y (1)\ni\n\n=\n\nwi,j \u00b7\n\n(8)\n\n5\n\n\fwith\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nwi,j =\n\n(U(i+j) \u2212 U(i\u2212j))2\n\n(cid:80)k(i)\nl=1 (U(i+l) \u2212 U(i\u2212l))2 +(cid:80)k\n(cid:80)k(i)\nl=1 (U(i+l) \u2212 U(i\u2212l))2 +(cid:80)k\n\n(U(i+j) \u2212 U(i))2\n\nl=k(i)+1(U(i+l) \u2212 U(i))2\n\nl=k(i)+1(U(i+l) \u2212 U(i))2\n\n, 1 \u2264 j \u2264 k(i);\n\n, k(i) < j \u2264 k.\n\nThis modi\ufb01cation leads to\n\nbias[ \u02c6Y (1)\n\ni\n\nand\n\n(cid:26)\n|U(cid:3) = Op\n|U(cid:3) = Op\n\nmax\n\n(cid:26)\n\n(cid:18) k(i)7/2\n(cid:18) n2\n\nk3n\n\n(cid:19)(cid:27)\n(cid:19)(cid:27)\n\nk \u2212 k(i)\n\n,\n\nk(i)5\nk3n2 ,\n\nn\nn2(k \u2212 k(i))2\n\nVar[ \u02c6Y (1)\n\n|U(cid:3) \u2192 0 when n \u2192 \u221e indicating that (8) is still asymptotically unbiased at the\n\nThe bias[ \u02c6Y (1)\nboundary. Worst case scenario, the variance is of the order Op(n2/k2) which is smaller than Op(n2).\nA similar result can be obtained for the right boundary.\n\nk3 ,\n\nmax\n\nk4\n\n.\n\ni\n\ni\n\n2.6 Smoothed \ufb01rst order derivative estimation\n\nThe estimators (5) and (8) generate \ufb01rst order derivatives which still contain the noise coming\nfrom the unknown errors ei, i = 1, . . . , n in model (4) and can only be evaluated at the design\npoints U(i), i = 1, . . . , n.\nIn order to evaluate the derivative in an arbitrary point we propose\nsmoothing the newly generated data set. However, from (5) it is clear that for the generated derivatives\n\u02c6Y (1)\n, i = 1, . . . , n the i.i.d. assumption is no longer valid since it is a weighted sum of differences of\ni\nthe original data set. Hence, bandwidth selection for any nonparametric smoothing method becomes\nincreasingly dif\ufb01cult. We rewrite estimator (5) in the form of the smoothed \ufb01rst order derivative\n\n|U(cid:3)\n\n\u02c6Y (1)\ni = r(1)\n\n2 (U(i)) + \u03b5i,\n\ni = 1, . . . , n\n\ni\n\ni\n\n(cid:17)\n\nwhere \u02c6Y (1)\n\n2 (U(i)) = r(1)(U(i)) + bias[ \u02c6Y (1)\n\nis the \ufb01rst order derivative, given by (5), r(1)\n\nwhere r(1)(\u00b7) is the smoothed (and our \ufb01nal) \ufb01rst order derivative estimate and \u03b5i =(cid:80)k\n(cid:16) ei+j\u2212ei\u2212j\n\nj=1 wi,j \u00b7\ne \u03c1n(U(i) \u2212 U(j))\n. Based on model (4), E[\u03b5|U ] = 0 and Cov(\u03b5i, \u03b5j|U(i), U(j)) = \u03c32\nU(i+j)\u2212U(i\u2212j)\ne < \u221e and \u03c1n is a stationary correlation function satisfying \u03c1n(0) = 1, \u03c1n(u) =\nfor i (cid:54)= j with \u03c32\n\u03c1n(\u2212u) and | \u03c1n(u)|\u2264 1 for all u. The subscript n allows the correlation function \u03c1n to shrink as\nn \u2192 \u221e [27]. Under mild assumptions on the correlation function, which is unknown, [27] showed\nthat, by using a bimodal type of kernel K such that K(0) = 0, the residual sum of squares (RSS) ap-\nproximates the asymptotic squared error uniformly over a set of bandwidths. Consequently, choosing\nthe bandwidth \u02c6hb (of the bimodal kernel) minimizing the RSS results in an optimal bandwidth that\nminimizes the asymptotic squared error asymptotically. As bimodal kernels introduce extra error in\nthe estimation due to their non-optimality we overcome this issue by using \u02c6hb as a pilot bandwidth\nand relate it to a bandwidth \u02c6h of a more optimal (unimodal) kernel, say the Gaussian kernel. As\nshown in [27], this can be achieved without any extra smoothing step. For local cubic regression, the\nrelation between the bimodal and unimodal bandwidth is\n\u02c6h = 1.01431\u02c6hb\n\nwhen using K(u) = (2/\nunimodal kernel respectively. Following the proof of [1, p. 101-103], it can be shown that \u02c6r(1)\nconsistent estimator of r(1)(\u00b7). In what follows, denote \u02c6r(1)\n\n2\u03c0) exp(\u2212u2/2) as bimodal and\n2 (\u00b7) is a\n\n\u221a\n\u03c0)u2 exp(\u2212u2) and K(u) = (1/\n\n2 (\u00b7) by \u02c6r(1)(\u00b7).\n\n\u221a\n\n2.7 Generalizing \ufb01rst order derivatives to any continuous distribution\nIt is possible to \ufb01nd a closed form expression for the distribution of the differences X(i+j) \u2212 X(i\u2212j)\nwith X i.i.d\u223c F where F is unknown and continuous [25] such that the density function f (x) = F (cid:48)(x).\n\n6\n\n\fSince this result is quite unattractive from a theoretical point of view, we advocate the use of the\nprobability integral transform stating that\n\nF (X) \u223c U (0, 1).\ntransform we\n\n(9)\n\nthe\n\nusing\n\nprobability\n\nset\nBy\n(F (X(1)), Y1), . . . , (F (X(n)), Yn) is the same as (U(1), Y1), . . . , (U(n), Yn). This leads back\nto the original setting of uniform order statistics discussed earlier. The \ufb01nal step is to transform back\nto the original space. In order for this step to work we need the existence of a density f. Since\nm(X) = r(F (X)) and by the chain rule\n\nknow that\n\nnew data\n\nintegral\n\nthe\n\ndm(X)\n\ndX\n\n=\n\ndr(U )\n\ndU\n\ndU\ndX\n\n= f (X)\n\ndr(U )\n\ndU\n\n(10)\n\nyielding m(1)(X) = f (X)r(1)(U ) which is the smoothed version of the \ufb01rst order derivative\nin the original space. In practice the distribution F and density f need to be estimated giving\n\n(cid:98)m(1)(X) = (cid:98)f (X)(cid:98)r(1)(U ). In this paper we use the kernel density estimator [28, 29] to estimate the\n\ndensity f and distribution F .\n\n3 Simulations\n\nConsider the following function\n\nfor X \u223c beta(2, 2),\n\nm(X) = cos2(2\u03c0X)\n\n(11)\nwith sample size n = 700 and e \u223c N (0, 0.22). We pretend we do not know the underlying distribution\nof X in model 11, since this is what occurs in applications. We use the kernel density estimator [30]\nto estimate the density f and cumulative distribution F . The tuning parameter k is selected over the\ninteger set [1,(cid:100)(n \u2212 1)/2(cid:101)] and according to Corollary 2. We use local cubic regression (p = 3) with\nbimodal kernel to initially smooth the data. Bandwidths for the bimodal kernel \u02c6hb are selected from\nthe set {0.1, 0.105, 0.11, . . . , 0.2} and corrected for a unimodal Gaussian kernel. Figure 1a shows\nthe \ufb01rst order noisy derivative (blue dots) and smoothed \ufb01rst order derivatives (dashed line) of r(\u00b7)\nafter using the probability integral transform. Using (10), the smoothed \ufb01rst order derivative \u02c6m(1)(\u00b7)\n(dashed line) in the original space is shown in Figure 1b. Figure 1b also shows the true \ufb01rst order\nderivative (full line) and the derivatives estimated by local quadratic regression [31] (dash-dotted\nline) for comparison purposes. Compared to the local polynomial derivative estimator in Figure 1b,\nthe proposed estimator is slightly better in the interior for model (11). However, both methods suffer\nfrom boundary effects. Next, we compare the proposed methodology to several popular methods for\nnonparametric derivative estimation, i.e. the local slope of the local polynomial regression estimator\nwith p = 2 and penalized cubic smoothing splines [32]. The order of the local polynomial is set to\np = 2, since p minus the order of the derivative is odd [1], and penalized smoothing cubic splines are\nused for the spline derivative estimator. For the Monte Carlo study, we constructed data sets of size\nn = 700 and generated the functions\n\nm(X) =(cid:112)X(1 \u2212 X) sin{(2.1\u03c0)/(X + 0.05)}\n\nfor X \u223c U(0.25, 1)\n\n(12)\n\nand\n\nm(X) = sin(2X) + 2 exp(\u221216X 2)\n\n(13)\n100 times according to model (2) with e \u223c N (0, 0.22) and e \u223c N (0, 0.32) for model (12) and\nmodel (13) respectively. Bandwidths are selected from the set {0.04, 0.045, . . . , 0.08} and corrected\nfor a unimodal Gaussian kernel. In order to remove boundary effects for all three methods, we use\nthe adjusted mean absolute error as a performance measure which we de\ufb01ne as\n\nfor X \u223c U(\u22121, 1)\n\nMAEadjusted =\n\n1\n670\n\n685(cid:88)\n\ni=16\n\n|(cid:98)m(cid:48)(Xi) \u2212 m(cid:48)(Xi)|.\n\nThe result is shown in Figure 2. The proposed method loses some performance due to estimating\nf and F . If F (X) and f (X) are known, the proposed method will have a better performance. In\ngeneral, the proposed method has equal performance to local quadratic regression and cubic penalized\nsmoothing splines.\n\n7\n\n\f(a) Illustration of the transformation U = \u02c6F (X)\n\n(b) Back transform according to (cid:98)m(1)(X) =\n(cid:98)f (X)(cid:98)r(1)(U )\n\nFigure 1: Illustration of the proposed methodology: (a) First order noisy derivative (dots), after\nprobability integral transform of original data, of model (11) based on k = 22, smoothed derivative\nbased on local cubic regression (dashed line); (b) true derivative (full line), smoothed derivative based\non local cubic regression (dashed line) and local polynomial derivative with p = 2 (dash-dotted line)\nin the original space. Boundary points are not shown for visual purposes.\n\n(a) model 12\n\n(b) model 13\n\nFigure 2: Results of the Monte Carlo study for the proposed methodology, local quadratic regression\nand penalized smoothing splines for \ufb01rst order derivative estimation.\n\n4 Conclusions\n\nIn this paper we proposed a theoretical framework for \ufb01rst order derivative estimation based on a\nvariance-reducing linear combination of symmetric quotients for random design. Although this is\na popular estimator for the equispaced design case, we showed that for the random design some\ndif\ufb01culties occur and extra estimation of unknown quantities are needed. It is also possible to extend\nthese type of estimators to higher order derivatives and similar theoretical results can be established.\n\n8\n\n\fReferences\n[1] J. Fan and I. Gijbels. Local polynomial modelling and its applications: monographs on statistics\n\nand applied probability. Chapman & Hall/CRC Press, 1996.\n\n[2] L. Gy\u00f6r\ufb01, M. Kohler, A. Krzy\u02d9zak, and H. Walk. A distribution-free theory of nonparametric\n\nregression. Springer Science & Business Media, 2006.\n\n[3] A. Tsybakov.\n\nIntroduction to Nonparametric Estimation. Springer Publishing Company,\n\nIncorporated, 2008.\n\n[4] G. Wahba and Y. Wang. When is the optimal regularization parameter insensitive to the choice\nof the loss function? Communications in Statistics-Theory and Methods, 19(5):1685\u20131700,\n1990.\n\n[5] H-G. M\u00fcller. Nonparametric regression analysis of longitudinal data, volume 46. Springer\n\nScience & Business Media, 2012.\n\n[6] J. Ramsay and B. Silverman. Applied functional data analysis: methods and case studies.\n\nSpringer, 2007.\n\n[7] I. Gijbels and A-C. Goderniaux. Data-driven discontinuity detection in derivatives of a regression\n\nfunction. Communications in Statistics-Theory and Methods, 33(4):851\u2013871, 2005.\n\n[8] P. Chaudhuri and J. Marron. Sizer for exploration of structures in curves. Journal of the\n\nAmerican Statistical Association, 94(447):807\u2013823, 1999.\n\n[9] V. Rondonotti, J. Marron, and C. Park. Sizer for time series: a new approach to the analysis of\n\ntrends. Electronic Journal of Statistics, 1:268\u2013289, 2007.\n\n[10] C. Park and K-H. Kang. Sizer analysis for the comparison of regression curves. Computational\n\nStatistics & Data Analysis, 52(8):3954\u20133970, 2008.\n\n[11] R. Charnigo, M. Francoeur, M. Pinar Meng\u00fc\u00e7, A. Brock, M. Leichter, and C. Srinivasan.\nDerivatives of scattering pro\ufb01les: tools for nanoparticle characterization. Journal of the Optical\nSociety of America A, 24(9):2578\u20132589, 2007.\n\n[12] B. Hassibi and D. Stork. Second order derivatives for network pruning: Optimal brain surgeon.\n\nIn Advances in neural information processing systems, pages 164\u2013171, 1993.\n\n[13] R. Eubank and P. Speckman. Con\ufb01dence bands in nonparametric regression. Journal of the\n\nAmerican Statistical Association, 88(424):1287\u20131301, 1993.\n\n[14] Y. Xia. Bias-corrected con\ufb01dence bands in nonparametric regression. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 60(4):797\u2013811, 1998.\n\n[15] D. Ruppert and M. Wand. Multivariate locally weighted least squares regression. The Annals of\n\nStatistics, pages 1346\u20131370, 1994.\n\n[16] H-G. M\u00fcller, U. Stadtm\u00fcller, and T. Schmitt. Bandwidth choice and con\ufb01dence intervals for\n\nderivatives of noisy data. Biometrika, 74(4):743\u2013749, 1987.\n\n[17] N. Heckman and J. Ramsay. Penalized regression with model-based penalties. Canadian\n\nJournal of Statistics, 28(2):241\u2013258, 2000.\n\n[18] C. Stone. Additive regression and other nonparametric models. The annals of Statistics, pages\n\n689\u2013705, 1985.\n\n[19] S. Zhou and D. Wolfe. On derivative estimation in spline regression. Statistica Sinica, pages\n\n93\u2013108, 2000.\n\n[20] W. H\u00e4rdle. Applied nonparametric regression. Cambridge university press, 1990.\n\n[21] R. Charnigo, B. Hall, and C. Srinivasan. A generalized c p criterion for derivative estimation.\n\nTechnometrics, 53(3):238\u2013253, 2011.\n\n9\n\n\f[22] K. De Brabanter, J. De Brabanter, B. De Moor, and I. Gijbels. Derivative estimation with local\n\npolynomial \ufb01tting. Journal of Machine Learning Research, 14(1):281\u2013301, 2013.\n\n[23] W. Wang and L. Lin. Derivative estimation based on difference sequence via locally weighted\n\nleast squares regression. Journal of Machine Learning Research, 16:2617\u20132641, 2015.\n\n[24] W. Dai, T. Tong, and M.G. Genton. Optimal estimation of derivatives in nonparametric\n\nregression. Journal of Machine Learning Research, 117:1\u201325, 2016.\n\n[25] H.A. David and H.N. Nagaraja. Order Statistics, Third Edn. John Wiley & Sons, 2003.\n\n[26] P. Hall, J. Kay, and D. Titterington. Asymptotically optimal difference-based estimation of\n\nvariance in nonparametric regression. Biometrika, 77(3):521\u2013528, 1990.\n\n[27] K. De Brabanter, F. Cao, I. Gijbels, and J. Opsomer. Local polynomial regression with correlated\n\nerrors in random design and unknown correlation structure, in press. Biometrika, 2018.\n\n[28] M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of\n\nMathematical Statistics, pages 832\u2013837, 1956.\n\n[29] E. Parzen. On estimation of a probability density function and mode. The annals of mathematical\n\nstatistics, 33(3):1065\u20131076, 1962.\n\n[30] T. Duong. ks: Kernel smoothing v1.11.1,\n\nhttps://cran.r-project.org/web/packages/ks/index.html, 2018.\n\n[31] J.L. Ojeda Cabrera. locpol: Kernel local polynomial regression v0.6,\n\nhttps://cran.r-project.org/web/packages/locpol/index.html, 2012.\n\n[32] B. Ripley. pspline: Penalized smoothing splines v1.0-18,\n\nhttps://cran.r-project.org/web/packages/pspline/index.html, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1768, "authors": [{"given_name": "Yu", "family_name": "Liu", "institution": "Iowa State University"}, {"given_name": "Kris", "family_name": "De Brabanter", "institution": "ISU"}]}