{"title": "Data driven estimation of Laplace-Beltrami operator", "book": "Advances in Neural Information Processing Systems", "page_first": 3963, "page_last": 3971, "abstract": "Approximations of Laplace-Beltrami operators on manifolds through graph Laplacians have become popular tools in data analysis and machine learning. These discretized operators usually depend on bandwidth parameters whose tuning remains a theoretical and practical problem. In this paper, we address this problem for the unormalized graph Laplacian by establishing an oracle inequality that opens the door to a well-founded data-driven procedure for the bandwidth selection. Our approach relies on recent results by Lacour and Massart (2015) on the so-called Lepski's method.", "full_text": "Data driven estimation of Laplace-Beltrami operator\n\nFr\u00e9d\u00e9ric Chazal\n\nInria Saclay\n\nPalaiseau France\n\nfrederic.chazal@inria.fr\n\nIlaria Giulini\nInria Saclay\n\nPalaiseau France\n\nilaria.giulini@me.com\n\nBertrand Michel\n\nEcole Centrale de Nantes\n\nLaboratoire de Math\u00e9matiques Jean Leray (UMR 6629 CNRS)\n\nNantes France\n\nbertrand.michel@ec-nantes.fr\n\nAbstract\n\nApproximations of Laplace-Beltrami operators on manifolds through graph Lapla-\ncians have become popular tools in data analysis and machine learning. These\ndiscretized operators usually depend on bandwidth parameters whose tuning re-\nmains a theoretical and practical problem. In this paper, we address this problem for\nthe unnormalized graph Laplacian by establishing an oracle inequality that opens\nthe door to a well-founded data-driven procedure for the bandwidth selection. Our\napproach relies on recent results by Lacour and Massart [LM15] on the so-called\nLepski\u2019s method.\n\n1\n\nIntroduction\n\nThe Laplace-Beltrami operator is a fundamental and widely studied mathematical tool carrying a\nlot of intrinsic topological and geometric information about the Riemannian manifold on which it is\nde\ufb01ned. Its various discretizations, through graph Laplacians, have inspired many applications in\ndata analysis and machine learning and led to popular tools such as Laplacian EigenMaps [BN03] for\ndimensionality reduction, spectral clustering [VL07], or semi-supervised learning [BN04], just to\nname a few.\nDuring the last \ufb01fteen years, many efforts, leading to a vast literature, have been made to understand\nthe convergence of graph Laplacian operators built on top of (random) \ufb01nite samples to Laplace-\nBeltrami operators. For example pointwise convergence results have been obtained in [BN05] (see\nalso [BN08]) and [HAL07], and a (uniform) functional central limit theorem has been established\nin [GK06]. Spectral convergence results have also been proved by [BN07] and [VLBB08]. More\nrecently, [THJ11] analyzed the asymptotic of a large family of graph Laplacian operators by taking\nthe diffusion process approach previously proposed in [NLCK06].\nGraph Laplacians depend on scale or bandwidth parameters whose choice is often left to the user.\nAlthough many convergence results for various metrics have been established, little is known about\nhow to rigorously and ef\ufb01ciently tune these parameters in practice. In this paper we address this\nproblem in the case of unnormalized graph Laplacian. More precisely, given a Riemannian manifold\nM of known dimension d and a function f : M \u2192 R , we consider the standard unnormalized graph\nLaplacian operator de\ufb01ned by\n\n(cid:18) y \u2212 Xi\n\n(cid:19)\n\n(cid:88)\n\nK\n\nh\n\n\u02c6\u2206hf (y) =\n\n1\n\nnhd+2\n\ni\n\n[f (Xi) \u2212 f (y)] ,\n\ny \u2208 M,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(1)\n\n1\n\n(4\u03c0)d/2\n\nm/4,\n\nK(y) =\n\nwhere h is a bandwidth, X1, . . . , Xn is a \ufb01nite point cloud sampled on M on which the values of f\ncan be computed, and K is the Gaussian kernel: for y \u2208 Rm\ne\u2212(cid:107)y(cid:107)2\nwhere (cid:107)y(cid:107)m is the Euclidean norm in the ambiant space Rm.\nIn this case, previous results (see for instance [GK06]) typically say that the bandwidth parameter h\nin \u02c6\u2206h should be taken of the order of n\u2212 1\nd+2+\u03b1 for some \u03b1 > 0, but in practice, for a given point\ncloud, these asymptotic results are not suf\ufb01cient to choose h ef\ufb01ciently. In the context of neighbor\ngraphs [THJ11] proposes self-tuning graphs by choosing h locally in terms of the distances to the\nk-nearest neighbor, but note that k still need to be chosen and moreover as far as we know there is\nno guarantee for such method to be rate-optimal. More recently a data driven method for spectral\nclustering has been proposed in [Rie15]. Cross validation [AC+10] is the standard approach for\ntuning parameters in statistics and machine learning. Nevertheless, the problem of choosing h in \u02c6\u2206h\nis not easy to rewrite as a cross validation problem, in particular because there is no obvious contrast\ncorresponding to the problem (see [AC+10]).\nThe so-called Lepski\u2019s method is another popular method for selecting the smoothing parameter\nof an estimator. The method has been introduced by Lepski [Lep92b, Lep93, Lep92a] for kernel\nestimators and local polynomials for various risks and several improvements of the method have then\nbeen proposed, see [LMS97, GL09, GL+08]. In this paper we adapt Lepski\u2019s method for selecting\nh in the graph Laplacian estimator \u02c6\u2206h. Our method is supported by mathematical guarantees:\n\ufb01rst we obtain an oracle inequality - see Theorem 3.1 - and second we obtain the correct rate of\nconvergence - see Theorem 3.3 - already proved in the asymptotical studies of [BN05] and [GK06]\nfor non data-driven choices of the bandwidth. Our approach follows the ideas recently proposed in\n[LM15], but for the speci\ufb01c problem of Laplacian operators on smooth manifolds. In this \ufb01rst work\nabout the data-driven estimation of Laplace-Beltrami operator, we focus as in [BN05] and [GK06] on\nthe pointwise estimation problem: we consider a smooth function f on M and the aim is to estimate\n\u02c6\u2206hf for the L2-norm (cid:107) \u00b7 (cid:107)2,M on M \u2282 Rm. The data driven method presented here may be adapted\nand generalized for other types of risks (uniform norms on functional family and convergence of the\nspectrum) and other types of graph Laplacian operators, this will be the subject of future works.\nThe paper is organized as follows: Lepski\u2019s method is introduced in Section 2. The main results are\nstated in Section 3 and a sketch of their proof is given in Section 4 (the complete proofs are given in\nthe supplementary material). A numerical illustration and a discussion about the proposed method\nare given in Sections 5 and 6 respectively.\n\n2 Lepski\u2019s procedure for estimating the Laplace-Beltrami operator\n\nAll the Riemannian manifolds considered in the paper are smooth compact d-dimensional subman-\nifolds (without boundary) of Rm endowed with the Riemannian metric induced by the Euclidean\nstructure of Rm. Recall that, given a compact d-dimensional smooth Riemannian manifold M with\nvolume measure \u00b5, its Laplace-Beltrami operator is the linear operator \u2206 de\ufb01ned on the space of\nsmooth functions on M as \u2206(f ) = \u2212 div(\u2207f ) where \u2207f is the gradient vector \ufb01eld and div the\ndivergence operator. In other words, using the Stoke\u2019s formula, \u2206 is the unique linear operator\nsatisfying\n\n(cid:90)\n\n(cid:90)\n\n(cid:107)\u2207f(cid:107)2d\u00b5 =\n\n\u2206(f )f d\u00b5.\n\nReplacing the volume measure \u00b5 by a distribution P which is absolutely continuous with respect to \u00b5\n, the weighted Laplace-Beltrami operator \u2206P is de\ufb01ned as\n\nM\n\nM\n\n\u2206Pf = \u2206f +\n\n(cid:104)\u2207p,\u2207f(cid:105) ,\n\n1\np\n\n(2)\n\nwhere p is the density of P with respect to \u00b5. The reader may refer to classical textbooks such as,\ne.g., [Ros97] or [Gri09] for a general and detailed introduction to Laplace operators on manifolds.\nIn the following, we assume that we are given n points X1, . . . , Xn sampled on M according to\nthe distribution P. Given a smooth function f on M, the aim is to estimate \u2206Pf, by selecting\n\n2\n\n\fan estimator in a given \ufb01nite family of graph Laplacian ( \u02c6\u2206hf )h\u2208H, where H is a \ufb01nite family of\nbandwidth parameters.\nLepski\u2019s procedure is generally presented as a method for selecting bandwidth in an adaptive way.\nMore generally, this method can be seen as an estimator selection procedure.\n\n2.1 Lepski\u2019s procedure\n\nWe \ufb01rst shortly explain the ideas of Lepski\u2019s method. Consider a target quantity s, a collection\nof estimators (\u02c6sh)h\u2208H and a loss function (cid:96)(\u00b7,\u00b7). A standard objective when selecting \u02c6sh is trying\nto minimize the risk E(cid:96)(s, \u02c6sh) among the family of estimators. In most settings, the risk of an\nestimator can be decomposed into a bias part and a variance part. Of course neither the risk,\nthe bias nor the variance of an estimator are known in practice. However in many cases, the\nvariance term can be controlled quite precisely. Lepski\u2019s method requires that the variance of each\nestimator \u02c6sh can be tightly upper bounded by a quantity v(h). In most cases, the bias can be\nwritten as (cid:96)(s, \u00afsh) where \u00afsh corresponds to some (deterministic) averaged version of \u02c6sh. It thus\nseems natural to estimate (cid:96)(s, \u00afsh) by (cid:96)(\u02c6sh(cid:48), \u02c6sh) for some h(cid:48) smaller than h. The later quantity\nincorporates some randomness while the bias does not. The idea is to remove the \u201crandom part\"\nof the estimation by considering [(cid:96)(\u02c6sh(cid:48), \u02c6sh) \u2212 v(h) \u2212 v(h(cid:48))]+, where [ ]+ denotes the positive part.\nThe bias term is estimated by considering all pairs of estimators (sh, \u02c6sh(cid:48)) through the quantity\nsuph(cid:48)\u2264h [(cid:96)(\u02c6sh(cid:48), \u02c6sh) \u2212 v(h) \u2212 v(h(cid:48))]+. Finally, the estimator minimizing the sum of the estimated\nbias and variance is selected, see eq. (3) below.\nIn our setting, the control of the variance of the graph Laplacian estimators \u02c6\u2206h is not tight enough to\ndirectly apply the above described method. To overcome this issue, we use a more \ufb02exible version of\nLepski\u2019s method that involves some multiplicative coef\ufb01cients a and b introduced in the variance and\nbias terms. More precisely, let V (h) = Vf (h) be an upper bound for E[(cid:107)(E[ \u02c6\u2206h] \u2212 \u02c6\u2206h)f(cid:107)2\n2,M ]. The\nbandwidth \u02c6h selected by our Lepski\u2019s procedure is de\ufb01ned by\n\n\u02c6h = \u02c6hf = arg min\n\nh\u2208H{B(h) + bV (h)}\n(cid:104)(cid:107)( \u02c6\u2206h(cid:48) \u2212 \u02c6\u2206h)f(cid:107)2\n\n(cid:105)\n2,M \u2212 aV (h(cid:48))\n\nwhere\n\nB(h) = Bf (h) = max\n\n(4)\nwith 0 < a \u2264 b. The calibration of the constants a and b in practice is beyond the scope of this paper,\nbut we suggest a heuristic procedure inspired from [LM15] in section 5.\n\nh(cid:48)\u2264h, h(cid:48)\u2208H\n\n+\n\n(3)\n\n(5)\n\n(6)\n\n(7)\n\n2.2 Variance of the graph Laplacian for smooth functions\nIn order to control the variance term, we consider for this paper the set F of smooth functions\nf : M \u2192 R uniformly bounded up to the third order. For some constant CF > 0 , let\n\n.\nWe introduce some notation before giving the variance term for f \u2208 F. De\ufb01ne\n\nf \u2208 C3(M, R) , (cid:107)f (k)(cid:107)\u221e \u2264 CF , k = 0, . . . , 3\n\nF =\n\n(cid:110)\n\n(cid:111)\n\n(cid:90)\n\n(cid:18) C(cid:107)u(cid:107)\u03b1+2\n(cid:18) C(cid:107)u(cid:107)\u03b1+2\n\n2\n\nd\n\nd\n\n1\n\n(4\u03c0)d\n\n1\n\n(cid:90)\n\nRd\n\nD\u03b1 =\n\n\u02dcD\u03b1 =\n\n(4\u03c0)d/2\n\nRd\n\n4\n\n(cid:19)\n\n(cid:19)\n\n+ C1(cid:107)u(cid:107)\u03b1\n\nd\n\ne\u2212(cid:107)u(cid:107)2\n\nd/4 du\n\n+ C1(cid:107)u(cid:107)\u03b1\n\nd\n\ne\u2212(cid:107)u(cid:107)2\n\nd/8 du\n\nwhere C and C1 are geometric constants that only depend on the metric structure of M (see Appendix).\nWe also introduce the d-dimensional Gaussian kernel on Rd:\n\nKd(u) =\n\n1\n\n(4\u03c0)d/2\n\ne\u2212(cid:107)u(cid:107)2\n\nd/4,\n\nu \u2208 Rd\n\nand we denote by (cid:107) \u00b7 (cid:107)p,d the Lp-norm on Rd. The next proposition provides an explicit bound V (h)\non the variance term.\n\n3\n\n\fProposition 2.1. Given h \u2208 H, for any f \u2208 F, we have\n\n(cid:0)2\u03c9d (cid:107)Kd(cid:107)2\n\u03b1d(h) = h2(cid:0)2D4 + D + 3\u03c9d (cid:107)Kd(cid:107)2\n\n2,d + \u03b1d(h)(cid:1) ,\n(cid:1) + h4 D6 + 3D\n\nC 2F\nnhd+2\n\nV (h) =\n\n2,d\n\n2\n\nD =\n\n3\u00b5(M )\n(4\u03c0)d/2\n\nand \u03c9d = 3 \u00d7 2d/2\u22121.\n\nwhere\n\nwith\n\n3 Results\n\n(8)\n\nWe now give the main result of the paper: an oracle inequality for the estimator \u02c6\u2206\u02c6h, or in other words,\na bound on the risk that shows that the performance of the estimator is almost as good as it would be\nif we knew the risks of each estimator. In particular it performs an (almost) optimal trade-off between\nthe variance term V (h) and the approximation term\n\nD(h) = Df (h) = max\n\n(cid:107)(p\u2206P \u2212 E[ \u02c6\u2206h])f(cid:107)2,M , sup\nh(cid:48)\u2264h\n\n(cid:107)(E[ \u02c6\u2206h(cid:48)] \u2212 E[ \u02c6\u2206h])f(cid:107)2,M\n\n(cid:27)\n\nTheorem 3.1. According to the notation introduced in the previous section, let \u0001 =\n\n\u03b4(h) =\n\nmax\n\nexp\n\nwhere c > 0 is an absolute constant and\n\n(cid:26)\n\n(cid:18)\n\n(cid:19)\n\u2212 min{\u00012, \u0001}\u221a\n(cid:34) 2\u03c9d (cid:107)Kd(cid:107)2\n\n24\n\nn\n\n\u03b3d(h(cid:48)) =\n\n1\n\n(cid:107)p(cid:107)\u221eh(cid:48)d\n\n2,d + \u03b1d(h(cid:48))\n(2\u03c9d (cid:107)Kd(cid:107)1,d + \u03b2d(h))2\n\n(cid:35)\n\n\u2264 2 sup\nh(cid:48)\u2264h\n\n(cid:107)(p\u2206P \u2212 E[ \u02c6\u2206h(cid:48)])f(cid:107)2,M .\na/2 \u2212 1 and\n\n\u221a\n\n, exp(cid:0)\u2212c\u00012\u03b3d(h(cid:48))(cid:1)(cid:27)\n\n(cid:26)\n\n(cid:88)\n\nh(cid:48)\u2264h\n\nwith \u03b1d de\ufb01ned by (8) and\n\nGiven f \u2208 C2(M, R), with probability at least 1 \u2212 2(cid:80)\n\n\u03b2d(h) = 2h\u03c9d (cid:107)Kd(cid:107)1,d + h2(2 \u02dcD3 + D) + h3( \u02dcD4 + D).\n\n(cid:110)\n\nh\u2208H \u03b4(h),\n\u221a\n\n3D(h) + (1 +\n\n(cid:111)\n2)(cid:112)bV (h)\n\n(9)\n\n.\n\n(cid:107)(p\u2206P \u2212 \u02c6\u2206\u02c6h)f(cid:107)2,M \u2264 inf\nh\u2208H\n\nthat the size of the bandwidth family H has an impact on the probability term 1 \u2212 2(cid:80)\n\nBroadly speaking, Theorem 3.1 says that there exists an event of large probability for which the\nestimator selected by Lepski\u2019s method is almost as good as the best estimator in the collection. Note\nh\u2208H \u03b4(h). If\nH is not too large, an oracle inequality for the risk of \u02c6\u2206\u02c6hf can be easily deduced from the later result.\nHenceforth we assume that f \u2208 F. We \ufb01rst give a control on the approximation term D(h).\nProposition 3.2. Assume that the density p is C2. It holds that\n\nwhere CF is de\ufb01ned in eq. (5) and \u03b3 > 0 is a constant depending on (cid:107)p(cid:107)\u221e, (cid:107)p(cid:48)(cid:107)\u221e, (cid:107)p(cid:48)(cid:48)(cid:107)\u221e and on\nM.\n\nD(h) \u2264 \u03b3 CF h\n\nWe consider the following grid of bandwidths:\n\nThe previous results lead to the pointwise rate of convergence of the graph Laplacian selected by\nLepski\u2019s method:\nTheorem 3.3. Assume that the density p is C2. For any f \u2208 F, we have\n\n(10)\n\nH =(cid:8)e\u2212k , (cid:100)log log(n)(cid:101) \u2264 k \u2264 (cid:98)log(n)(cid:99)(cid:9) .\nE(cid:104)(cid:107)(p\u2206P \u2212 \u02c6\u2206\u02c6h)f(cid:107)2,M\n\n(cid:105) (cid:46) n\u2212 1\n\nd+4 .\n\n4\n\n\f4 Sketch of the proof of theorem 3.1\n\nWe observe that the following inequality holds\n\n(cid:107)(p\u2206P \u2212 \u02c6\u2206\u02c6h)f(cid:107)2,M \u2264 D(h) + (cid:107)(E[ \u02c6\u2206h] \u2212 \u02c6\u2206h)f(cid:107)2,M +(cid:112)2 (B(h) + bV (h)).\n\n(11)\n\nIndeed, for h \u2208 H,\n\n(cid:107)(p\u2206P \u2212 \u02c6\u2206\u02c6h)f(cid:107)2,M \u2264 (cid:107)(p\u2206P \u2212 \u2206h)f(cid:107)2,M + (cid:107)(\u2206h \u2212 \u02c6\u2206h)f(cid:107)2,M + (cid:107)( \u02c6\u2206h \u2212 \u02c6\u2206\u02c6h)f(cid:107)2,M\n\n\u2264 D(h) + (cid:107)(\u2206h \u2212 \u02c6\u2206h)f(cid:107)2,M + (cid:107)( \u02c6\u2206h \u2212 \u02c6\u2206\u02c6h)f(cid:107)2,M .\n\nBy de\ufb01nition of B(h), for any h(cid:48) \u2264 h,\n\n(cid:107)( \u02c6\u2206h(cid:48) \u2212 \u02c6\u2206h)f(cid:107)2\n\n2,M \u2264 B(h) + aV (h(cid:48)) \u2264 B(max{h, h(cid:48)}) + aV (min{h, h(cid:48)}),\n\nso that, according to the de\ufb01nition of \u02c6h in eq. (3) and recalling that a \u2264 b,\n\n(cid:107)( \u02c6\u2206\u02c6h \u2212 \u02c6\u2206h)f(cid:107)2\n\n2,M \u2264 2 [B(h) + aV (h)] \u2264 2 [B(h) + bV (h)]\n\nwhich proves eq. (11).\n\nWe are now going to bound the terms that appear in eq. (11). The bound for D(h) is already given\nin proposition 3.2, so that in the following we focus on B(h) and (cid:107)(E[ \u02c6\u2206h] \u2212 \u02c6\u2206h)f(cid:107)2,M . More\nprecisely the bounds we present in the next two propositions are based on the following lemma from\n[LM15].\n\nLemma 4.1. Let X1, . . . , Xnbe an i.i.d. sequence of variables. Let (cid:101)S a countable set of functions\n(cid:80)\ni [gs(Xi) \u2212 E[gs(Xi)]] for any s \u2208 (cid:101)S. Assume that there exist constants \u03b8 and\nvg such that for any s \u2208 (cid:101)S\nand Var[gs(X)] \u2264 vg.\nDenote H = E[sups\u2208(cid:101)S \u03b7(s)]. Then for any \u0001 > 0 and any H(cid:48) \u2265 H\n\nand let \u03b7(s) = 1\nn\n\n(cid:107)gs(cid:107)\u221e \u2264 \u03b8\n\n(cid:34)\n\n(cid:35)\n\n(cid:26)\n\n(cid:18)\n\nP\n\n\u03b7(s) \u2265 (1 + \u0001)H(cid:48)\n\nsup\n\ns\u2208(cid:101)S\n\n\u2264 max\n\nexp\n\nProposition 4.2. Let \u0001 =\n\n\u03b41(h) =\n\nh(cid:48)\u2264h\n\nWith probability at least 1 \u2212 \u03b41(h)\n\nProposition 4.3. Let \u02dc\u0001 =\n\n\u221a\n\n\u03b42(h) = max\n\nWith probability at least 1 \u2212 \u03b42(h)\n\n(cid:19)(cid:27)\n\n.\n\n(cid:19)\n\n, exp\n\n(cid:18)\n\u2212 min{\u0001, 1}\u0001nH(cid:48)\n(cid:19)(cid:27)\n\n24 \u03b8\n\n\u2212 2\u00012\n3\n\n\u03b3d(h(cid:48))\n\n.\n\n(cid:18)\n\n, exp\n\n6vg\n\n\u2212 \u00012nH(cid:48)2\n(cid:19)\n\nn\n\n24\n\n(cid:88)\n\na\n\n(cid:26)\n\n\u221a\n(cid:18)\n2 \u2212 1. Given h \u2208 H, de\ufb01ne\n\u2212 min{\u00012, \u0001}\u221a\n\nmax\n\nexp\n\n(cid:18)\n\n(cid:19)\n\nB(h) \u2264 2D(h)2.\na \u2212 1. Given h \u2208 H de\ufb01ne\n\u2212 min{\u02dc\u00012, \u02dc\u0001}\u221a\nn\n\n(cid:26)\n(cid:18)\n(cid:107)(E[ \u02c6\u2206h] \u2212 \u02c6\u2206h)f(cid:107)2,M \u2264(cid:112)aV (h).\n\n, exp\n\nexp\n\n24\n\n(cid:19)(cid:27)\n\n\u03b3d(h)\n\n.\n\n\u2212 \u02dc\u00012\n24\n\nCombining the above propositions with eq. (11), we get that, for any h \u2208 H, with probability at least\n1 \u2212 (\u03b41(h) + \u03b42(h)),\n\n(cid:107)(p\u2206P \u2212 \u02c6\u2206\u02c6h)f(cid:107)2,M \u2264 D(h) +(cid:112)aV (h) +(cid:112)4D(h)2 + 2bV (h)\n\n2)(cid:112)bV (h)\n\n\u221a\n\n\u2264 3D(h) + (1 +\n\nwhere we have used the fact that a \u2264 b. Taking a union bound on h \u2208 H we conclude the proof.\n\n5\n\n\f5 Numerical illustration\n\nIn this section we illustrate the results of the previous section on a simple example. In section 5.1, we\ndescribe a practical procedure when the data set X is sampled according to the uniform measure on\nM. A numerical illustration us given in Section 5.2 when M is the unit 2-dimensional sphere in R3.\n\n5.1 Practical application of the Lepksi\u2019s method\n\n2,M /\u00b5(M ) is approximated by the averaged sum 1\nn2\n\nLepski\u2019s method presented in Section 2 can not be directly applied in practice for two reasons. First,\nwe can not compute the L2-norm (cid:107)(cid:107)2,M on M, the manifold M being unknown. Second, the variance\nterms involved in Lepski\u2019s method are not completely explicit.\nRegarding the \ufb01rst issue, we can approximate (cid:107)(cid:107)2,M by splitting the data into two samples: an\nestimation sample X1 for computing the estimators and a validation sample X2 for evaluating\nthis norm. More precisely, given two estimators \u02c6\u2206hf and \u02c6\u2206h(cid:48)f computed using X1, the quantity\n(cid:107)( \u02c6\u2206h \u2212 \u02c6\u2206h(cid:48))f(cid:107)2\n| \u02c6\u2206hf (x) \u2212 \u02c6\u2206h(cid:48)f (x)|2,\nwhere n2 is the number of points in X2. We use these approximations to evaluate the bias terms B(h)\nde\ufb01ned by (4).\nThe second issue comes from the fact that the variance terms involved in Lepski\u2019s method depend on\nthe metric properties of the manifold and on the sampling density, which are both unknown. Theses\nvariance terms are thus only known up to a multiplicative constant. This situation contrasts with more\nstandard frameworks for which a tight and explicit control on the variance terms can be proposed, as\nin [Lep92b, Lep93, Lep92a]. To address this second issue, we follow the calibration strategy recently\nproposed in [LM15] (see also [LMR16]). In practice we remove all the multiplicative constants from\nV (h): all these constants are passed into the terms a and b. This means that we rewrite Lepski\u2019s\nmethod as follows:\n\n(cid:80)\n\nx\u2208X2\n\n(cid:26)\n\n(cid:27)\n\n\u02c6h(a, b) = arg min\nh\u2208H\n\nB(h) + b\n\n1\nnh4\n\n(cid:20)\n\nwhere\n\nB(h) = max\n\nh(cid:48)\u2264h, h(cid:48)\u2208H\n\n(cid:107)( \u02c6\u2206h(cid:48) \u2212 \u02c6\u2206h)f(cid:107)2\n\n2,M \u2212 a\n\n1\nnh(cid:48)4\n\n(cid:21)\n\n.\n\n+\n\nWe choose a and b according to the following heuristic:\n\n1. Take b = a and consider the sequence of selected models: \u02c6h(a, a),\n2. Starting from large values of a, make a decrease and \ufb01nd the location a0 of the main\n\nbandwidth jump in the step function a (cid:55)\u2192 \u02c6h(a, a),\n\n3. Select the model \u02c6h(a0, 2a0).\n\nThe justi\ufb01cation of this calibration method is currently the subject of mathematical studies ([LM15]).\nNote that a similar strategy called \"slope heuristic\" has been proposed for calibrating (cid:96)0 penalties in\nvarious settings by strong mathematical results, see for instance [BM07, AM09, BMM12].\n\n5.2\n\nIllustration on the sphere\n\nIn this section we illustrate the complete method on a simple example with data points generated\nuniformly on the sphere S2 in R3. In this case, the weighted Laplace-Beltrami operator is equal to\nthe (non weighted) Laplace-Beltrami operator on the sphere.\nWe consider the function f (x, y, z) = (x2 + y2 + z) sin x cos x. The restriction of this function on\nthe sphere has the following representation in spherical coordinates:\n\n\u02dcf (\u03b8, \u03c6) = (sin2 \u03c6 + cos \u03c6) sin(sin \u03c6 cos \u03b8) cos(sin \u03c6 cos \u03b8).\n\nIt is well known that the Laplace-Beltrami operator on the sphere satis\ufb01es (see Section 3 in [Gri09]):\n\n(cid:18)\n\n(cid:19)\n\n\u2206S 2u =\n\n1\n\nsin2 \u03c6\n\n\u22022u\n\u2202\u03b82 +\n\n1\n\nsin \u03c6\n\n\u2202\n\u2202\u03c6\n\nsin \u03c6\n\n\u2202u\n\u2202\u03c6\n\nfor any smooth polar function u. This allows us to derive an analytic expression of \u2206S 2 \u02dcf.\n\n6\n\n\fWe sample n1 = 106 points on the sphere for computing the graph Laplacians and we use n = 103\npoints for approximating the norms (cid:107)( \u02c6\u2206h \u2212 \u02c6\u2206h(cid:48)) \u02dcf(cid:107)2\n2,M . We compute the graph Laplacians for\nbandwidths in a grid H between 0.001 and 0.8 (see \ufb01g. 1). The risk of each graph Laplacian is\nestimated by a standard Monte Carlo procedure (see \ufb01g. 2).\n\nFigure 1: Choosing h is crucial for estimating \u2206S 2 \u02dcf: small bandwidth over\ufb01ts \u2206S 2 \u02dcf whereas large\nbandwidth leads to almost constant approximation functions of \u2206S 2 \u02dcf.\n\nFigure 2: Estimation of the risk of each graph Laplacian operator: the oracle Laplacian is for\napproximatively h = 0.15.\n\nFigure 3 illustrates the calibration method. On this picture, the x-axis corresponds to the values of a\nand the y-axis represents the bandwidths. The blue step function represents the function a (cid:55)\u2192 \u02c6h(a, a).\nThe red step function gives the model selected by the rule a (cid:55)\u2192 \u02c6h(a, 2a). Following the heuristics\ngiven in Section 5.1, one could take for this example the value a0 \u2248 3.5 (location of the bandwidth\njump for the blue curve) which leads to select the model \u02c6h(a0, 2a0) \u2248 0.2 (red curve).\n\n6 Discussion\n\nThis paper is a \ufb01rst attempt for a complete and well-founded data driven method for inferring Laplace-\nBeltrami operators from data points. Our results suggest various extensions and raised some questions\nof interest. For instance, other versions of the graph Laplacian have been studied in the literature (see\n\n7\n\n\fFigure 3: Bandwidth jump heuristic: \ufb01nd the location of the jump (blue curve) and deduce the\nselected bandwidth with the red curve.\n\nfor instance [HAL07, BN08]), for instance when data is not sampled uniformly. It would be relevant\nto propose a bandwidth selection method for these alternative estimators also.\nFrom a practical point of view, as explained in section 5, there is a gap between the theory we\nobtain in the paper and what can be done in practice. To \ufb01ll this gap, a \ufb01rst objective is to prove an\noracle inequality in the spirit of Theorem 3.1 for a bias term de\ufb01ned in terms of the empirical norms\ncomputed in practice. A second objective is to propose mathematically well-founded heuristics for\nthe calibration of the parameters a and b.\nTuning bandwidths for the estimation of the spectrum of the Laplace-Beltrami operator is a dif\ufb01cult\nbut important problem in data analysis. We are currently working on the adaptation of our results to\nthe case of operator norms and spectrum estimation.\n\nAppendix: the geometric constants C and C1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:113)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 C1(cid:107)v(cid:107)2\n\nd\n\n1\n2\n\nThe following classical lemma (see, e.g. [GK06][Prop. 2.2 and Eq. 3.20]) relates the constants C\nand C1 introduced in Equations (6) and (7) to the geometric structure of M.\nLemma 6.1. There exist constants C, C1 > 0 and a positive real number r > 0 such that for any\nx \u2208 M, and any v \u2208 TxM such that (cid:107)v(cid:107) \u2264 r,\n\ndet(gij)(v) \u2212 1\n\nm \u2264 (cid:107)v(cid:107)2\n(12)\nwhere Ex : TxM \u2192 M is the exponential map and (gi,j)i,j \u2208 {1,\u00b7\u00b7\u00b7 , d} are the components of the\nmetric tensor in any normal coordinate system around x.\n\nd \u2264 (cid:107)Ex(v) \u2212 x(cid:107)2\n\nand\n\n(cid:107)v(cid:107)2\n\nd \u2264 (cid:107)v(cid:107)2\n\nd \u2212 C(cid:107)v(cid:107)4\n\nd\n\nAlthough the proof of the lemma is beyond the scope of this paper, notice that one can indeed give\nexplicit bounds on r and C in terms of the reach and injectivity radius of the submanifold M.\n\nAcknowledgments\n\nThe authors are grateful to Pascal Massart for helpful discussions on Lepski\u2019s method. This work\nwas supported by the ANR project TopData ANR-13-BS01-0008 and ERC Gudhi No. 339025\n\nReferences\n[AC+10] Sylvain Arlot, Alain Celisse, et al. A survey of cross-validation procedures for model selection.\n\nStatistics surveys, 4:40\u201379, 2010.\n\n[AM09] Sylvain Arlot and Pascal Massart. Data-driven calibration of penalties for least-squares regression.\n\nThe Journal of Machine Learning Research, 10:245\u2013279, 2009.\n\n8\n\n\f[BM07] Lucien Birg\u00e9 and Pascal Massart. Minimal penalties for gaussian model selection. Probability\n\ntheory and related \ufb01elds, 138(1-2):33\u201373, 2007.\n\n[BMM12] Jean-Patrick Baudry, Cathy Maugis, and Bertrand Michel. Slope heuristics: overview and imple-\n\nmentation. Statistics and Computing, 22(2):455\u2013470, 2012.\n\n[BN03] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. Neural computation, 15(6):1373\u20131396, 2003.\n\n[BN04] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds. Machine\n\nlearning, 56(1-3):209\u2013239, 2004.\n\n[BN05] Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for laplacian-based manifold\n\nmethods. In Learning theory, pages 486\u2013500. Springer, 2005.\n\n[BN07] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. Advances in Neural\n\nInformation Processing Systems, 19:129, 2007.\n\n[BN08] Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for laplacian-based manifold\n\nmethods. Journal of Computer and System Sciences, 74(8):1289\u20131308, 2008.\n\n[GK06] E. Gin\u00e9 and V. Koltchinskii. Empirical graph laplacian approximation of laplace\u2013beltrami operators:\n\nLarge sample results. In High dimensional probability, pages 238\u2013259. IMS, 2006.\n\n[GL+08] Alexander Goldenshluger, Oleg Lepski, et al. Universal pointwise selection rule in multivariate\n\nfunction estimation. Bernoulli, 14(4):1150\u20131190, 2008.\n\n[GL09] A. Goldenshluger and O. Lepski. Structural adaptation via lp-norm oracle inequalities. Probability\n\nTheory and Related Fields, 143(1-2):41\u201371, 2009.\n\n[Gri09] Alexander Grigoryan. Heat kernel and analysis on manifolds, volume 47. American Mathematical\n\nSoc., 2009.\n\n[HAL07] M. Hein, JY Audibert, and U.von Luxburg. Graph laplacians and their convergence on random\n\nneighborhood graphs. Journal of Machine Learning Research, 8(6), 2007.\n\n[Lep92a] Oleg V Lepski. On problems of adaptive estimation in white gaussian noise. Topics in nonparametric\n\nestimation, 12:87\u2013106, 1992.\n\n[Lep92b] OV Lepskii. Asymptotically minimax adaptive estimation. i: Upper bounds. optimally adaptive\n\nestimates. Theory of Probability & Its Applications, 36(4):682\u2013697, 1992.\n\n[Lep93] OV Lepskii. Asymptotically minimax adaptive estimation. ii. schemes without optimal adaptation:\n\nAdaptive estimators. Theory of Probability & Its Applications, 37(3):433\u2013448, 1993.\n\n[LM15] Claire Lacour and Pascal Massart. Minimal penalty for goldenshluger-lepski method. arXiv preprint\n\narXiv:1503.00946, 2015.\n\n[LMR16] Claire Lacour, Pascal Massart, and Vincent Rivoirard. Estimator selection: a new method with\n\napplications to kernel density estimation. arXiv preprint arXiv:1607.05091, 2016.\n\n[LMS97] Oleg V Lepski, Enno Mammen, and Vladimir G Spokoiny. Optimal spatial adaptation to inhomoge-\nneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. The\nAnnals of Statistics, pages 929\u2013947, 1997.\n\n[NLCK06] B. Nadler, S. Lafon, RR Coifman, and IG Kevrekidis. Diffusion maps, spectral clustering and reac-\ntion coordinates of dynamical systems. Applied and Computational Harmonic Analysis, 21(1):113\u2013\n127, 2006.\n\n[Rie15] Antonio Rieser. A topological approach to spectral clustering. arXiv:1506.02633, 2015.\n\n[Ros97] Steven Rosenberg. The Laplacian on a Riemannian manifold: an introduction to analysis on\n\nmanifolds. Number 31. Cambridge University Press, 1997.\n\n[THJ11] Daniel Ting, Ling Huang, and Michael Jordan. An analysis of the convergence of graph laplacians.\n\narXiv preprint arXiv:1101.5435, 2011.\n\n[VL07] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[VLBB08] Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The\n\nAnnals of Statistics, pages 555\u2013586, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1976, "authors": [{"given_name": "Frederic", "family_name": "Chazal", "institution": "INRIA"}, {"given_name": "Ilaria", "family_name": "Giulini", "institution": "INRIA and Paris Diderot"}, {"given_name": "Bertrand", "family_name": "Michel", "institution": "UPMC"}]}