{"title": "Robust Nonparametric Regression with Metric-Space Valued Output", "book": "Advances in Neural Information Processing Systems", "page_first": 718, "page_last": 726, "abstract": "Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including a robust median type estimator and the classical Frechet mean. Depending on the choice of the output space and the chosen metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimator with experiments.", "full_text": "Robust Nonparametric Regression with Metric-Space\n\nvalued Output\n\nDepartment of Computer Science, Saarland University\n\nCampus E1 1, 66123 Saarbr\u00a8ucken, Germany\n\nMatthias Hein\n\nhein@cs.uni-sb.de\n\nAbstract\n\nMotivated by recent developments in manifold-valued regression we propose a\nfamily of nonparametric kernel-smoothing estimators with metric-space valued\noutput including several robust versions. Depending on the choice of the output\nspace and the metric the estimator reduces to partially well-known procedures for\nmulti-class classi\ufb01cation, multivariate regression in Euclidean space, regression\nwith manifold-valued output and even some cases of structured output learning.\nIn this paper we focus on the case of regression with manifold-valued input and\noutput. We show pointwise and Bayes consistency for all estimators in the family\nfor the case of manifold-valued output and illustrate the robustness properties of\nthe estimators with experiments.\n\n1\n\nIntroduction\n\nIn recent years there has been an increasing interest in learning with output which differs from\nthe case of standard classi\ufb01cation and regression. The need for such approaches arises in several\napplications which possess more structure than the standard scenarios can model.\nIn structured\noutput learning, see [1, 2, 3] and references therein, one generalizes multiclass classi\ufb01cation to\nmore general discrete output spaces, in particular incooperating structure of the joint input and\noutput space. These methods have been successfully applied in areas like computational biology,\nnatural language processing and information retrieval. On the other hand there has been a recent\nseries of work which generalizes regression with multivariate output to the case where the output\nspace is a Riemannian manifold, see [4, 5, 6, 7], with applications in signal processing, computer\nvision, computer graphics and robotics. One can also see this branch as structured output learning\nif one thinks of a Riemannian manifold as isometrically embedded in a Euclidean space. Then the\nrestriction that the output has to lie on the manifold can be interpreted as constrained regression in\nEuclidean space, where the constraints couple several output features together.\nIn this paper we propose a family of kernel estimators for regression with metric-space valued input\nand output motivated by estimators proposed in [6, 8] for manifold-valued regression. We discuss\nloss functions and the corresponding Bayesian decision theory for this general regression problem.\nMoreover, we show that this family of estimators has several well known estimators as special\ncases for certain choices of the output space and its metric. However, our main emphasis lies on\nthe problem of regression with manifold-valued input and output which includes the multivariate\nEuclidean case. In particular, we show for all our proposed estimators their pointwise and Bayes\nconsistency, that is in the limit as the sample size goes to in\ufb01nity the estimated mapping converges\nto the Bayes optimal mapping. This includes estimators implementing several robust loss functions\nlike the L1-loss, Huber loss or the \u03b5-insensitive loss. This generality is possible since our proof\nconsiders directly the functional which is minimized instead of its minimizer as it is usually done in\nconsistency proofs of the Nadaraya-Watson estimator. Finally, we conclude with a toy experiment\nillustrating the robustness properties and difference of the estimators.\n\n1\n\n\f2 Bayesian decision theory and loss functions for metric-space valued output\nWe consider the structured output learning problem where the task is to learn a mapping \u03c6 : M \u2192 N\nbetween two metric spaces M and N, where dM denotes the metric of M and dN the metric of N.\nWe assume that both metric spaces M and N are separable1. In general, we are in a statistical\nsetting where the given input/output pairs (Xi, Yi) are i.i.d. samples from a probability measure P\non M \u00d7 N.\nIn order to prove later on consistency of our metric-space valued estimator we \ufb01rst have to de\ufb01ne the\nBayes optimal mapping \u03c6\u2217 : M \u2192 N in the case where M and N are general metric spaces which\ndepends on the employed loss function. In multivariate regression the most common loss function\nis, L(y, f(x)) = (cid:107)y \u2212 f(x)(cid:107)2\n2. However, it is well known that this loss is sensitive to outliers. In\nunivariate regression one therefore uses the L1-loss or other robust loss functions like the Huber or \u03b5-\ninsensitive loss. For the L1-loss the Bayes optimal function f\u2217 is given as f\u2217(x) = Med[Y |X = x],\nwhere Med denotes the median of P(Y |X = x) which is a robust location measure. Several general-\nizations of the median for multivariate output have been proposed, see e.g. [9]. In this paper we refer\nto the minimizer of the loss function L(y, f(x)) = (cid:107)y \u2212 f(x)(cid:107)Rn resp. L(y, f(x)) = dN (y, f(x))\nas the (generalized) median, since this seems to be the only generalization of the univariate me-\ndian which has a straightforward extension to metric spaces. In analogy to Euclidean case, we will\ntherefore use loss functions penalizing the distance between predicted output and desired output:\n\nL(y, \u03c6(x)) = \u0393(cid:0)dN (y, \u03c6(x))(cid:1),\n\nwhere \u0393 : R+ \u2192 R+. We will later on restrict \u0393 to a certain family of functions. The associated\n\u0393 : M \u2192 N\nrisk (or expected loss) is: R\u0393(\u03c6) = E[L(Y, \u03c6(X))] and its Bayes optimal mapping \u03c6\u2217\ncan then be determined by\n\ny \u2208 N, x \u2208 M,\n\nE[\u0393(cid:0)dN (Y, \u03c6(X))(cid:1)]\n\n\u03c6\u2217\n\u0393 :=\n\narg min\n\n\u03c6:M\u2192N, \u03c6 measurable\n\n=\n\narg min\n\n\u03c6:M\u2192N, \u03c6 measurable\n\nR\u0393(\u03c6) =\n\nEX[EY |X[\u0393(cid:0)dN (Y, \u03c6(X))(cid:1)| X].\n\n\u03c6:M\u2192N, \u03c6 measurable\n\narg min\n\n(1)\n\n\u03c6 : M \u2192 N so that E[\u0393(cid:0)dN (Y, \u03c6(X))(cid:1)] < \u221e. This holds always once N has bounded diameter.\n\nIn the second step we used a result of [10] which states that a joint probability measure on the product\nof two separable metric spaces can always be factorized into a conditional probability measure and\nthe marginal. In order that the risk is well-de\ufb01ned, we assume that there exists a measurable mapping\nApart from the global risk R\u0393(\u03c6) we analyze for each x \u2208 M the pointwise risk R(cid:48)\n\n\u0393(x, \u03c6(x)) = EY |X[\u0393(cid:0)dN (Y, \u03c6(X))(cid:1)| X = x],\nE[\u0393(cid:0)dN (Y, p)(cid:1)| X = x] = arg min\n\nwhich measures the loss suffered by predicting \u03c6(x) for the input x \u2208 M. The total loss R\u0393(\u03c6) of\nthe mapping \u03c6 is then R\u0393(\u03c6) = E[R(cid:48)\n\u0393(X, \u03c6(X))]. As in standard regression the factorization allows\nto \ufb01nd the Bayes optimal mapping \u03c6\u2217 pointwise,\n\u03c6\u2217\n\u0393(x) = arg min\n\n\u0393(cid:0)dN (y, p)(cid:1) d\u00b5x(y),\n\nR(cid:48)\n\u0393(x, p) = arg min\n\n\u0393(x, \u03c6(x)),\n\n(cid:90)\n\nR(cid:48)\n\np\u2208N\n\np\u2208N\n\np\u2208N\n\nN\n\nwhere d\u00b5x is the conditional probability of Y conditioned on X = x. Later on we prove consistency\nfor a set of kernel estimators each using a different loss function \u0393 from the following class of\nfunctions.\nDe\ufb01nition 1 A convex function \u0393 : R+ \u2192 R+ is said to be (\u03b1, s)-bounded if\n\n\u2022 \u0393 : R+ \u2192 R+ is continuously differentiable, monotonically increasing and \u0393(0) = 0,\n\u2022 \u0393(2x) \u2264 \u03b1 \u0393(x) for x \u2265 s and \u0393(s) > 0 and \u0393(cid:48)(s) > 0.\n\nSeveral functions \u0393 corresponding to standard loss functions in regression are (\u03b1, s)-bounded:\n\n\u2022 Lp-type loss: \u0393(x) = x\u03b3 for \u03b3 \u2265 1 is (2\u03b3, 1)-bounded,\n\u2022 Huber-loss: \u0393(x) = 2x2\n2 and \u0393(x) = 2x \u2212 \u03b5\n1A metric space is separable if it contains a countable dense subset.\n\nfor x \u2264 \u03b5\n\n\u03b5\n\n2 for x > \u03b5\n\n2 is (3, \u03b5\n\n2)-bounded.\n\n2\n\n\f\u2022 \u03b5-insensitive loss: \u0393(x) = 0 for x \u2264 \u03b5 and \u0393(x) = x \u2212 \u03b5 if x > \u03b5 is (3, 2\u03b5)-bounded.\n\n\u0393(x,\u00b7) cannot be guaranteed\nWhile uniqueness of the minimizer of the pointwise loss functional R(cid:48)\n\u0393(x,\u00b7) has\nanymore in the case of metric space valued output, the following lemma shows that R(cid:48)\nreasonable properties (all longer proofs can be found in Section 7 or in the supplementary material).\nIt generalizes a result provided in [11] for \u0393(x) = x2 to all (\u03b1, s)-bounded losses.\nLemma 1 Let N be a complete and separable metric space such that d(x, y) < \u221e for all x, y \u2208 N\n\u0393(x, q) < \u221e for some\nand every closed and bounded set is compact. If \u0393 is (\u03b1, s)-bounded and R(cid:48)\nq \u2208 N, then\n\u2022 R(cid:48)\n\u2022 R(cid:48)\n\u2022 The set of minimizers Q\u2217 = arg min\n\n\u0393(x, p) < \u221e for all p \u2208 N,\n\u0393(x,\u00b7) is continuous on N,\n\nR(cid:48)\n\u0393(x, q) exists and is compact.\n\nq\u2208N\n\nIt is interesting to have a look at one special loss, the case \u0393(x) = x2. The minimizer of the\npointwise risk,\n\n(cid:90)\n\nF (p) = arg min\n\np\u2208N\n\nN\n\nN (y, p) d\u00b5x(y),\nd2\n\nis called the Frech\u00b4et mean2 or Karcher mean in the case where N is a manifold. It is the generaliza-\ntion of a mean in Euclidean space to a general metric space. Unfortunately, it needs to be no longer\nunique as in the Euclidean case. A simple example is the sphere as the output space together with\na uniform probability measure on it. In this case every point p on the sphere attains the same value\nF (p) and thus the global minimum is non-unique. We refer to [12, 13, 11] for more information\nunder which conditions one can prove uniqueness of the global minimizer if N is a Riemannian\nmanifold. The generalization of the median to Riemannian manifolds, that is \u0393(x) = x, is discussed\nin [9, 4, 8]. For a discussion of the computation of the median in general metric spaces see [14].\n\n3 A family of kernel estimators with metric-space valued input and output\n\nIn the following we provide the de\ufb01nition of the kernel estimator with metric-space valued out-\nput motivated by the two estimators proposed in [6, 8] for manifold-valued output. We use in the\nfollowing the notation kh(x) = 1\n\nhm k(x/h).\n\ni=1 be the sample with Xi \u2208 M and Yi \u2208 N. The metric-space-valued\nDe\ufb01nition 2 Let (Xi, Yi)l\nkernel estimator \u03c6l : M \u2192 N from metric space M to metric space N is de\ufb01ned for all x \u2208 M as\n\n\u0393(cid:0)dN (q, Yi)(cid:1) kh\n\n(cid:0)dM (x, Xi)(cid:1),\n\nl(cid:88)\n\ni=1\n\n1\nl\n\n\u03c6l(x) = arg min\n\nq\u2208N\n\n(cid:0)dM (x, Xi)(cid:1) is to measure the similarity between x and Xi in M which should decrease as the\n\nwhere \u0393 : R+ \u2192 R+ is (\u03b1, s)-bounded and k : R+ \u2192 R+.\nIf the data contains a large fraction of outliers one should use a robust loss function \u0393, see Sec-\ntion 6. Usually the kernel function should be monotonically decreasing since the interpretation of\nkh\ndistance increases. The computational complexity to determine \u03c6l(x) is quite high as for each test\npoint one has to solve an optimization problem but comparable to structured output learning (see\ndiscussion below) where one maximizes for each test point the score function over the output space.\nFor manifold-valued output we will describe in the next section a simple gradient-descent type opti-\nmization scheme in order to determine \u03c6l(x).\nIt is interesting to see that several well-known nonparametric estimators for classi\ufb01cation and re-\ngression can be seen as special cases of this estimator (or a slightly more general form) for different\nchoices of the output space, its metric and the loss function. In particular, the approach shows a cer-\ntain analogy of a generalization of regression into a continuous space (manifold-valued regression)\nand regression into a discrete space (structured output learning).\n\n2In some cases the set of all local minimizers is denoted as the Frech\u00b4et mean set and the Frech\u00b4et mean is\n\ncalled unique if there exists only one global minimizer.\n\n3\n\n(2)\n\n\fMulticlass classi\ufb01cation: Let N = {1, . . . , K} where K denotes the number of classes K. If\nthere is no special class-structure, then we use the discrete metric on N, dN (q, q(cid:48)) = 1 if q (cid:54)= q(cid:48) and\n0 else leads for any \u0393 to the standard multiclass classi\ufb01cation scheme using a majority vote. Cost-\nsensitive multiclass classi\ufb01cation can be done by using dN (q, q(cid:48)) to model the cost of misclassifying\nclass q by class q(cid:48). Since general costs can generally not be modeled by a metric, it should be noted\nthat the estimator can be modi\ufb01ed using a similarity function, s : N \u00d7 N \u2192 R,\n\n\u03c6l(x) = arg max\n\nq\u2208N\n\n1\nl\n\ns(cid:0)q, Yi\n\n(cid:1) kh\n\n(cid:0)dM (x, Xi)(cid:1),\n\nl(cid:88)\n\ni=1\n\nThe consistency result below can be generalized to this case given that N has \ufb01nite cardinality.\n\nMultivariate regression: Let N = Rn and M be a metric space. Then for \u0393(x) = x2, one gets\n\n1\nl\n\n(cid:107)q \u2212 Yi(cid:107)2 kh\n\n(cid:0)dM (x, Xi)(cid:1),\n\nl(cid:88)\n(cid:0)dM (x,Xi)(cid:1)Yi\n(cid:0)dM (x,Xi)(cid:1) . This is the well-known Nadaraya-Watson\n\ni=1\n\nq\u2208N\n\n\u03c6l(x) = arg min\n(cid:80)l\n(cid:80)l\n\n1\nl\n\nwhich has the solution, \u03c6l(x) =\nestimator, see [15, 16], on a metric space. In [17] a related estimator is discussed when M is a closed\nRiemannian manifold and [18] discusses the Nadaraya-Watson estimator when M is a metric space.\n\n1\nl\n\ni=1 kh\ni=1 kh\n\nManifold-valued regression:\nIn [6] the estimator \u03c6l(x) has been proposed for the case where N is\na Riemannian manifold and \u0393(x) = x2, in particular with the emphasis on N being the manifold of\nshapes. The discussion of a robust median-type estimator, that is \u0393(x) = x, has been done recently\nin [8]. While it has been shown in [7] that an approach using a global smoothness regularizer\noutperforms the estimator \u03c6l(x), it is a well working baseline with a simple implementation, see\nSection 4.\n\nlated using kernels k(cid:0)(x1, q1), (x2, q2)(cid:1) on the product M \u00d7 N of input and output space, which are\n\nStructured output: Structured output learning, see [1, 2, 3] and references therein, can be formu-\n\nsupposed to measure jointly the similarity and thus can capture non-trivial dependencies between\ninput and output. Using such kernels [1, 2, 3] learn a score function s : M \u00d7 N \u2192 R, with\n\n\u03a8(x) = arg max\n\ns(x, q).\n\nbeing the \ufb01nal prediction for x \u2208 M. The similarity to our estimator \u03c6l(x) in (2) becomes more\nobvious when we use that in the framework of [1] the learned score function can be written as\n\n\u03a8l(x) = arg max\n\nq\u2208N\n\n1\nl\n\n\u03b1i k(cid:0)(x, q), (Xi, Yi)(cid:1),\n\nwhere \u03b1 \u2208 Rl is the learned coef\ufb01cient vector. Apart from the coef\ufb01cient vector \u03b1 this has almost\nthe form of the previously discussed estimator in Equation (3), using a joint similarity function on\ninput and output space. Clearly, a structured output method where the coef\ufb01cients \u03b1 have been\noptimized, should perform better than \u03b1i = const. In cases where training time is prohibitive the\nestimator without \u03b1 is an alternative, at least it provides a useful baseline for structured output\n\nlearning. Moreover, if the joint kernel factorizes, k(cid:0)(x1, q1), (x2, q2)(cid:1) = kM (x1, x2) kN (q1, q2) on\n\nM and N, and kN (q, q) = const., then one can rewrite the problem in (4) as,\n\n(3)\n\n(4)\n\nq\u2208N\n\nl(cid:88)\n\ni=1\n\nl(cid:88)\n\ni=1\n\n\u03a8l(x) = arg min\n\nq\u2208N\n\n1\nl\n\n\u03b1i kM (x, Xi)d2\n\nN (q, Yi),\n\nwhere dN is the induced (semi)-metric3 of kN . Apart from the learned coef\ufb01cients this is basically\nequivalent to \u03c6l(x) in (2) for \u0393(x) = x2.\nIn the following we restrict ourselves to the case where M and N are Riemannian manifolds. In this\ncase the optimization to obtain \u03c6l(x) can still be done very ef\ufb01ciently as the next section shows.\nN (p, q) = kN (p, p) + kN (q, q) \u2212 2kN (p, q).\n\n3The kernel kN induces a (semi)-metric dN on N via: d2\n\n4\n\n\fl(cid:88)\n\nwi \u0393(cid:0)dN (q, Yi)(cid:1).\n\nImplementation of the kernel estimator for manifold-valued output\n\n4\nFor \ufb01xed x \u2208 M, the functional F (q) for q \u2208 N which is optimized in the kernel estimator \u03c6l(x)\ncan be rewritten with wi = kh(dM (x, Xi)) as,\n\nThe covariant gradient of F (q) is given as, \u2207F(cid:12)(cid:12)q =(cid:80)l\n\nF (q) =\n\ni=1\n\ni=1 wi\u0393(cid:48)(cid:0)dN (p, Yi)(cid:1) vi, where vi \u2208 TqN is\n\na tangent vector at q with (cid:107)vi(cid:107)TqN = 1 given by the tangent vector at q of the minimizing4 geodesic\nfrom Yi to q (pointing \u201caway\u201d from Yi). Denoting by expq : TqN \u2192 N the exponential map at q,\nthe simple gradient descent based optimization scheme can be written as\n\n\u2022 choose a random point q0 from N,\n\u2022 while stopping criteria not ful\ufb01lled,\n\n1. compute gradient \u2207F at qk\n2. one has: qk+1 = expqk\n3. determine stepsize \u03b1 by Armijo rule [19].\n\n(cid:0) \u2212 \u03b1\u2207F|qk\n\n(cid:1)\n\nAs stopping criterion we use either the norm of the gradient or a threshold on the change of F . For\nthe experiments in Section 6 we get convergence in 5 to 40 steps.\n\n5 Consistency of the kernel estimator for manifold-valued input and output\n\nIn this section we show the pointwise and Bayes consistency of the kernel estimator \u03c6l in the case\nwhere M and N are Riemannian manifolds. This case already subsumes several of the interesting\napplications discussed in [6, 8]. The proof of consistency of the general metric-space valued kernel\nestimator (for a restricted class of metric spaces including all Riemannian manifolds) requires high\ntechnical overload which is interesting in itself but which would make the paper hard accessible.\nThe consistency of \u03c6l will be proven under the following assumptions:\n\nAssumptions (A1):\n\ni=1 is an i.i.d. sample of P on M \u00d7 N,\n\n1. The loss \u0393 : R+ \u2192 R+ is (\u03b1, s)-bounded.\n2. (Xi, Yi)l\n3. M and N are compact m-and n-dimensional manifolds,\n4. The data-generating measure P on M \u00d7 N is absolutely continuous with respect to the\n5. The marginal density on M ful\ufb01lls: p(x) \u2265 pmin, \u2200 x \u2208 M,\n6. The density p(\u00b7, y) is continuous on M for all y \u2208 N,\n\n7. The kernel ful\ufb01lls: a 1s\u2264r1 \u2264 k(s) \u2264 b e\u2212\u03b3 s2 and(cid:82)\n\nRm (cid:107)x(cid:107) k((cid:107)x(cid:107)) dx < \u221e,\n\nnatural volume element,\n\nNote, that existence of a density is not necessary for consistency. However, in order to keep the\ndet g dx denotes the\nproofs simple, we restrict ourselves to this setting. In the following dV =\nnatural volume element of a Riemannian manifold with metric g, vol(S) and diam(N) are the\nvolume and diameter of the set S. For the proof of our main theorem we need the following two\npropositions. The \ufb01rst one summarizes two results from [20].\nProposition 1 Let M be a compact m-dimensional Riemannian manifold. Then, there exists r0 > 0\nand S1, S2 > 0 such that for all x \u2208 M the volume of the balls B(x, r) with radius r \u2264 r0 satis\ufb01es,\n\nS1 rm \u2264 vol(cid:0)B(x, r)(cid:1) \u2264 S2 rm.\n\n\u221a\n\nMoreover, the cardinality K of a \u03b4-covering of M is upper bounded as, K \u2264 vol(N )\n\nS1\n\n(cid:17)m\n\n(cid:16) 2\n\n\u03b4\n\n.\n\n4The set of points where there the minimizing geodesic is not unique, the so called cut locus, has measure\n\nzero and therefore plays no role in the optimization.\n\n5\n\n\fMoreover, we need a result about convolutions on manifolds.\nProposition 2 Let the assumptions A1 hold, then if f is continuous we get for any x \u2208 M\\\u2202M,\n\nkh(dM (x, z))f(z) dV (z) = Cxf(x),\n\n(cid:90)\n\nlim\nh\u21920\n\nM\n\n(cid:82)\n\n(cid:90)\n\nwhere Cx = limh\u21920\nLipschitz constant L, then there exists a h0 > 0 such that for all h < h0(x),\n\nM kh(dM (x, z)) dV (z) > 0.\n\nIf moreover f is Lipschitz continuous with\n\nkh(dM (x, z))f(z) dV (z) = Cx f(x) + O(h).\n\nM\n\nThe following main theorem proves the almost sure pointwise convergence of the manifold-valued\nkernel estimator for all (\u03b1, s)-bounded loss functions \u0393.\n\nTheorem 1 Suppose the assumptions in A1 hold. Let \u03c6l(x) be the estimate of the kernel estimator\nfor sample size l. If h \u2192 0 and lhm/ log l \u2192 \u221e, then for any x \u2208 M\\\u2202M,\n\n\u0393(x, q)| = 0,\nR(cid:48)\nIf additionally p(\u00b7, y) is Lipschitz-continuous for any y \u2208 N, then\n\n\u0393(x, \u03c6l(x)) \u2212 arg min\nq\u2208N\n\nl\u2192\u221e|R(cid:48)\nlim\n\n\u0393(x, q)| = O(h) + O(cid:0)(cid:112)log l/(l hm)(cid:1),\n\nR(cid:48)\n\nl\u2192\u221e|R(cid:48)\nlim\n\n\u0393(x, \u03c6l(x)) \u2212 arg min\nq\u2208N\n\nThe optimal rate is given by h = O(cid:0)(log l/l) 1\n\n2+m(cid:1) so that\n\nalmost surely.\n\nalmost surely.\n\n(cid:16)(cid:0) log l/l(cid:1) 1\n2+m(cid:17)\n\n,\n\nalmost surely.\n\nl\u2192\u221e R(cid:48)\nlim\n\n\u0393(x, \u03c6l(x)) \u2212 arg min\nq\u2208N\n\nR(cid:48)\n\u0393(x, q) = O\n\nNote, that the condition l hm/ log l \u2192 \u221e for convergence is the same as for the Nadaraya-Watson\nestimator on a m-dimensional Euclidean space. This had to be expected as this condition still holds\nif one considers multivariate output, see [15, 16]. Thus, doing regression with manifold-valued\noutput is not more \u201cdif\ufb01cult\u201d than standard regression with multivariate output.\nNext, we show Bayes consistency of the manifold-valued kernel estimator.\nTheorem 2 Let the assumptions A1 hold. If h \u2192 0 and lhm/ log l \u2192 \u221e, then\n\nl\u2192\u221e R\u0393(\u03c6l) \u2212 R\u0393(\u03c6\u2217) = 0,\nlim\nProof: We have, R\u0393(\u03c6l) \u2212 R\u0393(\u03c6\u2217) \u2264 E[|R(cid:48)\nwe have almost everywhere,\nE[R(cid:48)\ntheorem proven by Glick, see [21], provides the result.\n\n\u0393(X, \u03c6(X))] < \u221e and E[R(cid:48)\n\n\u0393(X, \u03c6l(X)) \u2212 R(cid:48)\n\n\u0393(X, \u03c6\u2217(X))|]. Moreover,\nliml\u2192\u221e R(cid:48)\nSince\n\u0393(X, \u03c6\u2217(X))] < \u221e, an extension of the dominated convergence\n(cid:3)\n\n\u0393(x, \u03c6\u2217(x)) almost surely.\n\n\u0393(x, \u03c6l(x)) = R(cid:48)\n\nalmost surely.\n\n6 Experiments\n\nk(cid:0)|x \u2212 y|/h(cid:1) = 1 \u2212 |x \u2212 y|/h. The parameter h was found by 5-fold cross validation from the set\n\nWe illustrate the differences of median and mean type estimator on a synthetic dataset with the task\nof estimating a curve on the sphere, that is M = [0, 1] and N = S1. The kernel used had the form,\n[5, 10, 20, 40] \u2217 10\u22123. The results are summarized for different levels of outliers and different levels\nof van-Mises noise (note that the parameter k is inverse to the variance of the distribution) in Table\n1. As expected the the L1-loss and the Huber loss as robust loss functions outperform the L2-loss\nin the presence of outliers, whereas the L2-loss outperforms the robust versions when no outliers\nare present. Note, that the Huber loss as a hybrid version between L1- and L2-loss is even slightly\nbetter than the L1-loss in the presence of outliers as well as in the outlier free case. Thus for a given\ndataset it makes sense not only to do cross-validation of the parameter h of the kernel function but\nalso over different loss functions in order to adapt to possible outliers in the data.\n\n6\n\n\fFigure 1: Regression problem on the sphere with 1000 training points (black points). The blue\npoints are the ground truth disturbed by van Mises noise with parameter k = 100 and 20% (outliers)\nwith k = 3. The estimated curves are shown in green. Left: Result of L1-loss, mean error (ME)\n0.256, mean squared error (MSE) 0.165. Middle: Result of L2-loss: ME = 0.265, MSE = 0.169.\nRight: Result of Huber loss with \u03b5 = 0.1: ME = 0.255, MSE = 0.165. In particular, the curves\nfound using L1 and Huber loss are very close to the ground truth.\n\nTable 1: Mean squared error (unit 10\u22121) for regression on the sphere - for different noise levels k,\nnumber of labeled points, without and with outliers. Results are averaged over 10 runs.\n20% outliers\n500\n\nNumber of samples\n2.1 \u00b1 0.2 1.57 \u00b1 0.05\n0.63 \u00b1 0.11\nk = 100\nk = 1000 0.43 \u00b1 0.12\n2.1 \u00b1 0.5\n1.45 \u00b1 0.03\nk = 100 0.43 \u00b1 0.10 0.230 \u00b1 0.007 0.208 \u00b1 0.001 2.0 \u00b1 0.2 1.59 \u00b1 0.02\nk = 1000 0.28 \u00b1 0.16 0.032 \u00b1 0.003 0.025 \u00b1 0.001 2.0 \u00b1 0.4 1.51 \u00b1 0.03\nk = 100\n\nL1-Loss\n\u0393(x) = x\nL2-Loss\n\u0393(x) = x2\n0.61 \u00b1 0.11\nHuber-Loss\nwith \u03b5 = 0.1 k = 1000 0.42 \u00b1 0.12\n\n1.521 \u00b1 0.015\n1.400 \u00b1 0.008\n1.549 \u00b1 0.021\n1.447 \u00b1 0.015\n2.1 \u00b1 0.2 1.57 \u00b1 0.05 1.520 \u00b1 0.021\n2.1 \u00b1 0.5 1.44 \u00b1 0.02 1.397 \u00b1 0.008\n\nno outliers\n500\n0.260 \u00b1 0.027\n0.043 \u00b1 0.005\n\n0.257 \u00b1 0.026\n0.040 \u00b1 0.005\n\n0.219 \u00b1 0.003\n0.030 \u00b1 0.001\n\n0.218 \u00b1 0.003\n0.028 \u00b1 0.001\n\n1000\n\n1000\n\n100\n\n100\n\n7 Proofs\nLemma 2 Let \u03c6 : R+ \u2192 R be convex, differentiable and monotonically increasing. Then\nmin{\u03c6(cid:48)(x), \u03c6(cid:48)(y)}|y \u2212 x| \u2264 |\u03c6(y) \u2212 \u03c6(x)| \u2264 max{\u03c6(cid:48)(x), \u03c6(cid:48)(y)}|y \u2212 x|.\n\nProof of Theorem 1 We de\ufb01ne R(cid:48)\narg min\n\n. Note that \u03c6l(x) =\nR(cid:48)\n\u0393,l(x, q) as we have only divided by a constant factor. We use the standard technique for\n\ni=1 \u0393(dN (q,Yi)) kh(dM (x,Xi))\n\nE[kh(dM (x,X))]\n\n\u0393,l(x, q) =\n\n1\nl\n\nq\u2208N\nthe pointwise estimate,\n\u0393(x, \u03c6l(x)) \u2212 min\n\u0393(x, q) \u2264 R(cid:48)\nR(cid:48)\nR(cid:48)\nq\u2208N\n\n\u0393(x, \u03c6l(x)) \u2212 R(cid:48)\n\n\u0393,l(x, \u03c6l(x)) + R(cid:48)\n\n\u0393,l(x, \u03c6l(x)) \u2212 min\nq\u2208N\n\nR(cid:48)\n\u0393(x, q)\n\n\u2264 2 sup\nq\u2208N\n\n|R(cid:48)\n\n\u0393,l(x, q) \u2212 R(cid:48)\n\n\u0393(x, q)|.\n\nl\n\n(cid:12)(cid:12) 1\n\nIn order\n\ni=1 kh(dM (x,Xi))\n\nE[kh(dM (x,X))] \u2212 1(cid:12)(cid:12) < 1\n(cid:80)l\n(cid:16) 2\n(cid:17)n\nMoreover, we assume to have a \u03b4-covering of N with centers N\u03b4 = {q\u03b1}K\n1 we have K \u2264 vol(N )\nIntroducing RE\n\nto bound the supremum, we will work on the event E, where we assume,\n2, which holds with probability 1 \u2212 2 e\u2212C l hm for some constant C.\n\u03b1=1 where using Lemma\n. Thus for each q \u2208 N there exists q\u03b1 \u2208 N\u03b4 such that dN (q, q\u03b1) \u2264 \u03b4.\nE[kh(dM (x,X))]\n\nS1\n\n\u03b4\n\nwe have to control four terms,\n\n\u0393 (x, q) = E[\u0393(dN (q,Y ))kh(dM (x,X))]\n\u0393,l(x, q) \u2212 R(cid:48)\n\u0393,l(x, q) \u2212 R(cid:48)\n\u0393(x, q) =R(cid:48)\nR(cid:48)\n+ RE\n(cid:80)l\n\n\u0393,l(x, q\u03b1)(cid:12)(cid:12) =\n\u2264 2 dN (q, q\u03b1) \u0393(cid:48)(cid:0) diam(N)(cid:1) 1\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1\n\ni=1\n\nl\n\nl\n\n\u0393,l(x, q)\u2212R(cid:48)\n\n\u0393 (x, q\u03b1) \u2212 RE\n\nand using the decomposition,\n\u0393,l(x, q\u03b1) \u2212 RE\n\u0393,l(x, q\u03b1) + R(cid:48)\n\u0393 (x, q) \u2212 R(cid:48)\n\u0393 (x, q) + RE\n\n\u0393 (x, q\u03b1)\n\u0393(x, q),\n\n(cid:0)\u0393(cid:0)dN (q, Yi)(cid:1) \u2212 \u0393(cid:0)dN (q\u03b1, Yi)(cid:1)(cid:1)kh(dM (x, Xi))\n(cid:12)(cid:12)(cid:12)(cid:12)\n\u2264 3 \u0393(cid:48)(cid:0) diam(N)(cid:1) \u03b4.\n\nE[kh(dM (x, X))]\ni=1 kh(dM (x, Xi))\nE[kh(dM (x, X))]\n\n(cid:80)l\n\n(cid:12)(cid:12)R(cid:48)\n\n(cid:80)l\n\n7\n\n\fwhere we have used Lemma 2 and the fact that E holds. Then, there exists a constant C such that\n\n(cid:16)\n\nP\n\nmax\n1\u2264\u03b1\u2264K\n\n|R(cid:48)\n\nvol(N)\n\n(cid:17)n\n\n\u0393 (x, q\u03b1)| > \u03b5\n\n(cid:17) \u2264 2\n\n\u0393,l(x, q\u03b1) \u2212 RE\n\n(cid:16)2\n(cid:80)l\nS1\ni=1 Wi \u2212 E[Wi] where Wi =\ntogether with a union bound over the elements in the covering N\u03b4 using\n, Var Wi \u2264 \u0393(diam(N))2E[k2\n\nh(dM (x, X))]\n\ne\u2212C l hm\u03b52\n\n\u03b4\n\n,\n\n,\n\n\u2264 b\na\n\n\u0393(diam(N))2\nhmS1rm\n1 pmin\n\n(E[kh(dM (x, X))])2\n\nwhich can be shown using Bernstein\u2019s inequality for 1\nl\n\u0393(dN (q\u03b1,Yi))kh(dM (x,Xi))\n\nE[kh(dM (x,X))]\n|Wi| \u2264 b\na\n\n\u0393(diam(N))\nhmS1rm\n1 pmin\n\nwhere we used Proposition 1 to lower bound vol(B(x, h r1)) for small enough h. Third, we get for\nthe third term using again Lemma 2,\n\n|RE\n\n\u0393 (x, q\u03b1) \u2212 RE\n\n\u0393 (x, q)| \u2264 2\u0393(cid:48)(diam(N))dN (q, q\u03b1) \u2264 2\u0393(cid:48)(diam(N))\u03b4.\n\nLast, we have to bound the approximation error RE\ntion on the joint density p(x, y) we can use Proposition 2. For every x \u2208 M\\\u2202M we get,\n\n\u0393(x, q), Under the continuity assump-\n\n\u0393 (x, q)\u2212 R(cid:48)\n(cid:90)\n(cid:90)\n\nlim\nh\u21920\n\nM\n\nkh(dM (x, z))p(z)dV (z) = Cxp(x),\n\nkh(dM (x, z))p(z)dV (z),\n\nkh(dM (x, z))p(z, y)dV (z) = Cx p(x, y),\n\nlim\nh\u21920\nwhere Cx > 0. Thus with\n\nM\n\n(cid:90)\n\n(cid:90)\n\nM\n\nfh =\n\nkh(dM (x, z))p(z, y)dV (z),\n\ngh =\n\nwe get for every x \u2208 M\\\u2202M,\n\n(cid:12)(cid:12)(cid:12) fh\n\ngh\n\n(cid:12)(cid:12)(cid:12) \u2264 lim\n\nh\u21920\n\n\u2212 f\ng\n\nlim\nh\u21920\n\nM\n\n+ lim\nh\u21920\n\nf\n\n|fh \u2212 f|\n\ngh\n\n|gh \u2212 g|\ng gh\n\n= 0,\n\nRE\n\nlim\nh\u21920\n\nwhere we have used gh \u2265 aS1r1pmin > 0 and g = Cxp(x) > 0. Moreover, using results from\nthe proof of Proposition 2 one can show fh < C for some constant C. Thus fh/gh < C for some\nconstant and fh/gh \u2192 f /g as h \u2192 0. Using the dominated convergence theorem we thus get\np(x) dy = R(cid:48)\nFor the case where the joint density is Lipschitz continuous one gets using Proposition 2, RE\nR(cid:48)\n\u0393(x, q) + O(h).\nIn total, there exist constants A, B, C, D1, D2, such that for suf\ufb01ciently small h one has with prob-\nability 1 \u2212 AeB n log( 1\n\n\u0393(cid:0)dN (q, y)(cid:1) p(x, y)\n\nE[\u0393(dN (q, Y ))kh(dM (x, X))]\n\nE[kh(dM (x, X))]\n\n\u0393 (x, q) = lim\nh\u21920\n\n\u0393 (x, q) =\n\n\u0393(x, q).\n\n(cid:90)\n\n=\n\nN\n\n\u03b4 )\u2212Clhm\u03b52,\nsup\nq\u2208N\n\n|R(cid:48)\n\n\u0393,l(x, q) \u2212 RE\n\n\u0393 (x, q)| \u2264 2D1\u03b4 + \u03b5.\n\n\u0393 (x, q) = R(cid:48)\n\n\u0393(x, q).\n\u0393 (x, q) = R(cid:48)\n\nlog l \u2192 \u221e together with\nWith \u03b4 = l\u2212s for some s > 0 one gets convergence if\nFor the case where p(\u00b7, y) is Lipschitz continuous for all\nlimh\u21920 RE\ny \u2208 N we have RE\n\u0393(x, q) + O(h) and can choose s large enough so that the bound\nlog l \u2192 \u221e the\nfrom the approximation error dominates the one of the covering. Under the condition lhm\nprobabilistic bound is summable in l which yields almost sure convergence by the Borel-Cantelli-\nLemma. The optimal rate in the Lipschitz continuous case is then determined by \ufb01xing h such that\n(cid:3)\nboth terms of the bound are of the same order.\n\nlhm\n\nAcknowledgments\n\nWe thank Florian Steinke for helpful discussions about relations between generalized kernel esti-\nmators and structured output learning. This work has been partially supported by the Cluster of\nExcellence MMCI at Saarland University.\n\n8\n\n\fReferences\n[1] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. JMLR, 6:1453\u20131484, 2005.\n\n[2] J. Weston, G. BakIr, O. Bousquet, B. Sch\u00a8olkopf, T. Mann, and W. S. Noble. Joint kernel maps.\n\nIn Predicting Structured Data, pages 67\u201384. MIT Press, 2007.\n\n[3] E. Ricci, T. De Bie, and N. Cristianini. Magic moments for structured output prediction. JMLR,\n\n9:2803\u20132846, 2008.\n\n[4] K.V. Mardia and P.E. Jupp. Directional statistics. Wiley New York, 2000.\n[5] Inam Ur Rahman, Iddo Drori, Victoria C. Stodden, David L. Donoho, and Peter Schroder.\nMultiscale representations for manifold-valued data. Multiscale Modeling and Simulation,\n4(4):1201\u20131232, 2005.\n\n[6] B. C. Davis, P. T. Fletcher, E. Bullitt, and S. Joshi. Population shape regression from random\ndesign data. Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on,\npages 1\u20137, 2007.\n\n[7] F. Steinke and M. Hein. Non-parametric regression between Riemannian manifolds. In Ad-\n\nvances in Neural Information Processing Systems (NIPS) 21, pages 1561 \u2013 1568, 2009.\n\n[8] P. T. Fletcher, S. Venkatasubramanian, and S. Joshi. The geometric median on Riemannian\n\nmanifolds with application to robust atlas estimation. NeuroImage, 45:143 \u2013 152, 2009.\n\n[9] C. G. Small. A survey of multidimensional medians. International Statistical Review, 58:263\u2013\n\n277, 1990.\n\n[10] D. Blackwell and M. Maitra. Factorization of probability measures and absolutely measurable\n\nsets. Proc. Amer. Math. Soc., 92(2):251\u2013254, 1984.\n\n[11] R. Bhattacharya and V. Patrangenaru. Large sample theory of intrinsic and extrinsic sample\n\nmeans on manifolds I. Ann. Stat., 31(1):1\u201329, 2003.\n\n[12] H. Karcher. Riemannian center of mass and molli\ufb01er smoothing. Communications on Pure\n\nand Applied Mathematics, 30:509\u2013541, 1977.\n\n[13] W. Kendall. Probability, convexity, and harmonic maps with small image. I. Uniqueness and\n\n\ufb01ne existence. Proc. London Math. Soc., 61(2):371\u2013406, 1990.\n\n[14] P. Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the 31st\n\nSymposium on Theory of computing (STOC), pages 428 \u2013 434, 1999.\n\n[15] L. Gy\u00a8or\ufb01, M. Kohler, A. Krzy\u02d9zak, and H. Walk. A Distribution-Free Theory of Nonparametric\n\nRegression. Springer, New York, 2004.\n\n[16] W. Greblicki and M. Pawlak. Nonparametric System Identi\ufb01cation. Cambridge University\n\nPress, Cambrige, 2008.\n\n[17] B. Pelletier. Nonparametric regression estimation on closed Riemannian manifolds. J. of\n\nNonparametric Stat., 18:57\u201367, 2006.\n\n[18] S. Dabo-Niang and N. Rhomari. Estimation non parametrique de la regression avec variable\n\nexplicative dans un espace metrique. C. R. Math. Acad. Sci. Paris, 1:75\u201380, 2003.\n\n[19] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, Mass., 1999.\n[20] M. Hein. Uniform convergence of adaptive graph-based regularization.\n\nIn G. Lugosi and\nH. Simon, editors, Proc. of the 19th Conf. on Learning Theory (COLT), pages 50\u201364, Berlin,\n2006. Springer.\n\n[21] N. Glick. Consistency conditions for probability estimators and integrals of density estimators.\n\nUtilitas Math., 6:61\u201374, 1974.\n\n9\n\n\f", "award": [], "sourceid": 975, "authors": [{"given_name": "Matthias", "family_name": "Hein", "institution": null}]}