{"title": "Universal Kernels on Non-Standard Input Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 406, "page_last": 414, "abstract": "During the last years support vector machines (SVMs) have been successfully applied even in situations where the input space $X$ is not necessarily a subset of $R^d$. Examples include SVMs using probability measures to analyse e.g. histograms or coloured images, SVMs for text classification and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) $H\\subset L_p(P_X)$ is dense, or if the SVM is based on a universal kernel $k$.  So far, however, there are no RKHSs of practical interest known that satisfy these assumptions on $\\cH$ or $k$ if $X \\not\\subset R^d$.  We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of $R^d$. We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing.", "full_text": "Universal Kernels on Non-Standard Input Spaces\n\nAndreas Christmann\nUniversity of Bayreuth\n\nDepartment of Mathematics\n\nD-95440 Bayreuth\n\nandreas.christmann@uni-bayreuth.de\n\nIngo Steinwart\n\nUniversity of Stuttgart\n\nDepartment of Mathematics\n\nD-70569 Stuttgart\n\ningo.steinwart@mathematik.uni-stuttgart.de\n\nAbstract\n\nDuring the last years support vector machines (SVMs) have been successfully ap-\nplied in situations where the input space X is not necessarily a subset of Rd. Ex-\namples include SVMs for the analysis of histograms or colored images, SVMs for\ntext classi\ufb01cation and web mining, and SVMs for applications from computational\nbiology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be\nconsistent to the Bayes risk, if either the input space is a complete separable metric\nspace and the reproducing kernel Hilbert space (RKHS) H \u2282 Lp(PX ) is dense,\nor if the SVM uses a universal kernel k. So far, however, there are no kernels of\npractical interest known that satisfy these assumptions, if X (cid:54)\u2282 Rd. We close this\ngap by providing a general technique based on Taylor-type kernels to explicitly\nconstruct universal kernels on compact metric spaces which are not subset of Rd.\nWe apply this technique for the following special cases: universal kernels on the\nset of probability measures, universal kernels based on Fourier transforms, and\nuniversal kernels for signal processing.\n\n1\n\nIntroduction\n\nFor more than a decade, kernel methods such as support vector machines (SVMs) have belonged\nto the most successful learning methods. Besides several other nice features, one key argument\nfor using SVMs has been the so-called \u201ckernel trick\u201d [22], which decouples the SVM optimization\nproblem from the domain of the samples, thus making it possible to use SVMs on virtually any input\nspace X. This \ufb02exibility is in strong contrast to more classical learning methods from both machine\nlearning and non-parametric statistics, which almost always require input spaces X \u2282 Rd. As a\nresult, kernel methods have been successfully used in various application areas that were previously\ninfeasible for machine learning methods. The following, by no means exhaustive, list illustrates this:\n\u2022 SVMs processing probability measures, e.g. histograms, as input samples have been used to an-\nalyze histogram data such as colored images, see [5, 11, 14, 12, 27, 29], and also [17] for non-\nextensive information theoretic kernels on measures.\n\n\u2022 SVMs for text classi\ufb01cation and web mining [15, 12, 16],\n\u2022 SVMs with kernels from computational biology, e.g. kernels for trees and graphs [23].\nIn addition, several extensions or generalizations of kernel-methods have been considered, see\ne.g. [13, 26, 9, 16, 7, 8, 4]. Besides their practical success, SVMs nowadays also possess a rich\n\n1\n\n\fstatistical theory, which provides various learning guarantees, see [31] for a recent account. In-\nterestingly, in this analysis, the kernel and its reproducing kernel Hilbert space (RKHS) make it\npossible to completely decouple the statistical analysis of SVMs from the input space X. For ex-\nample, if one uses the hinge loss and a bounded measurable kernel whose RKHS H is separable\nand dense in L1(\u00b5) for all distributions \u00b5 on X, then [31, Theorem 7.22] together with [31, Theo-\nrem 2.31] and the discussion on [31, page 267ff] shows that the corresponding SVM is universally\nclassi\ufb01cation consistent even without an entropy number assumption if one picks a sequence (\u03bbn)\nof positive regularization parameters that satisfy \u03bbn \u2192 0 and n\u03bbn/ ln n \u2192 \u221e. In other words,\nindependently of the input space X, the universal consistency of SVMs is well-understood modulo\nan approximation theoretical question, namely that of the denseness of H in all L1(\u00b5).\nFor standard input spaces X \u2282 Rd and various classical kernels, this question of denseness has been\npositively answered. For example, for compact X \u2282 Rd, [30] showed that, among a few others,\nthe RKHSs of the Gaussian RBF kernels are universal, that is, they are dense in the space C(X)\nof continuous functions f : X \u2192 R. With the help of a standard result from measure theory, see\ne.g. [1, Theorem 29.14], it is then easy to conclude that these RKHS are also dense in all L1(\u00b5) for\nwhich \u00b5 has a compact support. This key result has been extended in a couple of different directions:\nFor example, [18] establishes universality for more classes of kernels on compact X \u2282 Rd, whereas\n[32] shows the denseness of the Gaussian RKHSs in L1(\u00b5) for all distributions \u00b5 on Rd. Finally,\n[7, 8, 28, 29] show that universal kernels are closely related to so-called characteristic kernels that\ncan be used to distinguish distributions. In addition, all these papers contain suf\ufb01cient or necessary\nconditions for universality of kernels on arbitrary compact metric spaces X, and [32] further shows\nthat the compact metric spaces are exactly the compact topological spaces on which there exist\nuniversal spaces.\nUnfortunately, however, it appears that neither the suf\ufb01cient conditions for universality nor the proof\nof the existence of universal kernels can be used to construct universal kernels on compact metric\nspaces X (cid:54)\u2282 Rd. In fact, to the best of our knowledge, no explicit example of such kernels has so far\nbeen presented. As a consequence, it seems fair to say that, beyond the X \u2282 Rd-case, the theory of\nSVMs is incomplete, which is in contrast to the obvious practical success of SVMs for such input\nspaces X as illustrated above.\nThe goal of this paper is to close this gap by providing the \ufb01rst explicit and constructive examples\nof universal kernels that live on compact metric spaces X (cid:54)\u2282 Rd. To achieve this, our \ufb01rst step is to\nextend the de\ufb01nition of the Gaussian RBF kernels, or more generally, kernels that can be expressed\nby a Taylor series, from the Euclidean Rd to its in\ufb01nite dimensional counter part, that is, the space\n(cid:96)2 of square summable sequences. Unfortunately, on the space (cid:96)2 we face new challenges due to\nits in\ufb01nite dimensional nature. Indeed, the closed balls of (cid:96)2 are no longer (norm)-compact subsets\nof (cid:96)2 and hence we cannot expect universality on these balls. To address this issue, one may be\ntempted to use the weak\u2217-topology on (cid:96)2, since in this topology the closed balls are both compact\nand metrizable, thus universal kernels do exist on them. However, the Taylor kernels do not belong\nto them, because \u2013basically\u2013 the inner product (cid:104)\u00b7 , \u00b7(cid:105)(cid:96)2 fails to be continuous with respect to the\nweak\u2217-topology as the sequence of the standard orthonormal basis vectors show. To address this\ncompactness issue we consider (norm)-compact subsets of (cid:96)2, only. Since the inner product of (cid:96)2 is\ncontinuous with respect to the norm by virtue of the Cauchy-Schwarz inequality, it turns out that the\nTaylor kernels are continuous with respect to the norm topology. Moreover, we will see that in this\nsituation the Stone-Weierstra\u00df-argument of [30] yields a variety of universal kernels including the\nin\ufb01nite dimensional extensions of the Gaussian RBF kernels.\nHowever, unlike the \ufb01nite dimensional Euclidean spaces Rd and their compact subsets, the compact\nsubsets of (cid:96)2 can be hardly viewed as somewhat natural examples of input spaces X. Therefore,\nwe go one step further by considering compact metric spaces X for which there exist a separable\nHilbert space H and an injective and continuous map \u03c1 : X \u2192 H. If, in this case, we \ufb01x an analytic\nfunction K : R \u2192 R that can be globally expressed by its Taylor series developed at zero and\nthat has strictly positive Taylor coef\ufb01cients, then k(x, x(cid:48)) := K((cid:104)\u03c1(x), \u03c1(x(cid:48))(cid:105)H) de\ufb01nes a universal\nkernel on X and the same is true for the analogous de\ufb01nition of Gaussian kernels. Although this\nsituation may look at a \ufb01rst glance even more arti\ufb01cial than the (cid:96)2-case, it turns out that quite a few\ninteresting explicit examples can be derived from this situation. Indeed, we will use this general\nresult to present examples of Gaussian kernels de\ufb01ned on the set of distributions over some input\nspace \u2126 and on certain sets of functions.\n\n2\n\n\fThe paper has the following structure. Section 2 contains the main results and constructs examples\nfor universal kernels based on our technique. In particular, we show how to construct universal\nkernels on sets of probability measures and on sets of functions, the latter being interesting for\nsignal processing. Section 3 contains a short discussion and Section 4 gives the proofs of the main\nresults.\n\n2 Main result\nA kernel k on a set X is a function k : X \u00d7 X \u2192 R for which all matrices of the form\ni,j=1, n \u2208 N, x1, . . . , xn \u2208 X, are symmetric and positive semi-de\ufb01nite. Equiva-\n(k(xi, xj))n\nlently, k is a kernel if and only there exists a Hilbert space \u02dcH and a map \u02dc\u03a6 : X \u2192 \u02dcH such\nthat k(x, x(cid:48)) = (cid:104) \u02dc\u03a6(x), \u02dc\u03a6(x(cid:48))(cid:105) \u02dcH for all x, x(cid:48) \u2208 X. While neither \u02dcH or \u02dc\u03a6 are uniquely determined,\nthe so-called reproducing kernel Hilbert space (RKHS) of k, which is given by\n\nH :=(cid:8)(cid:104)v, \u03a6(\u00b7 )(cid:105) \u02dcH : v \u2208 \u02dcH(cid:9)\n\nand (cid:107)f(cid:107)H := inf{(cid:107)v(cid:107) \u02dcH : f = (cid:104)v, \u03a6(\u00b7 )(cid:105) \u02dcH} is uniquely determined, see e.g. [31, Chapter 4.2]. For\nmore information on kernels, we refer to [31, Chapter 4]. Moreover, for a compact metric space\n(X, d), we write C(X) := {f : X \u2192 R| f continuous} for the space of continuous functions on X\nand equip this space with the usual supremum norm (cid:107) \u00b7 (cid:107)\u221e. A kernel k on X is called universal, if\nk is continuous and its RKHS H is dense in C(X). As mentioned before, this notion, which goes\nback to [30], plays a key role in the analysis of kernel-based learning methods. Let r \u2208 (0,\u221e].\nThe kernels we consider in this paper are constructed by functions K : [\u2212r, r] \u2192 R that can be\nexpressed by its Taylor series, that is\n\nK(t) =\n\nantn,\n\nt \u2208 [\u2212r, r] .\n\n\u221e(cid:88)\n\nn=0\n\n\u221e(cid:88)\n\n(1)\n\n(2)\n\nx, x(cid:48) \u2208 \u221a\nr} with radius\n\nrBRd ,\n\u221a\n\nFor such functions [31, Lemma 4.8] showed that\n\nk(x, x(cid:48)) := K((cid:104)x, x(cid:48)(cid:105)Rd ) =\n\nan(cid:104)x, x(cid:48)(cid:105)n\n\nRd ,\n\n\u221a\n\nn=0\n\nrBRd := {x \u2208 Rd : (cid:107)x(cid:107)2 \u2264 \u221a\n\nr, whenever all\nde\ufb01nes a kernel on the closed ball\nTaylor coef\ufb01cients an are non-negative. Following [31], we call such kernels Taylor kernels. [30],\nsee also [31, Lemma 4.57], showed that Taylor kernels are universal, if an > 0 for all n \u2265 0, while\n[21] notes that strict positivity on certain subsets of indices n suf\ufb01ces.\nObviously, the de\ufb01nition (2) of k is still possible, if one replaces Rd by its in\ufb01nite dimensional and\nseparable counterpart (cid:96)2 := {(wj)j\u22651 : (cid:107)(wj)(cid:107)2\nj < \u221e}. Let us denote the closed\nunit ball in (cid:96)2 by B(cid:96)2, or more generally, the closed unit ball of a Banach space E by BE, that is\nBE := {v \u2208 E : (cid:107)v(cid:107)E \u2264 1}. Our \ufb01rst main result shows that this extension leads to a kernel, whose\nrestrictions to compact subsets are universal, if an > 0 for all n \u2208 N0 := N \u222a {0}.\nTheorem 2.1 Let K : [\u2212r, r] \u2192 R be a function of the form (1). Then we have:\n\n:= (cid:80)\n\nj\u22651 w2\n\n(cid:96)2\n\ni) If an \u2265 0 for all n \u2265 0, then k :\n\nk(w, w(cid:48)) := K(cid:0)(cid:104)w, w(cid:48)(cid:105)(cid:96)2\n\nrB(cid:96)2 \u2192 R is a kernel, where\nw, w(cid:48) \u2208 \u221a\nan(cid:104)w, w(cid:48)(cid:105)n\n\n,\n\n(cid:96)2\n\nrB(cid:96)2.\n\n(3)\n\n\u221a\n\nrB(cid:96)2 \u00d7 \u221a\n\u221e(cid:88)\n(cid:1) =\n\nn=0\n\nii) If an > 0 for all n \u2208 N0, then the restriction k|W\u00d7W : W \u00d7 W \u2192 R of k to an arbitrary\n\ncompact set W \u2282 \u221a\n\nrB(cid:96)2 is universal.\n\nTo consider a \ufb01rst explicit example, let K := exp : R \u2192 R be the exponential function. Then\nK clearly satis\ufb01es the assumptions of Theorem 2.1 for all r > 0, and hence the resulting exponen-\ntial kernel is universal on every compact subset W of (cid:96)2. Moreover, for \u03c3 \u2208 (0,\u221e), the related\nGaussian-type RBF kernel k\u03c3 : (cid:96)2 \u00d7 (cid:96)2 \u2192 R de\ufb01ned by\n\nk\u03c3(w, w(cid:48)) := exp(cid:0)\u2212\u03c32(cid:107)w \u2212 w(cid:48)(cid:107)2\n\n(cid:1) =\n\n(cid:96)2\n\nexp(2\u03c32(cid:104)w, w(cid:48)(cid:105)(cid:96)2 )\n\nexp(\u03c32(cid:107)w(cid:107)2\n\n) exp(\u03c32(cid:107)w(cid:48)(cid:107)2\n\n(cid:96)2\n\n)\n\n(4)\n\n(cid:96)2\n\n3\n\n\fis also universal on every compact W \u2282 (cid:96)2, since modulo the scaling by \u03c3 it is the normalized\nversion of the exponential kernel, and thus it is universal by [31, Lemma 4.55].\nAlthough we have achieved our \ufb01rst goal, namely explicit, constructive examples of universal ker-\nnels on X (cid:54)\u2282 Rd, the result is so far not really satisfying. Indeed, unlike the \ufb01nite dimensional\nEuclidean spaces Rd, the in\ufb01nite dimensional space (cid:96)2 rarely appears as the input space in real-\nworld applications. The following second result can be used to address this issue.\nTheorem 2.2 Let X be a compact metric space and H be a separable Hilbert space such that there\nexists a continuous and injective map \u03c1 : X \u2192 H. Furthermore, let K : R \u2192 R be a function of\nthe form (1). Then the following statements hold:\n\ni) If an \u2265 0 for all n \u2208 N0, then k : X \u00d7 X \u2192 R de\ufb01nes a kernel, where\n\nk(x, x(cid:48)) := K(cid:0)(cid:10)\u03c1(x), \u03c1(x(cid:48))(cid:11)\n\n(cid:1) =\n\nH\n\n\u221e(cid:88)\n\nn=0\n\n(cid:10)\u03c1(x), \u03c1(x(cid:48))(cid:11)n\n\nH ,\n\nan\n\nx, x(cid:48) \u2208 X.\n\n(5)\n\nii) If an > 0 for all n \u2208 N0, then k is a universal kernel.\niii) For \u03c3 > 0, the Gaussian-type RBF-kernel k\u03c3 : X \u00d7 X \u2192 R is a universal kernel, where\n(6)\n\nk\u03c3(x, x(cid:48)) := exp(cid:0)\u2212\u03c32(cid:107)\u03c1(x) \u2212 \u03c1(x(cid:48))(cid:107)2H(cid:1) ,\npositive non-constant radial basis function kernels such as k\u03c3(x, x(cid:48)) := exp(cid:0)\u2212\u03c32(cid:107)\u03c1(x)\u2212 \u03c1(x(cid:48))(cid:107)H(cid:1)\nor the Student-type RBF kernels k\u03c3(x, x(cid:48)) :=(cid:0)1 + \u03c32(cid:107)\u03c1(x) \u2212 \u03c1(x(cid:48))(cid:107)2H\n(cid:1)\u2212\u03b1 for \u03c32 > 0 and \u03b1 \u2265 1.\n\nIt seems possible that the latter result for the Gaussian-type RBF kernel can be extended to other\n\nx, x(cid:48) \u2208 X.\n\nIndeed, [25] uses the fact that on Rd such kernels have an integral representation in terms of the\nGaussian RBF kernels to show, see [25, Corollary 4.9], that these kernels inherit approximation\nproperties such as universality from the Gaussian RBF kernels. We expect that the same arguments\ncan be made for (cid:96)2 and then, in a second step, for the situation of Theorem 2.2.\nBefore we provide some examples of situations in which Theorem 2.2 can be used to de\ufb01ne explicit\nuniversal kernels, we point to a technical detail of Theorem 2.2, which may be overseen, thus leading\nto wrong conclusions.\nTo this end, let (X, dX ) be an arbitrary metric space, H be a separable Hilbert space and \u03c1 : X \u2192 H\nbe an injective map. We write V := \u03c1(X) and equip this space with the metric de\ufb01ned by H.\nThus, \u03c1 : X \u2192 V is bijective by de\ufb01nition. Moreover, since H is assumed to be separable, it is\nisometrically isomorphic to (cid:96)2, and hence there exists an isometric isomorphism I : H \u2192 (cid:96)2. We\nwrite W := I(V ) and equip this set with the metric de\ufb01ned by the norm of (cid:96)2. For a function\nf : W \u2192 R, we can then consider the following diagram\nf \u25e6 I \u25e6 \u03c1\n\n-\n\n(R,| \u00b7 |)\n6\nf\n\n\u001b\n\n-\n\n(W,(cid:107) \u00b7 (cid:107)(cid:96)2)\n\n(X, dX )\n\n6\n\n\u03c1\n\n?\n(V,(cid:107) \u00b7 (cid:107)H)\n\n(7)\nSince both \u03c1 and I are bijective, it is easy to see that f not only de\ufb01nes a function g : X \u2192 R\nby g := f \u25e6 I \u25e6 \u03c1, but conversely, every function g : X \u2192 R has such a representation and this\nrepresentation is unique. In other words, there is a one-to-one relationship between the functions\nX \u2192 R and the functions W \u2192 R. Let us now assume that we have a kernel kW on W with RKHS\nHW and canonical feature map \u03a6W : W \u2192 HW . Then kX : X \u00d7 X \u2192 R, given by\n\nI\n\nkX (x, x(cid:48)) := kW (I \u25e6 \u03c1(x), I \u25e6 \u03c1(x(cid:48))) ,\n\nx, x(cid:48) \u2208 X,\n\nde\ufb01nes a kernel on X, since\n\nkX (x, x(cid:48)) = kW (I \u25e6 \u03c1(x), I \u25e6 \u03c1(x(cid:48))) = (cid:104)\u03a6W (I(\u03c1(x(cid:48)))), \u03a6W (I(\u03c1(x)))(cid:105)HW ,\n\nx, x(cid:48) \u2208 X,\n\n4\n\n\fshows that \u03a6W \u25e6 I \u25e6 \u03c1 : X \u2192 HW is a feature map of kX. Moreover, [31, Theorem 4.21] shows\nthat the RKHS HX of kX is given by\n\nHX =(cid:8)(cid:104)f, \u03a6W \u25e6 I \u25e6 \u03c1(\u00b7 )(cid:105)HW : f \u2208 HW\n\n(cid:9) .\n\nSince, for f \u2208 HW , the reproducing property of HW gives f \u25e6 I \u25e6 \u03c1(x) = (cid:104)f, \u03a6W \u25e6 I \u25e6 \u03c1(x)(cid:105)HW\nfor all x \u2208 X we thus conclude that HX = {f \u25e6 I \u25e6 \u03c1 : f \u2208 HW} =: HW \u25e6 I \u25e6 \u03c1. Let us\nnow assume that X is compact and that kW is one of the universal kernels considered in Theorem\n2.1 or the Gaussian RBF kernel (4). Then the proof of Theorem 2.2 shows that kX is one of the\nuniversal kernels considered in Theorem 2.2. Moreover, if we consider the kernel kV : V \u00d7 V \u2192 R\nde\ufb01ned by kV (v, v(cid:48)) := kW (I(v), I(v(cid:48))), then an analogous argument shows that kV is a universal\nkernel. This raises the question, whether we need the compactness of X, or whether it suf\ufb01ces\nto assume that \u03c1 is injective, continuous and has a compact image V . Surprisingly, the answer is\nthat it depends on the type of universality one needs. Indeed, if \u03c1 is as in Theorem 2.2, then the\ncompactness of X ensures that \u03c1 is a homeomorphism, that is, \u03c1\u22121 : V \u2192 X is continuous, too.\nSince I is clearly also a homeomorphism, we can easily conclude that C(X) = C(W ) \u25e6 I \u25e6 \u03c1,\nthat is, we have the same relationship as we have for the RKHSs HW and HX. From this, the\nuniversality is easy to establish. Let us now assume the compactness of V instead of the compactness\nof X. Then, in general, \u03c1 is not a homeomorphism and the sets of continuous functions on X\nand V are in general different, even if we consider the set of bounded continuous functions on X.\nTo see the latter, consider e.g. the map \u03c1 : [0, 1) \u2192 S1 onto the unit sphere S1 of R2 de\ufb01ned\nby \u03c1(t) := (sin(2\u03c0t), cos(2\u03c0t)). Now this difference makes it impossible to conclude from the\nuniversality of kV (or kW ) to the universality of kX. However, if \u03c4V denotes the topology of V ,\nthen \u03c1\u22121(\u03c4V ) := {\u03c1\u22121(O) : O \u2208 \u03c4V } de\ufb01nes a new topology on X, which satis\ufb01es \u03c1\u22121(\u03c4V ) \u2282 \u03c4X.\nConsequently, there are, in general, fewer continuous functions with respect to \u03c1\u22121(\u03c4V ). Now, it\nis easy to check that d\u03c1(x, x(cid:48)) := (cid:107)\u03c1(x) \u2212 \u03c1(x(cid:48))(cid:107)H de\ufb01nes a metric that generates \u03c1\u22121(\u03c4V ) and,\nsince \u03c1 is isometric with respect to this new metric, we can conclude that (X, d\u03c1) is a compact\nmetric space. Consequently, we are back in the situation of Theorem 2.2, and hence kX is universal\nwith respect to the space C(X, d\u03c1) of functions X \u2192 R that are continuous with respect to d\u03c1.\nIn other words, while HX may fail to approximate every function that is continuous with respect\nto dX, it does approximate every function that is continuous with respect to d\u03c1. Whether the latter\napproximation property is enough clearly depends on the speci\ufb01c application at hand.\nLet us now present some universal kernels of practical interest. Please note, that although the func-\ntion \u03c1 in our examples is even linear, the Theorem 2.2 only assumes \u03c1 to be continuous and injective.\nWe start with two examples where X is the set of distributions on some space \u2126.\nExample 1: universal kernels on the set of probability measures.\nLet (\u2126, d\u2126) be a compact metric space, B(\u2126) be its Borel \u03c3-algebra, and X := M1(\u2126) be the set of\nall Borel probability measures on \u2126. Then the topology describing weak convergence of probability\nmeasures can be metrized, e.g., by the Prohorov metric\n\ndX (P, P(cid:48)) := inf(cid:8)\u03b5 > 0 : P(A) \u2264 P(cid:48)(A\u03b5) + \u03b5 for all A \u2208 B(\u2126)(cid:9),\n\n(8)\nwhere A\u03b5 := {\u03c9(cid:48) \u2208 \u2126 : d\u2126(\u03c9, \u03c9(cid:48)) < \u03b5 for some \u03c9 \u2208 A}, see e.g. [2, Theorem 6.8, p. 73].\nMoreover, (X, dX ) is a compact metric space if and only if (\u2126, d\u2126) is a compact metric space, see\n[19, Thm. 6.4]. In order to construct universal kernels on (X, dX ) with the help of Theorem 2.2,\nit thus remains to \ufb01nd separable Hilbert spaces H and injective, continuous embeddings \u03c1 : X \u2192\nH. Let k\u2126 be a continuous kernel on \u2126 with RKHS H\u2126 and canonical feature map \u03a6\u2126(\u03c9) :=\nk\u2126(\u03c9,\u00b7), \u03c9 \u2208 \u2126. Note that k\u2126 is bounded because it is continuous and \u2126 is compact. Then H\u2126\nis separable and \u03a6\u2126 is bounded and continuous, see [31, Lemmata 4.23, 4.29, 4.33]. Assume that\nk\u2126 is additionally characteristic, i.e. the function \u03c1 : X \u2192 H\u2126 de\ufb01ned by the Bochner integral\n\u03c1(P) := EP\u03a6\u2126 is injective. Then the next lemma, which is taken from [10, Thm. 5.1] and which is\na modi\ufb01cation of a theorem in [3, p. III. 40], ensures the continuity of \u03c1.\nLemma 2.3 Let (\u2126, d\u2126) be a complete separable metric space, H be a separable Banach space and\n\u03a6 : \u2126 \u2192 H be a bounded, continuous function. Then \u03c1 : M1(\u2126) \u2192 H de\ufb01ned by \u03c1(P) := EP\u03a6 is\ncontinuous, i.e., EPn \u03a6 \u2192 EP\u03a6, whenever (Pn)n\u2208N \u2282 M1(\u2126) converges weakly in M1(\u2126) to P.\nConsequently, the map \u03c1 : M1(\u2126) \u2192 H\u2126 satis\ufb01es the assumptions of Theorem 2.2, and hence the\nGaussian-type RBF kernel\n\nk\u03c3(P, P(cid:48)) := exp(cid:0)\u2212\u03c32(cid:107)EP\u03a6\u2126 \u2212 EP(cid:48)\u03a6\u2126(cid:107)2H\u2126\n\n(cid:1) , P, P(cid:48) \u2208 M1(\u2126),\n\nP, P(cid:48) \u2208 X ,\n\n5\n\n(9)\n\n\f(or characteristic function), that is \u03c1(P) := \u02c6P, where \u02c6P(t) := (cid:82) ei(cid:104)z,t(cid:105)d\u00b5(z) \u2208 C, t \u2208 Rd. It is\n\nis universal and obviously bounded. Note that this kernel is conceptionally different to characteristic\nkernels on \u2126. Indeed, characteristic kernels live on \u2126 and their RKHS consist of functions \u2126 \u2192\nR, while the new kernel k\u03c3 lives on M1(\u2126) and its RKHS consists of functions M1(\u2126) \u2192 R.\nConsequently, k\u03c3 can be used to learn from samples that are individual distributions, e.g. represented\nby histograms, densities or data, while characteristic kernels can only be used to check whether two\nof such distributions are equal or not.\nExample 2: universal kernels based on Fourier transforms of probability measures.\nConsider, the set X := M1(\u2126), where \u2126 \u2282 Rd is compact. Moreover, let \u03c1 be the Fourier transform\nwell-known, see e.g. [6, Chap. 9], that, for all P \u2208 M1(\u2126), \u02c6P is uniformly continuous on Rd and\n(cid:107)\u02c6P(cid:107)\u221e \u2264 1. Moreover, \u03c1 : P (cid:55)\u2192 \u02c6P is injective, and if a sequence (Pn) converges weakly to some\nP, then (\u02c6Pn) converges uniformly to \u02c6P on every compact subset of Rd. Now let \u00b5 be a \ufb01nite\nBorel measure on Rd with support(\u00b5) = Rd, e.g., \u00b5 can be any probability distribution on Rd with\nLebesgue density h > 0. Then the previous properties of the Fourier transform can be used to\nshow that \u03c1 : M1(\u2126) \u2192 L2(\u00b5) is continuous, and hence Theorem 2.2 ensures that the following\nGaussian-type kernel is universal and bounded:\n\nk\u03c3(P, P(cid:48)) := exp(cid:0)\u2212\u03c32(cid:107)\u02c6P \u2212 \u02c6P(cid:48)(cid:107)2\n\n(cid:1) ,\n\nL2(\u00b5)\n\nP, P(cid:48) \u2208 M1(\u2126).\n\n(10)\n\nIn view of the previous two examples, we mention that the probability measures P and P(cid:48) are often\nnot directly observable in practice, but only corresponding empirical distributions can be obtained.\nIn this case, a simple standard technique is to construct histograms to represent these empirical distri-\nbutions as vectors in a \ufb01nite-dimensional Euclidean space, although it is well-known that histograms\ncan yield bad estimates for probability measures. Our new kernels make it possible to directly plug\nthe empirical distributions into the kernel k\u03c3, even if these distributions do not have the same length.\nMoreover, other techniques to convert empirical distributions to absolutely continuous distributions\nsuch as kernel estimators derived via weighted averaging of rounded points (WAPRing) and (averag-\ning) histograms with different origins, [20, 24] can be used in k\u03c3, too. Clearly, the preferred method\nwill most likely depend on the speci\ufb01c application at hand, and one bene\ufb01t of our construction is\nthat it allows this \ufb02exibility.\nExample 3: universal kernels for signal processing.\nLet (\u2126,A, \u00b5) be an arbitrary measure space and L2(\u00b5) be the usual space of square \u00b5-integrable\nfunctions on \u2126. Let us additionally assume that L2(\u00b5) is separable, which is typically, but not\nalways, satis\ufb01ed. In addition, let us assume that our input values xi \u2208 X are functions taken from\nsome compact set X \u2282 L2(\u00b5). A typical example, where this situation occurs, is signal processing,\nwhere the true signal f \u2208 L2([0, 1]), which is a function of time, cannot be directly observed, but a\nsmoothed version g := T \u25e6 f of the signal is observable. This smoothing can often be described by a\ncompact linear operator T : L2([0, 1]) \u2192 L2([0, 1]), e.g., a convolution operator, acting on the true\nsignals. Hence, if we assume that the true signals are contained in the closed unit ball BL2([0,1]),\nthen the observed, smoothed signals T \u25e6 f are contained in a compact subset X of L2([0, 1]). Let\nus now return to the general case introduced above. Then the identity map \u03c1 := id : X \u2192 L2(\u00b5)\nsatis\ufb01es the assumptions of Theorem 2.2, and hence the Gaussian-type kernel\ng, g(cid:48) \u2208 X,\n\nk\u03c3(g, g(cid:48)) := exp(cid:0)\u2212\u03c32(cid:107)g \u2212 g(cid:48)(cid:107)2\n\n(cid:1) ,\n\n(11)\n\nL2(\u00b5)\n\nde\ufb01nes a universal and bounded kernel on X. As in the previous examples, note that the computation\nof k\u03c3 does not require the functions g and g(cid:48) to be in a speci\ufb01c format such as a certain discretization.\n\n3 Discussion\n\nThe main goal of this paper was to provide an explicit construction of universal kernels that are\nde\ufb01ned on arbitrary compact metric spaces, which are not necessarily a subset of Rd. There is a\nstill increasing interest in kernel methods including support vector machines on such input spaces,\ne.g. for classi\ufb01cation or regression purposes for input values being probability measures, histograms\nor colored images. As examples, we gave explicit universal kernels on the set of probability distri-\nbutions and for signal processing. One direction of further research may be to generalize our results\nto the case of non-compact metric spaces or to \ufb01nd quantitative approximation results.\n\n6\n\n\f4 Proofs\n\n0 for the set of all sequences (ji)i\u22651 with values in N0 := N \u222a {0}.\nIn the following, we write NN\nElements of this set will serve us as multi-indices with countably many components. For j = (ji) \u2208\nNN\n0 , we will therefore adopt the multi-index notation\n\n(cid:88)\n\ni\u22651\n\n|j| :=\n\nji .\n\nNote that |j| < \u221e implies that j has only \ufb01nitely many components ji with ji (cid:54)= 0.\nLemma 4.1 Assume that n \u2208 N is \ufb01xed and that for all j \u2208 NN\nconstant cj \u2208 (0,\u221e). Then for all j \u2208 NN\nsuch that for all summable sequences (bi) \u2282 [0,\u221e) we have\n\n0 with |j| = n, we have some\n0 with |j| = n + 1, there exists a constant \u02dccj \u2208 (0,\u221e)\n\n(cid:18) (cid:88)\n\n\u221e(cid:89)\n\ncj\n\nbji\ni\n\n(cid:19)(cid:18) \u221e(cid:88)\n\n(cid:19)\n\nbi\n\n=\n\n(cid:88)\n\nj\u2208NN\n\n0 :|j|=n\n\ni=1\n\ni=1\n\nj\u2208NN\n\n0 :|j|=n+1\n\n\u221e(cid:89)\n\ni=1\n\n\u02dccj\n\nbji\ni\n\n.\n\nProof: This can be shown by induction, where the induction step is similar to the proof for the\nCauchy product of series.\nLemma 4.2 Assume that n \u2208 N0 is \ufb01xed. Then for all j \u2208 NN\ncj \u2208 (0,\u221e) such that for all summable sequences (bi) \u2282 [0,\u221e) we have\n\n0 with |j| = n, there exists a constant\n\u221e(cid:89)\n\n(cid:18) \u221e(cid:88)\n\n(cid:88)\n\n(cid:19)n\n\nbi\n\n=\n\ncj\n\nbji\ni\n\n.\n\ni=1\n\ni=1\n\nj\u2208NN\n\n0 :|j|=n\nProof: This can be shown by induction using Lemma 4.1.\nGiven a non-empty countable set J and a family w := (wj)j\u2208J \u2282 R, we write (cid:107)w(cid:107)2\nj ,\nj\u2208J w2\nand, as usual, we denote the space of all families for which this quantity is \ufb01nite by (cid:96)2(J). Recall\nthat (cid:96)2(J) together with (cid:107) \u00b7 (cid:107)2 is a Hilbert space and we denote its inner product by (cid:104)\u00b7 , \u00b7(cid:105)(cid:96)2(J).\nMoreover, (cid:96)2 := (cid:96)2(N) is separable, and by using an orthonormal basis representation, it is further\nknown that every separable Hilbert space is isometrically isomorphic to (cid:96)2. In this sense, (cid:96)2 can be\nviewed as a generic model for separable Hilbert spaces.\nThe following result provides a method to construct Taylor kernels on closed balls in (cid:96)2.\nProposition 4.3 Let r \u2208 (0,\u221e] and K : [\u2212r, r] \u2192 R be a function that can be expressed by its\n0 : |j| < \u221e}.\nIf an \u2265 0 for all n \u2265 0, then k :\n\n2 :=(cid:80)\n\n\u221a\n\nTaylor series given in (1), i.e. K(t) =(cid:80)\u221e\nrB(cid:96)2 \u00d7 \u221a\n(cid:1) =\nk(w, w(cid:48)) := K(cid:0)(cid:104)w, w(cid:48)(cid:105)(cid:96)2\n(cid:16)\n\u221e(cid:89)\n\n\u03a6(w) :=\n\ncj\n\nan(cid:104)w, w(cid:48)(cid:105)n\n\nn=0 antn, t \u2208 [\u2212r, r]. De\ufb01ne J := {j \u2208 NN\n\u221e(cid:88)\nrB(cid:96)2 \u2192 R de\ufb01ned by (3), i.e.\n(cid:17)\n\nw, w(cid:48) \u2208 \u221a\n\nrB(cid:96)2,\n\nn=0\n\n(cid:96)2\n\n,\n\nw \u2208 \u221a\n\nrB(cid:96)2,\n\nwji\ni\n\n,\n\nj\u2208J\n\ni=1\n\nis a kernel. Moreover, for all j \u2208 J, there exists a constant cj \u2208 (0,\u221e) such that \u03a6 :\n(cid:96)2(J) de\ufb01ned by\n\n\u221a\n\nrB(cid:96)2 \u2192\n\n(12)\n\nis a feature map of k, where we use the convention 00 := 1.\n\nProof: For w, w(cid:48) \u2208 \u221a\nrB(cid:96)2, the Cauchy-Schwarz inequality yields |(cid:104)w, w(cid:48)(cid:105)| \u2264 (cid:107)w(cid:107)2(cid:107)w(cid:48)(cid:107)2 \u2264 r\nand thus k is well-de\ufb01ned. Let wi denote the i-th component of w \u2208 (cid:96)2. Since (1) is absolutely\n0 , there exists a constant \u02dccj \u2208 (0,\u221e) such\nconvergent, Lemma 4.2 then shows that, for all j \u2208 NN\n\u221e(cid:89)\nthat\n\nSetting cj :=(cid:112)a|j|\u02dccj, we obtain that \u03a6 de\ufb01ned in (12) is indeed a feature map of k, and hence k is\n\nk(w, w(cid:48)) =\n\n(cid:88)\n\n\u221e(cid:89)\n\na|j|\u02dccj\n\nj\u2208NN\n\n(w(cid:48)\n\nwji\ni\n\ni)ji\n\ni=1\n\ni=1\n\n.\n\n0\n\na kernel.\n\n7\n\n\fBefore we can state our \ufb01rst main result the need to recall the following test of universality from\n[31, Theorem 4.56].\n\nTheorem 4.4 Let W be a compact metric space and k be a continuous kernel on W with k(w, w) >\n0 for all w \u2208 W . Suppose that we have an injective feature map \u03a6 : W \u2192 (cid:96)2(J) of k, where J\nis some countable set. We write \u03a6j : W \u2192 R for its j-th component, i.e., \u03a6(w) = (\u03a6j(w))j\u2208J,\nw \u2208 W . If A := span{\u03a6j : j \u2208 J} is an algebra, then k is universal.\n\nWith the help of Theorem 4.4 and Proposition 4.3 we can now prove our \ufb01rst main result.\n\nnow \ufb01x a compact W \u2282 \u221a\nProof of Theorem 2.1: We have already seen in Proposition 4.3 that k is a kernel on\nrB(cid:96)2. Let us\nrB(cid:96)2. For every j \u2208 J, where J is de\ufb01ned in Proposition 4.3, there are\nonly \ufb01nitely many components ji with ji (cid:54)= 0. Consequently, there exists a bijection between J and\nthe set of all \ufb01nite subsets of N. Since the latter is countable, J is countable. Furthermore, we have\n\n\u221a\n\n\u221e(cid:88)\n\nk(w, w) =\n\nan(cid:107)w(cid:107)2n\n\n(cid:96)2\n\n\u2265 a0 > 0\n\nn=0\n\ni = \u03a6(w(cid:48)), and hence \u03a6 is injective.\n\nfor all w \u2208 W , and it is obvious, that the components of the feature map \u03a6 found in Proposition\n4.3 span an algebra. Finally, if we have w, w(cid:48) \u2208 W with w (cid:54)= w(cid:48), there exists an i \u2265 1 such that\nwi (cid:54)= w(cid:48)\ni. For the multi-index j \u2208 J that equals 1 at the i-component and vanishes everywhere else\nwe then have \u03a6(w) = cjwi (cid:54)= cjw(cid:48)\nProof of Theorem 2.2: Since H is separable Hilbert space there exists an isometric isomorphism\nI : H \u2192 (cid:96)2. We de\ufb01ne V := \u03c1(X), see also the diagram in (7). Since \u03c1 is continuous, V is\nthe image of a compact set under a continuous map, and thus V is compact and the inverse of\nthe bijective map I \u25e6 \u03c1 : X \u2192 W is continuous. Consequently, there is a one-to-one relationship\nbetween the continuous functions fX on X and the continuous functions fW on W , namely C(X) =\nC(W ) \u25e6 I \u25e6 \u03c1, see also the discussion following (7). Moreover, the fact that I : H \u2192 (cid:96)2 is an\nisometric isomorphism yields (cid:104)I(\u03c1(x)), I(\u03c1(x(cid:48)))(cid:105)(cid:96)2 = (cid:104)\u03c1(x), \u03c1(x(cid:48))(cid:105)H for all x, x(cid:48) \u2208 X, and hence\nthe kernel k considered in Theorem 2.2 is of the form kX = kW (I \u25e6 \u03c1(\u00b7 ), I \u25e6 \u03c1(\u00b7 )), where kW\nis the corresponding kernel de\ufb01ned on W \u2282 (cid:96)2 considered in Theorem 2.2. Now, the discussion\nfollowing (7) showed HX = HW \u25e6 I \u25e6 \u03c1. Consequently, if we \ufb01x a function g \u2208 C(X), then\nf := g \u25e6 \u03c1\u22121 \u25e6 I\u22121 \u2208 C(W ) can be approximated by HW , that is, for all \u03b5 > 0, there exists an\nh \u2208 HW such that (cid:107)h\u2212 f(cid:107)\u221e \u2264 \u03b5. Since I \u25e6 \u03c1 : X \u2192 W is bijective and f \u25e6 I \u25e6 \u03c1 = g, we conclude\nthat (cid:107)h \u25e6 I \u25e6 \u03c1 \u2212 g(cid:107)\u221e \u2264 \u03b5. Now the assertion follows from h \u25e6 I \u25e6 \u03c1 \u2208 HX.\n\nReferences\n[1] H. Bauer. Measure and Integration Theory. De Gruyter, Berlin, 2001.\n[2] P. Billingsley. Convergence of probability measures. John Wiley & Sons, New York, 2nd edition, 1999.\n[3] N. Bourbaki. Integration I. Chapters 1-6. Springer, Berlin, 2004. Translated from the 1959, 1965, and\n\n1967 French originals by S.K. Berberian.\n\n[4] A. Caponnetto, C.A. Micchelli, M. Pontil, and Y. Ying. Universal multi-task kernels. J. Mach. Learn.\n\nRes., 9:1615\u20131646, 2008.\n\n[5] O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogram-based image classi\ufb01cation. IEEE Transac-\n\ntions on Neural Networks, 10:1055\u20131064, 1999.\n\n[6] R. M. Dudley. Real Analysis and Probability. Cambridge University Press, Cambridge, 2002.\n[7] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73\u201399, 2005.\n\n[8] K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel Dimension Reduction in Regression. Ann. Statist.,\n\n37:1871\u20131905, 2009.\n\n[9] K. Fukumizu, B. K. Sriperumbudur, A. Gretton, and B. Sch\u00a8olkopf. Characteristic kernels on groups\nand semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural\nInformation Processing Systems 21, pages 473\u2013480. 2009.\n\n8\n\n\f[10] R. Hable and A. Christmann. Qualitative robustness of support vector machines. arXiv:0912.0874v1,\n\n2009.\n\n[11] M. Hein and O. Bousquet. Kernels, associated structures and generalizations. Technical report, Max-\n\nPlanck-Institute for Biological Cybernetics, 2004.\n\n[12] M. Hein and O. Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on probability measures. In\n\nZ. Ghahramani and R. Cowell, editors, AISTATS, pages 136\u2013143, 2005.\n\n[13] M. Hein, O. Bousquet, and B. Sch\u00a8olkopf. Maximal margin classi\ufb01cation for metric spaces. Journal of\n\nComputer and System Sciences, 71:333\u2013359, 2005.\n\n[14] M. Hein, T. N. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their application in\nSVM\u2019s. In C. E. Rasmussen, H. H. B\u00a8ulthoff, M. Giese, and B. Sch\u00a8olkopf, editors, Pattern Recognition,\nProceedings of the 26th DAGM Symposium, pages 270\u2013277, Berlin, 2004. Springer.\n\n[15] T. Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers,\n\nBoston, 2002.\n\n[16] J. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. J. Mach. Learn. Res., 6:129\u2013163,\n\n2005.\n\n[17] A.F.T. Martins, N.A. Smith, E.P. Xing, P.M.Q. Aguiar, and M.A.T. Figueiredo. Nonextensive information\n\ntheoretic kernels on measures. J. Mach. Learn. Res., 10:935\u2013975, 2009.\n\n[18] C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. J. Mach. Learn. Res., 7:2651\u20132667, 2006.\n[19] K. R. Parthasarathy. Probability Measures on Metric Spaces. Academic Press, New York, 1967.\n[20] E. Parzen. On estimating of a probability density and mode. Ann. Math. Statist., 35:1065\u20131076, 1962.\n[21] A. Pinkus. Strictly positive de\ufb01nite functions on a real inner product space. Adv. Comput. Math., 20:263\u2013\n\n271, 2004.\n\n[22] B. Sch\u00a8olkopf, A. J. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Comput., 10:1299\u20131319, 1998.\n\n[23] B. Sch\u00a8olkopf, K. Tsuda, and J. P. Vert. Kernel Methods in Computational Biology. MIT Press, Cambridge,\n\nMA, 2004.\n\n[24] D. Scott. Averaged shifted histograms: Effective nonparametric density estimation in several dimensions.\n\nAnn. Statist., 13:1024\u20131040, 1985.\n\n[25] C. Scovel, D. Hush, I. Steinwart, and J. Theiler. Radial kernels and their reproducing kernel Hilbert\n\nspaces. Journal of Complexity, 2010, to appear.\n\n[26] A.J. Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A Hilbert Space Embedding for Distributions. In\nE. Takimoto, editor, Algorithmic Learning Theory, Lecture Notes on Computer Science. Springer, 2007.\nProceedings of the 10th International Conference on Discovery Science, 40-41.\n\n[27] B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Sch\u00a8olkopf. Kernel choice and clas-\nsi\ufb01ability for RKHS embeddings of probability distributions. In Y. Bengio, D. Schuurmans, J. Lafferty,\nC. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages\n1750\u20131758. 2009.\n\n[28] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. On the relation between universality, characteristic\nkernels and RKHS embeddings of measures. In Yee Whye Teh and M. Titterington, editors, AISTATS\n2010, Proc. of the 13th International Conference on Arti\ufb01cial Intelligence and Statistics, volume 9, pages\n773\u2013780. 2010.\n\n[29] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and RKHS em-\n\nbeddings of measures. arXiv:1003.0887v1, 2010.\n\n[30] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines. J. Mach.\n\nLearn. Res., 2:67\u201393, 2001.\n\n[31] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.\n[32] I. Steinwart, D. Hush, and C. Scovel. Function classes that approximate the Bayes risk. In COLT\u201906, 19th\n\nConference on Learning Theory, Pittsburgh, 2006.\n\n9\n\n\f", "award": [], "sourceid": 966, "authors": [{"given_name": "Andreas", "family_name": "Christmann", "institution": null}, {"given_name": "Ingo", "family_name": "Steinwart", "institution": null}]}