{"title": "Spectral Regularization for Support Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 495, "abstract": "In this paper we consider the problem of learning from data the support of a probability distribution when the distribution {\\em does not} have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call {\\em ``completely regular''}. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.", "full_text": "Spectral Regularization for Support Estimation\n\nErnesto De Vito\n\nDSA, Univ. di Genova, and\n\nINFN, Sezione di Genova, Italy\n\nLorenzo Rosasco\n\nCBCL - MIT, - USA, and\n\nIIT, Italy\n\ndevito@dima.ungie.it\n\nlrosasco@mit.edu\n\nAlessandro Toigo\n\nPolitec. di Milano, Dept. of Math., and\n\nINFN, Sezione di Milano, Italy\n\ntoigo@ge.infn.it\n\nAbstract\n\nIn this paper we consider the problem of learning from data the support of a prob-\nability distribution when the distribution does not have a density (with respect to\nsome reference measure). We propose a new class of regularized spectral esti-\nmators based on a new notion of reproducing kernel Hilbert space, which we call\n\u201ccompletely regular\u201d. Completely regular kernels allow to capture the relevant\ngeometric and topological properties of an arbitrary probability space. In partic-\nular, they are the key ingredient to prove the universal consistency of the spectral\nestimators and in this respect they are the analogue of universal kernels for su-\npervised problems. Numerical experiments show that spectral estimators compare\nfavorably to state of the art machine learning algorithms for density support esti-\nmation.\n\n1 Introduction\n\nIn this paper we consider the problem of estimating the support of an arbitrary probability distribu-\ntion and we are more broadly motivated by the problem of learning from complex high dimensional\ndata. The general intuition that allows to tackle these problems is that, though the initial repre-\nsentation of the data is often very high dimensional, in most situations the data are not uniformly\ndistributed, but are in fact con\ufb01ned to a small (possibly low dimensional) region. Making such an\nintuition rigorous is the key towards designing effective algorithms for high dimensional learning.\nThe problem of estimating the support of a probability distribution is of interest in a variety of ap-\nplications such as anomaly/novelty detection [8], or surface modeling [16]. From a theoretical point\nof view the problem has been usually considered in the setting where the probability distribution has\na density with respect to a known measure (for example the Lebesgue measure in Rd or the volume\nmeasure on a manifold). Among others we mention [22, 5] and references therein. Algorithms in-\nspired by Support Vector Machine (SVM), often called one-class SVM are have been proposed see\n[17, 20] and references therein. Another kernel method, related to the one we discuss in this paper, is\npresented in [11]. More generally one of the main approaches to learning from high dimensional is\nthe one considered in manifold learning. In this context the data are assumed to lie on a low dimen-\nsional Riemannian sub-manifold embedded (that is represented) in a high dimensional Euclidean\nspace. This framework inspired algorithms to solve a variety of problems such as: semisupervised\nlearning [3], clustering [23], data parameterization/dimensionality reduction [15, 21], to name a few.\nThe basic assumption underlying manifold learning is often too restrictive to describe real data and\nthis motivates considering other models, such as the setting where the data are assumed to be es-\nsentially concentrated around a low dimensional manifold as in [12], or can be modeled as samples\nfrom a metric space as in [10].\n\n1\n\n\fIn this paper we consider a general scenario (see [18]) where the underlying model is a probability\nspace (X, \u03c1) and we are given a (similarity) function K which is a reproducing kernel. The available\ntraining set is an i.i.d sample x1, . . . , xn \u223c \u03c1. The geometry (and topology) in (X, \u03c1) is de\ufb01ned by\nthe kernel K. While this framework is abstract and poses new challenges, by assuming the similarity\nfunction to be a reproducing kernel we can make full use of the good computational properties of\nkernel methods and the powerful theory of reproducing kernel Hilbert spaces (RKHS) [2]. Interest-\ningly, the idea of using a reproducing kernel K to construct a metric on a set X is originally due to\nSchoenberg (see for example [4]).\nBroadly speaking, in this setting we consider the problem of \ufb01nding a model of the smallest region\nX\u03c1 containing all the data. A rigorous formalization of this problem requires: 1) de\ufb01ning the region\nX\u03c1, 2) specifying the sense in which we model X\u03c1. This can be easily done if the probability distri-\nbution has density p with respect to a known measure, in fact X\u03c1 = {x \u2208 X : p(x) > 0}, but is\notherwise a challenging question for a general distribution. Intuitively, X\u03c1 can be thought of as the\nregion where the distribution is concentrated, that is \u03c1(X\u03c1) = 1. However, there are many different\nsets having this property. If X is Rd (in fact any topological space), a natural candidate to de\ufb01ne the\nregion of interest, is the notion of support of a probability distribution\u2013 de\ufb01ned as the intersection\nof the closed subsets C of X, such that \u03c1(C) = 1. In an arbitrary probability space the support of\nthe measure is not well de\ufb01ned since no topology is given.\nThe reproducing kernel K provides a way to solve this problem and also suggests a possible ap-\nproach to model X\u03c1. The \ufb01rst idea is to use the fact that under mild assumptions the kernel de\ufb01nes\na metric on X [18], so that the concept of closed set, hence that of support, is well de\ufb01ned. The\nsecond idea is to use the kernel to construct a function F\u03c1 such that the level set corresponding to\none is exactly the support X\u03c1\u2013 in this case we say that the RKHS associated to K separates the\nsupport X\u03c1. By doing this we are in fact imposing an assumption on X\u03c1: given a kernel K, we can\nonly separate certain sets. More precisely, our contribution is two-fold.\n\n\u2022 We prove that F\u03c1 is uniquely de\ufb01ned by the null space of the integral operator associated\nto K. Given that the integral operator (and its spectral properties) can be approximated\nstudying the kernel matrix on a sample, this result suggests a way to estimate the support\nempirically. However, a further complication arises from the fact that in general zero is\nnot an isolated point of the spectrum, so that the estimation of a null space is an ill-posed\nproblem (see for example [9]). Then, a regularization approach is needed in order to \ufb01nd a\nstable (hence generalizing) estimator. In this paper, we consider a spectral estimator based\non a spectral regularization strategy, replacing the kernel matrix with its regularized version\n(Tikhonov regularization being one example).\n\n\u2022 We introduce the notion of completely regular RKHS, that answer positively to the ques-\ntion whether there exist kernels that can separate the support of any distribution. Examples\nof completely regular kernels are presented and results suggesting how they can be con-\nstructed are given. The concept of completely regular RKHS plays a role similar to the\nconcept of universal kernels in supervised learning, for example see [19].\n\nFinally, given the above results, we show that the regularized spectral estimator enjoys a universal\nconsistency property: the correct support can be asymptotically recovered for any problem (that is\nany probability distribution).\nThe plan of the paper is as follows. In Section 2 we introduce the notion of completely regular\nkernels and their basic properties. In Section 3 we present the proposed regularized algorithms. In\nSection 4 and 5 we provide a theoretical and empirical analysis, respectively. Proofs and further\ndevelopment can be found in the supplementary material.\n\n2 Completely regular reproducing kernel Hilbert spaces\n\nIn this section we introduce the notion of a completely regular reproducing kernel Hilbert space.\nSuch a space de\ufb01nes a geometry on a measurable space X which is compatible with the measurable\nstructure. Furthermore it shows how to de\ufb01ne a function F such that the one level set is the support\nof the probability distribution. The function is determined by the spectral projection associated with\nthe null eigenvalue of the integral operator de\ufb01ned by the reproducing kernel. All the proofs of this\nsection are reported in the supplementary material.\n\n2\n\n\fWe assume X to be a measurable space with a probability measure \u03c1. We \ufb01x a complex1 reproducing\nkernel Hilbert space H on X with a reproducing kernel K : X \u00d7 X \u2192 C [2]. The scalar product\nand the norm are denoted by h\u00b7,\u00b7i, linear in the \ufb01rst argument, and k\u00b7k, respectively. For all x \u2208 X,\nKx \u2208 H denotes the function K(\u00b7, x). For each function f \u2208 H, the reproducing property f (x) =\nhf, Kxi holds for all x \u2208 X. When different reproducing kernel Hilbert spaces are considered, we\ndenote by HK the reproducing kernel Hilbert space with reproducing kernel K. Before giving the\nde\ufb01nition of completely regular RKHS, which is the key concept presented in this section, we need\nsome preliminary de\ufb01nitions and results.\nDe\ufb01nition 1. A subset C \u2282 X is separated by H, if, for any x0 6\u2208 C, there exists f \u2208 H such that\n(1)\n\nf (x) = 0\n\n\u2200x \u2208 C.\n\nf (x0) 6= 0\n\nand\n\nFor example, if X = Rd and H is the reproducing kernel Hilbert space with linear kernel K(x, t) =\nx \u00b7 t, the sets separated by H are precisely the hyperplanes containing the origin. In Eq. (1) the\nfunction f depends on x0 and C, but Proposition 1 below will show that there is a function, possibly\nnot in H, whose one level set is precisely C ( if K(x, x) = 1 ). Note that in [19] a different notion\nof separating property is given.\nWe need some further notation. For any set C, let PC : H \u2192 H be the orthogonal projection onto\nthe closure of the linear space generated by {Kx | x \u2208 C}, so that P 2\n\nC = PC, P \u2217\nker PC = {Kx | x \u2208 C}\u22a5 = {f \u2208 H | f (x) = 0, \u2200x \u2208 C}.\n\nC = PC and\n\nMoreover let FC : X \u2192 C be de\ufb01ned by FC (x) = hPC Kx, Kxi .\nProposition 1. For any subset C \u2282 X, the following facts are equivalent\n\n(i) the set C is separated by H;\n(ii) for all x 6\u2208 C, Kx /\u2208 Ran PC;\n(iii) C = {x \u2208 X | FC (x) = K(x, x)}.\n\n\u2200x /\u2208 C.\n\nIf one of the above conditions is satis\ufb01ed, then K(x, x) 6= 0\nA natural and minimal requirement on H is to be able to separates any pairs of distinct points and\nthis implies that Kx 6= Kt if x 6= t and K(x, x) 6= 0. The \ufb01rst condition ensures the metric given\nby\n(2)\nto be well de\ufb01ned. Then (X, dK) is a metric space and the sets separated by H are always dK-\nclosed, see Prop. 2 below. This last property is not enough to ensure that we can evaluate \u03c1 on the\nset separated by RKHS H. In fact the \u03c3-algebra generated by the metric d might not be contained in\nthe \u03c3-algebra on X. The next result shows that assuming the kernel to be measurable is enough to\nsolve this problem.\nProposition 2. Assume that Kx 6= Kt if x 6= t, then the sets separated by H are closed with respect\nto dK. Moreover, if H is separable and the kernel is measurable, then the sets separated by H are\nmeasurable.\n\ndK(x, y) = kKx \u2212 Ktk\n\nx, t \u2208 X.\n\nGiven the above premises, the following is the key de\ufb01nition that characterizes the reproducing\nkernel Hilbert spaces which are able to separate the largest family of subsets of X.\nDe\ufb01nition 2 (Completely Regular RKHS). A reproducing kernel Hilbert space H with reproducing\nkernel K such that Kx 6= Kt if x 6= t is called completely regular if H separates all the subsets\nC \u2282 X which are closed with respect to the metric (2).\nThe term completely regular is borrowed from topology, where a topological space is called com-\npletely regular if, for any closed subset C and any point x0 /\u2208 C, there exists a continuous function f\nsuch that f (x0) 6= 0 and f (x) = 0 for all x \u2208 C. In the supplementary material, several examples of\ncompletely regular reproducing kernel Hilbert spaces are given, as well as a discussion on how such\nspaces can be constructed. A particular case is when X is already a metric space with a distance\n\n1Considering complex valued RKHS allows to use the theory of Fourier transform and for practical prob-\n\nlems we can simply consider real valued kernels.\n\n3\n\n\ffunction dX. If K is continuous with respect to dX, the assumption of complete regularity forces\nthe metrics dK and dX to have the same closed subsets. Then, the supports de\ufb01ned by dK and dX\nare the same. Furthermore, since the closed sets of X are independent of H, the complete regularity\nof H can be proved by showing that a suitable family of bump2 functions is contained in H.\nCorollary 1. Let X be a separable metric space with respect to a metric dX . Assume that the kernel\nK is a continuous function with respect to dX and that the space H separates every subset C which\nis closed with respect to dX. Then\n\n(i) The space H is separable and K is measurable with respect to the Borel \u03c3-algebra gener-\n\nated by dX.\n\n(ii) The metric dK de\ufb01ned by (2) is equivalent to dX, that is, a set is closed with respect to dK\n\nif and only if it is closed with respect to dX .\n\n(iii) The space H is completely regular.\n\nAs a consequence of the above result, many classical reproducing kernel Hilbert spaces are com-\npletely regular. For example, if X = Rd and H is the Sobolev space of order s with s > d/2, then H\nis completely regular. This is due to the fact that the space of smooth compactly supported functions\nis contained in H. In fact, a standard result of analysis ensures that, for any closed set C and any\nx0 /\u2208 C there exists a smooth bump function such that f (x0) = 1 and its support is contained in\nthe complement of C. Interestingly enough, if H is the reproducing kernel Hilbert space with the\nGaussian kernel, it is known that the elements of H are analytic functions, see Cor. 4.44 in [19].\nClearly H can not be completely regular. Indeed, if C is a closed subset of Rd with not empty inte-\nrior and f \u2208 H is such that f (x) = 0 for all x \u2208 C, a standard result of complex analysis implies\nthat f (x) = 0 for every x \u2208 Rd. Finally, the next result shows that the reproducing kernel can be\nnormalized to one on the diagonal under the mild assumption that K(x, x) 6= 0 for all x \u2208 X.\nLemma 1. Assume that K(x, x) > 0 for all x \u2208 X. Then the reproducing kernel Hilbert space\nwith the normalized kernel K \u2032(x, t) =\n\nK(x, t)\n\npK(x, x)K(t, t)\n\nseparates the same sets as H.\n\nFinally we brie\ufb02y mention some examples and refer to the supplementary material for further de-\nvelopments. In particular, we prove that both the Laplacian kernel K(x, y) = e\u2212 kx\u2212yk2\u221a2\u03c3\nand \u21131-\nexponential kernel K(x, y) = e\u2212 kx\u2212yk1\u221a2\u03c3\nde\ufb01ned on Rd are completely regular for any \u03c3 > 0 and\nd \u2208 N.\n3 Spectral Algorithms for Learning the Support\n\nIn this section, we \ufb01rst discuss our framework and our main assumptions. Then we present the\nproposed regularized spectral algorithms.\nMotivated by the results in the previous section, we describe our framework which is given by a triple\n(X, \u03c1, K). We consider a probability space (X, \u03c1) and a training set x = (x1 . . . , xn) sampled\ni.i.d. with respect to \u03c1. Moreover we consider a reproducing kernel K satisfying the following\nassumption.\nAssumption 1. The reproducing kernel K is measurable and K(x, x) = 1, for all x \u2208 X. Moreover\nK de\ufb01nes a completely regular and separable RKHS H.\nWe endow X with the metric dK de\ufb01ned in (2), so that X becomes a separable metric space. The\nassumption of complete regularity ensures that any closed subset is separated by H and, hence, is\nmeasurable by Prop. 2. Then we can de\ufb01ne the support X\u03c1 of the measure \u03c1, as the intersection of\nall the closed sets C \u2282 X, such that \u03c1(C) = 1. Clearly X\u03c1 is closed and \u03c1(X\u03c1) = 1 (note that this\nlast property depends on the separability of X, hence of H).\nSummarizing the key result in the previous section, under the above assumptions, X\u03c1 is the one level\nset of the function F\u03c1 : X \u2192 [0, 1]\n\nF\u03c1(x) = hP\u03c1Kx, Kxi ,\n\n2Given an open subset U and a compact subset C \u2282 U, a bump function is a continuous compactly sup-\n\nported function which is one on C and its support is contained in U.\n\n4\n\n\fwhere P\u03c1 is a short notation for PX\u03c1. Since F\u03c1 depends on the unknown measure \u03c1, in practice\nit cannot be explicitly calculated. To design an effective empirical estimator we develop a novel\ncharacterization of the support of an arbitrary distribution that we describe in the next section.\n\n3.1 A New Characterization of the Support\n\nThe key observation towards de\ufb01ning a learning algorithm to estimate X\u03c1 it is that the projection P\u03c1\ncan be expressed in terms of the integral operator de\ufb01ned by the kernel K.\nTo see this, for all x \u2208 X, let Kx \u2297 Kx denote the rank one positive operator on H, given by\n\n(Kx \u2297 Kx)(f ) = hf, Kxi Kx = f (x)Kx\n\nf \u2208 H.\n\nMoreover, let T : H \u2192 H be the linear operator de\ufb01ned as\n\nT = ZX\n\nKx \u2297 Kxd\u03c1(x),\n\nwhere the integral converges in the Hilbert space of Hilbert-Schmidt operators on H (see for example\n[7] for the proof). Using the reproducing property in H [2], it is straightforward to see that T is\nsimply the integral operator with kernel K with domain and range in H.\nThen, one can easily see that the null space of T is precisely (I \u2212 P\u03c1)H, so that\n\nP\u03c1 = T \u2020T,\n\n(3)\n\nwhere T \u2020 is the pseudo-inverse of T (see for example [9]). Hence\n\nF\u03c1(x) = (cid:10)T \u2020T Kx, Kx(cid:11) .\n\nObserve that in general Kx does not belong to the domain of T \u2020 and, if \u03b8 denotes the Heaviside\nfunction with \u03b8(0) = 0, then spectral theory gives that P\u03c1 = T \u2020T = \u03b8(T ). The above observation\nis crucial as it gives a new characterization of the support of \u03c1 in terms of the null space of T and\nthe latter can be estimated from data.\n\n3.2 Spectral Regularization Algorithms\n\nFinally, in this section, we describe how to construct an estimator Fn of F\u03c1. As we mentioned above,\nEq. (3) suggests a possible way to learn the projection from \ufb01nite data. In fact, we can consider the\nempirical version of the integral operator associated to K which is simply de\ufb01ned by\n\nTn =\n\n1\nn\n\nn\n\nXi=1\n\nKxi \u2297 Kxi.\n\nThe latter operator is an unbiased estimator of T . Indeed, since Kx \u2297 Kx is a bounded random\nvariable into the separable Hilbert space of Hilbert-Schmidt operators, one can use concentration\ninequalities for random variables in Hilbert spaces to prove that\n\n\u221an\nlog nkT \u2212 TnkHS = 0\n\nlim\n\nn\u2192+\u221e\n\nalmost surely,\n\n(4)\n\nwhere k\u00b7kHS is the Hilbert-Schmidt norm (see for example [14] for a short proof). However, in\ngeneral T \u2020\nnTn does non converge to T \u2020T since 0 is an accumulation point of the spectrum of T or,\nequivalently, since T \u2020 is not a bounded operator. Hence, a regularization approach is needed.\nIn this paper we study a spectral \ufb01ltering approach which replaces T \u2020\nn with an approximation g\u03bb(Tn)\nobtained \ufb01ltering out the components corresponding to the small eigenvalues of Tn. The function g\u03bb\nis de\ufb01ned by spectral calculus. More precisely if Tn = Pj \u03c3jvj \u2297 vj is a spectral decomposition of\nTn, then g\u03bb(Tn) = Pj g\u03bb(\u03c3j )vj \u2297 vj. Spectral regularization de\ufb01ned by linear \ufb01lters is classical in\nthe theory of inverse problems [9]. Intuitively, g\u03bb(Tn) is an approximation of the generalized inverse\nn and it is such that the approximation gets better, but the condition number of g\u03bb(Tn) gets worse\nT \u2020\nas \u03bb decreases. More formally these properties are captured by the following set of conditions.\nAssumption 2. For \u03c3 \u2208 [0, 1], let r\u03bb(\u03c3) := \u03c3g\u03bb(\u03c3), then\n\n\u2022 r\u03bb(\u03c3) \u2208 [0, 1], \u2200\u03bb > 0,\n\n5\n\n\f\u2022 lim\u03bb\u21920 r\u03bb(\u03c3) = 1, , \u2200\u03c3 > 0\n\u2022 |r\u03bb(\u03c3) \u2212 r\u03bb(\u03c3\u2032)| \u2264 L\u03bb|\u03c3 \u2212 \u03c3\u2032|,\u2200\u03bb > 0, where L\u03bb is a positive constant depending on \u03bb.\nExamples of algorithms that fall into the above class include iterative methods\u2013 akin to boosting\n1\u03c3\u2264\u03bb(\u03c3), and Tikhonov regular-\n\u03c3+\u03bb . We refer the reader to [9] for more details and examples, and, given the space\n\nk=0(1 \u2212 \u03c3)k, spectral cut-off g\u03bb(\u03c3) = 1\n\nization g\u03bb(\u03c3) = 1\nconstraints, will focus mostly on Tikhonov regularization in the following.\nFor a chosen \ufb01lter, the regularized empirical estimator of F\u03c1 can be de\ufb01ned by\n\ng\u03bb(\u03c3) = Pm\u03bb\n\n1\u03c3>\u03bb(\u03c3) + 1\n\u03bb\n\n\u03c3\n\nFn(x) = hg\u03bb(Tn)TnKx, Kxi .\n\n(5)\n\nOne can see that that the computation of Fn reduces to solving a simple \ufb01nite dimensional problem\ninvolving the empirical kernel matrix de\ufb01ned by the training data. Towards this end, it is useful to\nintroduce the sampling operator Sn : H \u2192 Cn de\ufb01ned by Snf = (f (x1), . . . , f (xn)), f \u2208 H,\nwhich can be interpreted as the restriction operator which evaluates functions in H on the training set\npoints. The adjoint S\u2217\ni=1 \u03b1iKxi, \u03b1 = (\u03b11, . . . , \u03b1n) \u2208 Cn,\nand can be interpreted as the out-of-sample extension operator. A simple computation shows that\nn = Kn is the n by n kernel matrix, where the (i, j)-entry is K(xi, xj).\nTn = 1\nThen it is easy to see that g\u03bb(Tn)Tn = g\u03bb(S\u2217\n\nn : Cn \u2192 H of Sn is given by S\u2217\n\nn\u03b1 = Pn\nnSn/n = 1\n\nng\u03bb(Kn/n)Sn, so that\n\nnSn and SnS\u2217\n\nn S\u2217\n\nn S\u2217\n\nFn(x) =\n\nT g\u03bb(Kn/n)kx,\n\n(6)\n\nnSn/n)S\u2217\n1\nn\n\nkx\n\nwhere kx is the n-dimensional column vector kx = SnKx = (K(x1, x), . . . , K(xn, x)) . Note that\nEquation (6) plays the role of a representer theorem for the spectral estimator, in the sense that it\nreduces the problem of \ufb01nding an estimator in an in\ufb01nite dimensional space to a \ufb01nite dimensional\nproblem.\n\n4 Theoretical Analysis: Universal Consistency\n\nIn this section we study the consistency property of spectral estimators. All the proofs of this section\nare reported in the supplementary material. We prove the results only for the \ufb01lter corresponding to\nthe classical Tikhonov regularization though the same results hold for the class of spectral \ufb01lters de-\nscribed by Assumption 2. To study the consistency of the methods we need to choose an appropriate\nperformance measure to compare Fn and F\u03c1. Note that there is no natural notion of risk, since we\nhave to compute the function on and off the support. Also note that standard metric used for support\nestimation (see for example [22, 5]) cannot be used in our analsys since they rely on the existence\nof a reference measure \u00b5 (usually the Lebesgue measure) and the assumption that \u03c1 is absolutely\ncontinuous with respect to \u00b5.\nThe following preliminary result shows that we can control the convergence of the Tikhonov esti-\nmator Fn, de\ufb01ned by g\u03bb(T ) = (Tn + \u03bbnI)\u22121, to F\u03c1 uniformly on any compact set of X, provided\na suitable sequence \u03bbn.\nTheorem 1. Let Fn be the estimator de\ufb01ned by Tikhonov regularization and choose a sequence \u03bbn\nso that\n\nlim\nn\u2192\u221e\n\n\u03bbn = 0 and\n\nlimsup\nn\u2192\u221e\n\nlog n\n\u03bbn\u221an\n\n< +\u221e,\n\nthen\n\nlim\n\nn\u2192+\u221e\n\nsup\nx\u2208C|Fn(x) \u2212 F\u03c1(x)| = 0,\n\nalmost surely,\n\nfor every compact subset C of X\n\n(7)\n\n(8)\n\nWe add three comments. First, we note that, as we mentioned before, Tikhonov regularization\ncan be replaced by a large class of \ufb01lters. Second, we observe that a natural choice would be the\nregularization de\ufb01ned by kernel PCA [11], which corresponds to truncating the generalized inverse\nof the kernel matrix at some cutoff parameter \u03bb. However, one can show that, in general, in this case\nit is not possible to choose \u03bb so that the sample error goes to zero. In fact, for KPCA the sample\nerror depends on the gap between the M -th and the M + 1-th eigenvalue of T [1], where M -th\nand M + 1-th are the eigenvalues around the cutoff parameter. Such a gap can go to zero with an\n\n6\n\n\farbitrary rate so that there exists no choice of the cut-off parameter ensuring convergence to zero\nof the sample error. Third, we note that the uniform convergence of Fn to F\u03c1 on compact subsets\ndoes not imply the convergence of the level sets of Fn to the corresponding level sets of F\u03c1, for\nexample with respect to the standard Hausdorff distance among closed subsets. In practice to have\nan effective decision rule, an off-set parameter \u03c4n can be introduced and the level set is replaced by\nXn = {x \u2208 X | Fn(x) \u2265 1 \u2212 \u03c4n} \u2013 recall that Fn takes values in [0, 1]. The following result will\nshow that for a suitable choice of \u03c4n the Hausdorff distance between Xn \u2229 C and X\u03c1 \u2229 C goes to\nzero for all compact sets C. We recall that the Hausdorff distance between two subsets A, B \u2282 X is\n\ndK (b, A)}\nTheorem 2. If the sequence (\u03c4n)n\u2208N converges to zero in such a way that\n\ndH (A, B) = max{sup\n\ndK(a, B), sup\nb\u2208B\n\na\u2208A\n\nlim sup\nn\u2192\u221e\n\nsupx\u2208C|Fn(x) \u2212 F\u03c1(x)|\n\n\u03c4n\n\n\u2264 1,\n\nalmost surely\n\n(9)\n\nthen,\n\nlim\n\nn\u2192+\u221e\n\ndH (Xn \u2229 C, X\u03c1 \u2229 C) = 0\n\nalmost surely,\n\nfor any compact subset C.\n\nWe add two comments. First, it is possible to show that, if the (normalized) kernel K is such that\nlimx\u2032\u2192\u221e Kx(x\u2032) = 0 for any x \u2208 X \u2013 as it happens for the Laplacian kernel, then Theorems 1\nand 2 also hold by choosing C = X. Second, note that the choice of \u03c4n depends on the rate of\nconvergence of Fn to F\u03c1 which will itself depend on some a-priori assumption on \u03c1. Developing\nlearning rates and \ufb01nite sample bound is a key question that we will tackle in future work.\n\n5 Empirical Analysis\n\nIn this section we describe some preliminary experiments aimed at testing the properties and the\nperformances of the proposed methods both on simlauted and real data. Again for space constraints\nwe will only discuss spectral algorithms induced by Tikhonov regularization. Note that while com-\nputations can be made ef\ufb01cient in several ways, we consider a simple algorithmic protocol and leave\na more re\ufb01ned computational study for future work. Following the discussion in the last section,\nT (Kn + n\u03bbI)\u22121kx and a point is labeled\nTikhonov regularization de\ufb01nes an estimator Fn(x) = kx\nas belonging to the support if Fn(x) \u2265 1 \u2212 \u03c4. The computational cost for the algorithm is, in the\nworst case, of order n3, like standard regularized least squares, for training and order N n2 if we\nhave to predict the value of Fn at N test points. In practice, one has to choose a good value for the\nregularization parameter \u03bb and this requires computing multiple solutions, a so called regularization\npath. As noted in [13], if we form the inverse using the eigendecomposition of the kernel matrix the\nprice of computing the full regularization path is essentially the same as that of computing a single\nsolution (note that the cost of the eigen-decomposition of Kn is also of order n3 though the constant\nis worse). This is the strategy that we consider in the following. In our experiments we consid-\nered two data-sets the MNIST data-set and the CBCL face database. For the digits we considered\na reduced set consisting of a training set of 5000 images and a test set of 1000 images. In the \ufb01rst\nexperiment we trained on 500 images for the digit 3 and tested on 200 images of digits 3 and 8. Each\nexperiment consists of training on one class and testing on two different classes and was repeated\nfor 20 trials over different training set choices. The performance is evaluated computing ROC curve\n(and the corresponding AUC value) for varying \u03c4, \u03c4 \u2032, \u03c4 \u2032\u2032. For all our experiments we considered the\nLaplacian kernel. Note that, in this case the algorithm requires to choose 3 parameters: the regular-\nization parameter \u03bb, the kernel width \u03c3 and the threshold \u03c4. In supervised learning cross validation\nis typically used for parameter tuning, but cannot be used in our setting since support estimation is\nan unsupervised problem. Then, we considered the following heuristics. The kernel width is cho-\nsen as the median of the distribution of distances of the K-th nearest neighbor of each training set\npoint for K = 10. Fixed the kernel width, we choose regularization parameter in correspondence\nof the maximum curvature in the eigenvalue behavior\u2013 see Figure 1, the rational being that after this\nvalue the eigenvalues are relatively small. For comparison we considered a Parzen window density\nestimator and one-class SVM (1CSVM )as implemented by [6]. For the Parzen window estimator\nwe used the same kernel used in the spectral algorithm, that is the Laplacian kernel and use the\n\n7\n\n\f160\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\ne\nd\nu\n\nt\ni\n\ni\n\nn\ng\na\nM\n \ns\ne\nu\na\nv\nn\ne\ng\nE\n\nl\n\ni\n\nEigenvalues Decay\nRegularization Parameter\n\n0\n\n \n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\nEigenvalues Index\n\n300\n\n350\n\n400\n\nEigenvalues Decay\n\n \n\nEigenvalues Decay\n\n \n\ne\nd\nu\n\nt\ni\n\ni\n\nn\ng\na\nM\n \ns\ne\nu\na\nv\nn\ne\ng\nE\n\nl\n\ni\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n \n\nEigenvalues Decay\nRegularization Parameter\n\n5\n\n10\n\n15\n\n25\n\n20\n30\nEigenvalues Index\n\n35\n\n40\n\n45\n\n50\n\nFigure 1: Decay of the eigenvalues of the kernel matrix ordered in decreasing magnitude and corre-\nsponding regularization parameter (Left) and a detail of the \ufb01rst 50 eigenvalues (Right).\n\nsame width used in our estimator. Given a kernel width an estimate of the probability distribution\nis computed and can be used to estimate the support by \ufb01xing a threshold \u03c4 \u2032. For the one-class\nSVM we considered the Gaussian kernel, so that we have to \ufb01x the kernel width and a regularization\nparameter \u03bd. We \ufb01x the kernel width to be the same used by our estimator and \ufb01xed \u03bd = 0.9. For\nthe sake of comparison, also for one-class SVM we considered a varying offset \u03c4 \u2032\u2032 . The ROC curves\non the different tasks are reported (for one of the trial) in Figure 2, Left. The mean and standard\ndeviation of the AUC for the 3 methods is reported in Table 5. Similar experiments were repeated\nconsidering other pairs of digits, see Table 5. Also in the case of the CBCL data sets we considered\na reduced data-set consisting of 472 images for training and other 472 for test. On the different test\nperformed on the Mnist data the spectral algorithm always achieves results which are better- and\noften substantially better - than those of the other methods. On the CBCL dataset SVM provides the\nbest result, but spectral algorithm still provides a competitive performance.\n\n6 Conclusions\n\nIn this paper we presented a new approach to estimate the support of an arbitrary probability distri-\nbution. Unlike previous work we drop the assumption that the distribution has a density with respect\nto a (known) reference measure and consider a general probability space. To overcome this prob-\nlem we introduce a new notion of RKHS, that we call completely regular, that captures the relevant\ngeometric properties of the probability distribution. Then, the support of the distribution can be\ncharacterized as the null space of the integral operator de\ufb01ned by the kernel and can be estimated\nusing a spectral \ufb01ltering approach. The proposed estimators are proven to be universally consistent\nand have good empirical performances on some benchmark data-sets. Future work will be devoted\n\nMNIST 9vs4\n\nMNIST 1vs7\n\nCBCL\n\ns\no\nP\ne\nu\nr\nT\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n0\n\n \n\nSpectral\nParzen\nOneClassSVM\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFalsePos\n\ns\no\nP\ne\nu\nr\nT\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n0\n\n \n\nSpectral\nParzen\nOneClassSVM\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFalsePos\n\ns\no\nP\ne\nu\nr\nT\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n0\n\n \n\nSpectral\nParzen\nOneClassSVM\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFalsePos\n\nFigure 2: ROC curves for the different estimator in three different tasks: digit 9vs 4 Left, digit 1vs 7\nCenter, CBCL Right.\n\nSpectral\nParzen\n1CSVM\n\n3vs 8\n\n8vs 3\n\n1vs 7\n\n9vs 4\n\nCBCL\n\n0.8371 \u00b1 0.0056\n\n0.7830 \u00b1 0.0026\n\n0.9921 \u00b1 4.7283e \u2212 04\n\n0.8651 \u00b1 0.0024\n\n0.8682 \u00b1 0.0023\n\n0.7841 \u00b1 0.0069\n\n0.7656 \u00b1 0.0029\n\n0.9811 \u00b1 3.4158e \u2212 04\n\n0.0.7244 \u00b1 0.0030\n\n0.8778 \u00b1 0.0023\n\n0.7896 \u00b1 0.0061\n\n0.7642 \u00b1 0.0032\n\n0.9889 \u00b1 1.8479e \u2212 04\n\n0.7535 \u00b1 0.0041\n\n0.8824 \u00b1 0.0020\n\nTable 1: Average and standard deviation of the AUC for the different estimators on the considered\ntasks.\n\n8\n\n\fto derive \ufb01nite sample bounds, to develop strategies to scale-up the algorithms to massive data-sets\nand to a more extensive experimental analysis.\n\nReferences\n\n[1] P. M. Anselone. Collectively compact operator approximation theory and applications to in-\n\ntegral equations. Prentice-Hall Inc., Englewood Cliffs, N. J., 1971.\n\n[2] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337\u2013404, 1950.\n[3] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for\n\nlearning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399\u20132434, 2006.\n\n[4] C. Berg, J. Christensen, and P. Ressel. Harmonic analysis on semigroups, volume 100 of\n\nGraduate Texts in Mathematics. Springer-Verlag, New York, 1984.\n\n[5] G. Biau, B. Cadre, D. Mason, and Bruno Pelletier. Asymptotic normality in density support\n\nestimation. Electron. J. Probab., 14:no. 91, 2617\u20132635, 2009.\n\n[6] S. Canu, Y. Grandvalet, V. Guigue, and A. Rakotomamonjy. Svm and kernel methods matlab\n\ntoolbox. Perception Systmes et Information, INSA de Rouen, Rouen, France, 2005.\n\n[7] C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel Hilbert spaces of\n\nintegrable functions and Mercer theorem. Anal. Appl. (Singap.), 4(4):377\u2013408, 2006.\n\n[8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv.,\n\n41(3):1\u201358, 2009.\n\n[9] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375 of\n\nMathematics and its Applications. Kluwer Academic Publishers Group, Dordrecht, 1996.\n\n[10] M. Hein, O. Bousquet, and B. Schlkopf. Maximal margin classi\ufb01cation for metric spaces.\n\nJournal of Computer and System Sciences, 71(3):333\u2013359, 10 2005.\n\n[11] H. Hoffmann. Kernel pca for novelty detection. Pattern Recogn., 40(3):863\u2013874, 2007.\n[12] P Niyogi, S Smale, and S Weinberger. A topological view of unsupervised learning from noisy\n\ndata. preprint, Jan 2008.\n\n[13] R. Rifkin and R. Lippert. Notes on regularized least squares. Technical report, Massachusetts\n\nInstitute of Technology, 2007.\n\n[14] L. Rosasco, M. Belkin, and E. De Vito. On learning with integral operators. J. Mach. Learn.\n\nRes., 11:905\u2013934, 2010.\n\n[15] S Roweis and L Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, Jan 2000.\n\n[16] B. Sch\u00a8olkopf, J. Giesen, and S. Spalinger. Kernel methods for implicit surface modeling. In\nAdvances in Neural Information Processing Systems 17, pages 1193\u20131200, Cambridge, MA,\n2005. MIT Press.\n\n[17] B. Sch\u00a8olkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson. Estimating the support\n\nof a high-dimensional distribution. Neural Comput., 13(7):1443\u20131471, 2001.\n\n[18] S. Smale and D.X. Zhou. Geometry of probability spaces. Constr. Approx., 30(3):311\u2013323,\n\n2009.\n\n[19] I. Steinwart and A. Christmann. Support vector machines. Information Science and Statistics.\n\nSpringer, New York, 2008.\n\n[20] I. Steinwart, D. Hush, and C. Scovel. A classi\ufb01cation framework for anomaly detection. J.\n\nMach. Learn. Res., 6:211\u2013232 (electronic), 2005.\n\n[21] J. Tenenbaum, V. Silva, and J. Langford. A global geometric framework for nonlinear dimen-\n\nsionality reduction. Science, Jan 2000.\n\n[22] A. B. Tsybakov. On nonparametric estimation of density level sets. Ann. Statist., 25(3):948\u2013\n\n969, 1997.\n\n[23] U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007.\n\n9\n\n\f", "award": [], "sourceid": 781, "authors": [{"given_name": "Ernesto", "family_name": "Vito", "institution": null}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": null}, {"given_name": "Alessandro", "family_name": "Toigo", "institution": null}]}