{"title": "Measure Based Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1221, "page_last": 1228, "abstract": "", "full_text": "Measure Based Regularization\n\nOlivier Bousquet, Olivier Chapelle, Matthias Hein\n\nMax Planck Institute for Biological Cybernetics, 72076 T(cid:127)ubingen, Germany\n\nf(cid:12)rst.lastg@tuebingen.mpg.de\n\nAbstract\n\nWe address in this paper the question of how the knowledge of\nthe marginal distribution P (x) can be incorporated in a learning\nalgorithm. We suggest three theoretical methods for taking into\naccount this distribution for regularization and provide links to\nexisting graph-based semi-supervised learning algorithms. We also\npropose practical implementations.\n\n1\n\nIntroduction\n\nMost existing learning algorithms perform a trade-o(cid:11) between (cid:12)t of the data and\n\u2019complexity\u2019 of the solution. The way this complexity is de(cid:12)ned varies from one\nalgorithm to the other and is usually referred to as a prior probability or a regular-\nizer. The choice of this term amounts to having a preference for certain solutions\nand there is no a priori best such choice since it depends on the learning problem\nto be addressed. This means that the right choice should be dictated by prior\nknowledge or assumptions about the problem or the class of problems to which\nthe algorithm is to be applied. Let us consider the binary classi(cid:12)cation setting. A\ntypical assumption that is (at least implicitly) used in many learning algorithms is\nthe following\n\nTwo points that are close in input space should have the same label.\n\nOne possible way to enforce this assumption is to look for a decision function which\nis consistent with the training data and which does not change too much between\nneighboring points. This can be done in a regularization setting, using the Lips-\nchitz norm as a regularizer. For di(cid:11)erentiable functions, the Lipschitz norm of a\nfunction is the supremum of the norm of the gradient. It is thus natural to consider\nalgorithms of the form\n\nmin\n\nf\n\nsup\nx krf (x)k under constraints yif (xi) (cid:21) 1:\n\n(1)\n\nPerforming such a minimization on the set of linear functions leads to the maximum\nmargin solution (since the gradient x 7! hw; xi is w), whereas the 1-nearest neighbor\ndecision function is one of the solutions of the above optimization problem when\nthe set of functions is unconstrained [13].\nAlthough very useful because widely applicable, the above assumption is sometimes\ntoo weak. Indeed, most \u2019real-world\u2019 learning problems have more structure than\nwhat this assumption captures. For example, most data is located in regions where\n\n\fthe label is constant (clusters) and regions where the label is not well-de(cid:12)ned are\ntypically of low density. This can be formulated via the so-called cluster assumption:\n\nTwo points that are connected by a line that goes through high den-\nsity regions should have the same label\n\nAnother related way of stating this assumption is to say that the decision boundary\nshould lie in regions of low density.\n\nOur goal is to propose possible implementations of this assumption. It is important\nto notice that in the context of supervised learning, the knowledge of the joint prob-\nability P (x; y) is enough to achieve perfect classi(cid:12)cation (taking arg maxy P (x; y)\nas decision function), while in semi-supervised learning, even if one knows the dis-\ntribution P (x) of the instances, there is no unique or optimal way of using it. We\nwill thus try to propose a principled approach to this problem. A similar attempt\nwas made in [10] but in a probabilistic context, where the decision function was\nmodeled by a conditional probability distribution, while here we consider arbitrary\nreal-valued functions and use the standard regularization approach.\n\nWe will use three methods for obtaining regularizers that depend on the distri-\nbution P (x) of the data.\nIn section 2 we suggest to modify the regularizer in a\ngeneral way by weighting it with the data density. Then in section 3 we adopt a\ngeometric approach where we suggest to modify the distances in input space (in\na local manner) to take into account the density (i.e. we stretch or blow up the\nspace depending on the density). The third approach presented in section 4 builds\non spectral methods. The idea is to look for the analogue of graph-based spectral\nmethods when the amount of available data is in(cid:12)nite. We show that these three\napproaches are related in various ways and in particular we clarify the asymptotic\nbehavior of graph-based regularization. Finally, in section 5 we give a practical\nmethod for implementing one of the proposed regularizers and show its application\non a toy problem.\n\n2 Density based regularization\n\nThe (cid:12)rst approach we propose is to start with a gradient-based regularizer like\nkrfk which penalizes large variations of the function. Now, to implement the\ncluster assumption one has to penalize more the variations of the function in high\ndensity regions and less in low density regions. A natural way of doing this is\nto replace krfk by kprfk where p is the density of the marginal distribution P .\nMore generally, instead of the gradient, one can can consider a regularization map\nL : RX 7! (R+)X , where L(f )(x) is a measure of the smoothness of the function f\n\nat the point x, and then consider the following regularization term\n\nwhere (cid:31) is a strictly increasing function.\n\n(cid:10)(f ) = k L(f )(cid:31)(p) k ;\n\n(2)\n\nAn interesting case is when the norm in (2) is chosen as the L2 norm. Then, (cid:10)(f )\ncan be the norm of a Reproducing Kernel Hilbert Space (RKHS), which means that\nthere exist an Hilbert space H and a kernel function k : X 2 7! R such that\n\nqhf; fiH\n\n= (cid:10)(f ) and hf; k(x;(cid:1))iH\n\n= f (x):\n\n(3)\n\nThe reason for using an RKHS norm is the so-called representer theorem [5]: the\nfunction minimizing the corresponding regularized loss can be expressed as a linear\ncombination of the kernel function evaluated at the labeled points.\n\n\fHowever, it is not straightforward to (cid:12)nd the kernel associated with an RKHS norm.\nIn general, one has to solve equation (3). For instance, in the case L(f ) = (f 2 +\nkrfk2)1=2 and without taking the density into account ((cid:31) = 1), it has been shown in\n[3] that the corresponding kernel is the Laplacian one, k(x; y) = exp((cid:0)kx (cid:0) ykL1 )\nwith associated inner product hf; giH\n= hf; giL2 + hrf;rgiL2 : Taking the density\ninto account, this inner product becomes\n\nPlugging g = k(x; :) in above and expressing that (3) should be valid for all f 2 H,\nwe (cid:12)nd that k must satisfy\n\nhf; giH\n\n=(cid:10)f; (cid:31)2(p)g(cid:11)L2\n\n+(cid:10)rf; (cid:31)2(p)rg(cid:11)L2\n(cid:31)2(p)k(x; :) (cid:0) r((cid:31)2(p)rk(x; :)) = (cid:14)(x (cid:0) :);\n\n:\n\nwhere (cid:14) is the Dirac delta function. However, solving this di(cid:11)erential equation is\nnot an easy task for arbitrary p.\nSince (cid:12)nding the kernel function associated to a regularizer is, in general, a di(cid:14)cult\nproblem, we propose to perform the minimization of the regularized loss on a (cid:12)xed\nset of basis functions, i.e. f is expressed as a linear combination of functions \u2019i.\n\nWe will present in section 5 a practical implementation of this approach.\n\nf (x) =\n\n(cid:11)i\u2019i(x) + b:\n\n(4)\n\nl\n\nXi=1\n\n3 Density based change of geometry\n\nWe now try to adopt a geometric point of view. First we translate the cluster as-\nsumption into a geometric statement, then we explore how to enforce it by changing\nthe geometry of our underlying space. A similar approach was recently proposed by\nVincent and Bengio [12]. We will see that there exists such a change of geometry\nwhich leads to the same type of regularizer that was proposed in section 2.\nRecall that the cluster assumption states that points are likely to be in the same\nclass if they can be connected by a path through high density regions. Naturally\nthis means that we have to weight paths according to the density they are going\nthrough. This leads to introducing a new distance measure on the input space\n(typically Rd) de(cid:12)ned as the length of the shortest weighted path connecting two\npoints. With this new distance, we simply have to enforce that close points have\nthe same label (we thus recover the standard assumption).\nLet us make this more precise. We consider the euclidean space Rd as a (cid:13)at Rieman-\nnian manifold with metric tensor (cid:14), denoted by (Rn; (cid:14)). A Riemannian manifold\n(M; g) is also a metric space with the following path (or geodesic) distance:\n\nd(x; y) = inf\n\n(cid:13) fL((cid:13))j(cid:13) : [a; b] ! M; (cid:13)(a) = x; (cid:13)(b) = yg\n\nwhere (cid:13) is a piecewise smooth curve and L((cid:13)) is the length of the curve given by\n\nL((cid:13)) =Z b\n\na qgij((cid:13)(t)) _(cid:13)i _(cid:13)jdt\n\n(5)\n\nWe now want to change the metric (cid:14) of Rd such that the new distance is the weighted\npath distance corresponding to the cluster assumption. The only information we\nhave is the local density p(x), which is a scalar at every point and as such can\nonly lead to an isotropic transformation in the tangent space TxM. Therefore we\nconsider the following conformal transformation of the metric (cid:14)\n\n(cid:14)ij ! gij =\n\n1\n\n(cid:31)(p(x))\n\n(cid:14)ij\n\n(6)\n\n\fwhere (cid:31) is a strictly increasing function. We denote by (Rd; g) the distorted eu-\nclidean space. Note that this kind of transformation also changes the volume ele-\nment pgdx1 : : : dxd, where g is the determinant of gij.\n\ndx1 : : : dxd ! pgdx1 : : : dxd =\n\n1\n\n(cid:31)(p)d=2 dx1 : : : dxd\n\n(7)\n\nIn the following we will choose (cid:31)(x) = x, which is the simplest choice which gives\nthe desired properties.\nThe distance structure of the transformed space implements now the cluster as-\nsumption, since we see from (5) that all paths get weighted by the inverse density.\nTherefore we can use any metric based classi(cid:12)cation method and it will automat-\nically take into account the density of the data. For example the nearest neigh-\nbor classi(cid:12)er in the new distance is equivalent to the Lipschitz regularization (1)\nweighted with the density proposed in the last section.\nHowever, implementing such a method requires to compute the geodesic distance\nin (Rd; g), which is non trivial for arbitrary densities p. We suggest the following\napproximation which is similar in spirit to the approach in [11].\nSince we have a global chart of Rd we can give for each neighborhood B(cid:15)(x) in the\neuclidean space the following upper and lower bounds for the geodesic distance:\n\nz2B(cid:15)(x)s 1\n\ninf\n\np(z)kx (cid:0) yk (cid:20) d(x; y) (cid:20) sup\n\nz2B(cid:15)(x)s 1\nThen we choose a real (cid:15) and set for each x the distance to all points in a p(x)(cid:0)1=2(cid:15)-\nneighborhood of x as d(x; y) = p( x+y\n2 )(cid:0)1=2kx (cid:0) yk. The geodesic distance can then\nbe approximated by the shortest path along the obtained graph.\n\np(z)kx (cid:0) yk;\n\n8 y 2 B(cid:15)(x)\n\n(8)\n\nWe now show the relationship to the the regularization based approach of the pre-\nvious section. We denote by k(cid:1)kL2(Rd;g;(cid:6)) the L2 norm in (Rd; g) with respect to\nthe measure (cid:6) and by (cid:22) the standard Lebesgue measure on Rd. Let us consider\nthe regularizer krfk2\nL2(Rd;(cid:14);(cid:22)) which is the standard L2 norm of the gradient. Now\nmodifying this regularizer according to section 2 (by changing the underlying mea-\nsure) gives S(f ) = krfk2\nL2(Rd;(cid:14);P ). On the distorted space (Rd; g) we keep the\nLebesgue measure (cid:22) which can be done by integrating on the manifold with re-\nspect to the density (cid:27) = 1pg = pd=2, which cancels then with the volume element\n(cid:27)pgdx1 : : : dxd = dx1 : : : dxd. Since we have on (Rd; g), krfk2 = p(x)(cid:14)ij @f\n@f\n@xj we\n\n@xi\n\nget equivalence of S(f ).\n\nS(f ) = krfk2\n\nL2(Rd;(cid:14);P ) =ZRd\n\np(x)(cid:14)ij @f\n@xi\n\n@f\n@xj dx1 : : : dxd = krfk2\n\nL2(Rd;g;(cid:22))\n\n(9)\n\nThis shows that modifying the measure and keeping the geometry, or modifying\nthe geometry and keeping the Lebesgue measure leads to the same regularizer S(f ).\nHowever, there is a structural di(cid:11)erence between the spaces (Rd; (cid:14); P ) and (Rd; g; (cid:22))\neven if S(f ) is the same.\nIndeed, for regularization operators corresponding to\nhigher order derivatives the above correspondence is not valid any more.\n\n4 Link with Spectral Techniques\n\nRecently, there has been a lot of interest in spectral techniques for non linear di-\nmension reduction, clustering or semi-supervised learning. The general idea of these\napproaches is to construct an adjacency graph on the (unlabeled) points whose\nweights are given by a matrix W . Then the (cid:12)rst eigenvectors of a modi(cid:12)ed version\n\n\fof W give a more suitable representation of the points (taking into account their\nmanifold and/or cluster structure). An instance of such an approach and related\nreferences are given in [1] where the authors propose to use the following regularizer\n\n1\n2\n\nm\n\nXi;j=1\n\n(fi (cid:0) fj)2Wij = f>(D (cid:0) W )f ;\n\n(10)\n\nwhere fi is the value of the function at point xi (the index ranges over labeled and\n\nunlabeled points), D is a diagonal matrix with Dii =Pj Wij and Wij is chosen as\na function of the distance between xi and xj, for example Wij = K(kxi (cid:0) xjk =t).\nGiven a sample x1; : : : ; xm of m i.i.d.\ninstances sampled according to P (x), it is\npossible to rewrite (10) after normalization as the following random variable\n\nUf =\n\n1\n\n2m(m (cid:0) 1)Xi;j\n\n(f (xi) (cid:0) f (xj))2K(kxi (cid:0) xjk =t) :\n\nUnder the assumption that f and K are bounded, the result of [4] (see Inequality\n(5.7) in this paper, which applies to U-statistics) gives\n\nP [Uf (cid:21) E [Uf ] + t] (cid:20) e(cid:0)mt2=C 2\n\n;\n\nwhere C is a constant which does not depend on n and t. This shows that for each\n(cid:12)xed function, the normalized regularizer Uf converges towards its expectation when\nthe sample size increases. Moreover, one can check that\n\nE [Uf ] =\n\n1\n\n2Z Z (f (x) (cid:0) f (y))2K(kx (cid:0) yk =t)dP (x)dP (y) :\n\n(11)\n\nThis is the term that should be used as a regularizer if one knows the whole distri-\nbution since it is the limit of (10)1.\nThe following proposition relates the regularizer (11) to the one de(cid:12)ned in (2).\n\nProposition 4.1 If p is a density which is Lipschitz continuous and K is a contin-\nuous function on R+ such that x2+dK(x) 2 L2, then for any function f 2 C 2(Rd)\n\nwith bounded hessian\n\nlim\nt!0\n\nd\n\nC t2+d Z Z (f (x) (cid:0) f (y))2K(kx (cid:0) yk =t)p(x)p(y)dxdy\n\n=Z krf (x)k2 p2(x)dx;\n\n(12)\n\n(13)\n\n(14)\n\nwhere C =RRd kxk2 K(kxk) dx.\nProof: Let\u2019s (cid:12)x x. Writing a Taylor-Lagrange expansion of f and p around x in\nterms of h = (y (cid:0) x)=t gives\n\nt\n\n(cid:19) p(y)dy\n\nZ (f (x) (cid:0) f (y))2K(cid:18)kx (cid:0) yk\n= Z (thrf (x); hi + O(t2 khk2))2K(khk)(p(x) + O(tkhk)tddh\n= td+2p(x)Z hrf (x); hi2 K(khk)dh + O(td+3) ;\n\n1We have shown that the convergence of Uf towards E [Uf ] happens for each (cid:12)xed f but\nthis convergence can be uniform over a set of functions, provided this set is small enough.\n\n\fwe\n\nTo\n\nthe\n\nlast\n\nthis\n\nproof,\n\nrewrite\n\nintegral\n\nconclude\n\nas\nd : The last equality comes\n\nconstant (let\u2019s call it C2) times the identity matrix and this constant can be\n(cid:3)\n\nrf (x)>(cid:0)R hh>K(khk)dh(cid:1)rf (x) = krf (x)k2 C\nfrom the fact that, by symmetry considerations, R hh>K(khk)dh is equal to a\ncomputed by C2d = trace(cid:0)R hh>K(khk)dh(cid:1) = trace(cid:0)R h>hK(khk)dh(cid:1) = C:\nNote that di(cid:11)erent K lead to di(cid:11)erent a(cid:14)nity matrices:\nif we choose K(x) =\nexp((cid:0)x2=2), we get a gaussian RBF a(cid:14)nity matrix as used in [7], whereas K(x) =\n1x(cid:20)1 leads to an unweighted neighboring graph (at size t) [1].\nSo we have proved that if one takes the limit of the regularizer (10) when the sample\nsize goes to in(cid:12)nity and the scale parameter t goes to 0 (with appropriate scaling),\none obtains the regularizer\n\nZ krf (x)k2 p2(x)dx =(cid:10)f;r(cid:3)D2\n\nprf(cid:11) ;\n\nwhere r(cid:3) is the adjoint of r, Dp is the diagonal operator that maps f to pf and\nh:; :i is the inner product in L2.\nIn [2], the authors investigated the limiting behavior of the regularizer D (cid:0) W ob-\ntained from the graph and claimed that this is the empirical counterpart of the\nLaplace operator de(cid:12)ned on the manifold. However, this is true only if the distri-\nbution is uniform on the manifold. We have shown that, in the general case, the\ncontinuous equivalent of the graph Laplacian is r(cid:3)D2\n\npr.\n\n5 Practical Implementation and Experiments\n\nAs mentioned in section 2, it is di(cid:14)cult in general to (cid:12)nd the kernel associated with\na given regularizer and instead, we decided to minimize the regularized loss on a\n(cid:12)xed basis of functions (\u2019i)1(cid:20)i(cid:20)l, as expressed by equation (4).\nThe regularizer we considered is of the form (2) and is,\n\n(cid:10)(f ) = k krfk pp k2\n\nL2\n\n=Z rf (x) (cid:1) rf (x)p(x)dx:\n\nThus, the coe(cid:14)cients (cid:11) and b in expansion (4) are found by minimizing the following\nconvex regularized functional\n\nn\n\nXi=1\n\n1\nn\n\n|\n\n\u2018(f (xi); yi)\n\n+(cid:21)\n\nRemp(f )\n\n{z\n\n}\n\nl\n\nXi;j=1\n|\n\n(cid:11)i(cid:11)jZ r\u2019i(x) (cid:1) r\u2019j(x)p(x)dx\n}\n\n{z\nkL(f )ppk2\n\nL2\n\n:\n\n(15)\n\nIntroducing the l (cid:2) l matrix Hij =R r\u2019i(x) (cid:1) r\u2019j(x)p(x)dx and the n (cid:2) l matrix\n\nK with Kij = \u2019j(xi), the minimization of the functional (15) is equivalent to the\nfollowing one for the standard L1-SVM loss:\n\nmin\n(cid:11);b\n\n(cid:11)>H(cid:11) + C\n\n(cid:24)i\n\nn\n\nXi=1\n\nunder constraints 8i; yi(Pl\n\nj=1 Kij(cid:11)j + b) (cid:21) 1 (cid:0) (cid:24)i: The dual formulation of this\noptimization problem turns out to be the standard SVM one with a modi(cid:12)ed kernel\nfunction (see also [9]):\n\nmax\n\n(cid:12)\n\nn\n\nXi=1\n\n(cid:12)i (cid:0)\n\n1\n2\n\nn\n\nXi;j=1\n\n(cid:12)i(cid:12)jyiyjLij;\n\n\f2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\nFigure 1: Two moons toy problem: there are 2 labeled points (the cross and the\ntriangle) and 200 unlabeled points. The gray level corresponds to the output of the\nfunction. The function was expanded on all unlabeled points (m=200 in (4)) and\nthe widths of the gaussians have been chosen as (cid:27) = 0:5 and (cid:27)p = 0:05.\n\nunder constraints 0 (cid:20) (cid:12)i (cid:20) C and P (cid:12)iyi = 0; with L = KH(cid:0)1K>.\n\nOnce the vector (cid:12) has been found, the coe(cid:14)cients (cid:11) of the expansion are given by\n\n(cid:11) = H(cid:0)1K>diag(Y )(cid:12):\n\nIn order to calculate the Hij, one has to compute an integral. From now on, we\nconsider a special case where this integral can be computed analytically:\n\n(cid:15) The basis functions are gaussian RBF, \u2019i(x) = exp(cid:16)(cid:0) kx(cid:0)xik2\n2(cid:27)2 (cid:17) ; where\n(cid:15) The marginal density p is estimated using a Parzen window with a Gaussian\n\nthe points x1; : : : ; xl can be chosen arbitrarily. We decided to take the\nunlabeled points (or a subset of them) for this expansion.\n\nkernel, p(x) = 1\n\ni=1 exp(cid:16)(cid:0) kx(cid:0)xik2\np (cid:17) :\n\n2(cid:27)2\n\nmPm\n\nDe(cid:12)ning h = 1=(cid:27)2 and hp = 1=(cid:27)2\nconstant factor,\n\np, this integral turns out to be, up to an irrelevant\n\nm\n\nHij =\n\nh2\n\n2\n\n2h + hp\n\nkxi (cid:0) xjk2\n\nexp (cid:0)\nXk=1\n(cid:0)h2\np(xk (cid:0) xi) (cid:1) (xk (cid:0) xj) (cid:0) h(h + hp)(xi (cid:0) xj)2 + d(2h + hp)(cid:1) ;\n\nkxi (cid:0) xkk2 + kxj (cid:0) xkk2\n\n(cid:0)\n\nhhp\n\n2h + hp\n\n2\n\n!\n\nwhere d is the dimension of the input space.\n\nAfter careful dataset selection [6], we considered the two moons toy problem (see\n(cid:12)gure 1). On this 2D example, the regularizer we suggested implements perfectly\nthe cluster assumption: the function is smooth on high density regions and the\ndecision boundary lies in a low density region.\n\nWe also tried some real world experiments but were not successful. The reason\nmight be that in dimension more than 2, the gradient does not yield a suitable\nregularizer: there exists non continuous functions whose regularizer is 0. To avoid\nthis, from the Sobolev embedding lemma, we consider derivatives of order at least\nd=2. More speci(cid:12)cally, we are currently investigating the regularizer associated with\n\n\fa Gaussian kernel of width (cid:27)r [8, page 100],\n\n1Xp=1\n\n(cid:27)2p\nr\n\np!2p Z krpf (x)k2 p(x)dx;\n\nwith r2p (cid:17) (cid:1)p:\n\n6 Conclusion\n\nWe have tried to make a (cid:12)rst step towards a theoretical framework for semi-\nsupervised learning. Ideally, this framework should be based on general principles\nwhich can then be used to derive new heuristics or justify existing ones.\nOne such general principle is the cluster assumption. Starting from the assumption\nthat the distribution P (x) of the data is known, we have proposed several ideas\nto implement this principle and shown their relationships.\nIn addition, we have\nshown the relationship to the limiting behavior of an algorithm based on the graph\nLaplacian.\nWe believe that this topic deserves further investigation. From a theoretical point\nof view, other types of regularizers, involving, for example, higher order derivatives\nshould be studied. Also from a practical point of view, we should derive e(cid:14)cient\nalgorithms from the proposed ideas, especially by obtaining (cid:12)nite sample approxi-\nmations of the limit case where P (x) is known.\n\nReferences\n\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. Neural Computation, 15(6):1373{1396, 2003.\n\n[2] M. Belkin and P. Niyogi. Semi-supervised learning on manifolds. Machine Learning\n\njournal, 2003. to appear.\n\n[3] F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: From\nregularization to radial, tensor and additive splines. Technical Report Arti(cid:12)cial In-\ntelligence Memo 1430, Massachusetts Institute of Technology, 1993.\n\n[4] W. Hoe(cid:11)ding. Probability inequalities for sums of bounded random variables. Journal\n\nof the American Statistical Association, 58:13{30, 1963.\n\n[5] G. Kimeldorf and G. Wahba. Some results on tchebychean spline functions. Journal\n\nof Mathematics Analysis and Applications, 33:82{95, 1971.\n\n[6] Doudou LaLoudouana and Mambobo Bonouliqui Tarare. Data set selection.\n\nIn\n\nAdvances in Neural Information Processing Systems, volume 15, 2002.\n\n[7] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an\nalgorithm. In Advances in Neural Information Processing Systems, volume 14, 2001.\n\n[8] B. Sch(cid:127)olkopf and A. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002.\n\n[9] A. Smola and B. Scholkopf. On a kernel-based method for pattern recognition, re-\n\ngression, approximation and operator inversion. Algorithmica, 22:211{231, 1998.\n\n[10] M. Szummer and T. Jaakkola. Information regularization with partially labeled data.\nIn Advances in Neural Information Processing Systems, volume 15. MIT Press, 2002.\n\n[11] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. Science, 290(5500):2319{2323, 2000.\n\n[12] P. Vincent and Y. Bengio. Density-sensitive metrics and kernels. Presented at the\n\nSnowbird Learning Workshop, 2003.\n\n[13] U. von Luxburg and O. Bousquet. Distance-based classi(cid:12)cation with lipschitz func-\nIn Proceedings of the 16th Annual Conference on Computational Learning\n\ntions.\nTheory, 2003.\n\n\f", "award": [], "sourceid": 2504, "authors": [{"given_name": "Olivier", "family_name": "Bousquet", "institution": null}, {"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Matthias", "family_name": "Hein", "institution": null}]}