{"title": "Information Diffusion Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 391, "page_last": 398, "abstract": null, "full_text": "Information Diffusion Kernels\n\nJohn Lafferty\n\nSchool of Computer Science\nCarnegie Mellon University\nPittsburgh, PA 15213 USA\nlafferty@cs.cmu.edu\n\nGuy Lebanon\n\nSchool of Computer Science\nCarnegie Mellon University\nPittsburgh, PA 15213 USA\nlebanon@cs.cmu.edu\n\nAbstract\n\nA new family of kernels for statistical learning is introduced that ex-\nploits the geometric structure of statistical models. Based on the heat\nequation on the Riemannian manifold de\ufb01ned by the Fisher informa-\ntion metric, information diffusion kernels generalize the Gaussian kernel\nof Euclidean space, and provide a natural way of combining generative\nstatistical modeling with non-parametric discriminative learning. As a\nspecial case, the kernels give a new approach to applying kernel-based\nlearning algorithms to discrete data. Bounds on covering numbers for\nthe new kernels are proved using spectral theory in differential geometry,\nand experimental results are presented for text classi\ufb01cation.\n\n1 Introduction\n\nThe use of kernels is of increasing importance in machine learning. When \u201ckernelized,\u201d\nsimple learning algorithms can become sophisticated tools for tackling nonlinear data anal-\nysis problems. Research in this area continues to progress rapidly, with most of the activity\nfocused on the underlying learning algorithms rather than on the kernels themselves.\n\nKernel methods have largely been a tool for data represented as points in Euclidean space,\nwith the collection of kernels employed limited to a few simple families such as polynomial\nor Gaussian RBF kernels. However, recent work by Kondor and Lafferty [7], motivated\nby the need for kernel methods that can be applied to discrete data such as graphs, has\nproposed the use of diffusion kernels based on the tools of spectral graph theory. One\nlimitation of this approach is the dif\ufb01culty of analyzing the associated learning algorithms\nin the discrete setting. For example, there is no obvious way to bound covering numbers\nand generalization error for this class of diffusion kernels, since the natural function spaces\nare over discrete sets.\n\nIn this paper, we propose a related construction of kernels based on the heat equation. The\nkey idea in our approach is to begin with a statistical model of the data being analyzed, and\nto consider the heat equation on the Riemannian manifold de\ufb01ned by the Fisher information\nmetric of the model. The result is a family of kernels that naturally generalizes the familiar\nGaussian kernel for Euclidean space, and that includes new kernels for discrete data by\nbeginning with statistical families such as the multinomial. Since the kernels are intimately\nbased on the geometry of the Fisher information metric and the heat or diffusion equation\non the associated Riemannian manifold, we refer to them as information diffusion kernels.\n\n\fUnlike the diffusion kernels of [7], the kernels we investigate here are over continuous pa-\nrameter spaces even in the case where the underlying data is discrete. As a consequence,\nsome of the machinery that has been developed for analyzing the generalization perfor-\nmance of kernel machines can be applied in our setting. In particular, the spectral approach\nof Guo et al. [3] is applicable to information diffusion kernels, and in applying this ap-\nproach it is possible to draw on the considerable body of research in differential geometry\nthat studies the eigenvalues of the geometric Laplacian.\n\nIn the following section we review the relevant concepts that are required from information\ngeometry and classical differential geometry, de\ufb01ne the family of information diffusion\nkernels, and present two concrete examples, where the underlying statistical models are\nthe multinomial and spherical normal families. Section 3 derives bounds on the covering\nnumbers for support vector machines using the new kernels, adopting the approach of [3].\nSection 4 describes experiments on text classi\ufb01cation, and Section 5 discusses the results\nof the paper.\n\n2 Information Geometry and Diffusion Kernels\n\ngiven by\n\n\u000b\u000e\r\u0010\u000f\n)\u0010*\n\nor equivalently as\n\n. For each\n. Let\nis\n\nis'\u0015(\n0H2B4I\u0005\b\u0007\n\u000b\u000e\r\u0010\u000f\n\u0005\b\u0007\n\nLet\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\u000e\r\u0010\u000f\u0012\u0011\u0013\r\u0015\u0014\u0017\u0016\u0019\u0018\u001b\u001a\u001d\u001c\u001f\u001e be a -dimensional statistical model on a set!\nat each point in the interior of\u0016\n$%\u0005&\u0007\nassume the mapping\r\u0017#\n*:9\n)\u0010*\n\u000b\u000e\r\u0010\u000f . The Fisher information matrix7\n\u0007;\r\u0010\u000f=< of\n\u000f\u0013\u00011032546\u0005\b\u0007\nat\r>\u0014?\u0016\n+-,\n. and/\n)59\n)\f9\n)B*\n*:9\n\u000b\u000e\r\u0010\u000f\n\u000b\n\r\u0010\u000fJ\u0005&\u0007\n0H2B4I\u0005&\u0007\n\u000b\n\rB\u000f\n<C\u0001EDGF\n\u0007;\r\u0010\u000f\u0013\u0001A@\n)59NL\n*:9\n)\u0010*ML\n\u000b\u000e\r\u0010\u000f\n\u0005&\u0007\n\u0007;\r\u0010\u000f\u0013\u00011K\nDGF\n\"PO\n* ,8\nIn coordinates\r\n\u0007;\r\u0010\u000f de\ufb01nes a Riemannian metric on\u0016\n, giving\n -dimensional Riemannian manifold. One of the motivating properties of the Fisher infor-\nFor many statistical models there is a natural way to associate to each data point\" a pa-\nrameter vector\rG\u0007\nword counts. This amounts to the mapping which sends a document\"\n\u000f . Given such a mapping, we propose to apply a kernel on parameter\n\rR\u0007\nlikelihood model Q\n\u000f\u0013W\n\u0007;\rR\u0007\n\"\u001fV\nspace,SUT\nSUT\nMore generally, we may associate a data point\" with a posterior distribution\u0005&\u0007;\r&\u000b\n\u000f under\n\nin the statistical model. For example, in the case of text, under the\nmultinomial model a document is naturally associated with the relative frequencies of the\nto its maximum\n\na suitable prior. In the case of text, this is one way of \u201csmoothing\u201d the maximum likelihood\nmodel, using, for example, a Dirichlet prior. Given a kernel on parameter space, we then\naverage over the posteriors to obtain a kernel on data:\n\nmation metric is that, unlike the Euclidean distance, it is invariant under reparameterization.\nFor detailed treatments of information geometry we refer to [1, 6].\n\n\u000f\u0012\u0011M\rG\u0007\n\n\u000f\u0006\u000f .\n\n\"\u001fV\n\nthe structure of a\n\n(1)\n\n(2)\n\n*:9\n\nS>T\n\n\u000fX\u0001EDGYEDZY\n\n\u0007;\rR\u0011[\n\n\u000f-\u0005\b\u0007;\r&\u000b\n\n\u000f-\u0005\b\u0007;\n\nS>T\n\nIt remains to de\ufb01ne the kernel on parameter space. There is a fundamental choice: the ker-\nnel associated with heat diffusion on the parameter manifold under the Fisher information\nmetric.\n\n(3)\n\ncoordinates by\n\nFor a manifold\\ with metric8\n\n*]9\n\n\u000fc$\n\nthe Laplacian^`_Ca\u001db\n)\u0010*ML det8\u001d8\ne det8gf\n\n*:9\n\n*:9\n\n)59\n\na\u001db\n\nis given in local\n\n(4)\n\n\"\n\u0014\n!\n\"\n\u0001\n+\n,\n\u0007\n\"\n\"\n8\n8\n,\n7\n/\n,\n/\n,\n\"\n\"\n\"\n \n\"\n8\n\"\n\"\n \n\"\n\u000f\n\"\n\u0007\n\"\n\u0011\n\"\n\"\n\u0007\n\"\n\u0011\n\"\nV\nV\n\"\nV\n\u000b\n\"\nV\n\u000f\n \n\n \n\nV\nO\n\u0007\n\\\n\u0007\n\\\n\u000f\n^\n\u0001\nd\n\f.\n\nis\n\n\u0001\u0018\u0017\u0019\f\n\nis self-adjoint. Dirichlet boundary\n\nis the outer normal direction. The following theorem summarizes the basic properties for\n\n*]9\n*]9\n<I\u0001\n<\u0001\n\u0001\u0006\u0005\n+\b\u0007\n\u0002 , generalizing the classical operator div\u0003\nwhere7\n\u0007 . When\\\n+\n\tJ.\n\u0001\r\f\u000f\u000e\u0011\u0010\u0012\f\n\t with corresponding\n\u0002\u0014\u0013\ncompact the Laplacian has discrete eigenvalues\u000b\n* satisfying^\u0016\u0015\n* . When the manifold has a boundary, appropri-\neigenfunctions\u0015\nate boundary conditions must be imposed in order that^\n\u000b and Neumann boundary conditions require +\b\u001a\n+\b\u001b\u0016\u001c\nconditions set\u0015\n\u000b where\nthe kernel of the heat equation\u0007\n\u000f\u001f\u001e\n\u000b on\\\nTheorem 1. Let\\ be a geodesically complete Riemannian manifold. Then the heat\n\u0007\" \n\u000f ,(2)0$#$%\n\u0011! G\u000f existsandsatis\ufb01es(1)S\n\u0011! G\u000f\u0013\u0001('\n\u0007\" G\u000f ,\n\u0011! Z\u000f\u0013\u0001\nSUT\nT\u0001&\nkernelSUT\n\u00110/N\u000f\n\u0017`+\n\u00112 G\u000f?\u0001\n\u0011! G\u000f?\u0001-,\n\u00071/R\u00112 G\u000f\n/ , and (5)SUT\n\u000b , (4)S>T\nSUT\nT+*\n(3))\n\u000f.\n\u0007\" G\u000f .\n*43\n\u000f7\n\u000e65\n\u0011! G\u000f\nWe refer to [9] for a proof. Properties 2 and 3 imply that S\n\u0007\" G\u000f shows\nequation in \"\n, starting from \nIntegrating property 3 against a function8\n\u0011! G\u000f\n\u0007\" G\u000f\n `\u0001\n\u0011! G\u000f\n\u0007\" G\u000f\nT\"9\nthat 5\nS>T\nSUT\n\u0001;:\n\u0011! Z\u000f\nT\"9\nT\"9\nsince5\nis a positive operator; thusSgT\n8\u000f<\u0016=>\u000b\nT de\ufb01nes a Mercer kernel.\nis positive de\ufb01nite. Together, these properties show that S\n\u000f\u0015\u0001\nNote that when using such a kernel for classi\ufb01cation, the discriminant function \n\u000f can be interpreted as the solution to the heat equation with initial tem-\n*@?\nS>T\n* , and A\u000eB\u0007\n* on labeled data point\"\nperature A\u000e\u0010\u0007\n\u000fX\u0001\n\u000f\u0013\u0001\n\u000b on unlabeled points.\n\nThe following two basic examples illustrate the geometry of the Fisher information metric\nand its associated diffusion kernel: the multinomial corresponds to a Riemannian mani-\nfold of constant positive curvature, and the spherical normal family to a space of constant\nnegative curvature.\n\nsolves the heat\n\n. Therefore,\n\nSUT\n\nT\"9\n\n.\n\n2.1 The Multinomial\n\n\u0001F/\n\nThe multinomial is an important example of how information diffusion kernels can be\nis an element of\n\nThe representation of the Fisher information metric given in equation (2) suggests the\ngeometry underlying the multinomial.\nIn particular, the information metric is given by\n\napplied naturally to discrete data. For the multinomial family\u0003\u0006\u0005\b\u0007\n\t-\u000b\n\r\u0010\u000fM\u001e ,\r\n\u001cCB\nd . The transformation\r\n*43\nthe -simplex,\u0005\nthe -sphere of radius 2.\n\u001cGB\n)59\n)\f9\n)\u0010*\n*]9\n/R\u0011\n0H2B4\n03254\n\u0007;\rB\u000f\n\u0011M\r\nbetween two points\r\nV is given by\n2QP\u0011R\n\u0007;\rR\u0011[\n\n* maps the -simplex to\n$ED\n< so that the Fisher information corre-\n\nsponds to the inner product of tangent vectors to the sphere, and information geometry for\nthe multinomial is the geometry of the positive orthant of the sphere. The geodesic distance\n\nThis metric places greater emphasis on points near the boundary, which is expected to be\nimportant for text problems, which have sparse statistics. In general for the heat kernel on\na Riemannian manifold, there is an asymptotic expansion in terms of the parametrices; see\nfor example [9]. This expands the kernel as\n\n)\u0010*\n\u0001I:\n\u000f\u0013\u0001JDLKNM2O+O\n\n*1S\n\n(5)\n\nSUT\n\n\u00112 G\u000f\u0013\u0001\n\nKQTVU\u0006\u000f\n\nXW\n\n\u0007ZYC[]\\_^\n\n*fehg\n\n\u0011! G\u000f\u001fU\n\n\u0007\"U\n\nKAU\n\n*43\n`ba\n\n\u000edc\n\n\u001cCB\n*43\n\u0011! Z\u000f\n\nUsing the \ufb01rst order approximation and the explicit distance for the geodesic distance gives\n\n(6)\n\n8\n7\n8\n\u0004\n*\n\f\nb\n\t\n\t\n*\n*\n\u0015\n*\n\u000b\n+\nY\n\u0001\n.\n\u001c\n\u001c\n+\nY\n\u0001\n\u001d\n^\n\u0017\n+\n+\nT\n\u0001\n\u0007\n\"\nT\n\u0007\n\"\n\u0011\n\"\n\u000e\n\u0007\n\"\n\t\n^\n+\nS\n\u0001\n\u0007\n\"\nY\n\u0007\n\"\nS\n.\n \n\u0007\n\"\n\u0005\n(\n.\nT\n\u0015\n*\n\u0007\n\"\n\u000f\n\u0015\n*\nT\n\u0007\n\"\n8\n\u0007\n\"\n\u000f\n\u0001\n,\nY\n\u0007\n\"\n8\n \n \n,\nY\n,\nY\n\u0007\n\"\n8\n\u0007\n\"\n\u000f\n8\n \n\"\n \n,\nY\n8\n\u0007\n\"\n\u000f\n)\n5\n8\n*\n\u0007\n\"\n\u000f\n \n\"\n8\n\u0011\n5\n\u0007\n\"\nT\n\u0007\n\"\n\u0005\n*\n \n*\n\u0007\n\"\n\u0011\n\"\n*\n\"\n*\n?\n*\n \n\"\n\u0002\n\u0002\n\n*\n\u0001\n*\n#\ne\n\n*\n8\n\u0001\n\u0005\n\u0002\nH\n3\n\u0002\n\nH\n\nH\n\nH\n/\n \nV\n\u0002\nf\n\u0002\nL\n\n*\n\nV\nO\n\u0007\n\"\n\u0007\n\u0017\n \nb\n\u0007\n\"\nf\n*\n\u0007\n\"\na\n\u000f\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n \n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n \n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n \n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n \n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 1: Example decision boundaries using support vector machines with information\ndiffusion kernels for trinomial geometry on the 2-simplex (top right) and spherical normal\n\n(bottom right), compared with the standard Gaussian kernel (left).\n\ngeometry, \n\n\u0001JD\n\na simple formula for the approximate information diffusion kernel for the multinomial as\n\nIn Figure 1 this kernel is compared with the standard Euclidean space Gaussian kernel for\n\nKNM2O+O\n\n2AP\n\nYC[]\\\n\n*\u0001S\n\n(7)\n\n2.2 Spherical Normal\n\nupper half plane in hyperbolic space.\n\n\u0007;\rR\u0011[\n\n\u000f\u0003\u0002\n\nSUT\n\nis the scale of the variance. A calculation shows that8\n\nKATVU\u0006\u000f\n\u0005\u0004\n\u0001JD .\nthe case of the trinomial model, \nNow consider the statistical family given by\u0005&\u0007\n\t-\u000b\u000e\rg\u0001\nis the mean and \b\n*]9 . Thus, the Fisher information metric gives\u0016\n\n\u001cCB\n*43\n\u0007\u0007\u0006\u0013\u0011\t\b6\u000f\u0006\u000f\b\u0001\u000b\n\u0019\u0007\u0007\u0006\u0013\u0011\f\b\u000e\r\n\u0002\u0011\u0010\n\u0001\rD\u0014\u0013\n\u001c has a closed form [2]. For \n\u0017\u0005\u0013\nKAU]`\nKATVU\nP2#\u0018\u0017\u001a\u0019\u001c\u001b\nand for \n\u0002\u001f&\n\u0017%$\nT)(\nSUT\n2AP\f\u0019\u001c\u001b\n2AP\f\u0019\nP2#\u0018\u0017\u001a\u0019\u001c\u001b\n\u001c . For \n\"\u001fV\nwhere\u001b\nkernel is identical to the Gaussian kernel on\u001a\nIf only the mean\r>\u0001*\u0006\n\nThe heat kernel on hyperbolic space\nby\n\nS>T\n\u0001JD\u0014\u0013\n\u000f\u0013\u0001\n\nis the geodesic distance between the two points in\n\nY+[]\\_^\nYC[]\\#\"\n\nd\u0016\u0015\nKAT\n\nU\u001f\u001e\n\nthe kernel is given by\n\n\u000fX\u0001\nd\u001d\u0015\n\n.\n\n\u000f where\u0006\u0002\u0014\n*:9\n\u0007;\r\u0010\u000f\u0015\u0001\n\nthe structure of the\n\nit is given\n\n(8)\n\n(9)\n\nthe\n\nis unspeci\ufb01ed, then the associated kernel is the standard Gaussian\nRBF kernel. In Figure 1 the kernel for hyperbolic space is compared with the Euclidean\n\n\n\n\u0001\n\u0001\nV\n\u0007\n\u0007\nR\n\u0017\nd\nU\nb\nR\n\u0002\nf\n\u0002\nL\n\n*\n\nV\nS\n\u001c\n\n\u0002\n\u001a\n\u001c\n\n\u0002\n\u000f\nb\n\n\u0007\n'\n\u0001\n\u001a\n\u001c\n\n\u001a\nB\n\u0012\ne\nd\n\u0007\n\"\n\u0011\n\"\nV\n\u0017\nD\n\u0015\nT\n\u0015\nd\ne\n^\nd\n)\n)\n\u001b\n`\n\u0015\nb\nU\n\u0017\n\u001b\nb\ne\nD\n\u0007\n\"\n\u0011\n\"\nV\n\u0017\nD\n\u0015\nT\n\u0015\ne\nD\ne\n^\nd\n)\n)\n\u001b\n`\n\u0015\nD\n(\n \n!\nb\n\u0015\nB\n\u0007\nT\n'\n\u0017\n.\n\u0007\n'\ne\nO\n!\n\u0017\nO\n \n!\n\u0001\n \n\u0007\n\"\n\u0011\n\u000f\n\u0012\n\u0001\nd\n\fspace Gaussian kernel for the case of a 1-dimensional normal model with unknown mean\n\nand variance, corresponding to \n\ndiffusion kernel makes intuitive sense, since as the variance decreases the mean is known\nwith increasing certainty.\n\nD . Note that the curved decision boundary for the\n\n3 Spectral Bounds on Covering Numbers\n\nIn this section we prove bounds on the entropy and covering numbers for support vector\nmachines that use information diffusion kernels; these bounds in turn yield bounds on the\nexpected risk of the learning algorithms. We adopt the approach of Guo et al. [3], and make\nuse of bounds on the spectrum of the Laplacian on a Riemannian manifold, rather than\non VC dimension techniques. Our calculations give an indication of how the underlying\ngeometry in\ufb02uences the entropy numbers, which are inverse to the covering numbers.\n\nthe\n\n\u0007\n\t\n\n.\n\ndef\n\n(10)\n\nis de\ufb01ned as the collection of functions\n\nWe begin by recalling the main result of [3], modifying their notation slightly to conform\n\nto feature space de\ufb01ned by the Mercer kernel, and\nIt is of\n\n\u001c be a compact subset of -dimensional Euclidean space, and\nwith ours. Let \\\nis a Mercer kernel. Denote by\f\nsuppose thatS\n\u000f denote the\n\u0007\" G\u000f\n\u0007\n\t]\u00112 G\u000f\n=\u0006\u000b\n , and letc\neigenvalues ofS\n, i.e., of the mapping8\n\u0010\u0007\u0006\nP\u0003\u0002\n9\u0005\u0004\ncorresponding eigenfunctions. We assume that'\u0001\n\u001e with weight vector\npoints\"\n, the SVM hypothesis class for \b\nGiven \u0013\nbounded by\t\n\u0011\u000f\u000e\n$%\u00070:\r\f\n:\r\fP\u0011\u0010\u000e\n\n\u0001\u000b\n\u000f\u0012\u0011\n\u000fX\u0001\u0004\u0003\u0010\u0007\n\u000f\u0013#\nwhere \u000e\n\u0007\n\t\nis the mapping from\\\n:\u000e\tH\u0011\n< and \u0004\n\u0004 denote the corresponding Hilbert space inner product and norm.\ninterest to obtain uniform bounds on the covering numbers\n\u0019\u0007\u0012\u0011\n\n\u0013\u000b\n\u000f\u0006\u000f , de\ufb01ned as the\nsize of the smallest \u0011 -cover of \n\u0014\u000b\nin the metric induced by the norm \u0004\n\u000b . The following is the main result of Guo et al. [3].\n(\u0016\u0015\n\u0015\u0018\u0017\n\u0017\u0018\u0017\u0018\u0015\n\u0014\u001b\u001a , let\u001c\u001e\u001d\n\u001f denote the smallest integer for which\nTheorem 2. Given an integer\u0019\n$ and de\ufb01ne \u0011\n\"\u0018\"\n$\u0003*\n*43\u001f9\nO Then\n''(\t\nY10\nP\u0003\u0002\n\n\u0001\u000b\n\\,+\n\tJ..-\u000f/\n\u0019 .\n\u001f using spectral theory in Rie-\nTo apply this result, we will obtain bounds on the indices\u001c2\u001d\nTheorem 3. Let\\ beacompactRiemannianmanifoldofdimension withnon-negative\nRiccicurvature, andassumethattheboundaryof\\\nisconvex. Let\u000b\nwhere4\n\nmannian geometry. The following bounds on the eigenvalues of the Laplacian are due to\nLi and Yau [8].\n\ndenotetheeigenvaluesoftheLaplacianwithDirichletboundaryconditions. Then\n\n7! #\"\n7%$\n\"\u0018\"\n\n\u0019\u0007\u0012\u0011\n\n(11)\n\n\u000f\u0006\u000f\n\n\u0002 and3\n\nisthevolumeof\\ and3\n\nNote that the manifold of the multinomial model satis\ufb01es the conditions of this theorem.\nUsing these results we can establish the following bounds on covering numbers for infor-\nmation diffusion kernels. We assume Dirichlet boundary conditions; a similar result can\n\nbe proven for Neumann boundary conditions. We include the constant4\ndiffusion coef\ufb01cientU\n\nin order to indicate how the bounds depend on the geometry.\n\nb areconstantsdependingonlyonthedimension.\n\u000f and\n\nvol\u0007\n\n\u0001\n\u0018\n\u001a\n_\n\\\n\u0010\n\\\n\u0017\n$\n\u001a\n\u0002\n=\n\f\nb\n=\n\t\n\t\n\t\n#\n$\n,\nY\nS\n8\n \n9\n\u0001\n\\\nc\n9\n\u0004\n(\n*\n\u0014\n\\\n\u0001\n\u0003\n\"\n*\n\u0007\n\b\n\"\n\u0002\n\u0011\nO\nO\nO\n\u0011\n\"\n\u0015\n\u0007\n\"\n\u0002\n\u000f\n<\n\u0011\nO\nO\nO\n\u0007\n\"\n\u0015\n\u000f\n<\n\u0004\n\f\n\u0004\n\u0013\n\t\n\u001e\n\u000f\n\t\n\t\n\u0011\n\u0007\n\b\n\u0007\n\b\n\u000f\n8\n\u0004\n\b\n\u0001\n%\nK\n[\n*\n3\n\u0002\n\u0015\n\u000b\n8\n\u0007\n\"\n*\n\u000f\n\f\n9\nB\n\u0002\n\u0010\n\"\n\u001f\n\u0007\n(\n \n\u001d\n\u001f\n\u0001\n&\n)\n\u001c\n\u001d\n\u001f\n\"\n7\n \n\"\n7\nW\n\u001f\n\u0007\n(\n \n$\n*\nW\ne\n\u0005\n(\n*\nW\n\f\n*\n\u001d\n\u001f\n\u0011\n\u0007\n\b\n\u0013\n\u0010\n\u0006\n\u0002\n\u0013\n\u0006\nb\n\u0013\n\t\n\t\n\t\n3\n\u0002\n\u0007\n \n\u000f\n^\n\u001c\n4\n`\n\u0007\n\u0004\n\u0013\n\u0006\n9\n\u0013\n3\nb\n\u0007\n \n\u000f\n^\n\u001c\ne\nd\n4\n`\n\u0007\n\u0004\n\u0001\n\\\n\fTheorem 4. Let\\ be a compact Riemannian manifold, with volume4 , satisfying the\nconditions of Theorem 3. Then the coveringnumbers for the Dirichlet heat kernelS\nT on\n\nsatisfy\n\n(12)\n\nProof. By the lower bound in Theorem 3, the Dirichlet eigenvalues of the heat kernel\n\n*43\n\n\u0001\u000b\n\n\u000f\u0006\u000fX\u0001\n\n\u0019\u0007\u0012\u0011\n03254\n\u00112 G\u000f , which are given by\f\nSUT\n0H2B4\nwhere the second inequality comes from\u0005\nupper bound of Theorem 3, the inequality\u001c\n0H2B4X\f\nD\u0010\u000f\n0H2B4\n\nThe above inequality will hold in case\n\nor equivalently\n\n0H2B4\n\n\u001cGB\n\n\u0017ZU\n\n^\u0014^\n$ , satisfy0H2B4X\f\n03254\n\u0001\u0003\u0002\n*43\n\u001c will hold if\n\u001cGB\n\n`\u0014`\n\n0H2B4\nDB\u000f\n\n\u0004 . Thus,\n03254\n\n. Now using the\n\n9\u0005\u0004\u0007\u0006\n\n0H2B4\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\nthus,\n\n\u001cGB\n\n0H2B4\n0H2B4\n\u0019V`\ninto the expression for \u0011\n\n\u001cGB\n\u0002 ;\n\u000f . Plugging this bound on \u001c\n*43\u001f9\n. Inverting the above equation in03254\n03254\n. Thus, for \ufb01xedU\nand \b\n, and for \ufb01xed \u0011\n\u000fg\u0001\n\nb\u000e\n\r\f\nb\u000b\n\r\f\n\u0007\u0011\u0010\n, we have after some algebra that 0H2B4\n\u0019 gives equation (12).\n\n\u0019\u0007\u0012\u0011\nthe covering numbers scale as0H2B4\n\nthey scale as03254\n\n\u001cGB\nsince we may assume that 3\nconstant 3\nand using\u0005\ndepend on \u0013\n0H2B4\ntimeU .\n\n4 Experiments\n\n\u0019\u0007\u0012\u0011\n\nWe note that Theorem 4 of [3] can be used to show that this bound does not, in fact,\n\nfor a new\n\nin Theorem 2\n\nin the diffusion\n\nWe compared the information diffusion kernel to linear and Gaussian kernels in the con-\ntext of text classi\ufb01cation using the WebKB dataset. The WebKB collection contains some\n4000 university web pages that belong to \ufb01ve categories: course, faculty, student, project\nand staff. A \u201cbag of words\u201d representation was used for all three kernels, using only the\nword frequencies. For simplicity, all hypertext information was ignored. The information\ndiffusion kernel is based on the multinomial model, which is the correct model under the\n\n\\\n\u0011\n\u0007\n\b\ng\n4\nU\n\u0004\n\u0007\n`\nb\nb\n^\nd\n\u0011\n\u0007\n\"\n9\n\u0001\n5\n\nT\n\u0001\n9\n\u0013\n3\n\u0002\n\u0007\n \n\u000f\n)\n9\n\n*\n\u0007\n\u0017\nd\n\u001c\n^\n\f\n\u0002\n\t\n\t\n\t\n\f\n9\n\u0019\nb\n`\n=\nU\n3\n\u0002\n\u001c\n9\nf\n\u0002\n^\n\u0001\n4\n`\nb\n\u001c\ne\nD\n\u001c\n\u0019\n=\nU\n3\n\u0002\n \n \ne\nD\n^\n\u001c\n4\n`\nb\n\u001c\ne\nD\n\u001c\n\u0019\n9\n\u0002\n=\n,\n9\n\u000e\n\"\n\u0002\n \n\"\n\u0001\n \n\u0002\nB\n\u0002\n\u001d\n\u001f\n\u0013\nU\n3\nb\n^\n\u001c\ne\nD\n4\n`\nb\n\u001c\n=\n\u0017\n9\nB\n\u0002\n=\nU\n3\n\u0002\n \n \ne\nD\n^\n\u001c\n4\n`\nb\n\u001c\ne\nD\n\u001c\n\u0019\nU\n3\nb\n4\n\u0007\n\u0004\n^\n\u001c\n\u0007\n\u001c\ne\nb\n\u001c\n\u0017\n3\n\u0002\n3\nb\n \n \ne\nD\n\u001c\nb\n\u001c\n`\n=\nD\n\u0019\n\u001c\n=\n\b\n\t\n\t\n\t\n\t\nR\nD\n4\n\u0007\n\u0004\nU\n\u0007\n3\nb\n\u0017\n3\n\u0002\n\u001c\nb\n\u000f\n\u0019\nS\n\u001c\n\f\n\f\n\f\n=\n\b\n\t\n\t\n\t\n\t\nR\n4\n\u0007\n\u0004\n\u0007\n \ne\nU\n3\n\u0002\n\u0019\nS\n\u001c\n\f\n\f\n\f\nb\n=\n3\n\u001c\n\u001d\n\u001f\n\u0013\n\u000f\n3\n\u0002\n^\n\n\u0007\n\u0004\nT\n\u0004\n\u0004\n\u0006\n\u0002\n\u0007\n \n\u001d\n\u001f\n\u001d\n\u001f\n(\n*\nW\n5\n\n*\n\u0007\n\u0004\n\u0001\ng\n\"\n5\n\n9\n*\nW\n\u0007\n\u0004\n(\n\"\n\u0002\n\u0012\nW\n(\n\u0001\n\u0013\n^\n\"\nT\n\n\u0007\n\u0004\n(\n\u0004\n\u0004\n\u0006\n\u0007\n\u0007\n\u0004\n\u0006\n\u0007\n\u0019\n`\n\u0014\n\u0011\n\n\u000f\n\u0001\ng\n\"\n\u0004\n\u0006\n\u0007\n\u0007\n)\n\u0002\n\u0012\n*\n(\n\u0011\n\ng\n\"\nU\n\n\u0004\n\u0007\n(\n\f0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\n\ne\ns\n \nt\ns\ne\nT\n\nlinear\nrbf\ndiffusion\n\nlinear\nrbf\ndiffusion\n\n0.3\n\ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\n\n \nt\n\ne\ns\n \nt\ns\ne\nT\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n50\n\n100\n\n150\n\nNumber of training examples\n\n200\n\n250\n\n50\n\n100\n\n150\n\nNumber of training examples\n\n200\n\n250\n\nFigure 2: Experimental results on the WebKB corpus, using SVMs for linear (dot-dashed)\nand Gaussian (dotted) kernels, compared with the information diffusion kernel for the\nmultinomial (solid). Results for two classi\ufb01cation tasks are shown, faculty vs. course (left)\nand faculty vs. student (right). The curves shown are the error rates averaged over 20-fold\ncross validation.\n\nnormalizing the counts to sum to one.\n\nhood mapping \n\n(incorrect) assumption that the word occurrences are independent. The maximum likeli-\n\nG\u0007\n\n\u000f was used to map a document to a multinomial model, simply\n\nR\u0011\n\nD\u0001\n\nabove range.\n\n\u0011\u0003\u0002R\u0011\u0004\n\nFigure 2 shows test set error rates obtained using support vector machines for linear, Gaus-\nsian, and information diffusion kernels for two binary classi\ufb01cation tasks: faculty vs. course\nand faculty vs. student. The curves shown are the mean error rates over 20-fold cross val-\nidation and the error bars represent twice the standard deviation. For the Gaussian and\n\ninformation diffusion kernels we tested values of the kernels\u2019 free parameter (\b ore\nthe set\u0003\n\nU ) in\n\u001e . The plots in Figure 2 use the best parameter value in the\n\nOur results are consistent with previous experiments on this dataset [5], which have ob-\nserved that the linear and Gaussian kernels result in very similar performance. However\nthe information diffusion kernel signi\ufb01cantly outperforms both of them, almost always ob-\ntaining lower error rate than the average error rate of the other kernels. For the faculty\nvs. course task, the error rate is halved. This result is striking because the kernels use iden-\ntical representations of the documents, vectors of word counts (in contrast to, for example,\nstring kernels). We attribute this improvement to the fact that the information metric places\nmore emphasis on points near the boundary of the simplex.\n\n5 Discussion\n\nKernel-based methods generally are \u201cmodel free,\u201d and do not make distributional assump-\ntions about the data that the learning algorithm is applied to. Yet statistical models offer\nmany advantages, and thus it is attractive to explore methods that combine data models\nand purely discriminative methods for classi\ufb01cation and regression. Our approach brings\na new perspective to combining parametric statistical modeling with non-parametric dis-\ncriminative learning. In this aspect it is related to the methods proposed by Jaakkola and\nHaussler [4]. However, the kernels we investigate here differ signi\ufb01cantly from the Fisher\n\nkernel proposed in [4]. In particular, the latter is based on the Fisher score\u0004\nat a single point Q\ngiven by a covarianceS\b\u0007\n\n@\u000b\t\n\n@\n\t\n\n\u000fc\u0001\n\n\"\u001fV\n\n\"\u001fV\n\nin parameter space, and in the case of an exponential family model it is\n\nB\u000f\n0H2B4I\u0005\b\u0007\u0006\u0005\n* . In contrast, infor-\n\n#\n$\nQ\n \n\u000b\nO\nd\n\u0011\n\u000b\nO\n\u0011\n\u000b\nO\nd\n\u0011\nD\n,\n\u000b\nQ\n\n\u0007\n\"\n\u0011\n\u0005\n*\n)\n\"\n*\n\u0017\n,\n7\n\u0005\n*\n<\n*\n)\n*\n\u0017\n,\n7\n\u0005\n*\n<\n\fmation diffusion kernels are based on the full geometry of the statistical family, and yet are\nalso invariant under reparameterization of the family.\n\nBounds on the covering numbers for information diffusion kernels were derived for the\ncase of positive curvature, which apply to the special case of the multinomial. We note that\nthe resulting bounds are essentially the same as those that would be obtained for the Gaus-\n\nEuclidean space to get a Laplacian having only discrete spectrum; the results of [3] are\n\nmanifolds with curvature bounded below by a negative constant should also be attainable.\n\nWhile information diffusion kernels are very general, they may be dif\ufb01cult to compute in\nparticular cases; explicit formulas such as equations (8\u20139) for hyperbolic space are rare.\nTo approximate an information diffusion kernel it may be attractive to use the parametrices\n\nsian kernel on the \ufb02at -dimensional torus, which is the standard way of \u201ccompactifying\u201d\n\u0002 . Similar bounds for general\nd , corresponding to the circle\nformulated for the case \n\u000f between points, as we have done for the multinomial. In\n\u0007;\rR\u0011M\r\nand geodesic distance \n\u0007;\rR\u0011[\r\n\u0007:\u0005\b\u0007\n\t\f\u000b\u000e\r\u0010\u000f\n\u0005&\u0007\n\t-\u000b\u000e\r\n\u000f\u0006\u000f .\nusing the relation \n\u0002\u0002\u0001\n\u000f . For the multinomial,\nspeci\ufb01cation of the mapping of data to model parameters,\"\n\rG\u0007\n\u000b\u000e\r\u0010\u000f , which is\nwe have used the maximum likelihood mapping\"\n\ncases where the distance itself is dif\ufb01cult to compute exactly, a compromise may be to ap-\nproximate the distance between nearby points in terms of the Kullback-Leibler divergence,\n\nsimple and well motivated. As indicated in Section 2, there are other possibilities. This\nremains an interesting area to explore, particularly for latent variable models.\n\nThe primary \u201cdegree of freedom\u201d in the use of information diffusion kernels lies in the\n\nR\u0007\n\u0001FKNM\u00064\n\n\u0005P\u0007\n\nAcknowledgements\n\nThis work was supported in part by NSF grant CCR-0122581.\n\nReferences\n[1] S. Amari and H. Nagaoka. Methods of Information Geometry, volume 191 of Transla-\n\ntions of Mathematical Monographs. American Mathematical Society, 2000.\n\n[2] A. Grigor\u2019yan and M. Noguchi. The heat kernel on hyperbolic space. Bulletin of the\n\nLondon Mathematical Society, 30:643\u2013650, 1998.\n\n[3] Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for\n\nsupport vector machines. IEEE Trans. Information Theory, 48(1), January 2002.\n\n[4] T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi-\n\n\ufb01ers. In Advances in Neural Information Processing Systems, volume 11, 1998.\n\n[5] T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext\ncategorisation. In Proceedings of the International Conference on Machine Learning\n(ICML), 2001.\n\n[6] R. E. Kass and P. W. Vos. Geometrical Foundations of Asymptotic Inference. Wiley\n\nSeries in Probability and Statistics. John Wiley & Sons, 1997.\n\n[7] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input\nspaces. In Proceedings of the International Conference on Machine Learning (ICML),\n2002.\n\n[8] P. Li and S.-T. Yau. Estimates of eigenvalues of a compact Riemannian manifold. In\nGeometry of the Laplace Operator, volume 36 of Proceedings of Symposia in Pure\nMathematics, pages 205\u2013239, 1980.\n\n[9] R. Schoen and S.-T. Yau. Lectures on Differential Geometry, volume 1 of Conference\nProceedings and Lecture Notes in Geometry and Topology. International Press, 1994.\n\n\u0001\n\nV\nb\nV\n\u000f\n\u0004\nV\n#\n$\n\"\n#\n$\nQ\n\"\n\u000f\n%\nK\n[\n,\n\"\n\f", "award": [], "sourceid": 2216, "authors": [{"given_name": "Guy", "family_name": "Lebanon", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}