{"title": "Halting in Random Walk Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1639, "page_last": 1647, "abstract": "Random walk kernels measure graph similarity by counting matching walks in two graphs. In their most popular form of geometric random walk kernels, longer walks of length $k$ are downweighted by a factor of $\\lambda^k$ ($\\lambda < 1$) to ensure convergence of the corresponding geometric series. We know from the field of link prediction that this downweighting often leads to a phenomenon referred to as halting: Longer walks are downweighted so much that the similarity score is completely dominated by the comparison of walks of length 1. This is a naive kernel between edges and vertices. We theoretically show that halting may occur in geometric random walk kernels. We also empirically quantify its impact in simulated datasets and popular graph classification benchmark datasets. Our findings promise to be instrumental in future graph kernel development and applications of random walk kernels.", "full_text": "Halting in Random Walk Kernels\n\nMahito Sugiyama\n\nISIR, Osaka University, Japan\n\nJST, PRESTO\n\nKarsten M. Borgwardt\nD-BSSE, ETH Z\u00a8urich\n\nBasel, Switzerland\n\nmahito@ar.sanken.osaka-u.ac.jp\n\nkarsten.borgwardt@bsse.ethz.ch\n\nAbstract\n\nRandom walk kernels measure graph similarity by counting matching walks in\ntwo graphs. In their most popular form of geometric random walk kernels, longer\nwalks of length k are downweighted by a factor of (cid:21)k ((cid:21) < 1) to ensure con-\nvergence of the corresponding geometric series. We know from the \ufb01eld of link\nprediction that this downweighting often leads to a phenomenon referred to as\nhalting: Longer walks are downweighted so much that the similarity score is\ncompletely dominated by the comparison of walks of length 1. This is a na\u00a8\u0131ve\nkernel between edges and vertices. We theoretically show that halting may occur\nin geometric random walk kernels. We also empirically quantify its impact in sim-\nulated datasets and popular graph classi\ufb01cation benchmark datasets. Our \ufb01ndings\npromise to be instrumental in future graph kernel development and applications of\nrandom walk kernels.\n\n1 Introduction\n\nOver the last decade, graph kernels have become a popular approach to graph comparison [4, 5, 7, 9,\n12, 13, 14], which is at the heart of many machine learning applications in bioinformatics, imaging,\nand social-network analysis. The \ufb01rst and best-studied instance of this family of kernels are random\nwalk kernels, which count matching walks in two graphs [5, 7] to quantify their similarity. In par-\nticular, the geometric random walk kernel [5] is often used in applications as a baseline comparison\nmethod on graph benchmark datasets when developing new graph kernels. These geometric random\nwalk kernels assign a weight (cid:21)k to walks of length k, where (cid:21) < 1 is set to be small enough to\nensure convergence of the corresponding geometric series.\nRelated similarity measures have also been employed in link prediction [6, 10] as a similarity score\nbetween vertices [8]. However, there is one caveat regarding these approaches. Walk-based simi-\nlarity scores with exponentially decaying weights tend to suffer from a problem referred to as halt-\ning [1]. They may downweight walks of lengths 2 and more, so much so that the similarity score is\nultimately completely dominated by walks of length 1. In other words, they are almost identical to\na simple comparison of edges and vertices, which ignores any topological information in the graph\nbeyond single edges. Such a simple similarity measure could be computed more ef\ufb01ciently outside\nthe random walk framework. Therefore, halting may affect both the expressivity and ef\ufb01ciency of\nthese similarity scores.\nHalting has been conjectured to occur in random walk kernels [1], but its existence in graph kernels\nhas never been theoretically proven or empirically demonstrated. Our goal in this study is to answer\nthe open question if and when halting occurs in random walk graph kernels.\nWe theoretically show that halting may occur in graph kernels and that its extent depends on prop-\nerties of the graphs being compared (Section 2). We empirically demonstrate in which simulated\ndatasets and popular graph classi\ufb01cation benchmark datasets halting is a concern (Section 3). We\nconclude by summarizing when halting occurs in practice and how it can be avoided (Section 4).\n\n1\n\n\fWe believe that our \ufb01ndings will be instrumental in future applications of random walk kernels and\nthe development of novel graph kernels.\n\n2 Theoretical Analysis of Halting\n\nWe theoretically analyze the phenomenon of halting in random walk graph kernels. First, we review\nthe de\ufb01nition of graph kernels in Section 2.1. We then present our key theoretical result regarding\nhalting in Section 2.2 and clarify the connection to linear kernels on vertex and edge label histograms\nin Section 2.3.\n\n2.1 Random Walk Kernels\n\n\u2032\n\n; \u03c6\n\n\u2032\n; E\nV(cid:2) = f (v; v\n\u2032\nE(cid:2) = f ((u; u\n\nLet G = (V; E; \u03c6) be a labeled graph, where V is the vertex set, E is the edge set, and \u03c6 is a\nmapping \u03c6 : V [ E ! (cid:6) with the range (cid:6) of vertex and edge labels. For an edge (u; v) 2 E, we\nidentify (u; v) and (v; u) if G is undirected. The degree of a vertex v 2 V is denoted by d(v).\n\u2032\nThe direct (tensor) product G(cid:2) = (V(cid:2); E(cid:2); \u03c6(cid:2)) of two graphs G = (V; E; \u03c6) and G\n(V\n\n) is de\ufb01ned as follows [1, 5, 14]:\n\n=\n\n\u2032\n\n) 2 V (cid:2) V\n\u2032\n\u2032\n); (v; v\n\n\u2032\n\n(v\n\n)g;\n\n\u2032 j \u03c6(v) = \u03c6\n\u2032\n)) 2 V(cid:2) (cid:2) V(cid:2) j (u; v) 2 E; (u\n\u2032\n\u2032\n(v\n)) = \u03c6(v) = \u03c6\n\n\u2032\n\n; v\n\u2032\n\n; and \u03c6(u; v) = \u03c6\n\n\u2032\n\n\u2032\n(u\n\n; v\n\n\u2032\n\n)g;\n\n\u2032\n\n\u2032\n\n) 2 E\n) and \u03c6(cid:2)((u; u\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\n; v\n\n\u2032\n(u\n\n)) = \u03c6(u; v) =\n). We denote by A(cid:2) the adjacency matrix of G(cid:2) and denote by (cid:14)(cid:2) and \u2206(cid:2) the minimum\n\nand all labels are inherited, or \u03c6(cid:2)((v; v\n\u03c6\nand maximum degrees of G(cid:2), respectively.\n\u2032, random walk kernels count all pairs of matching\nTo measure the similarity between graphs G and G\n\u2032 [2, 5, 7, 11]. If we assume a uniform distribution for the starting and stopping\nwalks on G and G\n\u2032, the number of matching walks is obtained through the\nprobabilities over the vertices of G and G\nadjacency matrix A(cid:2) of the product graph G(cid:2) [14]. For each k 2 N, the k-step random walk kernel\n\u2032 is de\ufb01ned as:\nbetween two graphs G and G\n\n); (v; v\n\n[\nk\u2211\n\njV(cid:2)j\u2211\n\n]\n\nK k(cid:2)(G; G\n\n\u2032\n\n) =\n\n(cid:21)lAl(cid:2)\n\nwith a sequence of positive, real-valued weights (cid:21)0; (cid:21)1; (cid:21)2; : : : ; (cid:21)k assuming that A0(cid:2) = I, the\nidentity matrix. Its limit K\n\n) is simply called the random walk kernel.\n\n1\n(cid:2) (G; G\n\n\u2032\n\ni;j=1\n\nl=0\n\nij\n\n1\n(cid:2) can be directly computed if weights are the geometric series, or (cid:21)l = (cid:21)l, resulting\n\nInterestingly, K\nin the geometric random walk kernel:\n\n[ 1\u2211\n\njV(cid:2)j\u2211\n\n]\n\nKGR(G; G\n\n\u2032\n\n) =\n\n(cid:21)lAl(cid:2)\n\n=\n\njV(cid:2)j\u2211\n\n[\n(I (cid:0) (cid:21)A(cid:2))\n\n]\n\n(cid:0)1\n\nij :\n\ni;j=1\n\nl=0\n\nij\n\ni;j=1\n\n(cid:0)1 =\n\n\u22111\nIn the above equation, let (I(cid:0) (cid:21)A(cid:2))x = 0 for some value of x. Then, (cid:21)A(cid:2)x = x and ((cid:21)A(cid:2))lx =\nx for any l 2 N. If ((cid:21)A(cid:2))l converges to 0 as l ! 1, (I (cid:0) (cid:21)A(cid:2)) is invertible since x becomes 0.\nTherefore, (I (cid:0) (cid:21)A(cid:2))\nl=0 (cid:21)lAl(cid:2) from the equation (I (cid:0) (cid:21)A(cid:2))(I + (cid:21)A(cid:2) + (cid:21)2A2(cid:2) + : : : ) =\nI [5].\nIt is well-known that the geometric series of matrices, often called the Neumann series,\nI + (cid:21)A(cid:2) + ((cid:21)A(cid:2))2 + (cid:1)(cid:1)(cid:1) converges only if the maximum eigenvalue of A(cid:2), denoted by (cid:22)(cid:2);max, is\nstrictly smaller than 1=(cid:21). Therefore, the geometric random walk kernel KGR is well-de\ufb01ned only if\n(cid:21) < 1=(cid:22)(cid:2);max.\nThere is a relationship for the minimum and maximum degrees (cid:14)(cid:2) and \u2206(cid:2) of G(cid:2) [3]: (cid:14)(cid:2) (cid:20)\nd(cid:2) (cid:20) (cid:22)(cid:2);max (cid:20) \u2206(cid:2), where d(cid:2) is the average of the vertex degrees of G(cid:2), or d(cid:2) =\n(1=jV(cid:2)j)\nIn the inductive learning setting, since we do not know a priori target graphs that a learner will\nreceive in the future, (cid:21) should be small enough so (cid:21) < 1=(cid:22)(cid:2);max for any pair of unseen graphs.\nOtherwise, we need to re-compute the full kernel matrix and re-train the learner. In the transductive\n\nv2V(cid:2) d(v). In practice, it is suf\ufb01cient to set the parameter (cid:21) < 1=\u2206(cid:2).\n\n\u2211\n\n2\n\n\fsetting, we are given a collection G of graphs beforehand. We can explicitly compute the upper\n(cid:0)1 with the maximum of the maximum eigenvalues over\nbound of (cid:21), which is (maxG;G\u20322G (cid:22)(cid:2);max)\nall pairs of graphs G; G\n\n\u2032 2 G.\n\n2.2 Halting\n\nThe geometric random walk kernel KGR is one of the most popular graph kernels, as it can take\nwalks of any length into account [5, 14]. However, the fact that it weights walks of length k by the\n(cid:0)1 < 1, immediately tells us that the\nkth power of (cid:21), together with the condition that (cid:21) < ((cid:22)(cid:2);max)\ncontribution of longer walks is signi\ufb01cantly lowered in KGR. If the contribution of walks of length\n2 and more to the kernel value is even completely dominated by the contribution of walks of length\n1, we would speak of halting. It is as if the random walks halt after one step.\nHere, we analyze under which conditions this halting phenomenon may occur in geometric random\nwalk kernels. We obtain the following key theoretical statement by comparing KGR to the one-step\nrandom walk kernel K 1(cid:2).\n\u2032,\nTheorem 1 Let (cid:21)0 = 1 and (cid:21)1 = (cid:21) in the random walk kernel. For a pair of graphs G and G\n\n\u2032\n\n) (cid:20) KGR(G; G\n\n\u2032\n\n) (cid:20) K 1(cid:2)(G; G\n\n\u2032\n\n) + \";\n\nwhere\n\nK 1(cid:2)(G; G\n\n\" = jV(cid:2)j ((cid:21)\u2206(cid:2))2\n1 (cid:0) (cid:21)\u2206(cid:2) ;\n\nand \" monotonically converges to 0 as (cid:21) ! 0.\nProof. Let d(v) be the degree of a vertex v in G(cid:2) and N (v) be the set of neighboring vertices of v,\nthat is, N (v) = fu 2 V(cid:2) j (u; v) 2 E(cid:2)g. Since A(cid:2) is the adjacency matrix of G(cid:2), the following\nrelationships hold:\n\njV(cid:2)j\u2211\n\n\u2211\n\n[A2(cid:2)]ij =\n\ni;j=1\n\nv2V(cid:2)\n) (cid:20) jV(cid:2)j\u22063(cid:2) ; : : : ;\n\n\u2032\u2032\n\nd(v\n\nd(v) (cid:20) jV(cid:2)j\u2206(cid:2);\n\u2211\n\u2211\n\nv\u20322N (v)\n\nv\u2032\u20322N (v\u2032)\n\n\u2032\n\n) (cid:20) jV(cid:2)j\u22062(cid:2);\n\nd(v\n\n[An(cid:2)]ij (cid:20) jV(cid:2)j\u2206n(cid:2):\n\njV(cid:2)j\u2211\njV(cid:2)j\u2211\n\ni;j=1\n\ni;j=1\n\n[A(cid:2)]ij =\n\n[A3(cid:2)]ij =\n\nv2V(cid:2)\n\n\u2211\n\u2211\njV(cid:2)j\u2211\n\nv2V(cid:2)\n\n\u2032\n\nKGR(G; G\n\n) =\n(cid:20) K 1(cid:2)(G; G\n\ni;j=1\n\nv\u20322N (v)\n\n\u2211\njV(cid:2)j\u2211\njV(cid:2)j\u2211\n\ni;j=1\n\ni;j=1\n\n\u2032\n\n) +\n\nFrom the assumption that (cid:21)\u2206(cid:2) < 1, we have\n\n[I + (cid:21)A(cid:2) + (cid:21)2A2(cid:2) + : : : ]ij = K 1(cid:2)(G; G\n\n[(cid:21)2A2(cid:2) + (cid:21)3A3(cid:2) + : : : ]ij\n\n\u2032\n\n) + jV(cid:2)j(cid:21)2\u22062(cid:2)(1 + (cid:21)\u2206(cid:2) + (cid:21)2\u22062(cid:2) + : : : ) = K 1(cid:2)(G; G\n\n\u2032\n\n) + \":\n\nIt is clear that \" monotonically goes to 0 when (cid:21) ! 0.\nMoreover, we can normalize \" by dividing KGR(G; G\n\u2032,\nCorollary 1 Let (cid:21)0 = 1 and (cid:21)1 = (cid:21) in the random walk kernel. For a pair of graphs G and G\n\n) by K 1(cid:2)(G; G\n\n).\n\n\u2032\n\n\u2032\n\n\u2032\n1 (cid:20) KGR(G; G\n)\nK 1(cid:2)(G; G\u2032)\n\n(cid:20) 1 + \"\n\u2032\n\n;\n\nwhere\n\n\u2032\n\"\n\n=\n\n((cid:21)\u2206(cid:2))2\n\n(1 (cid:0) (cid:21)\u2206(cid:2))(1 + (cid:21)d(cid:2))\n\nand d(cid:2) is the average of vertex degrees of G(cid:2).\n\nProof. Since we have\n\nit follows that \"=K 1(cid:2)(G; G\n\n\u2032\n\nK 1(cid:2)(G; G\n\n\u2032\n\n) = jV(cid:2)j + (cid:21)\n\u2032.\n) = \"\n\n\u2211\n\nv2V(cid:2)\n\nd(v) = jV(cid:2)j(1 + (cid:21)d(cid:2));\n\nTheorem 1 can be easily generalized to any k-step random walk kernel K k(cid:2).\n\n3\n\n\u25a0\n\n\u25a0\n\n\fCorollary 2 Let \"(k) = jV(cid:2)j((cid:21)\u2206(cid:2))k=(1 (cid:0) (cid:21)\u2206(cid:2)). For a pair of graphs G and G\n\n\u2032, we have\n\nK k(cid:2)(G; G\n\n\u2032\n\n) (cid:20) KGR(G; G\n\n\u2032\n\n) (cid:20) K k(cid:2)(G; G\n\n\u2032\n\n) + \"(k + 1):\n\nOur results imply that, in the geometric random walk kernel KGR, the contribution of walks of\nlength longer than 2 diminishes for very small choices of (cid:21). This can easily happen in real-world\ngraph data, as (cid:21) is upper-bounded by the inverse of the maximum degree of the product graph.\n\n2.3 Relationships to Linear Kernels on Label Histograms\n\nNext, we clarify the relationship between KGR and basic linear kernels on vertex and edge label\nhistograms. We show that halting KGR leads to the convergence of it to such linear kernels.\n\u2032, let us introduce two linear kernels on vertex and edge histograms.\nGiven a pair of graphs G and G\nAssume that the range of labels (cid:6) = f1; 2; : : : ; sg without loss of generality. The vertex label\nhistogram of a graph G = (V; E; \u03c6) is a vector f = (f1; f2; : : : ; fs), such that fi = jfv 2 V j\n\u03c6(v) = igj for each i 2 (cid:6). Let f and f\n\u2032,\nbe the vertex label histograms of graphs G and G\nrespectively. The vertex label histogram kernel KVH(G; G\n) is then de\ufb01ned as the linear kernel\nbetween f and f\n\n:\n\n\u2032\n\n\u2032\n\n\u2032\n\nKVH(G; G\n\n\u2032\n\n) = \u27e8f ; f\n\n\u2032\u27e9 =\n\ns\ni=1 fif\n\n\u2032\ni :\n\nSimilarly, the edge label histogram is a vector g = (g1; g2; : : : ; gs), such that gi = jf(u; v) 2 E j\n\u03c6(u; v) = igj for each i 2 (cid:6). The edge label histogram kernel KEH(G; G\n) is de\ufb01ned as the linear\nkernel between g and g\n\n\u2032\n\n\u2032, for respective histograms:\n) = \u27e8g; g\n\nKEH(G; G\n\n\u2032\n\n\u2032\u27e9 =\n\n\u2032\ns\ni:\ni=1 gig\n\nFinally, we introduce the vertex-edge label histogram. Let h = (h111; h211; : : : ; hsss) be a his-\ntogram vector, such that hijk = jf(u; v) 2 E j \u03c6(u; v) = i; \u03c6(u) = j; \u03c6(v) = kgj for each\ni; j; k 2 (cid:6). The vertex-edge label histogram kernel KVEH(G; G\n) is de\ufb01ned as the linear kernel\nbetween h and h\n\n\u2032:\nfor the respective histograms of G and G\n\n\u2032\n\n\u2032\n\nKVEH(G; G\n\n\u2032\n\n\u2032\n\n\u2032\u27e9 =\n\n) = \u27e8h; h\n\u2032\ns\ni;j;k=1 hijkh\nijk:\n\u2032\n) if vertices are not labeled.\n\n) = KEH(G; G\n\nNotice that KVEH(G; G\nFrom the de\ufb01nition of the direct product of graphs, we can con\ufb01rm the following relationships\nbetween histogram kernels and the random walk kernel.\nLemma 1 For a pair of graphs G; G\n1\n(cid:21)0\n\n\u2032 and their direct product G(cid:2), we have\nK 0(cid:2)(G; G\n\n) = jV(cid:2)j:\n\nKVH(G; G\n\n) =\n\n\u2032\n\n\u2032\n\n\u2211\n\n\u2211\n\n\u2211\n\nKVEH(G; G\n\n\u2032\n\n) =\n\n1\n(cid:21)1\n\nK 1(cid:2)(G; G\n\n\u2032\n\n) (cid:0) (cid:21)0\n(cid:21)1\n\nK 0(cid:2)(G; G\n\n\u2032\n\n) =\n\n[A(cid:2)]ij:\n\nProof. The \ufb01rst equation KVH(G; G\n\n\u2032\n\n) = jV(cid:2)j can be proven from the following:\n\n\u2032\n\nKVH(G; G\n\n) =\n\n\u2032 2 V\n\n\u2032 j \u03c6(v) = \u03c6\n\u2032\n\n\u2032\n\n)gj = jf (v; v\n\n\u2032\n\n) 2 V (cid:2) V\n\n\u2032 j \u03c6(v) = \u03c6\n\n\u2032\n\n\u2032\n\n)gj\n\n(v\n\n(v\n\njV(cid:2)j\u2211\n\ni;j=1\n\nWe can prove the second equation in a similar fashion:\n\u2032 j \u03c6(u; v) = \u03c6\n\n) 2 E\n\nKVEH(G; G\n\n) = 2\n\n; v\n\n\u2032\n\n\u2032\n\n\u2032\n\n); \u03c6(u) = \u03c6\n\n\u2032\n\n\u2032\n(u\n\n\u2032\n); \u03c6(v) = \u03c6\n\n\u2032\n\n)gj\n\n(v\n\n= 2\n\n\u2032\n(u; v); (u\n\n\u2032\n\n; v\n\n)\n\n) 2 E (cid:2) E\n\n\u2032\n\n}(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2032\n\n\u2032\n\n; v\n\n(u\n\u2032\n\u2032\n); \u03c6(v) = \u03c6\n(u\n\n);\n\n\u2032\n\n)\n\n(v\n\n= 2jE(cid:2)j =\n\n[A(cid:2)]ij =\n\n1\n(cid:21)1\n\nK 1(cid:2)(G; G\n\nK 0(cid:2)(G; G\n\n\u2032\n\n):\n\n\u25a0\n\n\u2032\n\n\u2032\n\n; v\n\n\u2032\n(u\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u03c6(u; v) = \u03c6\n\n\u2032\n\n\u2032\n\n\u03c6(u) = \u03c6\n) (cid:0) (cid:21)0\n(cid:21)1\n\n\u2032\n\n4\n\nK 0(cid:2)(G; G\n\n\u2032\n\n):\n\n\u2211\n\nv2V\n\njf v\n= jV(cid:2)j =\n\u2211\n(cid:12)(cid:12)(cid:12)(cid:12){(\n\n(u;v)2E\n\n1\n(cid:21)0\njf (u\njV(cid:2)j\u2211\n\ni;j=1\n\n\fFinally, let us de\ufb01ne a new kernel\n\nKH(G; G\n\n\u2032\n\n) := KVH(G; G\n\n\u2032\n\n\u2032\n\n)\n\n(1)\n\n) + (cid:21)KVEH(G; G\n\u2032\n\n) = K 1(cid:2)(G; G\n\n\u2032\n\nwith a parameter (cid:21). From Lemma 1, since KH(G; G\nin the one-step random walk kernel K 1(cid:2), we have the following relationship from Theorem 1.\n\n) holds if (cid:21)0 = 1 and (cid:21)1 = (cid:21)\n\nCorollary 3 For a pair of graphs G and G\n\nwhere \" is given in Theorem 1.\n\nKH(G; G\n\n\u2032, we have\n) (cid:20) KGR(G; G\n\u2032\n\n\u2032\n\n) (cid:20) KH(G; G\n\n\u2032\n\n) + \";\n\nTo summarize, our results show that if the parameter (cid:21) of the geometric random walk kernel KGR is\nsmall enough, random walks halt, and KGR reduces to KH, which \ufb01nally converges to KVH. This\nis based on vertex histograms only and completely ignores the topological structure of the graphs.\n\n3 Experiments\n\nWe empirically examine the halting phenomenon of the geometric random walk kernel on popular\nreal-world graph benchmark datasets and semi-simulated graph data.\n\n3.1 Experimental Setup\n\nEnvironment. We used Amazon Linux AMI release 2015.03 and ran all experiments on a single\ncore of 2.5 GHz Intel Xeon CPU E5-2670 and 244 GB of memory. All kernels were implemented\nin C++ with Eigen library and compiled with gcc 4.8.2.\nDatasets. We collected \ufb01ve real-world graph classi\ufb01cation benchmark datasets:1 ENZYMES, NCI1,\nNCI109, MUTAG, and D&D, which are popular in the graph-classi\ufb01cation literature [13, 14].\nENZYMES and D&D are proteins, and NCI1, NCI109, and MUTAG are chemical compounds.\nStatistics of these datasets are summarized in Table 1, in which we also show the maximum of\nmaximum degrees of product graphs maxG;G\u20322G \u2206(cid:2) for each dataset G. We consistently used\n(cid:0)1 as the upper bound of (cid:21) in geometric random walk kernels, in which\n(cid:21)max = (maxG;G\u20322G \u2206(cid:2))\nthe gap was less than one order as the lower bound of (cid:21). The average degree of the product graph,\nthe lower bound of (cid:21), were 18:17, 7:93, 5:60, 6:21, and 13:31 for ENZYMES, NCI1, NCI109,\nMUTAG, and DD, respectively.\nKernels. We employed the following graph kernels in our experiments: We used linear kernels on\nvertex label histograms KVH, edge label histograms KEH, vertex-edge label histograms KVEH, and\nthe combination KH introduced in Equation (1). We also included a Gaussian RBF kernel between\nvertex-edge label histograms, denoted as KVEH;G. From the family of random walk kernels, we\nused the geometric random walk kernel KGR and the k-step random walk kernel K k(cid:2). Only the\nnumber k of steps were treated as a parameter in K k(cid:2) and (cid:21)k was \ufb01xed to 1 for all k. We used\n\ufb01x-point iterations [14, Section 4.3] for ef\ufb01cient computation of KGR. Moreover, we employed the\nWeisfeiler-Lehman subtree kernel [13], denoted as KWL, as the state-of-the-art graph kernel, which\nhas a parameter h of the number of iterations.\n\n3.2 Results on Real-World Datasets\n\nWe \ufb01rst compared the geometric random walk kernel KGR to other kernels in graph classi\ufb01cation.\nThe classi\ufb01cation accuracy of each graph kernel was examined by 10-fold cross validation with\nmulticlass C-support vector classi\ufb01cation (libsvm2 was used), in which the parameter C for C-\nSVC and a parameter (if one exists) of each kernel were chosen by internal 10-fold cross validation\n(CV) on only the training dataset. We repeated the whole experiment 10 times and reported average\n\n1The code and all datasets are available at:\n\nhttp://www.bsse.ethz.ch/mlcb/research/machine-learning/graph-kernels.html\n\n2http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm/\n\n5\n\n\fTable 1: Statistics of graph datasets, j(cid:6)V j and j(cid:6)Ej denote the number of vertex and edge labels.\nj(cid:6)Ej max\u2206(cid:2)\nDataset\n65\nENZYMES\nNCI1\n16\n17\nNCI109\n10\nMUTAG\nD&D\n50\n\navg.jEj maxjV j maxjEj\n149\n62.14\n32.3\n119\n119\n32.13\n33\n19.79\n715.66\n14267\n\navg.jV j\n32.63\n29.87\n29.68\n17.93\n284.32\n\n#classes\n6\n2\n2\n2\n2\n\nj(cid:6)V j\n3\n37\n38\n7\n82\n\nSize\n600\n4110\n4127\n188\n1178\n\n126\n111\n111\n28\n5748\n\n1\n3\n3\n11\n1\n\n(a) ENZYMES\n\n(b) NCI1\n\nFigure 1: Classi\ufb01cation accuracy on real-world datasets (Means (cid:6) SD).\n\n(c) NCI109\n\n(cid:0)7; 2\n\n(cid:0)5; : : : ; 25; 27g for C-SVC, the width (cid:27) 2 f10\n\nclassi\ufb01cation accuracies with their standard errors. The list of parameters optimized by the internal\nCV is as follows: C 2 f2\n(cid:0)2; : : : ; 102g in\nthe RBF kernel KVEH;G, the number of steps k 2 f1; : : : ; 10g in K k(cid:2), the number of iterations\nh 2 f1; : : : ; 10g in KWL, and (cid:21) 2 f10\n(cid:0)2; (cid:21)maxg in KH and KGR, where (cid:21)max =\n(maxG;G\u20322G \u2206(cid:2))\nResults are summarized in the left column of Figure 1 for ENZYMES, NCI1, and NCI109. We\npresent results on MUTAG and D&D in the Supplementary Notes, as different graph kernels do\nnot give signi\ufb01cantly different results (e.g., [13]). Overall, we could observe two trends. First,\nthe Weisfeiler-Lehman subtree kernel KWL was the most accurate, which con\ufb01rms results in [13],\n\n(cid:0)5; : : : ; 10\n\n(cid:0)1.\n\n6\n\n20304050203040502030405010\u2013(cid:31)10\u2013(cid:30)10\u2013(cid:29)10\u2013(cid:28)KGRKHAccuracyAccuracyAccuracyParameter \u03bbNumber of steps kKVHKEHKVEHKVEH,GKGRKWLKxkKHLabel histogramRandom walkComparison of KGR with KHk-step KxkComparison of various graph kernels(i)(ii)(iii)1357965707580850.06256570758085657075808510\u2013(cid:31)10\u2013(cid:30)10\u2013(cid:29)10\u2013(cid:28)KGRKHAccuracyAccuracyAccuracyParameter \u03bbNumber of steps kKVHKEHKVEHKVEH,GKGRKWLKxkKHLabel histogramRandom walkComparison of KGR with KHk-step KxkComparison of various graph kernels(i)(ii)(iii)1357965707580850.05886570758085657075808510\u2013(cid:31)10\u2013(cid:30)10\u2013(cid:29)10\u2013(cid:28)KGRKHAccuracyAccuracyAccuracyParameter \u03bbNumber of steps kKVHKEHKVEHKVEH,GKGRKWLKxkKHLabel histogramRandom walkComparison of KGR with KHk-step KxkComparison of various graph kernels(i)(ii)(iii)13579\fFigure 2: Distribution of log10 \"\n\n\u2032, where \"\n\n\u2032 is de\ufb01ned in Corollary 1, in real-world datasets.\n\nFigure 3: Classi\ufb01cation accuracy on semi-simulated datasets (Means (cid:6) SD).\n\nSecond, the two random walk kernels KGR and K k(cid:2) show greater accuracy than na\u00a8\u0131ve linear kernels\non edge and vertex histograms, which indicates that halting is not occurring in these datasets. It is\nalso noteworthy that employing a Gaussian RBF kernel on vertex-edge histograms leads to a clear\nimprovement over linear kernels on all three datasets. On ENZYMES, the Gaussian kernel is even\non par with the random walks in terms of accuracy.\nTo investigate the effect of halting in more detail, we show the accuracy of KGR and KH in the\n(cid:0)5 to its upper bound. We can clearly\ncenter column of Figure 1 for various choices of (cid:21), from 10\nsee that halting occurs for small (cid:21), which greatly affects the performance of KGR. More speci\ufb01cally,\n(cid:0)3 in our datasets), the accuracies are close to the\nif it is chosen to be very small (smaller than 10\nna\u00a8\u0131ve baseline KH that ignores the topological structure of graphs. However, accuracies are much\ncloser to that reached by the Weisfeiler-Lehman kernel if (cid:21) is close to its theoretical maximum. Of\ncourse, the theoretical maximum of (cid:21) depends on unseen test data in reality. Therefore, we often\nhave to set (cid:21) conservatively so that we can apply the trained model to any unseen graph data.\nMoreover, we also investigated the accuracy of the random walk kernel as a function of the number\nof steps k of the random walk kernel K k(cid:2). Results are shown in the right column of Figure 1. In\nall datasets, accuracy improves with each step, up to four to \ufb01ve steps. The optimal number of\nsteps in K k(cid:2) and the maximum (cid:21) give similar accuracy levels. We also con\ufb01rmed Theorem 1 that\nconservative choices of (cid:21) (10\nIn addition, Figure 2 shows histograms of log10 \"\n(max \u2206(cid:2))\ndeviation of KGR from KH in percentages. Although \"\nENZYMES and NCI datasets), we con\ufb01rmed the existence of relatively large \"\nthan 1 percent), which might cause the difference between KGR and KH.\n\n\u2032 is given in Corollary 1 for (cid:21) =\n\u2032 can be viewed as the\n\u2032 is small on average (about 0.1 percent in\n\u2032 in the plot (more\n\n(cid:0)1 for all pairs of graphs in the respective datasets. The value \"\n\n(cid:0)3 or less) give the same accuracy as a one-step random walk.\n\n\u2032, where \"\n\n3.3 Results on Semi-Simulated Datasets\n\nTo empirically study halting, we generated semi-simulated graphs from our three benchmark\ndatasets (ENZYMES, NCI1, and NCI109) and compared the three kernels KGR, KH, and KVH.\nIn each dataset, we arti\ufb01cially generated denser graphs by randomly adding edges, in which\nthe number of new edges per graph was determined from a normal distribution with the mean\n\n7\n\nPercentage\u22124\u22123\u22122\u2212101201020304050Percentage\u22124\u22123\u22122\u2212101201020304050Percentage\u22124\u22123\u22122\u2212101201020304050ENZYMESNCI1NCI109(a)(b)(c)log(cid:30)(cid:29) \u03b5\u2019log(cid:30)(cid:29) \u03b5\u2019log(cid:30)(cid:29) \u03b5\u2019Number of added edges010205010065707580Number of added edges010205010065707580AccuracyAccuracySim-ENZYMESSim-NCI1Sim-NCI109KGRKHKVHKGRKHKVHKGRKHKVH(a)(b)(c)Number of added edges01020501003035404550Accuracy25\fm 2 f10; 20; 50; 100g and the distribution of edge labels was unchanged. Note that the accuracy of\nthe vertex histogram kernel KVH stays always the same, as we only added edges.\nResults are plotted in Figure 3. There are two key observations. First, by adding new false edges\nto the graphs, the accuracy levels drop for both the random walk kernel and the histogram kernel.\nHowever, even after adding 100 new false edges per graph, they are both still better than a na\u00a8\u0131ve\nclassi\ufb01er that assigns all graphs to the same class (accuracy of 16.6 percent on ENZYMES and\napproximately 50 percent on NCI1 and NCI109). Second, the geometric random walk kernel quickly\napproaches the accuracy level of KH when new edges are added. This is a strong indicator that\nhalting occurs. As graphs become denser, the upper bound for (cid:21) gets smaller, and the accuracy of\nthe geometric random walk kernel KGR rapidly drops and converges to KH. This result con\ufb01rms\nCorollary 3, which says that both KGR and KH converge to KVH as (cid:21) goes to 0.\n\n4 Discussion\n\nIn this work, we show when and where the phenomenon of halting occurs in random walk kernels.\nHalting refers to the fact that similarity measures based on counting walks (of potentially in\ufb01nite\nlength) often downweight longer walks so much that the similarity score is completely dominated\nby walks of length 1, degenerating the random walk kernel to a simple kernel between edges and\nvertices. While it had been conjectured that this problem may arise in graph kernels [1], we provide\nthe \ufb01rst theoretical proof and empirical demonstration of the occurrence and extent of halting in\ngeometric random walk kernels.\nWe show that the difference between geometric random walk kernels and simple edge kernels de-\npends on the maximum degree of the graphs being compared. With increasing maximum degree,\nthe difference converges to zero. We empirically demonstrate on simulated graphs that the compar-\nison of graphs with high maximum degrees suffers from halting. On real graph data from popular\ngraph classi\ufb01cation benchmark datasets, the maximum degree is so low that halting can be avoided\nif the decaying weight (cid:21) is set close to its theoretical maximum. Still, if (cid:21) is set conservatively to a\nlow value to ensure convergence, halting can clearly be observed, even on unseen test graphs with\nunknown maximum degrees.\nThere is an interesting connection between halting and tottering [1, Section 2.1.5], a weakness of\nrandom walk kernels described more than a decade ago [11]. Tottering is the phenomenon that a\nwalk of in\ufb01nite length may go back and forth along the same edge, thereby creating an arti\ufb01cially\nin\ufb02ated similarity score if two graphs share a common edge. Halting and tottering seem to be oppos-\ning effects. If halting occurs, the effect of tottering is reduced and vice versa. Halting downweights\nthese tottering walks and counteracts the in\ufb02ation of the similarity scores. An interesting point is that\nthe strategies proposed to remove tottering from walk kernels did not lead to a clear improvement\nin classi\ufb01cation accuracy [11], while we observed a strong negative effect of halting on the classi-\n\ufb01cation accuracy in our experiments (Section 3). This \ufb01nding stresses the importance of studying\nhalting.\nOur theoretical and empirical results have important implications for future applications of random\nwalk kernels. First, if the geometric random walk kernel is used on a graph dataset with known\nmaximum degree, (cid:21) should be close to the theoretical maximum. Second, simple baseline kernels\nbased on vertex and edge label histograms should be employed to check empirically if the random\nwalk kernel gives better accuracy results than these baselines. Third, particularly in datasets with\nhigh maximum degree, we advise using a \ufb01xed-length-k random walk kernel rather than a geomet-\nric random walk kernel. Optimizing the length k by cross validation on the training dataset led to\ncompetitive or superior results compared to the geometric random walk kernel in all of our experi-\nments. Based on these results and the fact that by de\ufb01nition the \ufb01xed-length kernel does not suffer\nfrom halting, we recommend using the \ufb01xed-length random walk kernel as a comparison method in\nfuture studies on novel graph kernels.\nAcknowledgments. This work was supported by JSPS KAKENHI Grant Number 26880013 (MS),\nthe Alfried Krupp von Bohlen und Halbach-Stiftung (KB), the SNSF Starting Grant \u2018Signi\ufb01cant\nPattern Mining\u2019 (KB), and the Marie Curie Initial Training Network MLPM2012, Grant No. 316861\n(KB).\n\n8\n\n\fReferences\n[1] Borgwardt, K. M. Graph Kernels. PhD thesis, Ludwig-Maximilians-University Munich, 2007.\n[2] Borgwardt, K. M., Ong, C. S., Sch\u00a8onauer, S., Vishwanathan, S. V. N., Smola, A. J., and Kriegel,\nH.-P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl 1):i47\u2013i56, 2005.\n\n[3] Brualdi, R. A. The Mutually Bene\ufb01cial Relationship of Graphs and Matrices. AMS, 2011.\n[4] Costa, F. and Grave, K. D. Fast neighborhood subgraph pairwise distance kernel. In Proceed-\n\nings of the 27th International Conference on Machine Learning (ICML), 255\u2013262, 2010.\n\n[5] G\u00a8artner, T., Flach, P., and Wrobel, S. On graph kernels: Hardness results and ef\ufb01cient alterna-\n\ntives. In Learning Theory and Kernel Machines (LNCS 2777), 129\u2013143, 2003.\n\n[6] Girvan, M. and Newman, M. E. J. Community structure in social and biological networks.\n\nProceedings of the National Academy of Sciences (PNAS), 99(12):7821\u20137826, 2002.\n\n[7] Kashima, H., Tsuda, K., and Inokuchi, A. Marginalized kernels between labeled graphs. In\nProceedings of the 20th International Conference on Machine Learning (ICML), 321\u2013328,\n2003.\n\n[8] Katz, L. A new status index derived from sociometric analysis. Psychometrika, 18(1):39\u201343,\n\n1953.\n\n[9] Kriege, N., Neumann, M., Kersting, K., and Mutzel, P. Explicit versus implicit graph feature\nmaps: A computational phase transition for walk kernels. In Proceedings of IEEE International\nConference on Data Mining (ICDM), 881\u2013886, 2014.\n\n[10] Liben-Nowell, D. and Kleinberg, J. The link-prediction problem for social networks. Journal\n\nof the American Society for Information Science and Technology, 58(7):1019\u20131031, 2007.\n\n[11] Mah\u00b4e, P., Ueda, N., Akutsu, T., Perret, J.-L., and Vert, J.-P. Extensions of marginalized graph\nkernels. In Proceedings of the 21st International Conference on Machine Learning (ICML),\n2004.\n\n[12] Shervashidze, N. and Borgwardt, K. M. Fast subtree kernels on graphs. In Advances in Neural\n\nInformation Processing Systems (NIPS) 22, 1660\u20131668, 2009.\n\n[13] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M.\nWeisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12:2359\u20132561,\n2011.\n\n[14] Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., and Borgwardt, K. M. Graph kernels.\n\nJournal of Machine Learning Research, 11:1201\u20131242, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1009, "authors": [{"given_name": "Mahito", "family_name": "Sugiyama", "institution": "Osaka University"}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": "ETH Zurich"}]}