{"title": "Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 5723, "page_last": 5733, "abstract": "While graph kernels (GKs) are easy to train and enjoy provable theoretical guarantees, their practical performances are limited by their expressive power, as the kernel function often depends on hand-crafted combinatorial features of graphs. Compared to graph kernels, graph neural networks (GNNs) usually achieve better practical performance, as GNNs use multi-layer architectures and non-linear activation functions to extract high-order information of graphs as features. However, due to the large number of hyper-parameters and the non-convex nature of the training procedure, GNNs are harder to train. Theoretical guarantees of GNNs are also not well-understood. Furthermore, the expressive power of GNNs scales with the number of parameters, and thus it is hard to exploit the full power of GNNs when computing resources are limited. The current paper presents a new class of graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to \\emph{infinitely wide} multi-layer GNNs trained by gradient descent. GNTKs enjoy the full expressive power of GNNs and inherit advantages of GKs. Theoretically, we show GNTKs provably learn a class of smooth functions on graphs. Empirically, we test GNTKs on graph classification datasets and show they achieve strong performance.", "full_text": "Graph Neural Tangent Kernel:\n\nFusing Graph Neural Networks with Graph Kernels\n\nSimon S. Du\n\nInstitute for Advanced Study\n\nssdu@ias.edu\n\nKangcheng Hou\nZhejiang University\n\nkangchenghou@gmail.com\n\nBarnab\u00e1s P\u00f3czos\n\nCarnegie Mellon University\nbapoczos@cs.cmu.edu\n\nRuslan Salakhutdinov\n\nCarnegie Mellon University\nrsalakhu@cs.cmu.edu\n\nRuosong Wang\n\nCarnegie Mellon University.\nruosongw@andrew.cmu.edu\n\nKeyulu Xu\n\nMassachusetts Institute of Technology\n\nkeyulu@mit.edu\n\nAbstract\n\nWhile graph kernels (GKs) are easy to train and enjoy provable theoretical guar-\nantees, their practical performances are limited by their expressive power, as the\nkernel function often depends on hand-crafted combinatorial features of graphs.\nCompared to graph kernels, graph neural networks (GNNs) usually achieve better\npractical performance, as GNNs use multi-layer architectures and non-linear acti-\nvation functions to extract high-order information of graphs as features. However,\ndue to the large number of hyper-parameters and the non-convex nature of the\ntraining procedure, GNNs are harder to train. Theoretical guarantees of GNNs are\nalso not well-understood. Furthermore, the expressive power of GNNs scales with\nthe number of parameters, and thus it is hard to exploit the full power of GNNs\nwhen computing resources are limited. The current paper presents a new class\nof graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to\nin\ufb01nitely wide multi-layer GNNs trained by gradient descent. GNTKs enjoy the\nfull expressive power of GNNs and inherit advantages of GKs. Theoretically, we\nshow GNTKs provably learn a class of smooth functions on graphs. Empirically,\nwe test GNTKs on graph classi\ufb01cation datasets and show they achieve strong per-\nformance.\n\n1\n\nIntroduction\n\nLearning on graph-structured data such as social networks and biological networks requires one to\ndesign methods that effectively exploit the structure of graphs. Graph Kernels (GKs) and Graph Neu-\nral Networks (GNNs) are two major classes of methods for learning on graph-structured data. GKs,\nexplicitly or implicitly, build feature vectors based on combinatorial properties of input graphs. Pop-\nular choices of GKs include Weisfeiler-Lehman subtree kernel [Shervashidze et al., 2011], graphlet\nkernel [Shervashidze et al., 2009] and random walk kernel [Vishwanathan et al., 2010, G\u00e4rtner et al.,\n2003]. GKs inherit all bene\ufb01ts of kernel methods. GKs are easy to train, since the correspond-\ning optimization problem is convex. Moreover, the kernel function often has explicit expressions,\nand thus we can analyze their theoretical guarantees using tools in learning theory. The downside\nof GKs, however, is that hand-crafted features may not be powerful enough to capture high-order\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\finformation that involves complex interaction between nodes, which could lead to worse practical\nperformance than GNNs.\nGNNs, on the other hand, do not require explicitly hand-crafted feature maps. Similar to convo-\nlutional neural networks (CNNs) which are widely applied in computer vision, GNNs use multi-\nlayer structures and convolutional operations to aggregate local information of nodes, together with\nnon-linear activation functions to extract features from graphs. Various architectures have been pro-\nposed [Xu et al., 2019a, 2018]. GNNs extract higher-order information of graphs, which lead to\nmore powerful features compared to hand-crafted combinatorial features used by GKs. As a result,\nGNNs have achieved state-of-the-art performance on a large number of tasks on graph-structured\ndata. Nevertheless, there are also disadvantages of using GNNs. The objective function of GNNs\nis highly non-convex, and thus it requires careful hyper-parameter tuning to stabilize the training\nprocedure. Meanwhile, due to the non-convex nature of the training procedure, it is also hard to\nanalyze the learned GNNs directly. For example, one may ask whether GNNs can provably learn\ncertain class of functions. This question seems hard to answer given our limited theoretical under-\nstanding of GNNs. Another disadvantage of GNNs is that the expressive power of GNNs scales\nwith the number of parameters. Thus, it is hard to learn a powerful GNN when computing resources\nare limited. Can we build a model that enjoys the best of both worlds, i.e., a model that extracts\npowerful features as GNNs and is easy to train and analyze like GKs?\nIn this paper, we give an af\ufb01rmative answer to this question. Inspired by recent connections between\nkernel methods and over-parameterized neural networks [Arora et al., 2019b,a, Du et al., 2019, 2018,\nJacot et al., 2018, Yang, 2019], we propose a class of new graph kernels, Graph Neural Tangent\nKernels (GNTKs). GNTKs are equivalent to in\ufb01nitely wide GNNs trained by gradient descent, where\nthe word \u201ctangent\u201d corresponds to the training algorithm \u2014 gradient descent. While GNTKs are\ninduced by in\ufb01nitely wide GNNs, the prediction of GNTKs depends only on pairwise kernel values\nbetween graphs, for which we give an analytic formula to calculate ef\ufb01ciently. Therefore, GNTKs\nenjoy the full expressive power of GNNs, while inheriting bene\ufb01ts of GKs.\n\nOur Contributions. First, inspired by recent connections between over-parameterized neural net-\nworks and kernel methods Jacot et al. [2018], Arora et al. [2019a], Yang [2019], we present a gen-\neral recipe which translates a GNN architecture to its corresponding GNTK. This recipe works for\na wide range of GNNs, including graph isomorphism network (GIN) [Xu et al., 2019a], graph con-\nvolutional network (GCN) [Kipf and Welling, 2016], and GNN with jumping knowledge [Xu et al.,\n2018]. Second, we conduct a theoretical analysis of GNTKs. Using the technique developed in\nArora et al. [2019b], we show for a broad range of smooth functions over graphs, a certain GNTK\ncan learn them with polynomial number of samples. To our knowledge, this is the \ufb01rst sample com-\nplexity analysis in the GK and GNN literature. Finally, we validate the performance of GNTKs on 7\nstandard benchmark graph classi\ufb01cation datasets. On four of them, we \ufb01nd GNTK outperforms all\nbaseline methods and achieves state-of-the-art performance. In particular, GNKs achieve 83.6% ac-\ncuracy on COLLAB dataset and 67.9% accuracy on PTC dataset, compared to the best of baselines,\n81.0% and 64.6% respectively. Moreover, in our experiments, we also observe that GNTK is more\ncomputationally ef\ufb01cient than its GNN counterpart.\nThis paper is organized as follow. In Section 2, we provide necessary background and review op-\nerations in GNNs that we will use to derive GNTKs. In Section 3, we present our general recipe\nthat translates a GNN to its corresponding GNTK. In Section 4, we give our theoretical analysis\nof GNTKs. In Section 5, we compare GNTK with state-of-the-art methods on graph classi\ufb01cation\ndatasets. We defer technical proofs to the supplementary material.\n\n2 Preliminaries\n\nWe begin by summarizing the most common models for learning with graphs and, along the way,\nintroducing our notation. Let G = (V, E) be a graph with node features hv \u2208 Rd for each v \u2208 V .\nWe denote the neighborhood of node v by N (v). In this paper, we consider the graph classi\ufb01cation\ntask, where, given a set of graphs {G1, ..., Gn} \u2286 G and their labels {y1, ..., yn} \u2286 Y, our goal is to\nlearn to predict labels of unseen graphs.\nGraph Neural Network. GNN is a powerful framework for graph representation learning. Modern\nGNNs generally follow a neighborhood aggregation scheme Xu et al. [2019a], Gilmer et al. [2017],\n\n2\n\n\fXu et al. [2018], where the representation h(\u2113)\nv of each node v (in layer \u2113) is recursively updated by\naggregating and transforming the representations of its neighbors. After iterations of aggregation,\nthe representation of an entire graph is then obtained through pooling, e.g., by summing the rep-\nresentations of all nodes in the graph. Many GNNs, with different aggregation and graph readout\nfunctions, have been proposed under the neighborhood aggregation framework Xu et al. [2019a,b,\n2018], Scarselli et al. [2009], Li et al. [2016], Kearnes et al. [2016], Ying et al. [2018], Velickovic\net al. [2018], Hamilton et al. [2017], Duvenaud et al. [2015], Kipf and Welling [2016], Defferrard\net al. [2016], Santoro et al. [2018, 2017], Battaglia et al. [2016].\nNext, we formalize the GNN framework. We refer to the neighbor aggregation process as a BLOCK\noperation, and to graph-level pooling to as a READOUT operation.\nBLOCK Operation. A BLOCK operation aggregates features over a neighborhood N (u) \u222a {u}\nvia, e.g., summation, and transforms the aggregated features with non-linearity, e.g. multi-layer\nperceptron (MLP) or a fully-connected layer followed by ReLU. We denote the number of fully-\nconnected layers in each BLOCK operation, i.e., the number of hidden layers of an MLP, by R.\nWhen R = 1, the BLOCK operation can be formulated as\n\nr\n\n\u00b7 \u03c3\n\nc\u03c3\nm\n\n0@W\u2113 \u00b7 cu\n\nX\n\nh(\u2113\u22121)\n\nv\n\nv\u2208N (u)\u222a{u}\n\n1A .\n\nBLOCK(\u2113)(u) =\n\nHere, W\u2113 are learnable weights, initialized as Gaussian random variables. \u03c3 is an activation function\nlike ReLU. m is the output dimension of W\u2113. We set the scaling factor c\u03c3 to 2, following the\ninitialization scheme in He et al. [2015]. cu is a scaling factor for neighbor aggregation. Different\nGNNs often have different choices for cu. In Graph Convolution Network (GCN) [Kipf and Welling,\n2016], cu =\n|N (u)|+1, and in Graph Isomorphism Network (GIN) [Xu et al., 2019a], cu = 1, which\ncorrespond to averaging and summing over neighbor features, respectively.\nWhen the number of fully-connected layers R = 2, the BLOCK operation can be written as\n\n1\n\nr\n\n0@W\u2113,2\n\nr\n\nc\u03c3\nm\n\n\u03c3\n\n\u00b7 \u03c3\n\nc\u03c3\nm\n\n0@W\u2113,1 \u00b7 cu\n\nX\n\nh(\u2113\u22121)\n\nv\n\nv\u2208N (u)\u222a{u}\n\n1A1A ,\n\nBLOCK(\u2113)(u) =\n\nwhere W\u2113,1 and W\u2113,2 are learnable weights. Notice that here we \ufb01rst aggregate features over neigh-\nborhood N (u) \u222a {u} and then transforms the aggregated features with an MLP with R = 2 hidden\nlayers. BLOCK operations can be de\ufb01ned similarly for R > 2. Notice that the BLOCK operation\nwe de\ufb01ned above is also known as the graph (spatial) convolutional layer in the GNN literature.\n\nREADOUT Operation. To get the representation of an entire graph hG after L steps of aggrega-\ntion, we take the summation over all node features, i.e.,\n\nhG = READOUT\n\nu , u \u2208 V\nh(L)\n\n(cid:16)n\n\no(cid:17)\n\nX\n\nh(L)\nu .\n\n=\n\nu\u2208V\n\nThere are more sophisticated READOUT operations than a simple summation Xu et al. [2018],\nZhang et al. [2018a], Ying et al. [2018]. Jumping Knowledge Network (JK-Net) Xu et al. [2018]\nconsiders graph structures of different granularity, and aggregates graph features across all layers as\n\ni\n\n(cid:16)n\n\no(cid:17)\nu , u \u2208 V, \u2113 \u2208 [L]\nh(\u2113)\n\nX\n\nh\n\n=\n\nu\u2208V\n\nhG = READOUTJK\n\nh(0)\nu ; . . . ; h(L)\n\nu\n\n.\n\no(cid:17)\nBuilding GNNs using BLOCK and READOUT. Most modern GNNs are constructed using the\nBLOCK operation and the READOUT operation Xu et al. [2019a]. We denote the number of\nBLOCK operations (aggregation steps) in a GNN by L. For each \u2113 \u2208 [L] and u \u2208 V , we de\ufb01ne\nh(\u2113)\nu = BLOCK(\u2113)(u). The graph-level feature is then hG = READOUT\nor\n\nu , u \u2208 V\nh(L)\n\n(cid:16)n\n\n(cid:16)n\n\no(cid:17)\n\nhG = READOUTJK\nis applied or not.\n\nu , u \u2208 V, \u2113 \u2208 [L]\nh(\u2113)\n\n, depending on whether jumping knowledge (JK)\n\n3\n\n\f3 GNTK Formulas\n\nIn this section we present our general recipe which translates a GNN architecture to its corresponding\nGNTK. We \ufb01rst provide some intuitions on neural tangent kernels (NTKs). We refer readers to Jacot\net al. [2018], Arora et al. [2019a] for more comprehensive descriptions.\n\nIntuition of the Formulas\n\n3.1\nConsider a general neural network f (\u03b8, x) \u2208 R where \u03b8 \u2208 Rm is all the parameters in the network\nand x is the input. Given a training dataset {(xi, yi)n\n}, consider training the neural network by\nminimizing the squared loss over training data\n\ni=1\n\nnX\n\ni=1\n\n\u2113(\u03b8) =\n\n1\n2\n\n(f (\u03b8, xi) \u2212 yi)2.\n\nSuppose we minimize the squared loss \u2113(\u03b8) by gradient descent with in\ufb01nitesimally small learning\nrate, i.e., d\u03b8(t)\ni=1 be the network outputs. u(t) follows the\nevolution\n\ndt = \u2212\u2207\u2113(\u03b8(t)). Let u(t) = (f (\u03b8(t), xi))n\n\nwhere\n\n(cid:28)\n\nH(t)ij =\n\ndu\ndt\n\n= \u2212H(t)(u(t) \u2212 y),\n(cid:29)\n\n,\n\n\u2202\u03b8\n\n\u2202\u03b8\n\n\u2202f (\u03b8(t), xi)\n\n\u2202f (\u03b8(t), xj)\n\nfor (i, j) \u2208 [n] \u00d7 [n].\n\nRecent advances in optimization of neural networks have shown, for suf\ufb01ciently over-parameterized\nneural networks, the matrix H(t) keeps almost unchanged during the training process Arora et al.\n[2019b,a], Du et al. [2019, 2018], Jacot et al. [2018], in which case the training dynamics is identical\nto that of kernel regression. Moreover, under a random initialization of parameters, the random\nmatrix H(0) converges in probability to a certain deterministic kernel matrix, which is called Neural\nTangent Kernel (NTK) Jacot et al. [2018] and corresponds to in\ufb01nitely wide neural networks. See\nFigure 4 in the supplementary material for an illustration.\nExplicit formulas for NTKs of fully-connected neural networks have been given in Jacot et al. [2018].\nRecently, explicit formulas for NTKs of convolutional neural networks are given in Arora et al.\n[2019a]. The goal of this section is to give an explicit formula for NTKs that correspond to GNNs\nde\ufb01ned in Section 2. Our general strategy is inspired by Arora et al. [2019a]. Let f (\u03b8, G) \u2208 R be\nthe output of the corresponding GNN under parameters \u03b8 and input graph G, for two given graphs\n\u2032, to calculate the corresponding GNTK value, we need to calculate the expected value of\nG and G\n\n(cid:28)\n\n(cid:29)\n\n\u2202f (\u03b8, G)\n\n\u2202\u03b8\n\n,\n\n\u2202f (\u03b8, G\n\n\u2032\n\n)\n\n\u2202\u03b8\n\nin the limit that m \u2192 \u221e and \u03b8 are all Gaussian random variables, which can be viewed as a Gaussian\nprocess. For each layer in the GNN, we use \u03a3 to denote the covariance matrix of outputs of that\nlayer, and \u02d9\u03a3 to denote the covariance matrix corresponds to the derivative of that layer. Due to the\nmulti-layer structure of GNNs, these covariance matrices can be naturally calculated via dynamic\nprogramming.\n\n3.2 Formulas for Calculating GNTKs\n\u2032\n\n\u2032\n\n\u2032\n\n= (V\n\n, E\n\n) with |V | = n, |V\n\n\u2032 and a GNN with L BLOCK\nGiven two graphs G = (V, E), G\noperations and R fully-connected layers with ReLU activation in each BLOCK operation. We give\nthe GNTK formula of pairwise kernel value \u0398(G, G\nWe \ufb01rst de\ufb01ne the covariance matrix between input features of two input graphs G, G\nuse \u03a3(0)(G, G\nde\ufb01ned to be h\n\n) \u2208 R induced by this GNN.\n(cid:2)\n\u2032 \u2208 V\n) \u2208 Rn\u00d7n\n\u2032,\n\u2032\n\u03a3(0)(G, G\n\u2032 \u2208 V\nu hu\u2032, where hu and hu\u2032 are the input features of u \u2208 V and u\n\u22a4\n\nto denote. For two nodes u \u2208 V and u\n\n\u2032, which we\nuu\u2032 is\n\n\u2032| = n\n\n(cid:3)\n\n\u2032\n\n)\n\n\u2032.\n\n\u2032\n\n\u2032\n\n4\n\n\f\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\n(R)\n\n(G, G\n\nBLOCK Operation. A BLOCK operation in GNTK calculates a covariance matrix\n\u03a3(\u2113)\n, and calculates intermediate kernel values\n\u0398(\u2113)\n\n) \u2208 Rn\u00d7n\n\u2032\n(R)(G, G\n) \u2208 Rn\u00d7n\n\u2032\nh\n(r)(G, G\nh\n\nusing \u03a3(\u2113\u22121)\n, which will be later used to compute the \ufb01nal output.\ni\ni\n\nh\nMore speci\ufb01cally, we \ufb01rst perform a neighborhood aggregation operation\nh\n\u03a3(\u2113\u22121)\n\u0398(\u2113\u22121)\n\n) \u2208 Rn\u00d7n\nX\nX\n\nX\nX\n\nuu\u2032 =cucu\u2032\n\nv\u2032\u2208N (u\u2032)\u222a{u\u2032}\n\nv\u2208N (u)\u222a{u}\n\n(0)(G, G\n\ni\ni\n\nvv\u2032 ,\n\n(G, G\n\n(G, G\n\n\u0398(\u2113)\n\n\u03a3(\u2113)\n\n(R)\n\n)\n\n)\n\n)\n\n\u2032\n\n\u2032\n\n\u2032\n\n\u2032\n\n(0)(G, G\n\nuu\u2032 =cucu\u2032\n\n)\n\nvv\u2032 .\n\nv\u2208N (u)\u222a{u}\n\nv\u2032\u2208N (u\u2032)\u222a{u\u2032}\n\n(R)\n\n\u2032\n\n\u2032\n\n\u2032\n\n(R)(G, G\n\n(R)(G, G\n\n) and \u0398(0)\n\n) as \u03a3(0)(G, G\n\nHere we de\ufb01ne \u03a3(0)\n), for notational convenience. Next we\nperform R transformations that correspond to the R fully-connected layers with ReLU activation.\nHere \u03c3(z) = max{0, z} is the ReLU activation function. We denote \u02d9\u03c3(z) = \u001f[z \u2265 0] to be the\nderivative of the ReLU activation function.\nFor each r \u2208 [R], we de\ufb01ne\n\u2022 For u \u2208 V, u\ni\nh\n\n\u2032 \u2208 V\n\n\u2032,\n\n\u2032\n\n1A \u2208 R2\u00d72.\n\ni\ni\n\n)\n\n\u2032\n\nuu\u2032\n\n)\n\nu\u2032u\u2032\n\nh\n0@\nh\ni\ni\n\n\u2022 For u \u2208 V, u\n\nA(\u2113)\n\n(r) (G, G\n\n\u2032\n\n)\n\nuu\u2032 =\n\n\u03a3(\u2113)\n\n(r\u22121)(G, G)\n(r\u22121)(G\n\n, G)\n\n\u2032\n\n\u03a3(\u2113)\n\nu,u\n\nuu\u2032\n\n\u2032,h\n\u2032 \u2208 V\nh\n\n\u2032\n\n\u2032\n\n)\n\n)\n\nh\n\n\u03a3(\u2113)\n\n(r)(G, G\n\n\u02d9\u03a3(\u2113)\n(r) (G, G\n\u2032,\n\u2032\n\ni\n\nuu\u2032 =\n\nuu\u2032 =c\u03c3E\nuu\u2032 =c\u03c3E\n\n(a,b)\u223cN\n\n0,\n\n(a,b)\u223cN\n\n0,\n\n\u0398(\u2113)\n\n(r\u22121)(G, G\n\n\u2032\n\n)\n\nuu\u2032\n\nh\n\n\u2022 For u \u2208 V, u\n\u0398(\u2113)\n\n\u2032 \u2208 V\n\n(r)(G, G\n\n)\n\ni\ni\n\n(\n(\nh\n\nh\nh\n\n\u03a3(\u2113)\n\n\u03a3(\u2113)\n\n\u2032\n\n, G\n\n(r\u22121)(G, G\n(r\u22121)(G\n)\n)\n\nuu\u2032\n\n[\n[\n\n]\n(r)(G,G\u2032)\n]\nA(\u2113)\n(r)(G,G\u2032)\nA(\u2113)\ni\n\nuu\u2032\n\n\u2032\n\nh\n\n[\u03c3 (a) \u03c3 (b)] ,\n\n[ \u02d9\u03c3(a) \u02d9\u03c3(b)] .\n\n(1)\n\n(2)\n\ni\n\n\u2032\n\ni\n\ni\n\n\u2032\n\ni\n\nuu\u2032\n\nNote in the above we have shown how to calculate \u0398(\u2113)\nintermediate outputs will be used to calculate the \ufb01nal output of the corresponding GNTK.\n\n(R)(G, G\n\n\u02d9\u03a3(\u2113)\n\n(r) (G, G\n\u2032\n\n\u03a3(\u2113)\n\nuu\u2032 +\n\n(r) (G, G\n\n)\n) for each \u2113 \u2208 {0, 1, . . . , L}. These\n\nuu\u2032 .\n\n)\n\nREADOUT Operation. Given these intermediate outputs, we can now calculate the \ufb01nal output\nof GNTK using the following formula.\n\n\u0398(G, G\n\n\u2032\n\n) =\n\nu\u2208V,u\u2032\u2208V \u2032\nu\u2208V,u\u2032\u2208V \u2032\n\n)\n\n\u0398(\u2113)\n(R) (G, G\n\u2113=0 \u0398(\u2113)\nL\n\nuu\u2032\n(R)(G, G\n\n\u2032\n\n)\n\nwithout jumping knowledge\n\nwith jumping knowledge\n\n.\n\n8<:\n\nP\nP\n\nh\nhP\n\nTo better illustrate our general recipe, in Figure 1 we give a concrete example in which we translate\na GNN with L = 2 BLOCK operations, R = 1 fully-connection layer in each BLOCK operation,\nand jumping knowledge, to its corresponding GNTK.\n\n4 Theoretical Analysis of GNTK\n\nIn this section, we analyze the generalization ability of a GNTK that corresponds to a simple GNN.\nWe consider the standard supervised learning setup. We are given n training data {(Gi, yi)}n\ndrawn i.i.d. from the underlying distribution D, where Gi is the i-th input graph and yi is its label.\ni=1\nConsider a GNN with a single BLOCK operation, followed by the READOUT operation (without\n. We use \u0398 \u2208 Rn\u00d7n to denote\njumping knowledge). Here we set cu =\n) is the kernel function that corresponds\nthe kernel matrix, where \u0398ij = \u0398(Gi, Gj). Here \u0398(G, G\n\n(cid:16)(cid:13)(cid:13)(cid:13)P\n\nv\u2208N (u)\u222a{u} hv\n\n(cid:17)\u22121\n\n(cid:13)(cid:13)(cid:13)\n\n\u2032\n\n2\n\n5\n\n\fh\n\nh\n\ni\n\n\u2032\n\n)\n\nFigure 1: Illustration of our recipe that translates a GNN to a GNTK. For a GNN with\nL = 2 BLOCK operations, R = 1 fully-connected layer in each BLOCK operation, and jump-\n\u2032,\ning knowledge, the corresponding GNTK is calculated as follow. For two graphs G and G\n\u22a4\nu hu\u2032. We\nwe \ufb01rst calculate\nuu\u2032 = h\nuu\u2032 =\n, \u0398(\u2113\u22121)\nfollow the kernel formulas in Section 3 to calculate \u03a3(\u2113)\n(0), \u0398(\u2113)\n(Aggre-\n(r) using \u03a3(\u2113)\ngation) and calculate \u03a3(\u2113)\n(r\u22121), \u0398(\u2113)\n(r\u22121) (Nonlinearity). The \ufb01nal output is\nuu\u2032.\nu\u2208V,u\u2032\u2208V \u2032\n(R)(G, G\n\n(cid:3)\n(cid:2)\n\u03a3(0)(G, G\n(0) using \u03a3(\u2113\u22121)\n\n(r), \u0398(\u2113)\n\u2113=0 \u0398(\u2113)\n\nhP\n\n(r), \u02d9\u03a3(\u2113)\n\n(1)(G, G\n\n(1)(G, G\n\nuu\u2032 =\n\nP\n\n\u0398(G, G\n\n) =\n\n\u0398(0)\n\n\u03a3(0)\n\ni\n\n\u2032\n\n)\n\n(R)\n\n(R)\n\ni\n\n\u2032\n\n)\n\n\u2032\n\nL\n\n\u2032\n\n)\n\nto the simple GNN. See Section 3 for the formulas for calculating \u0398(G, G\nsion, we assume that the kernel matrix \u0398 \u2208 Rn\u00d7n is invertible.\nFor a testing point Gte, the prediction of kernel regression using GNTK on this testing point is\n\n). Throughout the discus-\n\n\u2032\n\nfker(Gte) = [\u0398(Gte, G1), \u0398(Gte, G1), . . . , \u0398(Gte, Gn)]\n\n\u22a4\n\n\u22121y.\n\n\u0398\n\nThe following result is a standard result for kernel regression proved using Rademacher complexity.\nFor a proof, see Bartlett and Mendelson [2002].\nTheorem 4.1 (Bartlett and Mendelson [2002]). Given n training data {(Gi, yi)}n\ni=1 drawn i.i.d.\nfrom the underlying distribution D. Consider any loss function \u2113 : R\u00d7R \u2192 [0, 1] that is 1-Lipschitz\nin the \ufb01rst argument such that \u2113(y, y) = 0. With probability at least 1 \u2212 \u03b4, the population loss of the\nGNTK predictor can be upper bounded by\n\nLD (fker) = E(G,y)\u223cD [\u2113(fker(G), y)] = O\n\n p\n\nr\n\n!\n\ny\u22a4\u0398\u22121y \u00b7 tr (\u0398)\n\nn\n\n+\n\nlog(1/\u03b4)\n\nn\n\n.\n\nNote that this theorem presents a data-dependent generalization bound which is related to the kernel\nmatrix \u0398 \u2208 Rn\u00d7n and the labels {yi}n\n\u22121y and\ntr (\u0398), then we can obtain a concrete sample complexity bound. We instantiate this idea to study\nthe class of graph labeling functions that can be ef\ufb01ciently learned by GNTKs.\nThe following two theorems guarantee that if labels are generated as described in (3), then the GNTK\nthat corresponds to the simple GNN described above can learn this function with polynomial number\nof samples. We \ufb01rst give an upper bound on y\n\ni=1. Using this theorem, if we can bound y\n\n\u22121y.\n\n\u0398\n\n\u0398\n\n\u22a4\n\n\u22a4\n\n6\n\n\f(cid:17)\n\u221eX\nTheorem 4.2. For each i \u2208 [n], if the labels {yi}n\n\nX\n\n(cid:16)\n\ni=1 satisfy\n\nX\n\n(cid:16)\n\n(cid:17)2l\n\n,\n\nwhere hu = cu\nwe have\n\n\u22a4\nu \u03b21\n\nh\n\n+\n\nu\u2208V\n\nyi = \u03b11\n\nP\n(3)\nq\nv\u2208N (u)\u222a{u} hv, \u03b11, \u03b12, \u03b14, . . . \u2208 R, \u03b21, \u03b22, \u03b24, . . . \u2208 Rd, and Gi = (V, E), then\ny\u22a4\u0398\n\n\u22121y \u2264 2|\u03b11| \u00b7 \u2225\u03b21\u22252 +\n\n2\u03c0(2l \u2212 1)|\u03b12l| \u00b7 \u2225\u03b22l\u22252l\n2 .\n\n\u221eX\n\nu\u2208V\n\n\u221a\n\n\u03b12l\n\nl=1\n\n\u22a4\nu \u03b22l\n\nh\n\nl=1\n\n2\n\nThe following theorem gives an upper bound on tr (\u0398).\nTheorem 4.3. If for all graphs Gi = (Vi, Ei) in the training set, |Vi| is upper bounded by V , then\ntr(\u0398) \u2264 O(nV\nCombining Theorem 4.2 and Theorem 4.3 with Theorem 4.1, we know if\n2\u03c0(2l \u2212 1)|\u03b12l| \u00b7 \u2225\u03b22l\u22252l\n\n). Here n is the number of training samples.\n\n2|\u03b11| \u00b7 \u2225\u03b21\u22252 +\n\n\u221eX\n\n\u221a\n\n2\n\nis bounded, and |Vi| is bounded for all graphs Gi = (Vi, Ei) in the training set, then the GNTK that\ncorresponds to the simple GNN described above can learn functions of forms in (3), with polynomial\nnumber of samples. To our knowledge, this is the \ufb01rst sample complexity analysis in the GK and\nGNN literature.\n\nl=1\n\n5 Experiments\n\nIn this section, we demonstrate the effectiveness of GNTKs using experiments on graph classi\ufb01ca-\ntion tasks. For ablation study, we investigate how the performance varies with the architecture of\nthe corresponding GNN. Following common practices of evaluating performance of graph classi\ufb01-\ncation models Yanardag and Vishwanathan [2015], we perform 10-fold cross validation and report\nthe mean and standard deviation of validation accuracies. More details about the experiment setup\ncan be found in Section B of the supplementary material.\n\nDatasets. The benchmark datasets include four bioinformatics datasets MUTAG, PTC, NCI1,\nPROTEINS and three social network datasets COLLAB, IMDB-BINARY, IMDB-MULTI. For each\ngraph, we transform the categorical input features to one-hot encoding representations. For datasets\nwhere the graphs have no node features, i.e. only graph structure matters, we use degrees as input\nnode features.\n\n5.1 Results\n\nWe compare GNTK with various state-of-the-art graph classi\ufb01cation algorithms: (1) the WL subtree\nkernel Shervashidze et al. [2011]; (2) state-of-the-art deep learning architectures, including Graph\nConvolutional Network (GCN) Kipf and Welling [2016], GraphSAGE Hamilton et al. [2017], Graph\nIsomorphism Network(GIN) Xu et al. [2019a], PATCHY-SANNiepert et al. [2016] and Deep Graph\nCNN (DGCNN) Zhang et al. [2018a]; (3) Graph kernels based on random walks, i.e., Anonymous\nWalk Embeddings Ivanov and Burnaev [2018] and RetGK Zhang et al. [2018b]. For deep learning\nmethods and random walk graph kernels, we report the accuracies reported in the original papers.\nThe experiment setup is deferred to Section B.\nThe graph classi\ufb01cation results are shown in Table 1. The best results are highlighted as bold. Our\nproposed GNTKs are powerful and achieve state-of-the-art classi\ufb01cation accuracy on most datasets.\nIn four of them, we \ufb01nd GNTKs outperform all baseline methods. In particular, GNTKs achieve\n83.6% accuracy on COLLAB dataset and 67.9% accuracy on PTC dataset, compared to the best of\nbaselines, 81.0% and 64.6% respectively. Notably, GNTKs give the best performance on all social\nnetwork datasets. Moreover, In our experiments, we also observe that with the same architecture,\nGNTK is more computational ef\ufb01cient that its GNN counterpart. On IMDB-B dataset, running GIN\nwith the default setup (of\ufb01cial implementation of Xu et al. [2019a]) takes 19 minutes on a TITAN X\nGPU and running GNTK only takes 2 minutes.\n\n7\n\n\fMethod\n\nGCN\nGraphSAGE\nPatchySAN\nDGCNN\nGIN\nWL subtree\nAWL\nRetGK\nGNTK\n\nN\nN\nG\n\nK\nG\n\n\u2013\n\n73.7\n\nCOLLAB\n79.0 \u00b1 1.8\n72.6 \u00b1 2.2\n80.2 \u00b1 1.9\n78.9 \u00b1 1.9\n73.9 \u00b1 1.9\n81.0 \u00b1 0.3\n83.6 \u00b1 1.0\n\n70.0\n\nIMDB-B\n74.0 \u00b1 3.4\n72.3 \u00b1 5.3\n71.0 \u00b1 2.2\n75.1 \u00b1 5.1\n73.8 \u00b1 3.9\n74.5 \u00b1 5.9\n71.9 \u00b1 1.0\n76.9 \u00b1 3.6\n\n47.8\n\nIMDB-M\n51.9 \u00b1 3.8\n50.9 \u00b1 2.2\n45.2 \u00b1 2.8\n52.3 \u00b1 2.8\n50.9 \u00b1 3.8\n51.5 \u00b1 3.6\n47.7 \u00b1 0.3\n52.8 \u00b1 4.6\n\nPTC\n\n58.6\n\n64.2 \u00b1 4.3\n63.9 \u00b1 7.7\n60.0 \u00b1 4.8\n64.6 \u00b1 7.0\n59.9 \u00b1 4.3\n62.5 \u00b1 1.6\n67.9 \u00b1 6.9\n\n\u2013\n\n74.4\n\nNCI1\n80.2 \u00b1 2.0\n77.7 \u00b1 1.5\n78.6 \u00b1 1.9\n82.7 \u00b1 1.7\n86.0 \u00b1 1.8\n84.5 \u00b1 0.2\n84.2 \u00b1 1.5\n\n\u2013\n\n85.8\n\nMUTAG PROTEINS\n76.0 \u00b1 3.2\n85.6 \u00b1 5.8\n75.9 \u00b1 3.2\n85.1 \u00b1 7.6\n75.9 \u00b1 2.8\n92.6 \u00b1 4.2\n76.2 \u00b1 2.8\n89.4 \u00b1 5.6\n90.4 \u00b1 5.7\n75.0 \u00b1 3.1\n87.9 \u00b1 9.8\n75.8 \u00b1 0.6\n90.3 \u00b1 1.1\n90.0 \u00b1 8.5\n75.6 \u00b1 4.2\n\n75.5\n\n\u2013\n\nTable 1: Classi\ufb01cation results (in %) for graph classi\ufb01cation datasets. GNN: graph neural net-\nwork based methods. GK: graph kernel based methods. GNTK: fusion of GNN and GK.\n\n(a) IMDB-BINARY\n\n(b) NCI1\n\nFigure 2: Effects of number of BLOCK operations and the scaling factor cu on the performance\nof GNTK. Each dot represents the performance of a particular GNTK architecture. We divide\ndifferent architectures into different groups by number of BLOCK operations. We color these GNTK\narchitecture points by the scaling factor cu. We observe the test accuracy is correlated with the\ndataset and the architecture.\n\n5.2 Relation between GNTK Performance and the Corresponding GNN\n\nWe conduct ablation study to investigate how the performance of GNTK varies as we change the\narchitecture of the corresponding GNN. We select two representative datasets, one social network\ndataset IMDBBINARY, and another bioinformatics dataset NCI1. For IMDBBINARY, we vary\nthe number of BLOCK operations in {2, 3, 4, 5, 6}. For NCI1, we vary the number of BLOCK\noperations in {8, 10, 12, 14, 16}. For both datasets, we vary the number of MLP layers in {1, 2, 3}.\n\nEffects of Number of BLOCK Operations and the Scaling Factor cu. We investigate how the\nperformance of GNTKs is correlated with number of BLOCK operations and the scaling factor cu.\nFirst, on the bioinformatics dataset (NCI), we observe that GNTKs with more layers perform better.\nThis is perhaps because, for molecules and bio graphs, more global structural information is helpful,\nas they provide important information about the chemical/bio entity. On such graphs, GNTKs are\nparticularly effective because GNTKs can easily scale to many layers, whereas the number of layers\nin GNNs may be restricted by computing resources.\nMoreover, the performance of GNTK is correlated with that of the corresponding GNN. For example,\nin social networks, GNTKs with sum aggregation cu = 1 work better than average aggregation\n|N (u)|+1. The similar pattern holds in GNNs, because sum aggregation learns more graph\ncu =\nstructure information than average aggregation Xu et al. [2019a]. This suggests GNTK can indeed\ninherit the properties and advantages of the corresponding GNN, while also gaining the bene\ufb01ts of\ngraph kernels.\n\n1\n\n8\n\n23456Number of BLOCK operations0.720.730.740.750.760.770.78Testing Accuracycu=1cu=1|(u)|+18910111213141516Number of BLOCK operations0.790.800.810.820.830.840.85Testing Accuracycu=1cu=1|(u)|+1\f(a) IMDBBINARY\n\n(b) NCI1\n\nFigure 3: Effects of jumping knowledge and number of MLP layers on the performance of\nGNTK. Each dot represents the test performance of a GNTK architecture. We divide all GNTK\narchitectures into different groups, according to whether jumping knowledge is applied, or number\nof MLP layers.\n\nEffects of Jumping Knowledge and Number of MLP Layers\nIn the GNN literature, jumping\nknowledge network (JK) is expected to improve performance Xu et al. [2018], Fey [2019].\nIn\nFigure 3, we observe that a similar trend holds for GNTK. The performance of GNTK is improved\non both NCI and IMDB datasets when jumping knowledge is applied. Moreover, increasing the\nnumber of MLP layers can increase the performance by \u223c 0.8%. These empirical \ufb01ndings further\ncon\ufb01rm that GNTKs can inherit the bene\ufb01ts of GNNs, since improvements on GNN architectures\nare re\ufb02ected in the improvements GNTKs.\nWe conclude that GNTKs are attractive for graph representation learning because they can combine\nthe advantages of both GNNs and GKs.\n\nAcknowledgments\n\nS. S. Du and B. P\u00f3czos acknowledge support from AFRL grant FA8750-17-2-0212 and DARPA\nD17AP0000. R. Salakhutdinov and R. Wang are supported in part by NSF IIS-1763562, Of\ufb01ce\nof Naval Research grant N000141812861, and Nvidia NVAIL award. K. Xu is supported by NSF\nCAREER award 1553284 and a Chevron-MIT Energy Fellowship. This work was performed while\nS. S. Du was a Ph.D. student at Carnegie Mellon University and K. Hou was visiting Carnegie\nMellon University.\n\n9\n\nWithoutWithJumping Knowledge0.720.730.740.750.760.770.78Testing Accuracy123Number of MLP layersWithoutWithJumping Knowledge0.790.800.810.820.830.840.85Testing Accuracy123Number of MLP layers\fReferences\nSanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On\n\nexact computation with an in\ufb01nitely wide neural net. arXiv preprint arXiv:1904.11955, 2019a.\n\nSanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of\noptimization and generalization for overparameterized two-layer neural networks. arXiv preprint\narXiv:1901.08584, 2019b.\n\nPeter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\nPeter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks\nfor learning about objects, relations and physics. In Advances in Neural Information Processing\nSystems, pages 4502\u20134510, 2016.\n\nMicha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In Advances in Neural Information Processing\nSystems, pages 3844\u20133852, 2016.\n\nSimon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds global\n\nminima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.\n\nSimon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\nover-parameterized neural networks. In International Conference on Learning Representations,\n2019.\n\nDavid K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al\u00e1n\nAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular\n\ufb01ngerprints. In Advances in neural information processing systems, pages 2224\u20132232, 2015.\n\nMatthias Fey. Just jump: Dynamic neighborhood aggregation in graph neural networks. arXiv\n\npreprint arXiv:1904.04849, 2019.\n\nThomas G\u00e4rtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and ef\ufb01cient\n\nalternatives. In Learning theory and kernel machines, pages 129\u2013143. Springer, 2003.\n\nJustin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural\nmessage passing for quantum chemistry. In International Conference on Machine Learning, pages\n1273\u20131272, 2017.\n\nWill Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.\n\nIn Advances in Neural Information Processing Systems, pages 1024\u20131034, 2017.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international\nconference on computer vision, pages 1026\u20131034, 2015.\n\nSergey Ivanov and Evgeniy Burnaev. Anonymous walk embeddings. In ICML, 2018.\n\nArthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and gener-\n\nalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.\n\nSteven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph\nconvolutions: moving beyond \ufb01ngerprints. Journal of computer-aided molecular design, 30(8):\n595\u2013608, 2016.\n\nThomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\nYujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. In International Conference on Learning Representations, 2016.\n\nMathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural net-\n\nworks for graphs. In International conference on machine learning, pages 2014\u20132023, 2016.\n\n10\n\n\fAdam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In\nAdvances in neural information processing systems, pages 4967\u20134976, 2017.\n\nAdam Santoro, Felix Hill, David Barrett, Ari Morcos, and Timothy Lillicrap. Measuring abstract\nreasoning in neural networks. In International Conference on Machine Learning, pages 4477\u2013\n4486, 2018.\n\nFranco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.\n\nThe graph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2009.\n\nNino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Ef\ufb01-\ncient graphlet kernels for large graph comparison. In Arti\ufb01cial Intelligence and Statistics, pages\n488\u2013495, 2009.\n\nNino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M Borg-\nwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539\u2013\n2561, 2011.\n\nPetar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua\nIn International Conference on Learning Representations,\n\nBengio. Graph attention networks.\n2018.\n\nS Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph\n\nkernels. Journal of Machine Learning Research, 11(Apr):1201\u20131242, 2010.\n\nKeyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie\nJegelka. Representation learning on graphs with jumping knowledge networks. In International\nConference on Machine Learning, pages 5449\u20135458, 2018.\n\nKeyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural\n\nnetworks? In International Conference on Learning Representations, 2019a.\n\nKeyulu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka.\n\nWhat can neural networks reason about? arXiv preprint arXiv:1905.13211, 2019b.\n\nPinar Yanardag and SVN Vishwanathan. Deep graph kernels.\n\nIn Proceedings of the 21th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365\u20131374.\nACM, 2015.\n\nGreg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior,\ngradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760,\n2019.\n\nRex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec.\nIn Advances in Neural\n\nHierarchical graph representation learning with differentiable pooling.\nInformation Processing Systems, 2018.\n\nMuhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning\narchitecture for graph classi\ufb01cation. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018a.\n\nZhen Zhang, Mianzhi Wang, Yijian Xiang, Yan Huang, and Arye Nehorai. RetGK: Graph kernels\n\nbased on return probabilities of random walks. In NeurIPS, 2018b.\n\n11\n\n\f", "award": [], "sourceid": 3077, "authors": [{"given_name": "Simon", "family_name": "Du", "institution": "Institute for Advanced Study"}, {"given_name": "Kangcheng", "family_name": "Hou", "institution": "Zhejiang University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Ruosong", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Keyulu", "family_name": "Xu", "institution": "MIT"}]}