{"title": "Diffusion-Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1993, "page_last": 2001, "abstract": "We present diffusion-convolutional neural networks (DCNNs), a new model for graph-structured data.  Through the introduction of a diffusion-convolution operation, we show how diffusion-based representations can be learned from graph-structured data and used as an effective basis for node classification. DCNNs have several attractive qualities, including a latent representation for graphical data that is invariant under isomorphism, as well as polynomial-time prediction and learning that can be represented as tensor operations and efficiently implemented on a GPU.  Through several experiments with real structured datasets, we demonstrate that DCNNs are able to  outperform probabilistic relational models and kernel-on-graph methods at relational node classification tasks.", "full_text": "Diffusion-Convolutional Neural Networks\n\nJames Atwood and Don Towsley\n\nCollege of Information and Computer Science\n\nUniversity of Massachusetts\n\nAmherst, MA, 01003\n\n{jatwood|towsley}@cs.umass.edu\n\nAbstract\n\nWe present diffusion-convolutional neural networks (DCNNs), a new model for\ngraph-structured data. Through the introduction of a diffusion-convolution oper-\nation, we show how diffusion-based representations can be learned from graph-\nstructured data and used as an effective basis for node classi\ufb01cation. DCNNs have\nseveral attractive qualities, including a latent representation for graphical data that\nis invariant under isomorphism, as well as polynomial-time prediction and learning\nthat can be represented as tensor operations and ef\ufb01ciently implemented on a GPU.\nThrough several experiments with real structured datasets, we demonstrate that\nDCNNs are able to outperform probabilistic relational models and kernel-on-graph\nmethods at relational node classi\ufb01cation tasks.\n\n1\n\nIntroduction\n\nWorking with structured data is challenging. On one hand, \ufb01nding the right way to express and\nexploit structure in data can lead to improvements in predictive performance; on the other, \ufb01nding\nsuch a representation may be dif\ufb01cult, and adding structure to a model can dramatically increase the\ncomplexity of prediction\nThe goal of this work is to design a \ufb02exible model for a general class of structured data that offers\nimprovements in predictive performance while avoiding an increase in complexity. To accomplish\nthis, we extend convolutional neural networks (CNNs) to general graph-structured data by introducing\na \u2018diffusion-convolution\u2019 operation. Brie\ufb02y, rather than scanning a \u2018square\u2019 of parameters across a\ngrid-structured input like the standard convolution operation, the diffusion-convolution operation\nbuilds a latent representation by scanning a diffusion process across each node in a graph-structured\ninput.\nThis model is motivated by the idea that a representation that encapsulates graph diffusion can provide\na better basis for prediction than a graph itself. Graph diffusion can be represented as a matrix power\nseries, providing a straightforward mechanism for including contextual information about entities\nthat can be computed in polynomial time and ef\ufb01ciently implemented on a GPU.\nIn this paper, we present diffusion-convolutional neural networks (DCNNs) and explore their per-\nformance on various classi\ufb01cation tasks on graphical data. Many techniques include structural\ninformation in classi\ufb01cation tasks, such as probabilistic relational models and kernel methods;\nDCNNs offer a complementary approach that provides a signi\ufb01cant improvement in predictive\nperformance at node classi\ufb01cation tasks.\nAs a model class, DCNNs offer several advantages:\n\n\u2022 Accuracy: In our experiments, DCNNs signi\ufb01cantly outperform alternative methods for\nnode classi\ufb01cation tasks and offer comparable performance to baseline methods for graph\nclassi\ufb01cation tasks.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a) Node classi\ufb01cation\n\n(b) Graph classi\ufb01cation\n\nFigure 1: DCNN model de\ufb01nition for node and graph classi\ufb01cation tasks.\n\n\u2022 Flexibility: DCNNs provide a \ufb02exible representation of graphical data that encodes node\nfeatures, edge features, and purely structural information with little preprocessing. DC-\nNNs can be used for a variety of classi\ufb01cation tasks with graphical data, including node\nclassi\ufb01cation and whole-graph classi\ufb01cation.\n\u2022 Speed: Prediction from an DCNN can be expressed as a series of polynomial-time tensor\noperations, allowing the model to be implemented ef\ufb01ciently on a GPU using existing\nlibraries.\n\nThe remainder of this paper is organized as follows. In Section 2, we present a formal de\ufb01nition of\nthe model, including descriptions of prediction and learning procedures. This is followed by several\nexperiments in Section 3 that explore the performance of DCNNs at node and graph classi\ufb01cation\ntasks. We brie\ufb02y describe the limitations of the model in Section 4, then, in Section 5, we present\nrelated work and discuss the relationship between DCNNs and other methods. Finally, conclusions\nand future work are presented in Section 6.\n\n2 Model\nConsider a situation where we have a set of T graphs G = {Gt|t \u2208 1...T}. Each graph Gt = (Vt, Et)\nis composed of vertices Vt and edges Et. The vertices are collectively described by an Nt \u00d7 F design\nmatrix Xt of features1, where Nt is the number of nodes in Gt, and the edges Et are encoded by an\nNt \u00d7 Nt adjacency matrix At, from which we can compute a degree-normalized transition matrix Pt\nthat gives the probability of jumping from node i to node j in one step. No constraints are placed on\nthe form of Gt; the graph can be weighted or unweighted, directed or undirected. Either the nodes or\ngraphs have labels Y associated with them, with the dimensionality of Y differing in each case.\nWe are interested in learning to predict Y ; that is, to predict a label for each of the nodes in each\ngraph or a label for each graph itself. In each case, we have access to some labeled entities (be they\nnodes or graphs), and our task is predict the values of the remaining unlabeled entities.\nThis setting can represent several well-studied machine learning tasks. If T = 1 (i.e. there is only\none input graph) and the labels Y are associated with the nodes, this reduces to the problem of\nsemisupervised classi\ufb01cation; if there are no edges present in the input graph, this reduces further to\nstandard supervised classi\ufb01cation. If T > 1 and the labels Y are associated with each graph, then\nthis represents the problem of supervised graph classi\ufb01cation.\nDCNNs are designed to perform any task that can be represented within this formulation. An DCNN\ntakes G as input and returns either a hard prediction for Y or a conditional distribution P(Y |X). Each\n\n1Without loss of generality, we assume that the features are real-valued.\n\n2\n\nPtXtNtNtFZtHWc (H x F)Wd (H x F)NtYtNtNtPtXtNtNtFZtHYtWc (H x F)Wd (H x F)Nt\fentity of interest (be it a node or a graph) is transformed to a diffusion-convolutional representation,\nwhich is a H \u00d7 F real matrix de\ufb01ned by H hops of graph diffusion over F features, and it is de\ufb01ned\nby an H \u00d7 F real-valued weight tensor W c and a nonlinear differentiable function f that computes\nthe activations. So, for node classi\ufb01cation tasks, the diffusion-convolutional representation of graph t,\nZt, will be a Nt \u00d7 H \u00d7 F tensor, as illustrated in Figure 1a; for graph classi\ufb01cation tasks, Zt will be\na H \u00d7 F matrix, as illustrated in Figures 1b.\nThe model is built on the idea of a diffusion kernel, which can be thought of as a measure of the level\nof connectivity between any two nodes in a graph when considering all paths between them, with\nlonger paths being discounted more than shorter paths. Diffusion kernels provide an effective basis\nfor node classi\ufb01cation tasks [1].\nThe term \u2018diffusion-convolution\u2019 is meant to evoke the ideas of feature learning, parameter tying, and\ninvariance that are characteristic of convolutional neural networks. The core operation of a DCNN is a\nmapping from nodes and their features to the results of a diffusion process that begins at that node. In\ncontrast with standard CNNs, DCNN parameters are tied according diffusion search depth rather than\ntheir position in a grid. The diffusion-convolutional representation is invariant with respect to node\nindex rather than position; in other words, the diffusion-convolututional activations of two isomorphic\ninput graphs will be the same2. Unlike standard CNNs, DCNNs have no pooling operation.\n\nNode Classi\ufb01cation Consider a node classi\ufb01cation task where a label Y is predicted for each input\nt be an Nt \u00d7 H \u00d7 Nt tensor containing the power series of Pt, de\ufb01ned as\nnode in a graph. Let P \u2217\nfollows:\n\nThe diffusion-convolutional activation Ztijk for node i, hop j, and feature k of graph t is given by\n\nP \u2217\ntijk = P j\n\ntik\n\n(cid:32)\n\njk \u00b7 Nt(cid:88)\n\n(cid:33)\n\nZtijk = f\n\nW c\n\nP \u2217\ntijlXtlk\n\n(1)\n\n(2)\n\n(4)\n(5)\n\nThe activations can be expressed more concisely using tensor notation as\n\nl=1\n\nZt = f (W c (cid:12) P \u2217\n\nt Xt)\n\n(3)\nwhere the (cid:12) operator represents element-wise multiplication; see Figure 1a. The model only\nentails O(H \u00d7 F ) parameters, making the size of the latent diffusion-convolutional representation\nindependent of the size of the input.\nThe model is completed by a dense layer that connects Z to Y . A hard prediction for Y , denoted \u02c6Y ,\ncan be obtained by taking the maximum activation and a conditional probability distribution P(Y |X)\ncan be found by applying the softmax function:\n\n\u02c6Y = arg max(cid:0)f(cid:0)W d (cid:12) Z(cid:1)(cid:1)\nP(Y |X) = softmax(cid:0)f(cid:0)W d (cid:12) Z(cid:1)(cid:1)\n\nThis keeps the same form in the following extensions.\n\nGraph Classi\ufb01cation DCNNs can be extended to graph classi\ufb01cation by taking the mean activation\nover the nodes\n\n(cid:1)\nP \u2217\nt Xt/Nt\nwhere 1Nt is an Nt \u00d7 1 vector of ones, as illustrated in Figure 1b.\n\nZt = f(cid:0)W c (cid:12) 1T\n\n(6)\n\nNt\n\nPurely Structural DCNNs DCNNs can be applied to input graphs with no features by associating\na \u2018bias feature\u2019 with value 1.0 with each node. Richer structure can be encoded by adding additional\nstructural node features such as Pagerank or clustering coef\ufb01cient, although this does introduce some\nhand-engineering and pre-processing.\n\n2A proof is given in the appendix.\n\n3\n\n\f(a) Cora Learning Curve\n\n(b) Pubmed Learning Curve\n\n(c) Search Breadth\n\nFigure 2: Learning curves (2a - 2b) and effect of search breadth (2c) for the Cora and Pubmed\ndatasets.\n\nLearning DCNNs are learned via stochastic minibatch gradient descent on backpropagated error.\nAt each epoch, node indices are randomly grouped into several batches. The error of each batch is\ncomputed by taking slices of the graph de\ufb01nition power series and propagating the input forward to\npredict the output, then setting the weights by gradient ascent on the back-propagated error. We also\nmake use of windowed early stopping; training is ceased if the validation error of a given epoch is\ngreater than the average of the last few epochs.\n\n3 Experiments\n\nIn this section we present several experiments to investigate how well DCNNs perform at node\nand graph classi\ufb01cation tasks. In each case we compare DCNNs to other well-known and effective\napproaches to the task.\nIn each of the following experiments, we use the AdaGrad algorithm [2] for gradient ascent with a\nlearning rate of 0.05. All weights are initialized by sampling from a normal distribution with mean\nzero and variance 0.01. We choose the hyperbolic tangent for the nonlinear differentiable function\nf and use the multiclass hinge loss between the model predictions and ground truth as the training\nobjective. The model was implemented in Python using Lasagne and Theano [3].\n\n3.1 Node classi\ufb01cation\n\nWe ran several experiments to investigate how well DCNNs classify nodes within a single graph. The\ngraphs were constructed from the Cora and Pubmed datasets, which each consist of scienti\ufb01c papers\n(nodes), citations between papers (edges), and subjects (labels).\n\nIn each experiment, the set G consists of a single graph G. During each trial, the input\nProtocol\ngraph\u2019s nodes are randomly partitioned into training, validation, and test sets, with each set having\n\nModel\nl1logistic\nl2logistic\n\nKED\nKLED\n\nCRF-LBP\n\n2-hop DCNN\n\nAccuracy\n0.7087\n0.7292\n0.8044\n0.8229\n0.8449\n0.8677\n\nCora\n\nF (micro)\n0.7087\n0.7292\n0.8044\n0.8229\n\n\u2013\n\n0.8677\n\nF (macro) Accuracy\n0.8718\n0.8631\n0.8125\n0.8228\n\n0.6829\n0.7013\n0.7928\n0.8117\n0.8248\n0.8584\n\n0.8976\n\n\u2013\n\nPubmed\nF (micro)\n0.8718\n0.8631\n0.8125\n0.8228\n\n\u2013\n\n0.8976\n\nF (macro)\n\n0.8698\n0.8614\n0.7978\n0.8086\n\n\u2013\n\n0.8943\n\nTable 1: A comparison of the performance between baseline (cid:96)1 and (cid:96)2-regularized logistic regression\nmodels, exponential diffusion and Laplacian exponential diffusion kernel models, loopy belief\npropagation (LBP) on a partially-observed conditional random \ufb01eld (CRF), and a two-hop DCNN on\nthe Cora and Pubmed datasets. The DCNN offers the best performance according to each measure,\nand the gain is statistically signi\ufb01cant in each case. The CRF-LBP result is quoted from [4], which\nfollows the same experimental protocol.\n\n4\n\n0.10.20.30.40.50.60.70.80.91.0Training Proportion0.00.20.40.60.81.0accuracycora: accuracydcnn2logisticl1logisticl20.10.20.30.40.50.60.70.80.91.0Training Proportion0.00.20.40.60.81.0accuracypubmed: accuracydcnn2logisticl1logisticl2012345N Hops0.600.650.700.750.800.850.900.951.00accuracyNode Classificationpubmedcora\fthe same number of nodes. During training, all node features X, all edges E, and the labels Y of\nthe training and validation sets are visible to the model. We report classi\ufb01cation accuracy as well\nas micro\u2013 and macro\u2013averaged F1; each measure is reported as a mean and con\ufb01dence interval\ncomputed from several trials.\nWe also provide learning curves for the CORA and Pubmed datasets. In this experiment, the validation\nand test set each contain 10% of the nodes, and the amount of training data is varied between 10%\nand 100% of the remaining nodes.\n\nBaseline Methods\n\u2018l1logistic\u2019 and \u2018l2logistic\u2019 indicate (cid:96)1 and (cid:96)2-regularized logistic regression,\nrespectively. The inputs to the logistic regression models are the node features alone (e.g. the graph\nstructure is not used) and the regularization parameter is tuned using the validation set. \u2018KED\u2019 and\n\u2018KLED\u2019 denote the exponential diffusion and Laplacian exponential diffusion kernels-on-graphs,\nrespectively, which have previously been shown to perform well on the Cora dataset [1]. These kernel\nmodels take the graph structure as input (e.g. node features are not used) and the validation set is\nused to determine the kernel hyperparameters. \u2018CRF-LBP\u2019 indicates a partially-observed conditional\nrandom \ufb01eld that uses loopy belief propagation for inference. Results for this model are quoted from\nprior work [4] that uses the same dataset and experimental protocol.\n\nNode Classi\ufb01cation Data The Cora corpus [5] consists of 2,708 machine learning papers and the\n5,429 citation edges that they share. Each paper is assigned a label drawn from seven possible machine\nlearning subjects, and each paper is represented by a bit vector where each feature corresponds to\nthe presence or absence of a term drawn from a dictionary with 1,433 unique entries. We treat the\ncitation network as an undirected graph.\nThe Pubmed corpus [5] consists of 19,717 scienti\ufb01c papers from the Pubmed database on the subject\nof diabetes. Each paper is assigned to one of three classes. The citation network that joins the papers\nconsists of 44,338 links, and each paper is represented by a Term Frequency Inverse Document\nFrequency (TFIDF) vector drawn from a dictionary with 500 terms. As with the CORA corpus, we\nconstruct an adjacency-based DCNN that treats the citation network as an undirected graph.\n\nResults Discussion Table 1 compares the performance of a two-hop DCNN with several baselines.\nThe DCNN offers the best performance according to different measures including classi\ufb01cation\naccuracy and micro\u2013 and macro\u2013averaged F1, and the gain is statistically signi\ufb01cant in each case\nwith negligible p-values. For all models except the CRF, we assessed this via a one-tailed two-sample\nWelch\u2019s t-test. The CRF result is quoted from prior work, so we used a one-tailed one-sample test.\nFigures 2a and Figure 2b show the learning curves for the Cora and Pubmed datasets. The DCNN\ngenerally outperforms the baseline methods on the Cora dataset regardless of the amount of training\ndata available, although the Laplacian exponential diffusion kernel does offer comparable performance\nwhen the entire training set is available. Note that the kernel methods were prohibitively slow to run\non the Pubmed dataset, so we do not include them in the learning curve.\nFinally, the impact of diffusion breadth on performance is shown in Figure 2. Most of the performance\nis gained as the diffusion breadth grows from zero to three hops, then levels out as the diffusion\nprocess converges.\n\n3.2 Graph Classi\ufb01cation\n\nWe also ran experiments to investigate how well DCNNs can learn to label whole graphs.\n\nProtocol At the beginning of each trial, input graphs are randomly assigned to training, validation,\nor test, with each set having the same number of graphs. During the learning phase, the training and\nvalidation graphs, their node features, and their labels are made visible; the training set is used to\ndetermine the parameters and the validation set to determine hyperparameters. At test time, the test\ngraphs and features are made visible and the graph labels are predicted and compared with ground\ntruth. Table 2 reports the mean accuracy, micro-averaged F1, and macro-averaged F1 over several\ntrials.\nWe also provide learning curves for the MUTAG (Figure 3a) and ENZYMES (Figure 3b) datasets.\nIn these experiments, validation and test sets each containing 10% of the graphs, and we report the\n\n5\n\n\f(a) MUTAG Learning Curve\n\n(b) ENZYMES Learning Curve\n\n(c) Search Breadth\n\nFigure 3: Learning curves for the MUTAG (3a) and ENZYMES (3b) datasets as well as the effect of\nsearch breadth (3c)\n\nperformance of each model as a function of the proportion of the remaining graphs that are made\navailable for training.\n\nBaseline Methods As a simple baseline, we apply linear classi\ufb01ers to the average feature vector of\neach graph; \u2018l1logistic\u2019 and \u2018l2logistic\u2019 indicate (cid:96)1 and (cid:96)2-regularized logistic regression applied as\ndescribed. \u2018deepwl\u2019 indicates the Weisfeiler-Lehman (WL) subtree deep graph kernel. Deep graph\nkernels decompose a graph into substructures, treat those substructures as words in a sentence, and \ufb01t\na word-embedding model to obtain a vectorization [6].\n\nGraph Classi\ufb01cation Data We apply DCNNs to a standard set of graph classi\ufb01cation datasets\nthat consists of NCI1, NCI109, MUTAG, PCI, and ENZYMES. The NCI1 and NCI109 [7] datasets\nconsist of 4100 and 4127 graphs that represent chemical compounds. Each graph is labeled with\nwhether it is has the ability to suppress or inhibit the growth of a panel of human tumor cell lines,\nand each node is assigned one of 37 (for NCI1) or 38 (for NCI109) possible labels. MUTAG [8]\ncontains 188 nitro compounds that are labeled as either aromatic or heteroaromatic with seven node\nfeatures. PTC [9] contains 344 compounds labeled with whether they are carcinogenic in rats with 19\nnode features. Finally, ENZYMES [10] is a balanced dataset containing 600 proteins with three node\nfeatures.\n\nResults Discussion In contrast with the node classi\ufb01cation experiments, there is no clear best\nmodel choice across the datasets or evaluation measures. In fact, according to Table 2, the only clear\nchoice is the \u2018deepwl\u2019 graph kernel model on the ENZYMES dataset, which signi\ufb01cantly outperforms\nthe other methods in terms of accuracy and micro\u2013 and macro\u2013averaged F measure. Furthermore,\nas shown in Figure 3, there is no clear bene\ufb01t to broadening the search breadth H. These results\nsuggest that, while diffusion processes are an effective representation for nodes, they do a poor job of\nsummarizing entire graphs. It may be possible to improve these results by \ufb01nding a more effective\nway to aggregate the node operations than a simple mean, but we leave this as future work.\n\nModel\nl1logistic\nl2logistic\ndeepwl\n\n2-hop DCNN\n5-hop DCNN\n\nModel\nl1logistic\nl2logistic\ndeepwl\n\n2-hop DCNN\n5-hop DCNN\n\nAccuracy\n0.5728\n0.5688\n0.6215\n0.6250\n0.6261\n\nAccuracy\n0.7190\n0.7016\n0.6563\n0.6635\n0.6698\n\nNCI1\n\nF (micro)\n0.5728\n0.5688\n0.6215\n0.5807\n0.5898\nMUTAG\nF (micro)\n0.7190\n0.7016\n0.6563\n0.7975\n0.8013\n\nF (macro)\n\n0.5711\n0.5641\n0.5821\n0.5807\n0.5898\n\nF (macro)\n\n0.6405\n0.5795\n0.5942\n0.79747\n0.8013\n\nAccuracy\n0.5555\n0.5586\n0.5801\n0.6275\n0.6286\n\nAccuracy\n0.5470\n0.5565\n0.5113\n0.5660\n0.5530\n\nNCI109\nF (micro)\n0.5555\n0.5568\n0.5801\n0.5884\n0.5950\nPTC\n\nF (micro)\n0.5470\n0.5565\n0.5113\n0.0500\n\n0.0\n\nF (macro)\n\n0.5411\n0.5402\n0.5178\n0.5884\n0.5899\n\nF (macro)\n\n0.4272\n0.4460\n0.4444\n0.0531\n0.0526\n\nAccuracy\n0.1640\n0.2030\n0.2155\n0.1590\n0.1810\n\nENZYMES\nF (micro)\n0.1640\n0.2030\n0.2155\n0.1590\n0.1810\n\nF (macro)\n\n0.0904\n0.1110\n0.1431\n0.0809\n0.0991\n\nTable 2: A comparison of the performance between baseline methods and two and \ufb01ve-hop DCNNs\non several graph classi\ufb01cation datasets.\n\n6\n\n0.10.20.30.40.50.60.70.80.91.0Training Proportion0.00.20.40.60.81.0accuracymutag: accuracydcnn2logisticl1logisticl20.10.20.30.40.50.60.70.80.91.0Training Proportion0.00.20.40.60.81.0accuracyenzymes: accuracydcnn2logisticl1logisticl20246810N Hops0.00.20.40.60.81.0accuracyGraph Classificationnci109enzymesnci1ptcmutag\f4 Limitations\n\nScalability DCNNs are realized as a series of operations on dense tensors. Storing the largest tensor\n(P \u2217, the transition matrix power series) requires O(N 2\nt H) memory, which can lead to out-of-memory\nerrors on the GPU for very large graphs in practice. As such, DCNNs can be readily applied to graphs\nof tens to hundreds of thousands of nodes, but not to graphs with millions to billions of nodes.\n\nLocality The model is designed to capture local behavior in graph-structured data. As a conse-\nquence of constructing the latent representation from diffusion processes that begin at each node,\nwe may fail to encode useful long-range spatial dependencies between individual nodes or other\nnon-local graph behavior.\n\n5 Related Work\n\nIn this section we describe existing approaches to the problems of semi-supervised learning, graph\nclassi\ufb01cation, and edge classi\ufb01cation, and discuss their relationship to DCNNs.\n\nOther Graph-Based Neural Network Models Other researchers have investigated how CNNs can\nbe extended from grid-structured to more general graph-structured data. [11] propose a spatial method\nwith ties to hierarchical clustering, where the layers of the network are de\ufb01ned via a hierarchical\npartitioning of the node set. In the same paper, the authors propose a spectral method that extends\nthe notion of convolution to graph spectra. Later, [12] applied these techniques to data where\na graph is not immediately present but must be inferred. DCNNs, which fall within the spatial\ncategory, are distinct from this work because their parameterization makes them transferable; a\nDCNN learned on one graph can be applied to another. A related branch of work that has focused on\nextending convolutional neural networks to domains where the structure of the graph itself is of direct\ninterest [13, 14, 15]. For example, [15] construct a deep convolutional model that learns real-valued\n\ufb01ngerprint representation of chemical compounds.\n\nProbabilistic Relational Models DCNNs also share strong ties to probabilistic relational models\n(PRMs), a family of graphical models that are capable of representing distributions over relational\ndata [16]. In contrast to PRMs, DCNNs are deterministic, which allows them to avoid the exponential\nblowup in learning and inference that hampers PRMs.\nOur results suggest that DCNNs outperform partially-observed conditional random \ufb01elds, the state-\nof-the-art model probabilistic relational model for semi-supervised learning. Furthermore, DCNNs\noffer this performance at considerably lower computational cost. Learning the parameters of both\nDCNNs and partially-observed CRFs involves numerically minimizing a nonconvex objective \u2013 the\nbackpropagated error in the case of DCNNs and the negative marginal log-likelihood for CRFs.\nIn practice, the marginal log-likelihood of a partially-observed CRF is computed using a contrast-\nof-partition-functions approach that requires running loopy belief propagation twice; once on the\nentire graph and once with the observed labels \ufb01xed [17]. This algorithm, and thus each step in\nthe numerical optimization, has exponential time complexity O(EtN Ct\nt ) where Ct is the size of\nthe maximal clique in Gt [18]. In contrast, the learning subroutine for an DCNN requires only one\nforward and backward pass for each instance in the training data. The complexity is dominated by\nthe matrix multiplication between the graph de\ufb01nition matrix A and the design matrix V , giving an\noverall polynomial complexity of O(N 2\n\nt F ).\n\nKernel Methods Kernel methods de\ufb01ne similarity measures either between nodes (so-called\nkernels on graphs) [1] or between graphs (graph kernels) and these similarities can serve as a\nbasis for prediction via the kernel trick. The performance of graph kernels can be improved by\ndecomposing a graph into substructures, treating those substructures as a words in a sentence, and\n\ufb01tting a word-embedding model to obtain a vectorization [6].\nDCNNs share ties with the exponential diffusion family of kernels on graphs. The exponential\ndiffusion graph kernel KED is a sum of a matrix power series:\n\n\u221e(cid:88)\n\nj=0\n\nKED =\n\n= exp(\u03b1A)\n\n(7)\n\n\u03b1jAj\n\nj!\n\n7\n\n\fThe diffusion-convolution activation given in (3) is also constructed from a power series. However,\nthe representations have several important differences. First, the weights in (3) are learned via\nbackpropagation, whereas the kernel representation is not learned from data. Second, the diffusion-\nconvolutional representation is built from both node features and the graph structure, whereas the\nexponential diffusion kernel is built from the graph structure alone. Finally, the representations have\ndifferent dimensions: KED is an Nt \u00d7 Nt kernel matrix, whereas Zt is a Nt \u00d7 H \u00d7 F tensor that\ndoes not conform to the de\ufb01nition of a kernel.\n\n6 Conclusion and Future Work\n\nBy learning a representation that encapsulates the results of graph diffusion, diffusion-convolutional\nneural networks offer performance improvements over probabilistic relational models and kernel\nmethods at node classi\ufb01cation tasks. We intend to investigate methods for a) improving DCNN\nperformance at graph classi\ufb01cation tasks and b) making the model scalable in future work.\n\n7 Appendix: Representation Invariance for Isomorphic Graphs\n\nIf two graphs G1 and G2 are isomorphic, then their diffusion-convolutional activations are the same.\nProof by contradiction; assume that G1 and G2 are isomorphic and that their diffusion-convolutional\nactivations are different. The diffusion-convolutional activations can be written as\n\n(cid:32)\n(cid:32)\n\njk (cid:12) (cid:88)\njk (cid:12) (cid:88)\n\nv\u2208V1\n\n(cid:88)\n(cid:88)\n\nv(cid:48)\u2208V1\n\nv\u2208V2\n\nv(cid:48)\u2208V2\n\nW c\n\nW c\n\n(cid:33)\n(cid:33)\n\nP \u2217\n1vjv(cid:48)X1v(cid:48)k/N1\n\nP \u2217\n2vjv(cid:48)X2v(cid:48)k/N2\n\nZ1jk = f\n\nZ2jk = f\n\nNote that\n\nby isomorphism, allowing us to rewrite the activations as\n\nV1 = V2 = V\nX1vk = X2vk = Xvk \u2200 v \u2208 V, k \u2208 [1, F ]\n1vjv(cid:48) = P \u2217\nP \u2217\n2vjv(cid:48) = P \u2217\nN1 = N2 = N\n(cid:32)\n(cid:32)\n\nvjv(cid:48) \u2200 v, v(cid:48) \u2208 V, j \u2208 [0, H]\n(cid:33)\n(cid:33)\n\nP \u2217\nvjv(cid:48)Xv(cid:48)k/N\n\nZ1jk = f\n\nW c\n\njk (cid:12)(cid:88)\njk (cid:12)(cid:88)\n\nv\u2208V\n\n(cid:88)\n(cid:88)\n\nv(cid:48)\u2208V\n\nP \u2217\nvjv(cid:48)Xv(cid:48)k/N\n\nZ2jk = f\n\nW c\n\nv\u2208V\n\nv(cid:48)\u2208V\n\nThis implies that Z1 = Z2 which presents a contradiction and completes the proof.\n\nAcknowledgments\n\nWe would like to thank Bruno Ribeiro, Pinar Yanardag, and David Belanger for their feedback on\ndrafts of this paper. This work was supported in part by Army Research Of\ufb01ce Contract W911NF-\n12-1-0385 and ARL Cooperative Agreement W911NF-09-2-0053. This work was also supported by\nNVIDIA through the donation of equipment used to perform experiments.\n\nReferences\n\n[1] Fran\u00e7ois Fouss, Kevin Francoisse, Luh Yen, Alain Pirotte, and Marco Saerens. An experimen-\ntal investigation of kernels on graphs for collaborative recommendation and semisupervised\nclassi\ufb01cation. Neural Networks, 31:53\u201372, July 2012.\n\n8\n\n\f[2] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. The Journal of Machine Learning Research, 2011.\n\n[3] James Bergstra, Olivier Breuleux, Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume\nDesjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU\nmath expression compiler. In Proceedings of the Python for Scienti\ufb01c Computing Conference\n(SciPy), 2010.\n\n[4] P Sen and L Getoor. Link-based classi\ufb01cation. Technical Report, 2007.\n[5] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina\n\nEliassi-Rad. Collective Classi\ufb01cation in Network Data. AI Magazine, 2008.\n\n[6] Pinar Yanardag and S V N Vishwanathan. Deep Graph Kernels. In the 21th ACM SIGKDD\nInternational Conference, pages 1365\u20131374, New York, New York, USA, 2015. ACM Press.\n[7] Nikil Wale, Ian A Watson, and George Karypis. Comparison of descriptor spaces for chemical\ncompound retrieval and classi\ufb01cation. Knowledge and Information Systems, 14(3):347\u2013375,\nAugust 2007.\n\n[8] Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and\nCorwin Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic\nnitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of\nmedicinal chemistry, 34(2):786\u2013797, 1991.\n\n[9] Hannu Toivonen, Ashwin Srinivasan, Ross D King, Stefan Kramer, and Christoph Helma. Statis-\ntical evaluation of the predictive toxicology challenge 2000\u20132001. Bioinformatics, 19(10):1183\u2013\n1193, 2003.\n\n[10] Karsten M Borgwardt, Cheng Soon Ong, Stefan Sch\u00f6nauer, SVN Vishwanathan, Alex J Smola,\nand Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(suppl\n1):i47\u2013i56, 2005.\n\n[11] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. arXiv.org, 2014.\n\n[12] M Henaff, J Bruna, and Y LeCun. Deep Convolutional Networks on Graph-Structured Data.\n\narXiv.org, 2015.\n\n[13] F Scarselli, M Gori, Ah Chung Tsoi, M Hagenbuchner, and G Monfardini. The Graph Neural\n\nNetwork Model. IEEE Transactions on Neural Networks, 2009.\n\n[14] A Micheli. Neural Network for Graphs: A Contextual Constructive Approach. IEEE Transac-\n\ntions on Neural Networks, 2009.\n\n[15] David K Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael G\u00f3mez-Bombarelli,\nTimothy Hirzel, Al\u00e1n Aspuru-Guzik, and Ryan P Adams. Convolutional Networks on Graphs\nfor Learning Molecular Fingerprints. NIPS, 2015.\n\n[16] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques.\n\nThe MIT Press, 2009.\n\n[17] Jakob Verbeek and William Triggs. Scene segmentation with crfs learned from partially labeled\n\nimages. NIPS, 2007.\n\n[18] Trevor Cohn. Ef\ufb01cient Inference in Large Conditional Random Fields. ECML, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1073, "authors": [{"given_name": "James", "family_name": "Atwood", "institution": "UMass Amherst"}, {"given_name": "Don", "family_name": "Towsley", "institution": "UMass Amherst"}]}