{"title": "Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 3844, "page_last": 3852, "abstract": "In this work, we are interested in generalizing convolutional neural networks (CNNs) from low-dimensional regular grids, where image, video and speech are represented, to high-dimensional irregular domains, such as social networks, brain connectomes or words\u2019 embedding, represented by graphs. We present a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs. Importantly, the proposed technique offers the same linear computational complexity and constant learning complexity as classical CNNs, while being universal to any graph structure. Experiments on MNIST and 20NEWS demonstrate the ability of this novel deep learning system to learn local, stationary, and compositional features on graphs.", "full_text": "Convolutional Neural Networks on Graphs\n\nwith Fast Localized Spectral Filtering\n\nMicha\u00ebl Defferrard\n\nXavier Bresson\n\nPierre Vandergheynst\n\n{michael.defferrard,xavier.bresson,pierre.vandergheynst}@epfl.ch\n\nEPFL, Lausanne, Switzerland\n\nAbstract\n\nIn this work, we are interested in generalizing convolutional neural networks\n(CNNs) from low-dimensional regular grids, where image, video and speech are\nrepresented, to high-dimensional irregular domains, such as social networks, brain\nconnectomes or words\u2019 embedding, represented by graphs. We present a formu-\nlation of CNNs in the context of spectral graph theory, which provides the nec-\nessary mathematical background and ef\ufb01cient numerical schemes to design fast\nlocalized convolutional \ufb01lters on graphs. Importantly, the proposed technique of-\nfers the same linear computational complexity and constant learning complexity\nas classical CNNs, while being universal to any graph structure. Experiments on\nMNIST and 20NEWS demonstrate the ability of this novel deep learning system\nto learn local, stationary, and compositional features on graphs.\n\nIntroduction\n\n1\nConvolutional neural networks [19] offer an ef\ufb01cient architecture to extract highly meaningful sta-\ntistical patterns in large-scale and high-dimensional datasets. The ability of CNNs to learn local\nstationary structures and compose them to form multi-scale hierarchical patterns has led to break-\nthroughs in image, video, and sound recognition tasks [18]. Precisely, CNNs extract the local sta-\ntionarity property of the input data or signals by revealing local features that are shared across\nthe data domain. These similar features are identi\ufb01ed with localized convolutional \ufb01lters or kernels,\nwhich are learned from the data. Convolutional \ufb01lters are shift- or translation-invariant \ufb01lters, mean-\ning they are able to recognize identical features independently of their spatial locations. Localized\nkernels or compactly supported \ufb01lters refer to \ufb01lters that extract local features independently of the\ninput data size, with a support size that can be much smaller than the input size.\nUser data on social networks, gene data on biological regulatory networks, log data on telecommu-\nnication networks, or text documents on word embeddings are important examples of data lying on\nirregular or non-Euclidean domains that can be structured with graphs, which are universal represen-\ntations of heterogeneous pairwise relationships. Graphs can encode complex geometric structures\nand can be studied with strong mathematical tools such as spectral graph theory [6].\nA generalization of CNNs to graphs is not straightforward as the convolution and pooling operators\nare only de\ufb01ned for regular grids. This makes this extension challenging, both theoretically and\nimplementation-wise. The major bottleneck of generalizing CNNs to graphs, and one of the primary\ngoals of this work, is the de\ufb01nition of localized graph \ufb01lters which are ef\ufb01cient to evaluate and learn.\nPrecisely, the main contributions of this work are summarized below.\n\n1. Spectral formulation. A spectral graph theoretical formulation of CNNs on graphs built\n\non established tools in graph signal processing (GSP). [31].\n\n2. Strictly localized \ufb01lters. Enhancing [4], the proposed spectral \ufb01lters are provable to be\n\nstrictly localized in a ball of radius K, i.e. K hops from the central vertex.\n3. Low computational complexity. The evaluation complexity of our \ufb01lters is linear w.r.t. the\n\ufb01lters support\u2019s size K and the number of edges |E|. Importantly, as most real-world graphs\nare highly sparse, we have |E| (cid:28) n2 and |E| = kn for the widespread k-nearest neighbor\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Architecture of a CNN on graphs and the four ingredients of a (graph) convolutional layer.\n\n(NN) graphs, leading to a linear complexity w.r.t the input data size n. Moreover, this\nmethod avoids the Fourier basis altogether, thus the expensive eigenvalue decomposition\n(EVD) necessary to compute it as well as the need to store the basis, a matrix of size n2.\nThat is especially relevant when working with limited GPU memory. Besides the data, our\nmethod only requires to store the Laplacian, a sparse matrix of |E| non-zero values.\n\n4. Ef\ufb01cient pooling. We propose an ef\ufb01cient pooling strategy on graphs which, after a rear-\n\nrangement of the vertices as a binary tree structure, is analog to pooling of 1D signals.\n\n5. Experimental results. We present multiple experiments that ultimately show that our for-\nmulation is (i) a useful model, (ii) computationally ef\ufb01cient and (iii) superior both in accu-\nracy and complexity to the pioneer spectral graph CNN introduced in [4]. We also show\nthat our graph formulation performs similarly to a classical CNNs on MNIST and study the\nimpact of various graph constructions on performance. The TensorFlow [1] code to repro-\nduce our results and apply the model to other data is available as an open-source software.1\n\n2 Proposed Technique\nGeneralizing CNNs to graphs requires three fundamental steps: (i) the design of localized convolu-\ntional \ufb01lters on graphs, (ii) a graph coarsening procedure that groups together similar vertices and\n(iii) a graph pooling operation that trades spatial resolution for higher \ufb01lter resolution.\n2.1 Learning Fast Localized Spectral Filters\nThere are two strategies to de\ufb01ne convolutional \ufb01lters; either from a spatial approach or from a\nspectral approach. By construction, spatial approaches provide \ufb01lter localization via the \ufb01nite size\nof the kernel. However, although graph convolution in the spatial domain is conceivable, it faces\nthe challenge of matching local neighborhoods, as pointed out in [4]. Consequently, there is no\nunique mathematical de\ufb01nition of translation on graphs from a spatial perspective. On the other\nside, a spectral approach provides a well-de\ufb01ned localization operator on graphs via convolutions\nwith a Kronecker delta implemented in the spectral domain [31]. The convolution theorem [22]\nde\ufb01nes convolutions as linear operators that diagonalize in the Fourier basis (represented by the\neigenvectors of the Laplacian operator). However, a \ufb01lter de\ufb01ned in the spectral domain is not\nnaturally localized and translations are costly due to the O(n2) multiplication with the graph Fourier\nbasis. Both limitations can however be overcome with a special choice of \ufb01lter parametrization.\n\nGraph Fourier Transform. We are interested in processing signals de\ufb01ned on undirected and\nconnected graphs G = (V,E, W ), where V is a \ufb01nite set of |V| = n vertices, E is a set of edges and\nW \u2208 Rn\u00d7n is a weighted adjacency matrix encoding the connection weight between two vertices.\nA signal x : V \u2192 R de\ufb01ned on the nodes of the graph may be regarded as a vector x \u2208 Rn\nwhere xi is the value of x at the ith node. An essential operator in spectral graph analysis is the\ngraph Laplacian [6], which combinatorial de\ufb01nition is L = D\u2212W \u2208 Rn\u00d7n where D \u2208 Rn\u00d7n is the\n\n1https://github.com/mdeff/cnn_graph\n\n2\n\nClassificationFully connected layersFeature extractionConvolutional layersInput graph signalse.g. bags of wordsOutput signalse.g. labelsGraph signal filtering1. Convolution2. Non-linear activationGraph coarsening3. Sub-sampling4. Pooling\fdiagonal degree matrix with Dii =(cid:80)\n\nj Wij, and normalized de\ufb01nition is L = In\u2212 D\u22121/2W D\u22121/2\nwhere In is the identity matrix. As L is a real symmetric positive semide\ufb01nite matrix, it has a\ncomplete set of orthonormal eigenvectors {ul}n\u22121\nl=0 \u2208 Rn, known as the graph Fourier modes, and\ntheir associated ordered real nonnegative eigenvalues {\u03bbl}n\u22121\nl=0 , identi\ufb01ed as the frequencies of the\ngraph. The Laplacian is indeed diagonalized by the Fourier basis U = [u0, . . . , un\u22121] \u2208 Rn\u00d7n\nsuch that L = U \u039bU T where \u039b = diag([\u03bb0, . . . , \u03bbn\u22121]) \u2208 Rn\u00d7n. The graph Fourier transform of a\nsignal x \u2208 Rn is then de\ufb01ned as \u02c6x = U T x \u2208 Rn, and its inverse as x = U \u02c6x [31]. As on Euclidean\nspaces, that transform enables the formulation of fundamental operations such as \ufb01ltering.\n\nSpectral \ufb01ltering of graph signals. As we cannot express a meaningful translation operator in\nthe vertex domain, the convolution operator on graph \u2217G is de\ufb01ned in the Fourier domain such that\nx \u2217G y = U ((U T x) (cid:12) (U T y)), where (cid:12) is the element-wise Hadamard product. It follows that a\nsignal x is \ufb01ltered by g\u03b8 as\n\ny = g\u03b8(L)x = g\u03b8(U \u039bU T )x = U g\u03b8(\u039b)U T x.\n\n(1)\n\nA non-parametric \ufb01lter, i.e. a \ufb01lter whose parameters are all free, would be de\ufb01ned as\n\ng\u03b8(\u039b) = diag(\u03b8),\n\nwhere the parameter \u03b8 \u2208 Rn is a vector of Fourier coef\ufb01cients.\nPolynomial parametrization for localized \ufb01lters. There are however two limitations with non-\nparametric \ufb01lters: (i) they are not localized in space and (ii) their learning complexity is in O(n),\nthe dimensionality of the data. These issues can be overcome with the use of a polynomial \ufb01lter\n\n(2)\n\n\u03b8k\u039bk,\n\ng\u03b8(\u039b) =\n\n\ufb01lter g\u03b8 centered at vertex i is given by (g\u03b8(L)\u03b4i)j = (g\u03b8(L))i,j =(cid:80)\n\n(3)\nwhere the parameter \u03b8 \u2208 RK is a vector of polynomial coef\ufb01cients. The value at vertex j of the\nk \u03b8k(Lk)i,j, where the kernel\nis localized via a convolution with a Kronecker delta function \u03b4i \u2208 Rn. By [12, Lemma 5.2],\ndG(i, j) > K implies (LK)i,j = 0, where dG is the shortest path distance, i.e. the minimum number\nof edges connecting two vertices on the graph. Consequently, spectral \ufb01lters represented by Kth-\norder polynomials of the Laplacian are exactly K-localized. Besides, their learning complexity is\nO(K), the support size of the \ufb01lter, and thus the same complexity as classical CNNs.\nRecursive formulation for fast \ufb01ltering. While we have shown how to learn localized \ufb01lters\nwith K parameters, the cost to \ufb01lter a signal x as y = U g\u03b8(\u039b)U T x is still high with O(n2) op-\nerations because of the multiplication with the Fourier basis U. A solution to this problem is to\nparametrize g\u03b8(L) as a polynomial function that can be computed recursively from L, as K mul-\ntiplications by a sparse L costs O(K|E|) (cid:28) O(n2). One such polynomial, traditionally used\nin GSP to approximate kernels (like wavelets), is the Chebyshev expansion [12]. Another op-\ntion, the Lanczos algorithm [33], which constructs an orthonormal basis of the Krylov subspace\nKK(L, x) = span{x, Lx, . . . , LK\u22121x}, seems attractive because of the coef\ufb01cients\u2019 independence.\nIt is however more convoluted and thus left as a future work.\nRecall that the Chebyshev polynomial Tk(x) of order k may be computed by the stable recurrence\nrelation Tk(x) = 2xTk\u22121(x) \u2212 Tk\u22122(x) with T0 = 1 and T1 = x. These polynomials form an\n\northogonal basis for L2([\u22121, 1], dy/(cid:112)1 \u2212 y2), the Hilbert space of square integrable functions with\nrespect to the measure dy/(cid:112)1 \u2212 y2. A \ufb01lter can thus be parametrized as the truncated expansion\n\nk=0\n\n\u03b8kTk(\u02dc\u039b),\n\ng\u03b8(\u039b) =\n\n(4)\nof order K \u2212 1, where the parameter \u03b8 \u2208 RK is a vector of Chebyshev coef\ufb01cients and Tk(\u02dc\u039b) \u2208\nRn\u00d7n is the Chebyshev polynomial of order k evaluated at \u02dc\u039b = 2\u039b/\u03bbmax \u2212 In, a diagonal matrix\n(cid:80)K\u22121\nof scaled eigenvalues that lie in [\u22121, 1]. The \ufb01ltering operation can then be written as y = g\u03b8(L)x =\nk=0 \u03b8kTk( \u02dcL)x, where Tk( \u02dcL) \u2208 Rn\u00d7n is the Chebyshev polynomial of order k evaluated at the\nscaled Laplacian \u02dcL = 2L/\u03bbmax \u2212 In. Denoting \u00afxk = Tk( \u02dcL)x \u2208 Rn, we can use the recurrence\nrelation to compute \u00afxk = 2 \u02dcL\u00afxk\u22121 \u2212 \u00afxk\u22122 with \u00afx0 = x and \u00afx1 = \u02dcLx. The entire \ufb01ltering operation\ny = g\u03b8(L)x = [\u00afx0, . . . , \u00afxK\u22121]\u03b8 then costs O(K|E|) operations.\n\nK\u22121(cid:88)\n\nk=0\n\nK\u22121(cid:88)\n\n3\n\n\fLearning \ufb01lters. The jth output feature map of the sample s is given by\n\nys,j =\n\ng\u03b8i,j (L)xs,i \u2208 Rn,\n\n(5)\n\nwhere the xs,i are the input feature maps and the Fin \u00d7 Fout vectors of Chebyshev coef\ufb01cients\n\u03b8i,j \u2208 RK are the layer\u2019s trainable parameters. When training multiple convolutional layers with\nthe backpropagation algorithm, one needs the two gradients\n\nFin(cid:88)\n\ni=1\n\nS(cid:88)\n\ns=1\n\nFout(cid:88)\n\nj=1\n\n\u2202E\n\u2202\u03b8i,j\n\n=\n\n[\u00afxs,i,0, . . . , \u00afxs,i,K\u22121]T \u2202E\n\u2202ys,j\n\nand\n\n\u2202E\n\u2202xs,i\n\n=\n\ng\u03b8i,j (L)\n\n\u2202E\n\u2202ys,j\n\n,\n\n(6)\n\nwhere E is the loss energy over a mini-batch of S samples. Each of the above three computations\nboils down to K sparse matrix-vector multiplications and one dense matrix-vector multiplication for\na cost of O(K|E|FinFoutS) operations. These can be ef\ufb01ciently computed on parallel architectures\nby leveraging tensor operations. Eventually, [\u00afxs,i,0, . . . , \u00afxs,i,K\u22121] only needs to be computed once.\n2.2 Graph Coarsening\nThe pooling operation requires meaningful neighborhoods on graphs, where similar vertices are\nclustered together. Doing this for multiple layers is equivalent to a multi-scale clustering of the graph\nthat preserves local geometric structures. It is however known that graph clustering is NP-hard [5]\nand that approximations must be used. While there exist many clustering techniques, e.g. the pop-\nular spectral clustering [21], we are most interested in multilevel clustering algorithms where each\nlevel produces a coarser graph which corresponds to the data domain seen at a different resolution.\nMoreover, clustering techniques that reduce the size of the graph by a factor two at each level offers\na precise control on the coarsening and pooling size. In this work, we make use of the coarsening\nphase of the Graclus multilevel clustering algorithm [9], which has been shown to be extremely ef-\n\ufb01cient at clustering a large variety of graphs. Algebraic multigrid techniques on graphs [28] and the\nKron reduction [32] are two methods worth exploring in future works.\nGraclus [9], built on Metis [16], uses a greedy algorithm to compute successive coarser versions of\na given graph and is able to minimize several popular spectral clustering objectives, from which we\nchose the normalized cut [30]. Graclus\u2019 greedy rule consists, at each coarsening level, in picking an\nunmarked vertex i and matching it with one of its unmarked neighbors j that maximizes the local\nnormalized cut Wij(1/di + 1/dj). The two matched vertices are then marked and the coarsened\nweights are set as the sum of their weights. The matching is repeated until all nodes have been ex-\nplored. This is an very fast coarsening scheme which divides the number of nodes by approximately\ntwo (there may exist a few singletons, non-matched nodes) from one level to the next coarser level.\n2.3 Fast Pooling of Graph Signals\nPooling operations are carried out many times and must be ef\ufb01cient. After coarsening, the vertices\nof the input graph and its coarsened versions are not arranged in any meaningful way. Hence, a\ndirect application of the pooling operation would need a table to store all matched vertices. That\nwould result in a memory inef\ufb01cient, slow, and hardly parallelizable implementation. It is however\npossible to arrange the vertices such that a graph pooling operation becomes as ef\ufb01cient as a 1D\npooling. We proceed in two steps: (i) create a balanced binary tree and (ii) rearrange the vertices.\nAfter coarsening, each node has either two children, if it was matched at the \ufb01ner level, or one, if it\nwas not, i.e the node was a singleton. From the coarsest to \ufb01nest level, fake nodes, i.e. disconnected\nnodes, are added to pair with the singletons such that each node has two children. This structure is\na balanced binary tree: (i) regular nodes (and singletons) either have two regular nodes (e.g. level\n1 vertex 0 in Figure 2) or (ii) one singleton and a fake node as children (e.g. level 2 vertex 0), and\n(iii) fake nodes always have two fake nodes as children (e.g.\nlevel 1 vertex 1). Input signals are\ninitialized with a neutral value at the fake nodes, e.g. 0 when using a ReLU activation with max\npooling. Because these nodes are disconnected, \ufb01ltering does not impact the initial neutral value.\nWhile those fake nodes do arti\ufb01cially increase the dimensionality thus the computational cost, we\nfound that, in practice, the number of singletons left by Graclus is quite low. Arbitrarily ordering the\nnodes at the coarsest level, then propagating this ordering to the \ufb01nest levels, i.e. node k has nodes\n2k and 2k + 1 as children, produces a regular ordering in the \ufb01nest level. Regular in the sense that\nadjacent nodes are hierarchically merged at coarser levels. Pooling such a rearranged graph signal is\n\n4\n\n\fFigure 2: Example of Graph Coarsening and Pooling. Let us carry out a max pooling of size 4\n(or two poolings of size 2) on a signal x \u2208 R8 living on G0, the \ufb01nest graph given as input. Note\nthat it originally possesses n0 = |V0| = 8 vertices, arbitrarily ordered. For a pooling of size 4,\ntwo coarsenings of size 2 are needed: let Graclus gives G1 of size n1 = |V1| = 5, then G2 of size\nn2 = |V2| = 3, the coarsest graph. Sizes are thus set to n2 = 3, n1 = 6, n0 = 12 and fake nodes\n(in blue) are added to V1 (1 node) and V0 (4 nodes) to pair with the singeltons (in orange), such that\neach node has exactly two children. Nodes in V2 are then arbitrarily ordered and nodes in V1 and\nV0 are ordered consequently. At that point the arrangement of vertices in V0 permits a regular 1D\npooling on x \u2208 R12 such that z = [max(x0, x1), max(x4, x5, x6), max(x8, x9, x10)] \u2208 R3, where\nthe signal components x2, x3, x7, x11 are set to a neutral value.\n\nanalog to pooling a regular 1D signal. Figure 2 shows an example of the whole process. This regular\narrangement makes the operation very ef\ufb01cient and satis\ufb01es parallel architectures such as GPUs as\nmemory accesses are local, i.e. matched nodes do not have to be fetched.\n3 Related Works\n3.1 Graph Signal Processing\nThe emerging \ufb01eld of GSP aims at bridging the gap between signal processing and spectral graph\ntheory [6, 3, 21], a blend between graph theory and harmonic analysis. A goal is to generalize\nfundamental analysis operations for signals from regular grids to irregular structures embodied by\ngraphs. We refer the reader to [31] for an introduction of the \ufb01eld. Standard operations on grids\nsuch as convolution, translation, \ufb01ltering, dilatation, modulation or downsampling do not extend\ndirectly to graphs and thus require new mathematical de\ufb01nitions while keeping the original intuitive\nconcepts. In this context, the authors of [12, 8, 10] revisited the construction of wavelet operators\non graphs and techniques to perform mutli-scale pyramid transforms on graphs were proposed in\n[32, 27]. The works of [34, 25, 26] rede\ufb01ned uncertainty principles on graphs and showed that\nwhile intuitive concepts may be lost, enhanced localization principles can be derived.\n3.2 CNNs on Non-Euclidean Domains\nThe Graph Neural Network framework [29], simpli\ufb01ed in [20], was designed to embed each node in\nan Euclidean space with a RNN and use those embeddings as features for classi\ufb01cation or regression\nof nodes or graphs. By setting their transition function f as a simple diffusion instead of a neural\nnet with a recursive relation, their state vector becomes s = f (x) = W x. Their point-wise output\nfunction g\u03b8 can further be set as \u02c6x = g\u03b8(s, x) = \u03b8(s \u2212 Dx) + x = \u03b8Lx + x instead of another\nneural net. The Chebyshev polynomials of degree K can then be obtained with a K-layer GNN, to\nbe followed by a non-linear layer and a graph pooling operation. Our model can thus be interpreted\nas multiple layers of diffusions and node-local operations.\nThe works of [11, 7] introduced the concept of constructing a local receptive \ufb01eld to reduce the\nnumber of learned parameters. The idea is to group together features based upon a measure of\nsimilarity such as to select a limited number of connections between two successive layers. While\nthis model reduces the number of parameters by exploiting the locality assumption, it did not attempt\nto exploit any stationarity property, i.e. no weight-sharing strategy. The authors of [4] used this\nidea for their spatial formulation of graph CNNs. They use a weighted graph to de\ufb01ne the local\nneighborhood and compute a multiscale clustering of the graph for the pooling operation. Inducing\nweight sharing in a spatial construction is however challenging, as it requires to select and order the\nneighborhoods when a problem-speci\ufb01c ordering (spatial, temporal, or otherwise) is missing.\nA spatial generalization of CNNs to 3D-meshes, a class of smooth low-dimensional non-Euclidean\nspaces, was proposed in [23]. The authors used geodesic polar coordinates to de\ufb01ne the convolu-\n\n5\n\n0156481090120324501223451045189237117113210610\fModel\nClassical CNN\nProposed graph CNN GC32-P4-GC64-P4-FC512\n\nArchitecture\nC32-P4-C64-P4-FC512\n\nAccuracy\n\n99.33\n99.14\n\nTable 1: Classi\ufb01cation accuracies of the proposed graph CNN and a classical CNN on MNIST.\n\ntion on mesh patches, and formulated a deep learning architecture which allows comparison across\ndifferent manifolds. They obtained state-of-the-art results for 3D shape recognition.\nThe \ufb01rst spectral formulation of a graph CNN, proposed in [4], de\ufb01nes a \ufb01lter as\n\ng\u03b8(\u039b) = B\u03b8,\n\n(7)\nwhere B \u2208 Rn\u00d7K is the cubic B-spline basis and the parameter \u03b8 \u2208 RK is a vector of control points.\nThey later proposed a strategy to learn the graph structure from the data and applied the model to\nimage recognition, text categorization and bioinformatics [13]. This approach does however not\nscale up due to the necessary multiplications by the graph Fourier basis U. Despite the cost of\ncomputing this matrix, which requires an EVD on the graph Laplacian, the dominant cost is the need\nto multiply the data by this matrix twice (forward and inverse Fourier transforms) at a cost of O(n2)\noperations per forward and backward pass, a computational bottleneck already identi\ufb01ed by the\nauthors. Besides, as they rely on smoothness in the Fourier domain, via the spline parametrization,\nto bring localization in the vertex domain, their model does not provide a precise control over the\nlocal support of their kernels, which is essential to learn localized \ufb01lters. Our technique leverages\non this work, and we showed how to overcome these limitations and beyond.\n4 Numerical Experiments\nIn the sequel, we refer to the non-parametric and non-localized \ufb01lters (2) as Non-Param, the \ufb01lters\n(7) proposed in [4] as Spline and the proposed \ufb01lters (4) as Chebyshev. We always use the Graclus\ncoarsening algorithm introduced in Section 2.2 rather than the simple agglomerative method of [4].\nOur motivation is to compare the learned \ufb01lters, not the coarsening algorithms.\nWe use the following notation when describing network architectures: FCk denotes a fully con-\nnected layer with k hidden units, Pk denotes a (graph or classical) pooling layer of size and stride\nk, GCk and Ck denote a (graph) convolutional layer with k feature maps. All FCk, Ck and GCk\nlayers are followed by a ReLU activation max(x, 0). The \ufb01nal layer is always a softmax regression\nand the loss energy E is the cross-entropy with an (cid:96)2 regularization on the weights of all FCk layers.\nMini-batches are of size S = 100.\n4.1 Revisiting Classical CNNs on MNIST\nTo validate our model, we applied it to the Euclidean case on the benchmark MNIST classi\ufb01cation\nproblem [19], a dataset of 70,000 digits represented on a 2D grid of size 28 \u00d7 28. For our graph\nmodel, we construct an 8-NN graph of the 2D grid which produces a graph of n = |V| = 976 nodes\n(282 = 784 pixels and 192 fake nodes as explained in Section 2.3) and |E| = 3198 edges. Following\nstandard practice, the weights of a k-NN similarity graph (between features) are computed as\n\n(cid:19)\n\n(cid:18)\n\n\u2212(cid:107)zi \u2212 zj(cid:107)2\n\n2\n\n\u03c32\n\nWij = exp\n\n,\n\n(8)\n\nwhere zi is the 2D coordinate of pixel i.\nThis is an important sanity check for our model, which must be able to extract features on any graph,\nincluding the regular 2D grid. Table 1 shows the ability of our model to achieve a performance very\nclose to a classical CNN with the same architecture. The gap in performance may be explained\nby the isotropic nature of the spectral \ufb01lters, i.e.\nthe fact that edges in a general graph do not\npossess an orientation (like up, down, right and left for pixels on a 2D grid). Whether this is a\nlimitation or an advantage depends on the problem and should be veri\ufb01ed, as for any invariance.\nMoreover, rotational invariance has been sought: (i) many data augmentation schemes have used\nrotated versions of images and (ii) models have been developed to learn this invariance, like the\nSpatial Transformer Networks [14]. Other explanations are the lack of experience on architecture\ndesign and the need to investigate better suited optimization or initialization strategies.\nThe LeNet-5-like network architecture and the following hyper-parameters are borrowed from the\nTensorFlow MNIST tutorial2: dropout probability of 0.5, regularization weight of 5 \u00d7 10\u22124, initial\n\n2https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros\n\n6\n\n\fModel\nLinear SVM\nMultinomial Naive Bayes\nSoftmax\nFC2500\nFC2500-FC500\nGC32\n\nAccuracy\n\n65.90\n68.51\n66.28\n64.64\n65.76\n68.26\n\nTable 2: Accuracies of the proposed graph\nCNN and other methods on 20NEWS.\n\nFigure 3: Time to process a mini-batch of S = 100\n20NEWS documents w.r.t. the number of words n.\n\nDataset Architecture\nMNIST GC10\nMNIST GC32-P4-GC64-P4-FC512\n\nNon-Param (2)\n\nSpline (7) [4] Chebyshev (4)\n\n95.75\n96.28\n\n97.26\n97.15\n\n97.48\n99.14\n\nTable 3: Classi\ufb01cation accuracies for different types of spectral \ufb01lters (K = 25).\n\nAccuracy\n\nModel\nClassical CNN\nProposed graph CNN GC32-P4-GC64-P4-FC512\n\nArchitecture\nC32-P4-C64-P4-FC512\n\nTime (ms)\nCPU GPU Speedup\n210\n6.77x\n8.00x\n1600\n\n31\n200\n\nTable 4: Time to process a mini-batch of S = 100 MNIST images.\n\nlearning rate of 0.03, learning rate decay of 0.95, momentum of 0.9. Filters are of size 5 \u00d7 5 and\ngraph \ufb01lters have the same support of K = 25. All models were trained for 20 epochs.\n4.2 Text Categorization on 20NEWS\nTo demonstrate the versatility of our model to work with graphs generated from unstructured data,\nwe applied our technique to the text categorization problem on the 20NEWS dataset which consists\nof 18,846 (11,314 for training and 7,532 for testing) text documents associated with 20 classes [15].\nWe extracted the 10,000 most common words from the 93,953 unique words in this corpus. Each\ndocument x is represented using the bag-of-words model, normalized across words. To test our\nmodel, we constructed a 16-NN graph with (8) where zi is the word2vec embedding [24] of word\ni, which produced a graph of n = |V| = 10, 000 nodes and |E| = 132, 834 edges. All models\nwere trained for 20 epochs by the Adam optimizer [17] with an initial learning rate of 0.001. The\narchitecture is GC32 with support K = 5. Table 2 shows decent performances: while the proposed\nmodel does not outperform the multinomial naive Bayes classi\ufb01er on this small dataset, it does\ndefeat fully connected networks, which require much more parameters.\n4.3 Comparison between Spectral Filters and Computational Ef\ufb01ciency\nTable 3 reports that the proposed parametrization (4) outperforms (7) from [4] as well as non-\nparametric \ufb01lters (2) which are not localized and require O(n) parameters. Moreover, Figure 4\ngives a sense of how the validation accuracy and the loss E converges w.r.t. the \ufb01lter de\ufb01nitions.\nFigure 3 validates the low computational complexity of our model which scales as O(n) while [4]\nscales as O(n2). The measured runtime is the total training time divided by the number of gradient\nsteps. Table 4 shows a similar speedup as classical CNNs when moving to GPUs. This exempli\ufb01es\nthe parallelization opportunity offered by our model, who relies solely on matrix multiplications.\nThose are ef\ufb01ciently implemented by cuBLAS, the linear algebra routines provided by NVIDIA.\n4.4\nFor any graph CNN to be successful, the statistical assumptions of locality, stationarity, and compo-\nsitionality regarding the data must be ful\ufb01lled on the graph where the data resides. Therefore, the\nlearned \ufb01lters\u2019 quality and thus the classi\ufb01cation performance critically depends on the quality of\n\nIn\ufb02uence of Graph Quality\n\n7\n\n20004000600080001000012000number of features (words)0200400600800100012001400time (ms)ChebyshevNon-Param / Spline\fFigure 4: Plots of validation accuracy and training loss for the \ufb01rst 2000 iterations on MNIST.\n\nArchitecture\nGC32\nGC32-P4-GC64-P4-FC512\n\n8-NN on 2D Euclidean grid\n\n97.40\n99.14\n\nrandom\n96.88\n95.39\n\nTable 5: Classi\ufb01cation accuracies with different graph constructions on MNIST.\n\nbag-of-words\n\npre-learned\n\n67.50\n\n66.98\n\nword2vec\n\nlearned\n68.26\n\napproximate\n\n67.86\n\nrandom\n67.75\n\nTable 6: Classi\ufb01cation accuracies of GC32 with different graph constructions on 20NEWS.\n\nthe graph. For data lying on Euclidean space, experiments in Section 4.1 show that a simple k-NN\ngraph of the grid is good enough to recover almost exactly the performance of standard CNNs. We\nalso noticed that the value of k does not have a strong in\ufb02uence on the results. We can witness the\nimportance of a graph satisfying the data assumptions by comparing its performance with a random\ngraph. Table 5 reports a large drop of accuracy when using a random graph, that is when the data\nstructure is lost and the convolutional layers are not useful anymore to extract meaningful features.\nWhile images can be structured by a grid graph, a feature graph has to be built for text documents\nrepresented as bag-of-words. We investigate here three ways to represent a word z: the simplest op-\ntion is to represent each word as its corresponding column in the bag-of-words matrix while, another\napproach is to learn an embedding for each word with word2vec [24] or to use the pre-learned em-\nbeddings provided by the authors. For larger datasets, an approximate nearest neighbors algorithm\nmay be required, which is the reason we tried LSHForest [2] on the learned word2vec embeddings.\nTable 6 reports classi\ufb01cation results which highlight the importance of a well constructed graph.\n5 Conclusion and Future Work\nIn this paper, we have introduced the mathematical and computational foundations of an ef\ufb01cient\ngeneralization of CNNs to graphs using tools from GSP. Experiments have shown the ability of the\nmodel to extract local and stationary features through graph convolutional layers. Compared with\nthe \ufb01rst work on spectral graph CNNs introduced in [4], our model provides a strict control over the\nlocal support of \ufb01lters, is computationally more ef\ufb01cient by avoiding an explicit use of the Graph\nFourier basis, and experimentally shows a better test accuracy. Besides, we addressed the three\nconcerns raised by [13]: (i) we introduced a model whose computational complexity is linear with\nthe dimensionality of the data, (ii) we con\ufb01rmed that the quality of the input graph is of paramount\nimportance, (iii) we showed that the statistical assumptions of local stationarity and compositionality\nmade by the model are veri\ufb01ed for text documents as long as the graph is well constructed.\nFuture works will investigate two directions. On one hand, we will enhance the proposed framework\nwith newly developed tools in GSP. On the other hand, we will explore applications of this generic\nmodel to important \ufb01elds where the data naturally lies on graphs, which may then incorporate exter-\nnal information about the structure of the data rather than arti\ufb01cially created graphs which quality\nmay vary as seen in the experiments. Another natural and future approach, pioneered in [13], would\nbe to alternate the learning of the CNN parameters and the graph.\n\n8\n\n500100015002000020406080100validation accuracyChebyshevNon-ParamSpline5001000150020002.02.53.03.54.04.55.05.56.06.5training lossChebyshevNon-ParamSpline\fReferences\n[1] Mart\u00edn Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.\n\n[2] M. Bawa, T. Condie, and P. Ganesan. LSH Forest: Self-Tuning Indexes for Similarity Search. In Inter-\n\nnational Conference on World Wide Web, pages 651\u2013660, 2005.\n\n[3] M. Belkin and P. Niyogi. Towards a Theoretical Foundation for Laplacian-based Manifold Methods.\n\nJournal of Computer and System Sciences, 74(8):1289\u20131308, 2008.\n\n[4] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral Networks and Deep Locally Connected Net-\n\n[5] T.N. Bui and C. Jones. Finding Good Approximate Vertex and Edge Partitions is NP-hard. Information\n\nworks on Graphs. arXiv:1312.6203, 2013.\n\nProcessing Letters, 42(3):153\u2013159, 1992.\n\n2016.\n\n2006.\n\n[6] F. R. K. Chung. Spectral Graph Theory, volume 92. American Mathematical Society, 1997.\n[7] A. Coates and A.Y. Ng. Selecting Receptive Fields in Deep Networks. In Neural Information Processing\n\nSystems (NIPS), pages 2528\u20132536, 2011.\n\n[8] R.R. Coifman and S. Lafon. Diffusion Maps. Applied and Computational Harmonic Analysis, 21(1):5\u201330,\n\n[9] I. Dhillon, Y. Guan, and B. Kulis. Weighted Graph Cuts Without Eigenvectors: A Multilevel Approach.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(11):1944\u20131957, 2007.\n\n[10] M. Gavish, B. Nadler, and R. Coifman. Multiscale Wavelets on Trees, Graphs and High Dimensional\nData: Theory and Applications to Semi Supervised Learning. In International Conference on Machine\nLearning (ICML), pages 367\u2013374, 2010.\n\n[11] K. Gregor and Y. LeCun. Emergence of Complex-like Cells in a Temporal Product Network with Local\n\nReceptive Fields. In arXiv:1006.0448, 2010.\n\n[12] D. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on Graphs via Spectral Graph Theory.\n\nApplied and Computational Harmonic Analysis, 30(2):129\u2013150, 2011.\n\n[13] M. Henaff, J. Bruna, and Y. LeCun. Deep Convolutional Networks on Graph-Structured Data.\n\narXiv:1506.05163, 2015.\n\n[14] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances\n\nin Neural Information Processing Systems, pages 2017\u20132025, 2015.\n\n[15] T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.\n\nCarnegie Mellon University, Computer Science Technical Report, CMU-CS-96-118, 1996.\n\n[16] G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.\n\nSIAM Journal on Scienti\ufb01c Computing (SISC), 20(1):359\u2013392, 1998.\n\n[17] D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.\n[18] Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning. Nature, 521(7553):436\u2013444, 2015.\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recog-\n\nnition. In Proceedings of the IEEE, 86(11), pages 2278\u20132324, 1998.\n\n[20] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated Graph Sequence Neural Net-\n\n[21] U. Von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n[22] S. Mallat. A Wavelet Tour of Signal Processing. Academic press, 1999.\n[23] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic convolutional\nIn Proceedings of the IEEE International Conference on\n\nneural networks on riemannian manifolds.\nComputer Vision Workshops, pages 37\u201345, 2015.\n\n[24] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Estimation of Word Representations in Vector Space. In\n\nInternational Conference on Learning Representations, 2013.\n\n[25] B. Pasdeloup, R. Alami, V. Gripon, and M. Rabbat. Toward an Uncertainty Principle for Weighted Graphs.\n\nIn Signal Processing Conference (EUSIPCO), pages 1496\u20131500, 2015.\n\n[26] N. Perraudin, B. Ricaud, D. Shuman, and P. Vandergheynst. Global and Local Uncertainty Principles for\n\nworks.\n\n[27] I. Ram, M. Elad, and I. Cohen. Generalized Tree-based Wavelet Transform. IEEE Transactions on Signal\n\nSignals on Graphs. arXiv:1603.03030, 2016.\n\nProcessing,, 59(9):4199\u20134209, 2011.\n\n[28] D. Ron, I. Safro, and A. Brandt. Relaxation-based Coarsening and Multiscale Graph Organization. SIAM\n\nIournal on Multiscale Modeling and Simulation, 9:407\u2013423, 2011.\n\n[29] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The Graph Neural Network\n\nModel. 20(1):61\u201380.\n\n[30] J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence (PAMI), 22(8):888\u2013905, 2000.\n\n[31] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The Emerging Field of Signal\nProcessing on Graphs: Extending High-Dimensional Data Analysis to Networks and other Irregular Do-\nmains. IEEE Signal Processing Magazine, 30(3):83\u201398, 2013.\n\n[32] D.I. Shuman, M.J. Faraji, and P. Vandergheynst. A Multiscale Pyramid Transform for Graph Signals.\n\nIEEE Transactions on Signal Processing, 64(8):2119\u20132134, 2016.\n\n[33] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst. Accelerated Filtering on Graphs using\n\nLanczos Method. preprint arXiv:1509.04537, 2015.\n\n[34] M. Tsitsvero and S. Barbarossa. On the Degrees of Freedom of Signals on Graphs. In Signal Processing\n\nConference (EUSIPCO), pages 1506\u20131510, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1911, "authors": [{"given_name": "Micha\u00ebl", "family_name": "Defferrard", "institution": "EPFL"}, {"given_name": "Xavier", "family_name": "Bresson", "institution": "EPFL"}, {"given_name": "Pierre", "family_name": "Vandergheynst", "institution": "EPFL"}]}