{"title": "Adaptive Sampling Towards Fast Graph Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4558, "page_last": 4567, "abstract": "Graph Convolutional Networks (GCNs) have become a crucial tool on learning representations of graph vertices. The main challenge of adapting GCNs on large-scale graphs is the scalability issue that it incurs heavy cost both in computation and memory due to the uncontrollable neighborhood expansion across layers. In this paper, we accelerate the training of GCNs through developing an adaptive layer-wise sampling method. By constructing the network layer by layer in a top-down passway, we sample the lower layer conditioned on the top one, where the sampled neighborhoods are shared by different parent nodes and the over expansion is avoided owing to the fixed-size sampling. More importantly, the proposed sampler is adaptive and applicable for explicit variance reduction, which in turn enhances the training of our method. Furthermore, we propose a novel and economical approach to promote the message passing over distant nodes by applying skip connections.\nIntensive experiments on several benchmarks verify the effectiveness of our method regarding the classification accuracy while enjoying faster convergence speed.", "full_text": "Adaptive Sampling\n\nTowards Fast Graph Representation Learning\n\nWenbing Huang1, Tong Zhang2, Yu Rong1, Junzhou Huang1\n\n1 Tencent AI Lab. ;\n\n2 Australian National University;\n\nhwenbing@126.com, tong.zhang@anu.edu.au,\n\nyu.rong@hotmail.com, joehhuang@tencent.com\n\nAbstract\n\nGraph Convolutional Networks (GCNs) have become a crucial tool on learning\nrepresentations of graph vertices. The main challenge of adapting GCNs on large-\nscale graphs is the scalability issue that it incurs heavy cost both in computation\nand memory due to the uncontrollable neighborhood expansion across layers. In\nthis paper, we accelerate the training of GCNs through developing an adaptive\nlayer-wise sampling method. By constructing the network layer by layer in a\ntop-down passway, we sample the lower layer conditioned on the top one, where\nthe sampled neighborhoods are shared by different parent nodes and the over\nexpansion is avoided owing to the \ufb01xed-size sampling. More importantly, the\nproposed sampler is adaptive and applicable for explicit variance reduction, which\nin turn enhances the training of our method. Furthermore, we propose a novel\nand economical approach to promote the message passing over distant nodes by\napplying skip connections. Intensive experiments on several benchmarks verify the\neffectiveness of our method regarding the classi\ufb01cation accuracy while enjoying\nfaster convergence speed.\n\n1\n\nIntroduction\n\nDeep Learning, especially Convolutional Neural Networks (CNNs), has revolutionized various\nmachine learning tasks with grid-like input data, such as image classi\ufb01cation [1] and machine\ntranslation [2]. By making use of local connection and weight sharing, CNNs are able to pursue\ntranslational invariance of the data. In many other contexts, however, the input data are lying on\nirregular or non-euclidean domains, such as graphs which encode the pairwise relationships. This\nincludes examples of social networks [3], protein interfaces [4], and 3D meshes [5]. How to de\ufb01ne\nconvolutional operations on graphs is still an ongoing research topic.\nThere have been several attempts in the literature to develop neural networks to handle arbitrarily\nstructured graphs. Whereas learning the graph embedding is already an important topic [6, 7, 8],\nthis paper mainly focus on learning the representations for graph vertices by aggregating their\nfeatures/attributes. The closest work to this vein is the Graph Convolution Network (GCN) [9] that\napplies connections between vertices as convolution \ufb01lters to perform neighborhood aggregation. As\ndemonstrated in [9], GCNs have achieved the state-of-the-art performance on node classi\ufb01cation.\nAn obvious challenge for applying current graph networks is the scalability. Calculating convolutions\nrequires the recursive expansion of neighborhoods across layers, which however is computationally\nprohibitive and demands hefty memory footprints. Even for a single node, it will quickly cover\na large portion of the graph due to the neighborhood expansion layer by layer if particularly the\ngraph is dense or powerlaw. Conventional mini-batch training is unable to speed up the convolution\ncomputations, since every batch will involve a large amount of vertices, even the batch size is small.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Network construction by different methods: (a) the node-wise sampling approach; (b)\nthe layer-wise sampling method; (c) the model considering the skip-connection. To illustrate the\neffectiveness of the layer-wise sampling, we assume that the nodes denoted by the red circle in (a)\nand (b) have at least two parents in the upper layer. In the node-wise sampling, the neighborhoods\nof each parent are not seen by other parents, hence the connections between the neighborhoods and\nother parents are unused. In contrast, for the layer-wise strategy, all neighborhoods are shared by\nnodes in the parent layer, thus all between-layer connections are utilized.\n\nTo avoid the over-expansion issue, we accelerate the training of GCNs by controlling the size of the\nsampled neighborhoods in each layer (see Figure 2). Our method is to build up the network layer by\nlayer in a top-down way, where the nodes in the lower layer1 are sampled conditionally based on the\nupper layer\u2019s. Such layer-wise sampling is ef\ufb01cient in two technical aspects. First, we can reuse the\ninformation of the sampled neighborhoods since the nodes in the lower layer are visible and shared\nby their different parents in the upper layer. Second, it is easy to \ufb01x the size of each layer to avoid\nover-expansion of the neighborhoods, as the nodes of the lower layer are sampled as a whole.\nThe core of our method is to de\ufb01ne an appropriate sampler for the layer-wise sampling. A common\nobjective to design the sampler is to minimize the resulting variance. Unfortunately, the optimal\nsampler to minimize the variance is uncomputable due to the inconsistency between the top-down\nsampling and the bottom-up propagation in our network (see \u00a7 4.2 for details). To tackle this issue, we\napproximate the optimal sampler by replacing the uncomputable part with a self-dependent function,\nand then adding the variance to the loss function. As a result, the variance is explicitly reduced by\ntraining the network parameters and the sampler.\nMoreover, we explore how to enable ef\ufb01cient message passing across distant nodes. Current method-\ns [6, 10] resort to random walks to generate neighborhoods of various steps, and then take integration\nof the multi-hop neighborhoods. Instead, this paper proposes a novel mechanism by further adding a\nskip connection between the (l + 1)-th and (l \u2212 1)-th layers. This short-cut connection reuses the\nnodes in the (l \u2212 1)-th layer as the 2-hop neighborhoods of the (l + 1)-th layer, thus it naturally\nmaintains the second-order proximity without incurring extra computations.\nTo sum up, we make the following contributions in this paper: I.We develop a novel layer-wise\nsampling method to speed up the GCN model, where the between-layer information is shared and the\nsize of the sampling nodes is controllable. II. The sampler for the layer-wise sampling is adaptive and\ndetermined by explicit variance reduction in the training phase. III. We propose a simple yet ef\ufb01cient\napproach to preserve the second-order proximity by formulating a skip connection across two layers.\nWe evaluate the performance of our method on four popular benchmarks for node classi\ufb01cation,\nincluding Cora, Citeseer, Pubmed [11] and Reddit [3]. Intensive experiments verify the effectiveness\nof our method regarding the classi\ufb01cation accuracy and convergence speed.\n\n2 Related Work\n\nWhile graph structures are central tools for various learning tasks (e.g. semi-supervised learning\nin [12, 9]), how to design ef\ufb01cient graph convolution networks has become a popular research topic.\nGraph convolutional approaches are often categorized into spectral and non-spectral classes [13]. The\nspectral approach \ufb01rst proposed by [14] de\ufb01nes the convolution operation in Fourier domain. Later,\n[15] enables localized \ufb01ltering by applying ef\ufb01cient spectral \ufb01lters, and [16] employs Chebyshev\n\n1Here, lower layers denote the ones closer to the input.\n\n2\n\n(a)(b)(c)Skip Connection\fexpansion of the graph Laplacian to avoid the eigendecomposition. Recently, GCN is proposed\nin [9] to simplify previous methods with \ufb01rst-order expansion and re-parameterization trick. Non-\nspectral approaches de\ufb01ne convolution on graph by using the spatial connections directly. For\ninstance, [17] learns a weight matrix for each node degree, the work by [18] de\ufb01nes multiple-hop\nneighborhoods by using the powers series of a transition matrix, and other authors [19] extracted\nnormalized neighborhoods that contain a \ufb01xed number of nodes.\nA recent line of research is to generalize convolutions by making use of the patch operation [20]\nand self-attention [13]. As opposed to GCNs, these methods implicitly assign different importance\nweights to nodes of a same neighborhood, thus enabling a leap in model capacity. Particularly, Monti\net al. [20] presents mixture model CNNs to build CNN architectures on graphs using the patch\noperation, while the graph attention networks [13] compute the hidden representations of each node\non graph by attending over its neighbors following a self-attention strategy.\nMore recently, two kinds of sampling-based methods including GraphSAGE [3] and FastGCN [21]\nwere developed for fast representation learning on graphs. To be speci\ufb01c, GraphSAGE computes node\nrepresentations by sampling neighborhoods of each node and then performing a speci\ufb01c aggregator\nfor information fusion. The FastGCN model interprets graph convolutions as integral transforms\nof embedding functions and samples the nodes in each layer independently. While our method is\nclosely related to these methods, we develop a different sampling strategy in this paper. Compared to\nGraphSAGE that is node-wise, our method is based on layer-wise sampling as all neighborhoods\nare sampled as altogether, and thus can allow neighborhood sharing as illustrated in Figure 2. In\ncontrast to FastGCN that constructs each layer independently, our model is capable of capturing the\nbetween-layer connections as the lower layer is sampled conditionally on the top one. We detail the\ncomparisons in \u00a7 6. Another related work is the control-variate-based method by [22]. However, the\nsampling process of this method is node-wise, and the historical activations of nodes are required.\n\n3 Notations and Preliminaries\nNotations. This paper mainly focuses on undirected graphs. Let G = (V,E) denote the undirected\ngraph with nodes vi \u2208 V, edges (vi, vj) \u2208 E, and N de\ufb01nes the number of the nodes. The adjacency\nmatrix A \u2208 RN\u00d7N represents the weight associated to edge (vi, vj) by each element Aij. We also\nhave a feature matrix X \u2208 RN\u00d7D with xi denoting the D-dimensional feature for node vi.\nGCN. The GCN model developed by Kipf and Welling [9] is one of the most successful convolutional\nnetworks for graph representation learning. If we de\ufb01ne h(l)(vi) as the hidden feature of the l-th\nlayer for node vi, the feed forward propagation becomes\n\n(cid:19)\n\n(cid:18)(cid:88)N\n\nj=1\n\nh(l+1)(vi) = \u03c3\n\n(1)\nwhere \u02c6A = (\u02c6a(vi, uj)) \u2208 RN\u00d7N is the re-normalization of the adjacency matrix; \u03c3(\u00b7) is a nonlinear\nfunction; W (l) \u2208 RD(l)\u00d7D(l\u22121) is the \ufb01lter matrix in the l-th layer; and we denote the nodes in the\nl-th layer as uj to distinguish them from those in the (l + 1)-th layer.\n\n\u02c6a(vi, uj)h(l)(uj)W (l)\n\ni = 1,\u00b7\u00b7\u00b7 , N,\n\n,\n\n4 Adaptive Sampling\n\nEq. (1) indicates that, GCNs require the full expansion of neighborhoods for the feed forward\ncomputation of each node. This makes it computationally intensive and memory-consuming for\nlearning on large-scale graphs containing more than hundreds of thousands of nodes. To circumvent\nthis issue, this paper speeds up the feed forward propagation by adaptive sampling. The proposed\nsampler is adaptable and applicable for variance reduction.\nWe \ufb01rst re-formulate the GCN update to the expectation form and introduce the node-wise sampling\naccordingly. Then, we generalize the node-wise sampling to a more ef\ufb01cient framework that is termed\nas the layer-wise sampling. To minimize the resulting variance, we further propose to learn the\nlayer-wise sampler by performing variance reduction explicitly. Lastly, we introduce the concept of\nskip-connection, and apply it to enable the second-order proximity for the feed-forward propagation.\n\n4.1 From Node-Wise Sampling to Layer-Wise Sampling\n\nNode-Wise Sampling. We \ufb01rst observe that Eq (1) can be rewritten to the expectation form, namely,\n\n3\n\n\f(cid:88)n\n\n1\nn\n\nj=1\n\nh(l+1)(vi) = \u03c3W (l) (N (vi)Ep(uj|vi)[h(l)(uj)]),\n\n(2)\nwhere we have included the weight matrix W (l) into the function \u03c3(\u00b7) for concision; p(uj|vi) =\n\n\u02c6a(vi, uj)/N (vi) de\ufb01nes the probability of sampling uj given vi, with N (vi) =(cid:80)N\n\nj=1 \u02c6a(vi, uj).\n\nA natural idea to speed up Eq. (2) is to approximate the expectation by Monte-Carlo sampling. To be\nspeci\ufb01c, we estimate the expectation \u00b5p(vi) = Ep(uj|vi)[h(l)(uj)] with \u02c6\u00b5p(vi) given by\n\n\u02c6\u00b5p(vi) =\n\nh(l)(\u02c6uj),\n\n\u02c6uj \u223c p(uj|vi).\n\n(3)\n\nthe Monte-Carlo estimation can reduce the complexity of\n\nBy setting n (cid:28) N,\n(1) from\nO(|E|D(l)D(l\u22121)) (|E| denotes the number of edges) to O(n2D(l)D(l\u22121)) if the numbers of the\nsampling points for the (l + 1)-th and l-th layers are both n.\nBy applying Eq. (3) in a multi-layer network, we construct the network structure in a top-down\nmanner: sampling the neighbours of each node in the current layer recursively (see Figure 2 (a)).\nHowever, such node-wise sampling is still computationally expensive for deep networks, because the\nnumber of the nodes to be sampled grows exponentially with the number of layers. Taking a network\nwith depth d for example, the number of sampling nodes in the input layer will increase to O(nd),\nleading to signi\ufb01cant computational burden for large d2.\nLayer-Wise Sampling. We equivalently transform Eq. (2) to the following form by applying\nimportance sampling, i.e.,\n\nh(l+1)(vi) = \u03c3W (l) (N (vi)Eq(uj|v1,\u00b7\u00b7\u00b7 ,vn)[\n\n(4)\nwhere q(uj|v1,\u00b7\u00b7\u00b7 , vn) is de\ufb01ned as the probability of sampling uj given all the nodes of the current\nlayer (i.e., v1,\u00b7\u00b7\u00b7 , vn). Similarly, we can speed up Eq. (4) by approximating the expectation with the\nMonte-Carlo mean, namely, computing h(l+1)(vi) = \u03c3W (l) (N (vi)\u02c6\u00b5q(vi)) with\n\nq(uj|v1,\u00b7\u00b7\u00b7 , vn)\n\nh(l)(uj)]),\n\np(uj|vi)\n\n(cid:88)n\n\n1\nn\n\n\u02c6\u00b5q(vi) =\n\np(\u02c6uj|vi)\n\nq(\u02c6uj|v1,\u00b7\u00b7\u00b7 , vn)\n\nj=1\n\nh(l)(\u02c6uj),\n\n\u02c6uj \u223c q(\u02c6uj|v1,\u00b7\u00b7\u00b7 , vn).\n\n(5)\n\nWe term the sampling in Eq. (5) as the layer-wise sampling strategy. As opposed to the node-wise\nmethod in Eq. (3) where the nodes {\u02c6uj}n\nj=1 are generated for each parent vi independently, the\nsampling in Eq. (5) is required to be performed only once. Besides, in the node-wise sampling, the\nneighborhoods of each node are not visible to other parents; while for the layer-wise sampling all\nsampling nodes {\u02c6uj}n\nj=1 are shared by all nodes of the current layer. This sharing property is able to\nenhance the message passing at utmost. More importantly, the size of each layer is \ufb01xed to n, and the\ntotal number of sampling nodes only grows linearly with the network depth.\n\n4.2 Explicit Variance Reduction\n\nThe remaining question for the layer-wise sampling is how to de\ufb01ne the exact form of the sampler\nq(uj|v1,\u00b7\u00b7\u00b7 , vn). Indeed, a good estimator should reduce the variance caused by the sampling\nprocess, since high variance probably impedes ef\ufb01cient training. For simplicity, we concisely denote\nthe distribution q(uj|v1,\u00b7\u00b7\u00b7 , vn) as q(uj) below.\nAccording to the derivations of importance sampling in [23], we immediately conclude that\nProposition 1. The variance of the estimator \u02c6\u00b5q(vi) in Eq. (5) is given by\n\nVarq(\u02c6\u00b5q(vi)) =\n\nEq(uj )[\n\n1\nn\n\n(p(uj|vi)|h(l)(uj)| \u2212 \u00b5q(vi)q(uj))2\n\nq2(uj)\n\n].\n\nThe optimal sampler to minimize the variance Varq(uj )(\u02c6\u00b5q(vi)) in Eq. (6) is given by\n\nq\u2217(uj) =\n\n(cid:80)N\np(uj|vi)|h(l)(uj)|\nj=1 p(uj|vi)|h(l)(uj)| .\n\n(6)\n\n(7)\n\n2One can reduce the complexity of the node-wise sampling by removing the repeated nodes. Even so, for\n\ndense graphs, the sampling nodes will still quickly \ufb01lls up the whole graph as the depth grows.\n\n4\n\n\fUnfortunately, it is infeasible to compute the optimal sampler in our case. By its de\ufb01nition, the sampler\nq\u2217(uj) is computed based on the hidden feature h(l)(uj) that is aggregated by its neighborhoods in\nprevious layers. However, under our top-down sampling framework, the neural units of lower layers\nare unknown unless the network is completely constructed by the sampling.\nTo alleviate this chicken-and-egg dilemma, we learn a self-dependent function of each node to\ndetermine its importance for the sampling. Let g(x(uj)) be the self-dependent function computed\nbased on the node feature x(uj). Replacing the hidden function in Eq. (7) with g(x(uj)) arrives at\n\nThe sampler by Eq. (8) is node-wise and varies for different vi. To make it applicable for the\nlayer-wise sampling, we summarize the computations over all nodes {vi}n\n\ni=1, thus we attain\n\nq\u2217(uj) =\n\nq\u2217(uj) =\n\n(cid:80)N\np(uj|vi)|g(x(uj))|\nj=1 p(uj|vi)|g(x(uj))| ,\n(cid:80)n\n(cid:80)n\n(cid:80)N\ni=1 p(uj|vi)|g(x(uj))|\ni=1 p(uj|vi)|g(x(vj))| .\n\nj=1\n\n(8)\n\n(9)\n\nIn this paper, we de\ufb01ne g(x(uj)) as a linear function i.e. g(x(uj)) = Wgx(uj) parameterized by the\nmatrix Wg \u2208 R1\u00d7D. Computing the sampler in Eq. (9) is ef\ufb01cient, since computing p(uj|vi) (i.e. the\nadjacent value) and the self-dependent function g(x(uj)) is fast.\nNote that applying the sampler given by Eq. (9) not necessarily results in a minimal variance. To\nful\ufb01ll variance reduction, we add the variance to the loss function and explicitly minimize the variance\nby model training. Suppose we have a mini-batch of data pairs {(vi, yi)}n\ni=1, where vi is the target\nnodes and yi is the corresponded ground-true label. By the layer-wise sampling (Eq. (9)), the nodes\nof previous layer are sampled given {vi}n\ni=1, and this process is recursively called layer by layer\nuntil we reaching the input domain. Then we perform a bottom-up propagation to compute the\nhidden features and obtain the estimated activation for node vi, i.e. \u02c6\u00b5q(vi). Certain nonlinear and\nsoft-max functions are further added on \u02c6\u00b5q(vi) to produce the prediction \u00afy(\u02c6\u00b5q(vi))). By taking the\nclassi\ufb01cation loss and variance (Eq. (6)) into account, we formulate a hybrid loss as\n\n1\nn\n\nL =\n\nLc(yi, \u00afy(\u02c6\u00b5q(vi))) + \u03bbVarq(\u02c6\u00b5q(vi))),\n\n(10)\nwhere Lc is the classi\ufb01cation loss (e.g., the crossing entropy); \u03bb is the trade-off parameter and \ufb01xed\nas 0.5 in our experiments. Note that the activations for other hidden layers are also stochastic, and the\nresulting variances should be reduced. In Eq. (10) we only penalize the variance of the top layer for\nef\ufb01cient computation and \ufb01nd it suf\ufb01cient to deliver promising performance in our experiments.\nTo minimize the hybrid loss in Eq. (10), it requires to perform gradient calculations. For the network\nparameters, e.g. W (l) in Eq. (2), the gradient calculation is straightforward and can be easily derived\nby the automatically-differential platform, e.g., TensorFlow [24]. For the parameters of the sampler,\ne.g. Wg in Eq. (9), calculating the gradient is nontrivial as the sampling process (Eq. (5)) is non-\ndifferential. Fortunately, we prove that the gradient of the classi\ufb01cation loss with respect to the\nsampler is zero. We also derive the gradient of the variance term regarding the sampler, and detail the\ngradient calculation in the supplementary material\n\n(cid:88)n\n\ni=1\n\n5 Preserving Second-Order Proximities by Skip Connections\n\nThe GCN update in Eq. (1) only aggregates messages passed from 1-hop neighborhoods. To allow the\nnetwork to better utilize information across distant nodes, we can sample the multi-hop neighborhoods\nfor the GCN update in a similar way as the random walk [6, 10]. However, the random walk requires\nextra sampling to obtain distant nodes which is computationally expensive for dense graphs. In this\npaper, we propose to propagate the information over distant nodes via skip connections.\nThe key idea of the skip connection is to reuse the nodes of the (l \u2212 1)-th layer to preserve the\nsecond-order proximity (see the de\ufb01nition in [7]). For the (l + 1)-th layer, the nodes of the (l \u2212 1)-th\nlayer are actually the 2-hop neighborhoods. If we further add a skip connection from the (l \u2212 1)-th\nto the (l + 1)-th layer, as illustrated in Figure 2 (c), the aggregation will involve both the 1-hop and\n2-hop neighborhoods. The calculations along the skip connection are formulated as\n\n(cid:88)n\n\nj=1\n\nh(l+1)\nskip (vi) =\n\n\u02c6askip(vi, sj)h(l\u22121)(sj)W (l\u22121)\nskip ,\n\ni = 1,\u00b7\u00b7\u00b7 , n,\n\n(11)\n\n5\n\n\fwhere s = {sj}n\nj=1 denote the nodes in the (l\u2212 1)-th layer. Due to the 2-hop distance between vi and\nsj, the weight \u02c6askip(vi, sj) is supposed to be the element of \u02c6A2. Here, to avoid the full computation\nof \u02c6A2, we estimate the weight with the sampled nodes of the l-th layer, i.e.,\n\n\u02c6askip(vi, sj) \u2248(cid:88)n\n\n(12)\n\nInstead of learning a free W (l\u22121)\n\nskip\n\nk=1\n\n\u02c6a(vi, uk)\u02c6a(uk, sj).\nin Eq. (11), we decompose it to be\nW (l\u22121)\nskip = W (l\u22121)W (l),\n\n(13)\nwhere W (l) and W (l\u22121) are the \ufb01lters of the l-th and (l\u2212 1)-th layers in original network, respectively.\nThe output of skip-connection will be added to the GCN layer (Eq.(1)) before nonlinearity.\nBy the skip connection, the second-order proximity is maintained without extra 2-hop sampling.\nBesides, the skip connection allows the information to pass between two distant layers thus enabling\nmore ef\ufb01cient back-propagation and model training.\nWhile the designs are similar, our motivation of applying the skip connection is different to the\nresidual function in ResNets [1]. The purpose of employing the skip connection in [1] is to gain\naccuracy by increasing the network depth. Here, we apply it to preserve the second-order proximity.\nIn contrast to the identity mappings used in ResNets, the calculation along the skip-connection in our\nmodel should be derived speci\ufb01cally (see Eq. (12) and Eq. (13)).\n\n6 Discussions and Extensions\n\nRelation to other sampling methods. We contrast our approach with GraphSAGE [3] and FastGC-\nN [21] regarding the following aspects:\n1. The proposed layer-wise sampling method is novel. GraphSAGE randomly samples a \ufb01xed-size\nneighborhoods of each node, while FastGCN constructs each layer independently according to\nan identical distribution. As for our layer-wise approach, the nodes in lower layers are sampled\nconditioned on the upper ones, which is capable of capturing the between-layer correlations.\n2. Our framework is general. Both GraphSAGE and FastGCN can be categorized as the speci\ufb01c\nvariants of our framework. Speci\ufb01cally, the GraphSAGE model is regarded as a node-wise sampler\nin Eq (3) if p(uj|vi) is de\ufb01ned as the uniform distribution; FastGCN can be considered as a special\nlayer-wise method by applying the sampler q(uj) that is independent to the nodes {vi}n\ni=1 in Eq. (5).\n3. Our sampler is parameterized and trainable for explicit variance reduction. The sampler of\nGraphSAGE or FastGCN involves no parameter and is not adaptive for minimizing variance. In\ncontrast, our sampler modi\ufb01es the optimal importance sampling distribution with a self-dependent\nfunction. The resulting variance is explicitly reduced by \ufb01ne-tuning the network and sampler.\nTaking the attention into account. The GAT model [13] applies the idea of self-attention to graph\nrepresentation learning. Concisely, it replaces the re-normalization of the adjacency matrix in Eq. (1)\nj=1 a((h(l)(vi), (h(l)(uj))h(l)(vj)W (l)), where\na(h(l)(vi), h(l)(uj)) measures the attention value between the hidden features vi and uj, which is\nderived as a(h(l)(vi), h(l)(uj)) = SoftMax(LeakyReLU(W1h(l)(vi), W2h(l)(uj))) by using the\nLeakyReLU nonlinearity and SoftMax normalization with parameters W1 and W2.\nIt is impracticable to apply the GAT-like attention mechanism directly in our framework, as the\nprobability p(uj|vi) in Eq. (9) will become related to the attention value a(h(l)(vi), h(l)(uj)) that\nis determined by the hidden features of the l-th layer. As discussed in \u00a7 4.2, computing the hidden\nfeatures of lower layers is impossible unless the network is already built after sampling. To solve this\nissue, we develop a novel attention mechanism by applying the self-dependent function similar to\nEq. (9). The attention is computed as\n\nwith speci\ufb01c attention values, i.e., h(l+1)(vi) = \u03c3((cid:80)N\n\na(x(vi), x(uj)) =\n\nReLu(W1g(x(vi)) + W2g(x(uj))),\n\n(14)\n\n1\nn\n\nwhere W1 and W2 are the learnable parameters.\n\n7 Experiments\n\nWe evaluate the performance of our methods on the following benchmarks: (1) categorizing academic\npapers in the citation network datasets\u2013Cora, Citeseer and Pubmed [11]; (2) predicting which\n\n6\n\n\fFigure 2: The accuracy curves of test data on Cora, Citeseer and Reddit. Here, one training epoch\nmeans a complete pass of all training samples.\n\ncommunity different posts belong to in Reddit [3]. These graphs are varying in sizes from small to\nlarge. Particularly, the number of nodes in Cora and Citeseer are of scale O(103), while Pubmed\nand Reddit contain more than 104 and 105 vertices, respectively. Following the supervised learning\nscenario in FastGCN [21], we use all labels of the training examples for training. More details of the\nbenchmark datasets and more experimental evaluations are presented in the supplementary material.\nOur sampling framework is inductive in the sense that it clearly separates out test data from training.\nIn contrast to the transductive learning where all vertices should be provided, our approach aggregates\nthe information from each node\u2019s neighborhoods to learn structural properties that can be generalized\nto unseen nodes. For testing, the embedding of a new node may be either computed by using the full\nGCN architecture or approximated through sampling as is done in model training. Here we use the\nfull architecture as it is more straightforward and easier to implement. For all datasets, we employ\nthe network with two hidden layers as usual. The hidden dimensions for the citation network datasets\n(i.e., Cora, Citeseer and Pubmed) are set to be 16. For the Reddit dataset, the hidden dimensions are\nselected to be 256 as suggested by [3]. The numbers of the sampling nodes for all layers excluding\nthe top one are set to 128 for Cora and Citeseer, 256 for Pubmed and 512 for Reddit. The sizes of\nthe top layer (i.e. the stochastic mini-batch size) are chosen to be 256 for all datasets. We train all\nmodels using early stopping with a window size of 30, as suggested by [9], and report the results\ncorresponding to the best validation accuracies. Further details on the network architectures and\ntraining settings are contained in the supplementary material.\n\n7.1 Alation Studies on the Adaptive Sampling\n\nBaselines. The codes of GraphSAGE [3] and FastGCNN [21] provided by the authors are implement-\ned inconsistently; here we re-implement them based on our framework to make the comparisons more\nfair3. In details, we implement the GraphSAGE method by applying the node-wise strategy with a\nuniform sampler in Eq. (3), where the number of the sampling neighborhoods for each node are set\nto 5. For FastGCN, we adopt the Independent-Identical-Distribution (IID) sampler proposed by [21]\nin Eq. (5), where the number of the sampling nodes for each layer is the same as our method. For\nconsistence, the re-implementations of GraphSAGE and FastGCN are named as Node-Wise and IID\nin our experiments. We also implement the Full GCN architecture as a strong baseline. All compared\nmethods shared the same network structure and training settings for fair comparison. We have also\nconducted the attention mechanism introduced in \u00a7 6 for all methods.\nComparisons with other sampling methods. The random seeds are \ufb01xed and no early stopping is\nused for the experiments here. Figure 2 reports the converging behaviors of all compared methods\nduring training on Cora, Citeseer and Reddit4. It demonstrates that our method, denoted as Adapt,\nconverges faster than other sampling counterparts on all three datasets. Interestingly, our method\neven outperforms the Full model on Cora and Reddit. Similar to our method, the IID sampling is\nalso layer-wise, but it constructs each layer independently. Thanks to the conditional sampling, our\nmethod achieves more stable convergent curve than the IID method as Figure 2 shown. It turns out\nthat considering the between-layer information helps in stability and accuracy.\nMoreover, we draw the training time in Figure 3 (a). Clearly, all sampling methods run faster than the\nFull model. Compared to the Node-Wise method, our approach exhibits a higher training speed due to\n\n3We also perform experimental comparisons by using the public codes of FastGCN in the supplementary\n\nmaterial.\n\n4The results on Pubmed are provided in the supplementary material.\n\n7\n\n050100150200250300Epoch0.20.40.60.8Testing AccuracyCoraAdaptAdapt (no vr)FullIIDNode-Wise2902922942962983000.8550.860.8650.870.8750102030405060708090100Epoch0.10.20.30.40.50.60.7Testing AccuracyCiteseerAdaptAdapt (no vr)FullIIDNode-Wise7080901000.780.7850.790.79505101520253035404550Epoch0.30.40.50.60.70.80.9Testing AccuracyRedditAdaptAdapt (no vr)FullIIDNode-Wise4042444648500.920.930.940.950.96\fTable 1: Accuracy Comparisons with state-of-the-art methods.\n\nCora\n0.8229\n0.8677\n0.8500\n0.8220\n\nMethods\nKLED [25]\n\n2-hop DCNN [18]\n\nFastGCN [21]\nGraphSAGE[3]\n\nFull\nIID\n\nNode-Wise\nAdapt (no vr)\n\nAdapt\n\n0.9568 \u00b1 0.0069\n0.8664 \u00b1 0.0011\n0.8611 \u00b1 0.0437\n0.8506 \u00b1 0.0048\n0.8202 \u00b1 0.0133\n0.9449 \u00b1 0.0026\n0.9501 \u00b1 0.0047\n0.8588 \u00b1 0.0062\n0.8744 \u00b1 0.0034 0.7966 \u00b1 0.0018 0.9060 \u00b1 0.0016 0.9627 \u00b1 0.0032\n\n0.7934 \u00b1 0.0026\n0.7387 \u00b1 0.0078\n0.7734 \u00b1 0.0081\n0.7942 \u00b1 0.0022\n\n0.9022 \u00b1 0.0008\n0.8200 \u00b1 0.0114\n0.9002 \u00b1 0.0017\n0.9060 \u00b1 0.0024\n\nCiteseer\n\n-\n-\n\n0.7760\n0.7140\n\nPubmed\n0.8228\n0.8976\n0.8800\n0.8710\n\nReddit\n\n-\n-\n\n0.9370\n0.9432\n\n(a)\n\n(b)\n\nFigure 3: (a) Training time per epoch on Pubmed and Reddit. (b) Accuracy curves of testing data on\nCora for our Adapt method and its variant by adding skip connections.\n\nthe more compact architecture. To show this, suppose the number of nodes in the top layer is n, then\nfor the Node-Wise method the input, hidden and top layers are of sizes 25n, 5n and n, respectively,\nwhile the numbers of the nodes in all layers are n for our model. Even with less sampling nodes, our\nmodel still surpasses the Node-Wise method by the results in Figure 2.\nHow important is the variance reduction? To justify the importance of the variance reduction, we\nimplement a variant of our model by setting the trade-off parameter as \u03bb = 0 in Eq. (10). By this,\nthe parameters of the self-dependent function are randomly initialized and no training is performed.\nFigure 2 shows that, removing the variance loss does decrease the accuracies of our method on Cora\nand Reddit. For Citeseer, the effect of removing the variance reduction is not so signi\ufb01cant. We\nconjecture that the average degree of Citeseer (i.e. 1.4) is smaller than Cora (i.e. 2.0) and Reddit (i.e.\n492), and penalizing the variance is not so impeding due to the limited diversity of neighborhoods.\nComparisons with other state-of-the-art methods. We contrast the performance of our methods\nwith the graph kernel method KLED [25] and Diffusion Convolutional Network (DCN) [18]. We\nuse the reported results of KLED and DCN on Cora and Pubmed in [18]. We also summarize the\nresults of GraphSAGE and FastGCN by their original implementations. For GraphSAGE, we report\nthe results by the mean aggregator with the default parameters. For FastGCN, we directly make use\nof the provided results by [21]. For the baselines and our approach, we run the experiments with\nrandom seeds over 20 trials and record the mean accuracies and the standard variances. All results\nare organized in Table 1. As expected, our method achieves the best performance among all datasets,\nwhich are consistent with the results in Figure 2. It is also observed that removing the variance\nreduction will decrease the performance of our method especially on Cora and Reddit.\n\n7.2 Evaluations of the Skip Connection\n\nWe evaluate the effectiveness of the skip connection on Cora. For the experiments on other datasets,\nwe present the details in the supplementary material. The original network has two hidden layers.\nWe further add a skip connection between the input and top layers, by using the computations in\nEq. (12) and Eq. (13). Figure 2 displays the convergent curves of the original Adapt method and its\nvariant with the skip connection, where the random seeds are shared and no early stopping is adapted.\nAlthough the improvement by our skip connection is not big regarding the \ufb01nal accuracy, it indeed\nspeeds up the convergence signi\ufb01cantly. This can be observed from Figure 3 (b) where adding the\nskip connection reduces the required epoches to converge from around 150 to 100.\n\n8\n\nPubmedReddit10-1100101102103Training TimeAdapt (ours)FullIIDNode-Wise050100150200250300Epoch0.20.40.60.8Testing AccuracyCoraAdapt+scAdapt\fTable 2: Testing Accuracies on Cora.\n\nAdapt\n\n0.8744 \u00b1 0.0034\n\nAdapt+sc\n\n0.8774 \u00b1 0.0032\n\nAdapt+2-hop\n0.8814 \u00b1 0.0017\n\nWe run experiments with different random seeds over 20 trials and report the mean results obtained by\nearly stopping in Table 2. It is observed that the skip connection slightly improves the performance.\nBesides, we explicitly involve the 2-hop neighborhood sampling in our method by replacing the\nre-normalization matrix \u02c6A with its 2-order power expansion, i.e. \u02c6A + \u02c6A2. As displayed in Table 2,\nthe explicit 2-hop sampling further boosts the classi\ufb01cation accuracy. Although the skip-connection\nmethod is slightly inferior to the explicit 2-hop sampling, it avoids the computation of (i.e. \u02c6A2) and\nyields more computationally bene\ufb01cial for large and dense graphs.\n\n8 Conclusion\n\nWe present a framework to accelerate the training of GCNs through developing a sampling method\nby constructing the network layer by layer. The developed layer-wise sampler is adaptive for\nvariance reduction. Our method outperforms the other sampling-based counterparts: GraphSAGE and\nFastGCN in effectiveness and accuracy on extensive experiments. We also explore how to preserve\nthe second-order proximity by using the skip connection. The experimental evaluations demonstrate\nthat the skip connection further enhances our method in terms of the convergence speed and eventual\nclassi\ufb01cation accuracy.\n\nReferences\n[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing\nSystems, pages 6000\u20136010, 2017.\n\n[3] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, pages 1025\u20131035, 2017.\n\n[4] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. Protein interface prediction using graph\nconvolutional networks. In Advances in Neural Information Processing Systems, pages 6533\u20136542, 2017.\n\n[5] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d\nclassi\ufb01cation and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4,\n2017.\n\n[6] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 701\u2013710. ACM, 2014.\n\n[7] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale\ninformation network embedding. In Proceedings of the 24th International Conference on World Wide Web,\npages 1067\u20131077. International World Wide Web Conferences Steering Committee, 2015.\n\n[8] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the\n22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855\u2013864.\nACM, 2016.\n\n[9] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[10] Felipe Petroski Such, Shagan Sah, Miguel Alexander Dominguez, Suhas Pillai, Chao Zhang, Andrew\nMichael, Nathan D Cahill, and Raymond Ptucha. Robust spatial \ufb01ltering with graph convolutional neural\nnetworks. IEEE Journal of Selected Topics in Signal Processing, 11(6):884\u2013896, 2017.\n\n[11] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad.\n\nCollective classi\ufb01cation in network data. AI magazine, 29(3):93, 2008.\n\n9\n\n\f[12] Wei Liu, Jun Wang, and Shih-Fu Chang. Robust and scalable graph-based semisupervised learning.\n\nProceedings of the IEEE, 100(9):2624\u20132638, 2012.\n\n[13] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio.\n\nGraph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[14] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected\n\nnetworks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[15] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data.\n\narXiv preprint arXiv:1506.05163, 2015.\n\n[16] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs\nIn Advances in Neural Information Processing Systems, pages\n\nwith fast localized spectral \ufb01ltering.\n3844\u20133852, 2016.\n\n[17] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al\u00e1n\nAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints.\nIn Advances in neural information processing systems, pages 2224\u20132232, 2015.\n\n[18] James Atwood and Don Towsley. Diffusion-convolutional neural networks.\n\nInformation Processing Systems, pages 1993\u20132001, 2016.\n\nIn Advances in Neural\n\n[19] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for\n\ngraphs. In International conference on machine learning, pages 2014\u20132023, 2016.\n\n[20] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M\nBronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proc. CVPR,\nvolume 1, page 3, 2017.\n\n[21] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: Fast learning with graph convolutional networks via\n\nimportance sampling. arXiv preprint arXiv:1801.10247, 2018.\n\n[22] Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional networks with variance\n\nreduction. In International conference on machine learning, 2018.\n\n[23] Art B. Owen. Monte Carlo theory, methods and examples. 2013.\n\n[24] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for large-scale machine\nlearning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\n[25] Fran?ois Fouss, Kevin Fran?oisse, Luh Yen, Alain Pirotte, and Marco Saerens. An experimental investi-\ngation of graph kernels on a collaborative recommendation task. In Proceedings of the 6th International\nConference on Data Mining (ICDM 2006, pages 863\u2013868, 2006.\n\n10\n\n\f", "award": [], "sourceid": 2226, "authors": [{"given_name": "Wenbing", "family_name": "Huang", "institution": "Tencent AI Lab"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "The Australian National University"}, {"given_name": "Yu", "family_name": "Rong", "institution": "Tencent AI Lab"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}]}