{"title": "Graph Transformer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 11983, "page_last": 11993, "abstract": "Graph neural networks (GNNs) have been widely used in representation learning on graphs and achieved state-of-the-art performance in tasks such as node classification and link prediction. However, most existing GNNs are designed to learn node representations on the fixed and homogeneous graphs. The limitations especially become problematic when learning representations on a misspecified graph or a heterogeneous graph that consists of various types of nodes and edges. In this paper, we propose Graph Transformer Networks (GTNs) that are capable of generating new graph structures, which involve identifying useful connections between unconnected nodes on the original graph, while learning effective node representation on the new graphs in an end-to-end fashion. Graph Transformer layer, a core layer of GTNs, learns a soft selection of edge types and composite relations for generating useful multi-hop connections so-call meta-paths. Our experiments show that GTNs learn new graph structures, based on data and tasks without domain knowledge, and yield powerful node representation via convolution on the new graphs. Without domain-specific graph preprocessing, GTNs achieved the best performance in all three benchmark node classification tasks against the state-of-the-art methods that require pre-defined meta-paths from domain knowledge.", "full_text": "Graph Transformer Networks\n\nSeongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang\u2217 , Hyunwoo J. Kim\u2217\n\nDepartment of Computer Science and Engineering\n\n{ysj5419, minbyuljeong, raehyun, kangj, hyunwoojkim}@korea.ac.kr\n\nKorea University\n\nAbstract\n\nGraph neural networks (GNNs) have been widely used in representation learning on\ngraphs and achieved state-of-the-art performance in tasks such as node classi\ufb01cation\nand link prediction. However, most existing GNNs are designed to learn node\nrepresentations on the \ufb01xed and homogeneous graphs. The limitations especially\nbecome problematic when learning representations on a misspeci\ufb01ed graph or\na heterogeneous graph that consists of various types of nodes and edges.\nIn\nthis paper, we propose Graph Transformer Networks (GTNs) that are capable of\ngenerating new graph structures, which involve identifying useful connections\nbetween unconnected nodes on the original graph, while learning effective node\nrepresentation on the new graphs in an end-to-end fashion. Graph Transformer layer,\na core layer of GTNs, learns a soft selection of edge types and composite relations\nfor generating useful multi-hop connections so-called meta-paths. Our experiments\nshow that GTNs learn new graph structures, based on data and tasks without\ndomain knowledge, and yield powerful node representation via convolution on the\nnew graphs. Without domain-speci\ufb01c graph preprocessing, GTNs achieved the\nbest performance in all three benchmark node classi\ufb01cation tasks against the state-\nof-the-art methods that require pre-de\ufb01ned meta-paths from domain knowledge.\n\n1\n\nIntroduction\n\nIn recent years, Graph Neural Networks (GNNs) have been widely adopted in various tasks over\ngraphs, such as graph classi\ufb01cation [11, 21, 40], link prediction [18, 30, 42] and node classi\ufb01cation\n[3, 14, 33]. The representation learnt by GNNs has been proven to be effective in achieving state-of-\nthe-art performance in a variety of graph datasets such as social networks [7, 14, 35], citation networks\n[19, 33], functional structure of brains [20], recommender systems [1, 27, 39]. The underlying graph\nstructure is utilized by GNNs to operate convolution directly on graphs by passing node features\n[12, 14] to neighbors, or perform convolution in the spectral domain using the Fourier basis of a\ngiven graph, i.e., eigenfunctions of the Laplacian operator [9, 15, 19].\nHowever, one limitation of most GNNs is that they assume the graph structure to operate GNNs on is\n\ufb01xed and homogeneous. Since the graph convolutions discussed above are determined by the \ufb01xed\ngraph structure, a noisy graph with missing/spurious connections results in ineffective convolution\nwith wrong neighbors on the graph. In addition, in some applications, constructing a graph to operate\nGNNs is not trivial. For example, a citation network has multiple types of nodes (e.g., authors,\npapers, conferences) and edges de\ufb01ned by their relations (e.g., author-paper, paper-conference),\nand it is called a heterogeneous graph. A na\u00efve approach is to ignore the node/edge types and\ntreat them as in a homogeneous graph (a standard graph with one type of nodes and edges). This,\napparently, is suboptimal since models cannot exploit the type information. A more recent remedy is\nto manually design meta-paths, which are paths connected with heterogeneous edges, and transform\n\n\u2217corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fa heterogeneous graph into a homogeneous graph de\ufb01ned by the meta-paths. Then conventional\nGNNs can operate on the transformed homogeneous graphs [37, 43]. This is a two-stage approach\nand requires hand-crafted meta-paths for each problem. The accuracy of downstream analysis can be\nsigni\ufb01cantly affected by the choice of these meta-paths.\nHere, we develop Graph Transformer Network (GTN) that learns to transform a heterogeneous input\ngraph into useful meta-path graphs for each task and learn node representation on the graphs in an\nend-to-end fashion. GTNs can be viewed as a graph analogue of Spatial Transformer Networks [16]\nwhich explicitly learn spatial transformations of input images or features. The main challenge to\ntransform a heterogeneous graph into new graph structure de\ufb01ned by meta-paths is that meta-paths\nmay have an arbitrary length and edge types. For example, author classi\ufb01cation in citation networks\nmay bene\ufb01t from meta-paths which are Author-Paper-Author (APA) or Author-Paper-Conference-\nPaper-Author (APCPA). Also, the citation networks are directed graphs where relatively less graph\nneural networks can operate on. In order to address the challenges, we require a model that generates\nnew graph structures based on composite relations connected with softly chosen edge types in a\nheterogeneous graph and learns node representations via convolution on the learnt graph structures\nfor a given problem.\nOur contributions are as follows: (i) We propose a novel framework Graph Transformer Networks, to\nlearn a new graph structure which involves identifying useful meta-paths and multi-hop connections\nfor learning effective node representation on graphs. (ii) The graph generation is interpretable and the\nmodel is able to provide insight on effective meta-paths for prediction. (iii) We prove the effectiveness\nof node representation learnt by Graph Transformer Networks resulting in the best performance\nagainst state-of-the-art methods that additionally use domain knowledge in all three benchmark node\nclassi\ufb01cation on heterogeneous graphs.\n\n2 Related Works\n\nGraph Neural Networks. In recent years, many classes of GNNs have been developed for a wide\nrange of tasks. They are categorized into two approaches: spectral [5, 9, 15, 19, 22, 38] and non-\nspectral methods [7, 12, 14, 26, 29, 33]. Based on spectral graph theory, Bruna et al. [5] proposed a\nway to perform convolution in the spectral domain using the Fourier basis of a given graph. Kipf et al.\n[19] simpli\ufb01ed GNNs using the \ufb01rst-order approximation of the spectral graph convolution. On the\nother hand, non-spectral approaches de\ufb01ne convolution operations directly on the graph, utilizing\nspatially close neighbors. For instance, Veli\u02c7ckovi\u00b4c et al. [33] applies different weight matrices for\nnodes with different degrees and Hamilton et al. [14] has proposed learnable aggregator functions\nwhich summarize neighbors\u2019 information for graph representation learning.\nNode classi\ufb01cation with GNNs. Node classi\ufb01cation has been studied for decades. Conventionally,\nhand-crafted features have been used such as simple graph statistics [2], graph kernel [34], and\nengineered features from a local neighbor structure [23]. These features are not \ufb02exible and suffer\nfrom poor performance. To overcome the drawback, recently node representation learning methods\nvia random walks on graphs have been proposed in DeepWalk [28], LINE [32], and node2vec [13]\nwith tricks from deep learning models (e.g., skip-gram) and have gained some improvement in\nperformance. However, all of these methods learn node representation solely based on the graph\nstructure. The representations are not optimized for a speci\ufb01c task. As CNNs have achieved\nremarkable success in representation learning, GNNs learn a powerful representation for given\ntasks and data. To improve performance or scalability, generalized convolution based on spectral\nconvolution [4, 26], attention mechanism on neighbors [25, 33], subsampling [6, 7] and inductive\nrepresentation for a large graph [14] have been studied. Although these methods show outstanding\nresults, all these methods have a common limitation which only deals with a homogeneous graph.\nHowever, many real-world problems often cannot be represented by a single homogeneous graph.\nThe graphs come as a heterogeneous graph with various types of nodes and edges. Since most GNNs\nare designed for a single homogeneous graph, one simple solution is a two-stage approach. Using\nmeta-paths that are the composite relations of multiple edge types, as a preprocessing, it converts the\nheterogeneous graph into a homogeneous graph and then learns representation. The metapath2vec\n[10] learns graph representations by using meta-path based random walk and HAN [37] learns graph\nrepresentation learning by transforming a heterogeneous graph into a homogeneous graph constructed\nby meta-paths. However, these approaches manually select meta-paths by domain experts and thus\nmight not be able to capture all meaningful relations for each problem. Also, performance can be\n\n2\n\n\fsigni\ufb01cantly affected by the choice of meta-paths. Unlike these approaches, our Graph Transformer\nNetworks can operate on a heterogeneous graph and transform the graph for tasks while learning\nnode representation on the transformed graphs in an end-to-end fashion.\n\n3 Method\n\nThe goal of our framework, Graph Transformer Networks, is to generate new graph structures and\nlearn node representations on the learned graphs simultaneously. Unlike most CNNs on graphs that\nassume the graph is given, GTNs seek for new graph structures using multiple candidate adjacency\nmatrices to perform more effective graph convolutions and learn more powerful node representations.\nLearning new graph structures involves identifying useful meta-paths, which are paths connected\nwith heterogeneous edges, and multi-hop connections. Before introducing our framework, we brie\ufb02y\nsummarize the basic concepts of meta-paths and graph convolution in GCNs.\n\n3.1 Preliminaries\n\nOne input to our framework is multiple graph structures with different types of nodes and edges. Let\nT v and T e be the set of node types and edge types respectively. The input graphs can be viewed\nas a heterogeneous graph [31] G = (V, E), where V is a set of nodes, E is a set of observed edges\nwith a node type mapping function fv : V \u2192 T v and an edge type mapping function fe : E \u2192 T e.\nEach node vi \u2208 V has one node type, i.e., fv(vi) \u2208 T v. Similarly, for eij \u2208 E, fe(eij) \u2208 T e. When\n|T e| = 1 and |T v| = 1, it becomes a standard graph. In this paper, we consider the case of |T e| > 1.\nLet N denotes the number of nodes, i.e., |V |. The heterogeneous graph can be represented by a set\nof adjacency matrices {Ak}K\nk=1 where K = |T e|, and Ak \u2208 RN\u00d7N is an adjacency matrix where\nAk[i, j] is non-zero when there is a k-th type edge from j to i. More compactly, it can be written as a\ntensor A \u2208 RN\u00d7N\u00d7K. We also have a feature matrix X \u2208 RN\u00d7D meaning that the D-dimensional\ninput feature given for each node.\nMeta-Path [37] denoted by p is a path on the heterogeneous graph G that is connected with heteroge-\nt2\u2212\u2192 . . . tl\u2212\u2192 vl+1, where tl \u2208 T e denotes an l-th edge type of meta-path.\nneous edges, i.e., v1\nIt de\ufb01nes a composite relation R = t1 \u25e6 t2 . . . \u25e6 tl between node v1 and vl+1, where R1 \u25e6 R2 denotes\nthe composition of relation R1 and R2. Given the composite relation R or the sequence of edge types\n(t1, t2, . . . , tl), the adjacency matrix AP of the meta-path P is obtained by the multiplications of\nadjacency matrices as\n\nt1\u2212\u2192 v2\n\nAP = Atl . . . At2At1.\n\n(1)\nThe notion of meta-path subsumes multi-hop connections and new graph structures in our framework\nare represented by adjacency matrices. For example, the meta-path Author-Paper-Conference (APC),\nwhich can be represented as A AP\u2212\u2212\u2192 P P C\u2212\u2212\u2192 C, generates an adjacency matrix AAP C by the\nmultipication of AAP and AP C.\nGraph Convolutional network (GCN). In this work, a graph convolutional network (GCN) [19] is\nused to learn useful representations for node classi\ufb01cation in an end-to-end fashion. Let H (l) be the\nfeature representations of the lth layer in GCNs, the forward propagation becomes\n\n2 H (l)W (l)(cid:17)\n\n(cid:16) \u02dcD\u2212 1\n\u02dcD is the degree matrix of \u02dcA, i.e., \u02dcDii = (cid:80)\n\n(2)\nwhere \u02dcA = A + I \u2208 RN\u00d7N is the adjacency matrix A of the graph G with added self-connections,\n\u02dcAij, and W (l) \u2208 Rd\u00d7d is a trainable weight matrix.\nOne can easily observe that the convolution operation across the graph is determined by the given\ngraph structure and it is not learnable except for the node-wise linear transform H (l)W (l). So\nthe convolution layer can be interpreted as the composition of a \ufb01xed convolution followed by an\nactivation function \u03c3 on the graph after a node-wise linear transformation. Since we learn graph\nstructures, our framework bene\ufb01ts from the different convolutions, namely, \u02dcD\u2212 1\n2 , obtained\nfrom learned multiple adjacency matrices. The architecture will be introduced later in this section.\nFor a directed graph (i.e., asymmetric adjacency matrix), \u02dcA in (2) can be normalized by the inverse\nof in-degree diagonal matrix D\u22121 as H (l+1) = \u03c3( \u02dcD\u22121 \u02dcAH (l)W (l)).\n\n2 \u02dcA \u02dcD\u2212 1\n\nH (l+1) = \u03c3\n\n2 \u02dcA \u02dcD\u2212 1\n\n,\n\ni\n\n3\n\n\fFigure 1: Graph Transformer Layer softly selects adjacency matrices (edge types) from the set of\nadjacency matrices A of a heterogeneous graph G and learns a new meta-path graph represented by\nA(1) via the matrix multiplication of two selected adjacency matrices Q1 and Q2. The soft adjacency\nmatrix selection is a weighted sum of candidate adjacency matrices obtained by 1 \u00d7 1 convolution\nwith non-negative weights from softmax(W 1\n\u03c6).\n\n3.2 Meta-Path Generation\n\nPrevious works [37, 43] require manually de\ufb01ned meta-paths and perform Graph Neural Networks\non the meta-path graphs. Instead, our Graph Transformer Networks (GTNs) learn meta-paths for\ngiven data and tasks and operate graph convolution on the learned meta-path graphs. This gives a\nchance to \ufb01nd more useful meta-paths and lead to virtually various graph convolutions using multiple\nmeta-path graphs.\nThe new meta-path graph generation in Graph Transformer (GT) Layer in Fig. 1 has two components.\nFirst, GT layer softly selects two graph structures Q1 and Q2 from candidate adjacency matrices A.\nSecond, it learns a new graph structure by the composition of two relations (i.e., matrix multiplication\nof two adjacency matrices, Q1Q2).\n\nIt computes the convex combination of adjacency matrices as(cid:80)\n\nAtl in (4) by 1x1 convolu-\n\ntl\u2208T e \u03b1(l)\n\ntl\n\ntion as in Fig. 1 with the weights from softmax function as\n\nQ = F (A; W\u03c6) = \u03c6(A; softmax(W\u03c6)),\n\n(3)\nwhere \u03c6 is the convolution layer and W\u03c6 \u2208 R1\u00d71\u00d7K is the parameter of \u03c6. This trick is similar\nto channel attention pooling for low-cost image/action recognition in [8]. Given two softly chosen\nadjacency matrices Q1 and Q2, the meta-path adjacency matrix is computed by matrix multiplication,\nQ1Q2. For numerical stability, the matrix is normalized by its degree matrix as A(l) = D\u22121Q1Q2.\nNow, we need to check whether GTN can learn an arbitrary meta-path with respect to edge types and\npath length. The adjacency matrix of arbitrary length l meta-paths can be calculated by\n\nAP =\n\n\u03b1(1)\n\nt1 At1\n\n\u03b1(2)\n\nt2 At2\n\n. . .\n\n\u03b1(l)\ntl\n\nAtl\n\n(4)\n\nwhere AP denotes the adjacency matrix of meta-paths, T e denotes a set of edge types and \u03b1(l)\nis\nthe weight for edge type tl at the lth GT layer. When \u03b1 is not one-hot vector, AP can be seen as\nthe weighted sum of all length-l meta-path adjacency matrices. So a stack of l GT layers allows to\nlearn arbitrary length l meta-path graph structures as the architecture of GTN shown in Fig. 2. One\nissue with this construction is that adding GT layers always increase the length of meta-path and this\ndoes not allow the original edges. In some applications, both long meta-paths and short meta-paths\nare important. To learn short and long meta-paths including original edges, we include the identity\n\ntl\n\n4\n\n(cid:32) (cid:88)\n\nt1\u2208T e\n\n(cid:33)(cid:32) (cid:88)\n\nt2\u2208T e\n\n(cid:33)\n\n(cid:32)(cid:88)\n\ntl\u2208T e\n\n(cid:33)\n\n\fFigure 2: Graph Transformer Networks (GTNs) learn to generate a set of new meta-path adjacency\nmatrices A(l) using GT layers and perform graph convolution as in GCNs on the new graph structures.\nMultiple node representations from the same GCNs on multiple meta-path graphs are integrated by\n2 \u2208 RN\u00d7N\u00d7C are\nconcatenation and improve the performance of node classi\ufb01cation. Q(l)\nintermediate adjacency tensors to compute meta-paths at the lth layer.\n\n1 and Q(l)\n\nmatrix I in A, i.e., A0 = I. This trick allows GTNs to learn any length of meta-paths up to l + 1\nwhen l GT layers are stacked.\n\n3.3 Graph Transformer Networks\n\nWe here introduce the architecture of Graph Transformer Networks. To consider multiple types of\nmeta-paths simultaneously, the output channels of 1\u00d71 convolution in Fig. 1 is set to C. Then, the GT\nlayer yields a set of meta-paths and the intermediate adjacency matrices Q1 and Q2 become adjacency\ntensors Q1 and Q2 \u2208 RN\u00d7N\u00d7C as in Fig.2. It is bene\ufb01cial to learn different node representations via\nmultiple different graph structures. After the stack of l GT layers, a GCN is applied to each channel\nof meta-path tensor A(l) \u2208 RN\u00d7N\u00d7Cand multiple node representations are concatenated as\n\nZ =\n\nC\n\n\u015e\n\n\u03c3( \u02dcD\u22121\n\ni\n\ni=1\n\n\u02dcA(l)\n\ni XW ),\n\n(5)\n\nwhere \u015e is the concatenation operator, C denotes the number of channels, \u02dcA(l)\ni + I is the\n, W \u2208 Rd\u00d7d is a\nadjacency matrix from the ith channel of A(l), \u02dcDi is the degree matrix of \u02dcA(l)\ntrainable weight matrix shared across channels and X \u2208 RN\u00d7d is a feature matrix. Z contains the\nnode representations from C different meta-path graphs with variable, at most l + 1, lengths. It is\nused for node classi\ufb01cation on top and two dense layers followed by a softmax layer are used. Our\nloss function is a standard cross-entropy on the nodes that have ground truth labels. This architecture\ncan be viewed as an ensemble of GCNs on multiple meta-path graphs learnt by GT layers.\n\ni = A(l)\n\ni\n\n4 Experiments\n\nIn this section, we evaluate the bene\ufb01ts of our method against a variety of state-of-the-art models on\nnode classi\ufb01cation. We conduct experiments and analysis to answer the following research questions:\nQ1. Are the new graph structures generated by GTN effective for learning node representation? Q2.\nCan GTN adaptively produce a variable length of meta-paths depending on datasets? Q3. How can\nwe interpret the importance of each meta-path from the adjacency matrix generated by GTNs?\n\n5\n\n\fTable 1: Datasets for node classi\ufb01cation on heterogeneous graphs.\n\nDataset\nDBLP\nACM\nIMDB\n\n# Nodes\n18405\n8994\n12772\n\n# Edges\n67946\n25922\n37288\n\n# Edge type\n\n4\n4\n4\n\n# Features\n\n334\n1902\n1256\n\n# Training\n\n800\n600\n300\n\n# Validation\n\n400\n300\n300\n\n# Test\n2857\n2125\n2339\n\nDatasets. To evaluate the effectiveness of meta-paths generated by Graph Transformer Networks,\nwe used heterogeneous graph datasets that have multiple types of nodes and edges. The main task\nis node classi\ufb01cation. We use two citation network datasets DBLP and ACM, and a movie dataset\nIMDB. The statistics of the heterogeneous graphs used in our experiments are shown in Table 1.\nDBLP contains three types of nodes (papers (P), authors (A), conferences (C)), four types of edges\n(PA, AP, PC, CP), and research areas of authors as labels. ACM contains three types of nodes (papers\n(P), authors (A), subject (S)), four types of edges (PA, AP, PS, SP), and categories of papers as labels.\nEach node in the two datasets is represented as bag-of-words of keywords. On the other hand, IMDB\ncontains three types of nodes (movies (M), actors (A), and directors (D)) and labels are genres of\nmovies. Node features are given as bag-of-words representations of plots.\nImplementation details. We set the embedding dimension to 64 for all the above methods for a\nfair comparison. The Adam optimizer was used and the hyperparameters (e.g., learning rate, weight\ndecay etc.) are respectively chosen so that each baseline yields its best performance. For random\nwalk based models, a walk length is set to 100 per node for 1000 iterations and the window size\nis set to 5 with 7 negative samples. For GCN, GAT, and HAN, the parameters are optimized using\nthe validation set, respectively. For our model GTN, we used three GT layers for DBLP and IMDB\ndatasets, two GT layers for ACM dataset. We initialized parameters for 1 \u00d7 1 convolution layers\nin the GT layer with a constant value. Our code is publicly available at https://github.com/\nseongjunyun/Graph_Transformer_Networks.\n\n4.1 Baselines\n\nTo evaluate the effectiveness of representations learnt by the Graph Transformer Networks in node\nclassi\ufb01cation, we compare GTNs with conventional random walk based baselines as well as state-of-\nthe-art GNN based methods.\nConventional Network Embedding methods have been studied and recently DeepWalk [28] and\nmetapath2vec [10] have shown predominant performance among random walk based approaches.\nDeepWalk is a random walk based network embedding method which is originally designed for\nhomogeneous graphs. Here we ignore the heterogeneity of nodes/edges and perform DeepWalk\non the whole heterogeneous graph. However, metapath2vec is a heterogeneous graph embedding\nmethod which performs meta-path based random walk and utilizes skip-gram with negative sampling\nto generate embeddings.\nGNN-based methods We used the GCN [19], GAT [33], and HAN [37] as GNN based methods.\nGCN is a graph convolutional network which utilizes a localized \ufb01rst-order approximation of the\nspectral graph convolution designed for the symmetric graphs. Since our datasets are directed\ngraphs, we modi\ufb01ed degree normalization for asymmetric adjacency matrices, i.e., \u02dcD\u22121 \u02dcA rather\nthan \u02dcD\u22121/2 \u02dcA \u02dcD\u22121/2. GAT is a graph neural network which uses the attention mechanism on the\nhomogeneous graphs. We ignore the heterogeneity of node/edges and perform GCN and GAT on\nthe whole graph. HAN is a graph neural network which exploits manually selected meta-paths.\nThis approach requires a manual transformation of the original graph into sub-graphs by connecting\nvertices with pre-de\ufb01ned meta-paths. Here, we test HAN on the selected sub-graphs whose nodes are\nlinked with meta-paths as described in [37].\n\n4.2 Results on Node Classi\ufb01cation\n\nEffectiveness of the representation learnt from new graph structures. Table 2. shows the perfor-\nmances of GTN and other node classi\ufb01cation baselines. By analysing the result of our experiment,\nwe will answer the research Q1 and Q2. We observe that our GTN achieves the highest perfor-\nmance on all the datasets against all network embedding methods and graph neural network methods.\n\n6\n\n\fTable 2: Evaluation results on the node classi\ufb01cation task (F1 score).\n\nDeepWalk metapath2vec GCN GAT HAN GTN\u2212I GTN (proposed)\n\nDBLP\nACM\nIMDB\n\n63.18\n67.42\n32.08\n\n85.53\n87.61\n35.21\n\n87.30\n91.60\n56.89\n\n93.71\n92.33\n58.14\n\n92.83\n90.96\n56.77\n\n93.91\n91.13\n52.33\n\n94.18\n92.68\n60.92\n\nGNN-based methods, e.g., GCN, GAT, HAN, and the GTN perform better than random walk-based\nnetwork embedding methods. Furthermore, the GAT usually performs better than the GCN. This is\nbecause the GAT can specify different weights to neighbor nodes while the GCN simply averages\nover neighbor nodes. Interestingly, though the HAN is a modi\ufb01ed GAT for a heterogeneous graph, the\nGAT usually performs better than the HAN. This result shows that using the pre-de\ufb01ned meta-paths\nas the HAN may cause adverse effects on performance. In contrast, Our GTN model achieved the\nbest performance compared to all other baselines on all the datasets even though the GTN model uses\nonly one GCN layer whereas GCN, GAT and HAN use at least two layers. It demonstrates that the\nGTN can learn a new graph structure which consists of useful meta-paths for learning more effective\nnode representation. Also compared to a simple meta-path adjacency matrix with a constant in the\nbaselines, e.g., HAN, the GTN is capable of assigning variable weights to edges.\nIdentify matrix in A to learn variable-length meta-paths. As mentioned in Section 3.2, the\nidentity matrix is included in the candidate adjacency matrices A. To verify the effect of identity\nmatrix, we trained and evaluated another model named GT N\u2212I as an ablation study. the GT N\u2212I\nhas exactly the same model structure as GTN but its candidate adjacency matrix A doesn\u2019t include an\nidentity matrix. In general, the GT N\u2212I consistently performs worse than the GTN. It is worth to note\nthat the difference is greater in IMDB than DBLP. One explanation is that the length of meta-paths\nGT N\u2212I produced is not effective in IMDB. As we stacked 3 layers of GTL, GT N\u2212I always produce\n4-length meta-paths. However shorter meta-paths (e.g. MDM) are preferable in IMDB.\n\n4.3\n\nInterpretation of Graph Transformer Networks\n\nnotation \u03b1 \u00b7 A =(cid:80)K\n\nWe examine the transformation learnt by GTNs to discuss the question interpretability Q3. We \ufb01rst\ndescribe how to calculate the importance of each meta-path from our GT layers. For the simplicity,\nwe assume the number of output channels is one. To avoid notational clutter, we de\ufb01ne a shorthand\nk \u03b1kAk for a convex combination of input adjacency matrices. The lth GT layer\nin Fig. 2 generates an adjacency matrix A(l) for a new meta-path graph using the previous layer\u2019s\noutput A(l\u22121) and input adjacency matrices \u03b1(l) \u00b7 A as follows:\n\n(cid:16)\n\nD(l\u22121)(cid:17)\u22121\n\nA(l\u22121)\n\nA(l) =\n\n(cid:33)\n\n(cid:32) K(cid:88)\n\ni\n\n\u03b1(l)\n\ni Ai\n\n,\n\n(6)\n\nwhere D(l) denotes a degree matrix of A(l), Ai denotes the input adjacency matrix for an edge type i\nand \u03b1i denotes the weight of Ai. Since we have two convex combinations at the \ufb01rst layer as in Fig.\n\u03c6 ). In our GTN, the meta-path tensor from\n1, we denote \u03b1(0) = softmax(W 1\nthe previous tensor is reused for Ql\n\u03c6 ) for each layer to calculate\nQl\n2. Then, the new adjacency matrix from the lth GT layer can be written as\n\n1, we only need \u03b1(l) = softmax(W 2\n\n\u03c6 ), \u03b1(1) = softmax(W 2\n\n(cid:16)\n(cid:16)\n\nD(l\u22121)(cid:17)\u22121\nD(l\u22121)(cid:17)\u22121\n\n(cid:16)\n(cid:16)\n\nD(1)(cid:17)\u22121(cid:16)\nD(1)(cid:17)\u22121\n\n. . .\n\n. . .\n\nA(l) =\n\n=\n\n\uf8eb\uf8ed (cid:88)\n\nt0,t1,...,tl\u2208T e\n\n(\u03b1(0) \u00b7 A)(\u03b1(1) \u00b7 A)(\u03b1(2) \u00b7 A) . . . (\u03b1(l) \u00b7 A)\n\n\u03b1(0)\nt0 \u03b1(1)\n\nt1 . . . \u03b1(l)\ntl\n\nAt0At1 . . . Atl\n\n(cid:17)\n\n\uf8f6\uf8f8,\n\n(7)\n\n(8)\n\nwhere T e denotes a set of edge types and \u03b1(l)\nis an attention score for edge type tl at the lth GT layer.\nSo, A(l) can be viewed as a weighted sum of all meta-paths including 1-length (original edges) to\n\nl-length meta-paths. The contribution of a meta-path tl, tl\u22121, . . . , t0, is obtained by(cid:81)l\n\ntl\n\ni=0 \u03b1(i)\nti .\n\n7\n\n\fTable 3: Comparison with prede\ufb01ned meta-paths and top-ranked meta-paths by GTNs. Our model\nfound important meta-paths that are consistent with pre-de\ufb01ned meta-paths between target nodes (a\ntype of nodes with labels for node classi\ufb01cations). Also, new relevant meta-paths between all types\nof nodes are discovered by GTNs.\n\nDataset\n\nPrede\ufb01ned\nMeta-path\n\nDBLP APCPA, APA\nACM\nIMDB MAM, MDM\n\nPAP, PSP\n\nMeta-path learnt by GTNs\n\nTop 3 (between target nodes)\n\nTop 3 (all)\n\nAPCPA, APAPA, APA\n\nPAP, PSP\n\nMDM, MAM, MDMDM\n\nCPCPA, APCPA, CP\nAPAP, APA, SPAP\nDM, AM, MDM\n\n(a)\n\n(b)\n\nFigure 3: After applying softmax function on 1x1 conv \ufb01lter W i\n\u03c6 (i: index of layer) in Figure 1,\nwe visualized this attention score of adjacency matrix (edge type) in DBLP (left) and IMDB (right)\ndatasets. (a) Respectively, each edge indicates (Paper-Author), (Author-Paper), (Paper-Conference),\n(Conference-Paper), and identity matrix. (b) Edges in IMDB dataset indicates (Movie-Director),\n(Director-Movie), (Movie-Actor), (Actor-Movie), and identity matrix.\n\ni=0 \u03b1(i)\n\nNow we can interpret new graph structures learnt by GTNs. The weight(cid:81)l\n\nti for a meta-path\n(t0, t1, . . . tl) is an attention score and it provides the importance of the meta-path in the prediction\ntask. In Table 3 we summarized prede\ufb01ned meta-paths, that are widely used in literature, and the\nmeta-paths with high attention scores learnt by GTNs.\nAs shown in Table 3, between target nodes, that have class labels to predict, the prede\ufb01ned meta-paths\nby domain knowledge are consistently top-ranked by GTNs as well. This shows that GTNs are\ncapable of learning the importance of meta-paths for tasks. More interestingly, GTNs discovered\nimportant meta-paths that are not in the prede\ufb01ned meta-path set. For example, in the DBLP dataset\nGTN ranks CPCPA as most importance meta-paths, which is not included in the prede\ufb01ned meta-path\nset. It makes sense that author\u2019s research area (label to predict) is relevant to the venues where\nthe author publishes. We believe that the interpretability of GTNs provides useful insight in node\nclassi\ufb01cation by the attention scores on meta-paths.\nFig.3 shows the attention scores of adjacency matrices (edge type) from each Graph Transformer\nLayer. Compared to the result of DBLP, identity matrices have higher attention scores in IMDB.\nAs discussed in Section 3.3, a GTN is capable of learning shorter meta-paths than the number of\nGT layers, which they are more effective as in IMDB. By assigning higher attention scores to the\nidentity matrix, the GTN tries to stick to the shorter meta-paths even in the deeper layer. This result\ndemonstrates that the GTN has ability to adaptively learns most effective meta-path length depending\non the dataset.\n\n8\n\n\f5 Conclusion\n\nWe proposed Graph Transformer Networks for learning node representation on a heterogeneous graph.\nOur approach transforms a heterogeneous graph into multiple new graphs de\ufb01ned by meta-paths with\narbitrary edge types and arbitrary length up to one less than the number of Graph Transformer layers\nwhile it learns node representation via convolution on the learnt meta-path graphs. The learnt graph\nstructures lead to more effective node representation resulting in state-of-the art performance, without\nany prede\ufb01ned meta-paths from domain knowledge, on all three benchmark node classi\ufb01cation on\nheterogeneous graphs. Since our Graph Transformer layers can be combined with existing GNNs, we\nbelieve that our framework opens up new ways for GNNs to optimize graph structures by themselves\nto operate convolution depending on data and tasks without any manual efforts. Interesting future\ndirections include studying the ef\ufb01cacy of GT layers combined with different classes of GNNs rather\nthan GCNs. Also, as several heterogeneous graph datasets have been recently studied for other\nnetwork analysis tasks, such as link prediction [36, 41] and graph classi\ufb01cation [17, 24], applying\nour GTNs to the other tasks can be interesting future directions.\n\n6 Acknowledgement\n\nThis work was supported by the National Research Foundation of Korea (NRF) grant funded\nby the Korea government (MSIT) (NRF-2019R1G1A1100626, NRF-2016M3A9A7916996, NRF-\n2017R1A2A1A17069645).\n\nReferences\n[1] R. v. d. Berg, T. N. Kipf, and M. Welling. Graph convolutional matrix completion. arXiv\n\npreprint arXiv:1706.02263, 2017.\n\n[2] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classi\ufb01cation in social networks. In Social\n\nnetwork data analytics, pages 115\u2013148. Springer, 2011.\n\n[3] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classi\ufb01cation in social networks. In Social\n\nnetwork data analytics, pages 115\u2013148. 2011.\n\n[4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning:\n\ngoing beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\n[5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected\n\nnetworks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[6] J. Chen, J. Zhu, and L. Song. Stochastic training of graph convolutional networks with variance\n\nreduction. arXiv preprint arXiv:1710.10568, 2017.\n\n[7] J. Chen, T. Ma, and C. Xiao. FastGCN: Fast learning with graph convolutional networks via\n\nimportance sampling. In International Conference on Learning Representations, 2018.\n\n[8] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. A\u02c62-nets: Double attention networks. In\n\nAdvances in Neural Information Processing Systems, pages 352\u2013361, 2018.\n\n[9] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\nwith fast localized spectral \ufb01ltering. In Advances in neural information processing systems,\npages 3844\u20133852, 2016.\n\n[10] Y. Dong, N. V. Chawla, and A. Swami. metapath2vec: Scalable representation learning for\n\nheterogeneous networks. In KDD \u201917, pages 135\u2013144, 2017.\n\n[11] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,\nand R. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in neural information processing systems, pages 2224\u20132232, 2015.\n\n[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing\nIn Proceedings of the 34th International Conference on Machine\n\nfor quantum chemistry.\nLearning-Volume 70, pages 1263\u20131272, 2017.\n\n9\n\n\f[13] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of\nthe 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\n2016.\n\n[14] W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs.\n\nCoRR, abs/1706.02216, 2017.\n\n[15] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data.\n\nCoRR, abs/1506.05163, 2015.\n\n[16] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in\n\nneural information processing systems, pages 2017\u20132025, 2015.\n\n[17] R. Kim, C. H. So, M. Jeong, S. Lee, J. Kim, and J. Kang. Hats: A hierarchical graph attention\n\nnetwork for stock movement prediction, 2019.\n\n[18] T. N. Kipf and M. Welling. Variational graph auto-encoders. NIPS Workshop on Bayesian Deep\n\nLearning, 2016.\n\n[19] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn International Conference on Learning Representations (ICLR), 2017.\n\n[20] S. I. Ktena, S. Parisot, E. Ferrante, M. Rajchl, M. C. H. Lee, B. Glocker, and D. Rueckert.\nDistance metric learning using graph convolutional networks: Application to functional brain\nnetworks. CoRR, 2017.\n\n[21] J. Lee, I. Lee, and J. Kang. Self-attention graph pooling. CoRR, 2019.\n\n[22] T. Lei, W. Jin, R. Barzilay, and T. Jaakkola. Deriving neural architectures from sequence and\ngraph kernels. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pages 2024\u20132033, 2017.\n\n[23] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of\n\nthe American society for information science and technology, 58(7):1019\u20131031, 2007.\n\n[24] H. Linmei, T. Yang, C. Shi, H. Ji, and X. Li. Heterogeneous graph attention networks for\nsemi-supervised short text classi\ufb01cation. In Proceedings of the 2019 Conference on Empirical\nMethods in Natural Language Processing (EMNLP), 2019.\n\n[25] Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, L. Song, and Y. Qi. Geniepath: Graph neural networks\n\nwith adaptive receptive paths. arXiv preprint arXiv:1802.00910, 2018.\n\n[26] F. Monti, D. Boscaini, J. Masci, E. Rodol\u00e0, J. Svoboda, and M. M. Bronstein. Geometric deep\n\nlearning on graphs and manifolds using mixture model cnns. CoRR, abs/1611.08402, 2016.\n\n[27] F. Monti, M. Bronstein, and X. Bresson. Geometric matrix completion with recurrent multi-\ngraph neural networks. In Advances in Neural Information Processing Systems, pages 3697\u2013\n3707, 2017.\n\n[28] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 701\u2013710, 2014.\n\n[29] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. IEEE Transactions on Neural Networks, 2009.\n\n[30] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling\nrelational data with graph convolutional networks. In European Semantic Web Conference,\npages 593\u2013607, 2018.\n\n[31] C. Shi, Y. Li, J. Zhang, Y. Sun, and S. Y. Philip. A survey of heterogeneous information network\n\nanalysis. IEEE Transactions on Knowledge and Data Engineering, 29(1):17\u201337, 2016.\n\n[32] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network\n\nembedding. In WWW, 2015.\n\n10\n\n\f[33] P. Veli\u02c7ckovi\u00b4c, G. Cucurull, A. Casanova, A. Romero, P. Li\u00f2, and Y. Bengio. Graph attention\nIn International Conference on Learning Representations, 2018. URL https:\n\nnetworks.\n//openreview.net/forum?id=rJXMpikCZ.\n\n[34] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels.\n\nJournal of Machine Learning Research, 11(Apr):1201\u20131242, 2010.\n\n[35] D. Wang, P. Cui, and W. Zhu. Structural deep network embedding. In Proceedings of the 22Nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n1225\u20131234, 2016.\n\n[36] X. Wang, X. He, Y. Cao, M. Liu, and T.-S. Chua. Kgat: Knowledge graph attention network\nfor recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on\nKnowledge Discovery & Data Mining (KDD), pages 950\u2013958, 2019.\n\n[37] X. Wang, H. Ji, C. Shi, B. Wang, P. Cui, P. Yu, and Y. Ye. Heterogeneous graph attention\n\nnetwork. CoRR, abs/1903.07293, 2019.\n\n[38] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng. Graph wavelet neural network. In International\n\nConference on Learning Representations, 2019.\n\n[39] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolu-\ntional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM\nSIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974\u2013983,\n2018.\n\n[40] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec. Hierarchical graph\n\nrepresentation learning with differentiable pooling. CoRR, abs/1806.08804, 2018.\n\n[41] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla. Heterogeneous graph neural\nnetwork. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge\nDiscovery & Data Mining (KDD), pages 793\u2013803, 2019.\n\n[42] M. Zhang and Y. Chen. Link prediction based on graph neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 5165\u20135175, 2018.\n\n[43] Y. Zhang, Y. Xiong, X. Kong, S. Li, J. Mi, and Y. Zhu. Deep collective classi\ufb01cation in\nheterogeneous information networks. In Proceedings of the 2018 World Wide Web Conference\non World Wide Web, pages 399\u2013408, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6458, "authors": [{"given_name": "Seongjun", "family_name": "Yun", "institution": "Korea university"}, {"given_name": "Minbyul", "family_name": "Jeong", "institution": "Korea university"}, {"given_name": "Raehyun", "family_name": "Kim", "institution": "Korea university"}, {"given_name": "Jaewoo", "family_name": "Kang", "institution": "Korea University"}, {"given_name": "Hyunwoo", "family_name": "Kim", "institution": "Korea University"}]}