Paper ID: | 6458 |
---|---|

Title: | Graph Transformer Networks |

The paper's writing is clear and good. The interpretation of Graph Transformer Network (section 4.3) provides useful insights of the model design. The experiments look a bit lacking as it only demonstrates on two citation datasets and one movie datasets. It would be great to provide results on different datasets in different domains.

The paper revolves around the construction of appropriate models for heterogeneous graphs, where nodes and arcs are of different types. Overall, this is a fundamental topics for the massive development of Graph Neural Networks. The paper shows how to generate new graph structures and learn node representations by looking for new graph structures. The proposed Graph Transformer Network (GTN) learns to transform a heterogeneous input graph into useful meta-path graphs for each task. Overall, the paper provides remarkable results and faces quite a common problem in most interesting applications. This reviewer very much liked the idea behind meta-paths and the central idea behind the "attention scores". The presentation is acceptable but the curious reader can hardly grasp a clear foundation on the presentation of the model. Here are some questions that one very much would like to be addressed: 1. Pag 3, line 132: The case of directed graphs is simply supposed to be processed by changing eq. 2 so as there is no right-multiplication of \tilde{D}^{-1/2}. As far as I understand, in any case, this gives rise to the same computation as in non-directed graphs. However, one might consider that in case of directed graphs there are data flow computational schemes that make it possible to determine a state-based representation of the graph. Basically, the same feedforward computation that takes place in the neural map can be carried out at data level in directed graphs. 2. While the learning of the attention scores is a distinctive feature of the paper, at the same time, one might be worried about the nature of their convex combination. It requires the attention scores to sum up to one. When looking at eq. 8 (pag. 7) one might be worried about the \alpha product, since each term is remarkably lower than one! Isn't there a critical gradient vanishing? 3. Pag. 4 line 157 and pag. 7, line 235 refer to the idea of "choosing" A_0= I (identity matrix). It is claimed that this yields better results. The explanation at pag. 7 is not fully satisfactory. Basically, the identity matrix has an obvious meaning in terms of graph structure. However, what I'm missing is the reason why it should be included in the generation of the meta-paths. Couldn't be this connected to my previous comment (2)? 4. Eq. (5) is supposed to express by Z the mode representation emerging from different meta-paths. Why is the D matrix only left-multiplying the adjacency matrix? Maybe I'm missing an important technical detail since also my previous question 1 on directed graphs seems to be on on a related issue. 5. The authors might also better discuss and make references to attempts to deal with heterogeneous graphs characterized by different types of nodes/arcs. Basically, beginning with early studies on GNN, it has been realized that the stationarity of GNN can be overcome by involving different sets of parameters that are associated with different node/edge data type.

Originality: the proposed idea is inspired from the spatial transformer networks. Clarity: the notations of this paper are a bit confusing, especially for Sec 3.1 when describing heterogeneous graph and meta-paths. Significance: the proposed method provides a good way to interpret the generation of meta-paths and helps to understand the problem.