NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2372
Title:Understanding Attention and Generalization in Graph Neural Networks

Reviewer 1

UPDATE: I have increased the score to 6 as long as the authors will revise the paper as promised in the responses. === This paper has more than one topic being discussed. It at the first part talks mostly about the attention mechanism, and in the second section it introduces a new model ChebyGIN, then in the third section it proposed a weakly-supervised attention training approach. Overall, the paper is not all about its title "Understanding Attention in Graph Neural Networks". In 2.3 the paper says "the performance of both GCNs and GINs is quite poor and, consequently, it is also hard for the attention subnetwork to learn", thus it proposes ChebyGIN as a stronger model. In the experiments we only see a few results from non-ChebyGIN models. It raise the concern that are most of the statements and observations on attention only work with stronger models? Or, is ChebyGIN the only strong model that works well with attention? To summarize, it is concerning to use a limited number of models to understand attention. In the section "Why is the variance of some results so high?", the paper raises an interesting issue of attention, which is poorly initialized attention cannot be recovered. It is an important issue as an initialization-sensitive model training is what people would like to avoid. However, there's no further discussion or attempt in solving it. In other deep learning models, for instance CNN, initialization has been studied in several literatures. Different random number distribution may have substantial impact on the initialization. The proposed weakly-supervised attention supervision borrows the attention coefficients from Eq.6, which uses a model's prediction y_i in the computation. The paper is not clear on, or we may have overlooked, how is the model predicting y_i trained? Is it suggesting we first train a model in an unsupervised manner, and use this first version to computer the attention coefficients, then again train the model with the coefficient in the weakly-supervised manner? In Table 2 the variance of the results are still high enough to make the improvement insignificant, especially when comparing unsupervised v.s. weakly-supervised. It is concerning to consider if the proposed weakly-supervised training approach is meaningful without a good initialization.

Reviewer 2

The authors extensively cite the prior work that they are extending here and their work in context. The idea to prune nodes on basis of attention scores they get is novel compared to previous work, as is the case for supervising said attention (atleast in context of graph neural networks). The synthetic datasets provide reasonable test cases to test future algorithms in this domain. I find a few questions that need to be resolved and/or explained in more detail. First, is the attention only used to prune the nodes in the graph or are the representation to the subsequent layers of the model weighted by the representation scores ? If it is the first case, then how is loss function of the final task backproped through the layer i.e if input to next layer is X[X.alpha < threshold], then I don't see a obvious way of passing the gradient through alpha values. If it is the second case, then I would assume the representation passed to subsequent layers are on a reduced scale (and depends on how peaky or uniform the attention is) . In general, authors make a good job of evaluating their design decisions. Some of the examples are - 1) Generalisation performance - In Colors and Triangles task, they test on larger graphs than those appearing in the training set. On Colors and MNIST data, they test on augmenting the feature space to unseen colors. A question I have here why in colors task, GIN performs better while in triangles task , chebyGIN performs better ? Please also clarify how AUROC is being used to quantify attention correctness. Do we consider attention as providing us with binary decision about node's importance i.e if the node is important or not ? If ranking needs to be evaluated, authors might use direct measure of rank measurement like spearman rho or kendall tau. 2) Evaluating GIN and chebyGIN model on all 3 datasets with unsupervised, weakly supervised and supervised attention. The results here show that supervised attention is required for improving performance and that unsupervised attention doesn't give any reasonable improvement of previous baselines. A question I have here is how is weak supervision generated for the model. The authors mention that weak supervision is generated by removing a node from a graph and see how much model's output change . But which model are we talking about here ? Are the authors training a unsupervised attention model and generate weak supervision on basis of that ? In general, Section 3.4 needs to be expanded to detail how weak supervision is generated since in most cases, as author's mention, we don't have ground truth attention and from Table 1, we see that attention only helps if supervised or weakly supervised. 3) Multiple ablation studies on how attention init, thresholding and input dimensionality affect the model output. In section "How results differ depending on to which layer we apply the attention model?" , I am not sure which results/graphs are being used to back this claim (Please clarify that).

Reviewer 3

This paper studies the limitations of attention mechanism in GNNs when conducting graph classification tasks. Through empirical study on graph with ground truth attention, graph with Gaussian noise, and graphs with unseen node features, Overall, it is an interesting work with empirical analysis, while the technical contribution is limited. It provides insights and interesting discussion on the capability of attention mechanism in GNNs; for example, main strength of attention over nodes in GNNs is the ability to generalize to more complex or noisy graphs at test time. Also, the factors influencing performance of GNNs with attention are: initialization of the attention model, strength of the main GNN model, and other hyperparameters of the attention and GNN models. Authors finally suggest that GNNs with supervised training of attention are significantly more accurate and robust, even with weakly-supervised training since ground truth attention is often not available. Their introduced way of supervised or weakly-supervised strategies require important hyperparameters setting, which was not discussed. Most of the results are reported on datasets which have ground truth attention values. Only Table 2 shows evaluation on other real-world datasets, without comparison to GIN and GCN, which had good performance in Table 1.