Reviews: Understanding the Representation Power of Graph Neural Networks in Learning Graph Topology

This paper presents a theoretical analysis of the representational power of graph convolutional network (GCN) models. It is shown that these models are in many cases incapable of learning topological features of the graph such as graph moments, and these shortcomings are used to motivate new GCN architectures combining several propagation rules and incorporating multiple residual connections. The analysis seems to be sound (albeit distant to my area of expertise), and the obtained emprical results seem to support it. I believe that this paper could serve to better improve understanding of GCNs' representational power, and therefore I would vote for (marginal) acceptance. I have a few comments that could be used to improve the work: - In its current form, it is somewhat unclear what "GCN" exactly refers to. It seems that the authors use it interchangeably to refer to the GCN model of Kipf and Welling, and to convolutional layers on graphs more generally. I would recommend a thorough pass through the paper to make it clear what the authors are referring to at different points. - The point that multiple propagation models can be helpful for easing learning has already been empirically pointed out by related work. For example, the graph attention network (GAT) paper (ICLR 2018) finds it critical to use multi-head attention on a graph to achieving strong performance. This effect is probably more pronounced in attention-based models where the propagation model itself is learnable, but it might be worth positioning the proposed architecture in the context of such works as well. - Very similarly, the idea of using residual connections resonates heavily with the jumping knowledge network architecture (Xu et al., ICML 2018), and it would be good to position the work with respect to it as well. - Minor: when referring to the propagation models in sect. 4, I believe the references for Hamilton et al. and Kipf and Welling are flipped with respect to their propagation models. - One thing I found unclear for the graph classification setup is how exactly the graph-level features are computed. The propagation rule explains how to obtain node-level representations, and it's clear how these can be used for node-level classification, but how are these aggregated to a graph-level representations? Is simple mean-pooling used? I invite the authors to clarify. - The experimental analysis is good, thorough and seems to support the theoretical observations - I believe it could be interesting to test the method on more "real-world" graphs as well. Would it make sense to launch the architecture on the standard graph classification datasets (as used in the GIN paper)? --- After rebuttal --- I thank the authors for constructively addressing my comments and providing improvements to the clarity of the setup. I am still unable to fully gauge the contribution of the paper, and the lack of additional benchmarks makes it hard for me to position the proposed operator against prior work (such as GIN). Especially related to the point the rebuttal makes about GATs, I believe that the authors might be overlooking a relevant point: each of the K attention heads are *learnable*, and thus can and often will learn K separate propagation rules. The reason for suggesting real-world benchmarks simply is to show the benefit of choosing K different "hard-coded" propagation models as the authors do here, as opposed to being able to learn them (as GAT/GIN do). I still think the work contains worthy theoretical contribution on an overlooked problem, and I will retain my score.

Reviewer 2

This paper studied an important problem regarding the representation power of GNN when learning graph topology. The paper first introduced graph moments, a key indicator for characterizing the random process of graph generation. Then they showed that a particular variant of GNN with linear activation function and graph filer D^{-1}A has issue in learning degree of a graph. They further pointed the cause of the this limitation due to permutation invariant and then proposed a new modular GCN design. Some theoretical analysis is also provided regarding the graph moments. Some cases study are shown to justify the claims. In general, this paper studied an important problem - understanding the power of GNN regarding its representation since GNN area has been fast growing in recent years. However, there are severe limitations of this paper: 1) It looks to me that authors try to use GNN for graph generation, which graph moments plays a key factor for characterizing this process. Authors pointed out a particular GNN variant with linear activation function and graph filter D^-1A as well as no node attributes has limitations in learning graph moments. However, this particular choices failed does not mean no GNN can do it. Especially, the chosen GNN variants seems like a very uncommon one. 2) Authors tried to connect the failures of a GNN variant in learning graph moments with the property that GNN needs to enforce node permutation invariant. And it seems to me that A_hat = D^-1A caused this issue. However, I have no idea how this conclusion is obtained. Not to mentioned that no GNN paper used D^-1A (but using D^-1/2AD^-1/2), why D^-1 A is associated with node permutation invariant. In fact, the node permutation invariant is typically enforced the node aggregation function (well discussed in GraphSAGE paper). 3) In lines 181-182, authors listed three popular GCN modules and which papers used them (GCN, graphSAGE, and GGNN). However, I did not see these papers used these three different graph operations, instead they only used f3 = D^-1/2AD^-1/2. I am a little supervised how authors get this conclusion. For the proposed GCN module, I did not see why these designs can help learn graph moments. Also, this design is just a simple variant of original GNN and I did not see any novel thing here. These designs have been used in many applications papers. 4) The experimental evaluations are weak. They are many graph generation papers in the literature both from conventional literature and recent GNN papers. But I didi not see any baselines there.

Reviewer 3

Reviewer 4

Summary ======= The authors investigate the capability of graph convolutional networks (GCN) to approximate graph moments. Herein a graph moments is a polynomial functions of the graph's adjacency matrix. In a theoretical section they analyze necessary conditions on GCNs to be capable of learning such moments. The main insight here is that: (1) Exact depth configuration, i.e., a deep enough architecture selection, is necessary to learn a graph moment of certain order (degree of the polynomial). (2) Residual connections allow to learn also moments of lower order than the maximum order which can be achieved with the selected architecture's depth. In the light of this considerations the authors propose a unified residual architecture combining three different message passing paradigms. Experimentally, the authors conduct an ablation study where they evaluate the impact of depth, wideness, and choice of activation function on the performance of regressing different graph moments. The theoretical findings support the results of those experiments. In another experimental part the authors evaluate similar architectural design choices in a classification setup on synthetic data (i.e., random graph generation). Originality =========== The theoretical result are interesting and original. Quality ======= Except for one issue the overall quality of this work is good, could, however, by polished a little more to improve readability and clarity (see below). The main concern is the lack of related work concerning graph moments. Are they introduced by the authors? If not, state where they are originally defined! Do they have application so far? Given that this work is built around graph moments this has to be handled, one way or the other. Clarity ======= Notation: There should be done some improvement concerning the mathematical notation. The main concern is a lack of explicit introduction of notational convenience which makes things unnecessarily hard to understand, e.g., 1) p3 l112 what is M_p? shouldn't this be M_p(A) 2) sometimes you use an index such as in f(A)_i, I think you mean the i-th coordinate function? 3) The moments are matrix valued, right? 4) p3 l92 why two times \hat{A}? 5) (other things of similar nature) Contribution: It seems that the theoretical contribution is the core of this work. However, the proposed unified layer (using 3 message passing paradigms simultaneously) and the introduction of residual connections are very strongly/(over?) stated and even get an experimental section. But I do not understand the motivation to present this here as contribution. For me, it is not clear how the theoretical findings motivate to gather several message passing paradigms in one *module*. Significance ============ The work is potentially significant. Even more if the importance of moments for graphs could be stated more clearly (see Quality --> related work comment).

Paper ID:	8876
Title:	Understanding the Representation Power of Graph Neural Networks in Learning Graph Topology

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4