Review for NeurIPS paper: Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks

NeurIPS 2020

Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks

Review 1

Summary and Contributions: This paper analyzes a boosting algorithm that is related to multi-scale GNN, for optimization and generalization bounds. It also provides empirical observations for the boosting algorithm.

Strengths: The claims are sound, the paper is not too incremental, and relevant to the NeurIPS community.

Weaknesses: The claims are sound, but somewhat trivial given previous results. The proofs are simply the combinations of standard known results. So, this paper does not provide a new proof technique or insights in terms of theory. But, it applies that to an interesting question about multi-scale GNN, which is good.

Correctness: The claims seem to be correct overall. The claim on "test error bound monotonically decreasing with depth" should be deleted for sure. This is not supported by the results here. This claim requires the several assumptions implicitly made in this paper that are not validated and would be invalidated in the future. First, the test error bound in theorem 2 contains D^(t), which contains 2^L factor, which increases exponentially as the depth increases. There are also other terms that can increase as depth increase. Therefore, "the test error bound monotonically decreasing with depth" is not valid, and I suspect the opposite: this test error bound indeed increases with depth for practical cases. But, I understand that this seems to stem from discrepancy in terminology at first glance. Depth in this paper means depth in aggregation functions over t, instead of depth for classification networks L. But, depth in aggregation functions do not contain any nonlinear activation, which is the reason why this bound does not contain 2^T factor or other factor that grows exponentially in t in the test error bound. If we use non-linearity in-between aggregation functions G, we have the same issue for depth in aggregation functions over t too. Therefore, this is not just due to the discrepancy in terminology, but fundamental in proof.

Clarity: The paper is well written. It also provides previous related papers well.

Relation to Prior Work: This paper clearly discusses how this work differs from previous contributions.

Reproducibility: Yes

Additional Feedback: I read the author feedback. It answers my question well and was consistent with what I assumed in the original review. Therefore, I remain my positive evaluation. In my understanding, in standard multi-scale GNN, there are nonlinear activation in-between aggregation functions G. In this paper, there is no nonlinear activation in-between aggregation functions G. Nonlinear activation is only in B. Therefore, "graph" part G is always linear. Is there such multi-scale GNN in the literature? Why do you think this "linear version" works better than standard multi-scale GNN with non-linearity in-between aggregations? Is there any reference or additional empirical study to demonstrate the effect of "linear version vs nonlinear version"? The answer to these questions affect my final evaluation a lot, since this determines how relevant this paper is for GNN. After author response on this, I may decrease or increase my evaluation a lot (my current evaluation is very tentative). Theorem 1 is trivial to me given Assumption 1, even without any previous papers. But, I have worked on this field, and what I thought trivial turned out to be not trivial to others in many times previously. Theorem 2 (with the correction in appendix) is more interesting to me, but this also follows standard proof of Rademacher complexity bounds. This bound can increase exponentially as depth increase, as I mentioned above. The trade-off issue mentioned after Theorem 2 is not resolved in this paper. But, it is good to discuss the limitation explicitly, unlike many other papers.

Review 2

Summary and Contributions: The paper presents a novel boosting style method to optimize graph NNs and tries to pull together various pieces from learning theory to shed light on the optimization and generalization accuracy in GNNs, in particular as it relates to a phenomenon called "over-smoothing" in the DNN community.

Strengths: The paper digests quite some literature and results, which it tries to adapt to the chosen setting.

Weaknesses: I find the paper extremely hard to read. I can't state that I feel comfortable, I have understood the main claims and results. There is very little empirical evidence or clean argumentation that would motivate the reader to follows the suggested path that combines "muddy" fundings from GNNs with established learning theory. As it stands, I find the paper needs a major revision to work out the actual claims and results.

Correctness: (Unable to check due to complexity of write-up.)

Clarity: No, surely not. See above.

Relation to Prior Work: A lot of work is cited, but I still feel there is a lack of clarity in putting this work into the context of other work in GNNs.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper proposed GB-GNN, a multi-scale graph neural network with the boosting theory to deal with the over-smoothing problem. On top of the generalization gap, the authors also studied the optimization guarantee under the weak learning condition. The upper bound of test error is derived under the special case of MLPs as the transformation function and GCNs as the aggregation function. Both theoretical analysis and experimental results indicate an improved performance on test error when deepening the model. In terms of the prediction accuracy, the framework (with several variants) has been implemented on the standard node classification datasets and performs comparable results to the conventional GNNs.

Strengths: The proposed idea combines the idea of multi-scale GNNs and boosting theory in ensemble methods. Under the condition of weak learnability, the proposed model can derive the optimization and generalization guarantees. With an increased depth (the training iteration), the model's generalization bound is monotonically decreasing and thus avoid the over-smoothing problem in general GNN frameworks. Detailed proofs are formulated to fill the insufficient discussion in literature, and empirical evidence is provided with three node classification benchmark datasets, which suggests the comparable performance of the underlying GB-GNN models.

Weaknesses: 1. According to the notations in Equation (3), I believe that the model fits a separate transformation function $b^{(t)}$ by an L-layer MLP at each iteration~$t$. If that is the case, the weight to be estimated and the number of hyper-parameters to be tuned could be large, especially when increasing the iteration number and layers. I would expect a small section concerning the efficiency of the algorithm and the performance sensitivity to the parameter settings. 2. According to Section 5, the performance of this multi-scale ensemble model relies heavily on the depth of the model. However, the wlc condition could easily break when it goes too deep. That is, for a larger dataset with a more complex structure, the model could fail before learning sufficient information. 3. Although theoretically appealing, the GB-GNN model does not outperform many competitors in experiments, according to Table 5 of Section G. Even for the three standard datasets that are baselines for node classification, yet still one of the models cannot run properly. Given this, the model's capability on large datasets is questionable.

Correctness: The proof seems correct.

Clarity: Generally, the paper is pretty readable; however, the notation could be improved in a couple of places. 1. It would be easier to follow if all the notations are properly defined right beside the equations. For example, the definitions of $\tilde{B}^{(t)}$ at line 198? The constant $\tilde{C}^{(t)}$ in Equation 4 are not clear. I can see it is a positive number and an upper bound constraint for $||W_C||_1$, but what's the meaning of it? What's the difference between $C^{(t)}$ (for example, in line 668 - 669) and $\tilde{C}^{(t)}$? 2. It is hard to understand Equation 4 fully. I assume the model uses one layer of inactivated GCN in each iteration with the same weight set $W$ and adjacency matrix $\tilde{P}$ to transform $X^{(t-1)}$ to $X^{(t)}$. Please correct me if I am wrong; otherwise, it would be clearer to add superscript $(t-1)$ and $(t)$ to X.

Relation to Prior Work: The authors stated in Supplementary that several existing works in a similar field. However, their analysis either restricted to a specific model, or they only considered the generalization gap but left the optimization guarantee aside. This work extends the idea in [67] to multi-layered and multi-scale GNNs, and include optimization analysis on top of [25].

Reproducibility: Yes

Additional Feedback: The w.l.c params ($\alpha_t, \beta_t$) are quite confusion to me. According to Algorithm 1, they are inputs in each iteration. However, they are not hyper-parameters to tune with empirical loss (at least not shown in Table 3 of Section E4), and they are not part of the loss functions. I wonder how these parameters influence the mode performance, and how are they valued? 2. In Equation 4, I wonder why applying $\tilde{P}$ could exclude neighbouring information (line 42)? 3. According to Section G1 and Section G2, especially for the Cora dataset, one hidden layer MLPs significantly outperforms other 0-4 hidden layer choices, for both training and test losses. The models also perform more stable in terms of the accuracy score. Does it indicate a `weaker' learner is favourable than deep `stronger' learners? What are the possible explanations of that?