__ Summary and Contributions__: This paper aims at tackling the over-smoothing problems of graph neural networks (GNNs) and trying to enable training deep GNNs. The authors proposed two over-smoothing metrics: Group Distance Ratio and Instance Information Gain, which quantify the over-smoothing from global (graph communities) and local (node individuals) views. Besides, one normalization mechanism called Differentiable Group Normalization was proposed to address the over-smoothing problem. Experiments on node classification tasks with models of varying depth were conducted.

__ Strengths__: The problem that authors study is important for graph neural networks.

__ Weaknesses__: (1) Empirical results seem to be weak compared to other works [1] aiming at tackling over-smoothing problem. According to table 1, Deep GNNs with DGN outperform those with other normalization mechanisms. However, the performance degradation still exists when the GNNs are made deeper. [1] proposed methods not from the normalization view but address the performance degradation, especially for very deep GNNs.
(2) For the proposed over-smoothing metric, the Group Distance Ratio is a simply extension of the existing works [2] focusing on node pair distance. Though the idea is somewhat incremental, the proposed Differentiable Group Normalization relates it indeed. However, the Instance Information Gain employ mutual information between the input features and output representations as a metric, which seems to be somewhat weird. According to the Appendix F, the output representation is taken from the final prediction layer, which is the result of a linear transformation applied to the top hidden features. Thus, the quality of final prediction layer would influence the calculation of the metric. In other words, there seems to be various factors including the over-smoothing behavior of deep GNNs that directly influence this metric.
(3) For the proposed Differentiable Group Normalization, the mechanism firstly learns a differentiable clustering for potential groups and then normalize the features within the same groups softly. However, the training process of the GNNs with DGN of different depth is not showed. Intuitively, the cluster assignment matrix is dynamically changed along the training of GNNs. It is better to show the influence of the DGN brought to the training of GNNs as a supplementary for experiments.
[1] Chen, Ming, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. “Simple and Deep Graph Convolutional Networks.” In Proceedings of International Conference on Machine Learning 2020, 3730–3740, 2020.
[2] Zhao L , Akoglu L . PairNorm: Tackling Oversmoothing in GNNs[J]. 2019.

__ Correctness__: The claims, methods seem to be correct. However, for the empirical methodology, the necessity to evaluate the GNNs with varying depth on datasets consisting missing features is unaccounted. It is better to discuss what could be brought to researchers or practitioners to evaluate methods aiming at tackling the over-smoothing problem in the missing feature condition.

__ Clarity__: This paper is easy to follow. But the left panel of Figure 3 is hard to read and could be improved by converting to a table.

__ Relation to Prior Work__: This paper is missing a related work section, while simply compare the proposed differentiable group normalization with several other methods for over-smoothing problem in GNNs in Section 3.1. It is better to firstly review the over-smoothing problem in GNNs and then introduce existing understanding and methods for this problem.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: Over-smoothing is one of the limiting factor for scaling deep GNN for discrimination learning. Motivated by this issue in GNN, this work propose interleaving GNN layers with differentiable group normalisation (DGN) layer. DGN normalises node features within the same group while distinguishing feature distributions among other groups. This is shown to slow down performance dropping with increasing stack. Further, in order to study over-smoothing issue it introduces two metrics. Experiments on standard datasets shows that DGN improves the performance of GNN on classification task.

__ Strengths__: 1. This is the first work to introduce precise metrics for measuring over-smoothing in GNN - Group Distance Ratio and Instance information gain. Until recently, it had been empirically identified by studying the distance between node pairs.
2. Unlike the previous work, which tries tackling over-smoothing issue by using specific regularizing loss, this work proposes new differentiable parametric layer. This layer indirectly reduces over-smoothing.
3. Relative to prior work, the deeper GNN using DGN layers shows improved performance.

__ Weaknesses__: 1. For a fix column Table 1, it is not clear whether the total improvement is only due to normalisation operation in DGN or due to increased params w.r.t. PN & BN.
2. Please discuss runtime impact.
3. Although, I see significant performance improvement for deeper stacks compared to prior work, the overall performance for deep stack still lags behind the shallow GNNs + DGN.
4. Currently, the novel contribution is defining two precise overs-smoothing metrics. The DGN layer definition is very obvious, although unconnected to these metrics. And differentiable clustering process is simple extension from differentiable graph pooling (DiffPool) module [A].
[A] Ying, Zhitao, et al. "Hierarchical graph representation learning with differentiable pooling." Advances in neural information processing systems. 2018.

__ Correctness__: Yes.

__ Clarity__: The paper is very well written and easy to follow.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: 1. For S update, eq (7), have you tried using Gumbel-Softmax or Softmax with temperature param ?
############
Post Rebuttal: I agree with other reviewer that the current work should make a comparison to other recent work. Moreover, I am not convinced with the applicability of this work which is more or less useful for missing feature case.
At the same time, I find merit in this work and believe other recent works should be considered contemporary to this work. Hence, I keep my score unchanged.

__ Summary and Contributions__: This paper proposes the differentiable group normalization (DGN) to mitigate the over-smooth issue inherent in Graph Neural Network. In order to guide the methods tackling the over-smooth issue, two metrics (i.e.Group Distance Ratio and Instance information gain) are proposed to evaluate how severe the node representations are smoothed. Besides, the paper furthermore shows the proposed DGN effectively mitigates the over-smooth issue and significantly improves the performance.

__ Strengths__: The paper is well written. In particular, it provides very good motivations for group-wise normalization. In this way, it is really easy to follow the core idea of the paper.
The proposed two metrics are important for guiding the future development of methods for tackling the over-smooth issue.
The paper validates the performance of DGN with multiple datasets and multiple GNNs. The experiments are very solid.

__ Weaknesses__: The paper has several weaknesses which I will detail below:
# Contributions:
(1), The main technical contribution of this paper is the differentiable group normalization (DGN). However, it seems that the exact same normalization has been already proposed -- It was called attentive normalization [1] instead. The idea of the attentive normalization is to split features into multiple groups and normalize with’s its own statistics (i.e., mean and std). From my point of view, the attentive normalization and the DGNs basically share the idea and even formulation. The minor difference would be that attentive normalization in [1] is trained with additional regularization tricks to avoid mode collapse.
[2], Aslo, as mentioned in [1], it’s possible that some groups get the zero assignments in the assignment matrix -- e.g., No node is assigned in some groups. I think it would be straightforward to investigate this issue and perhaps improve performance with the same regularization tricks as in [1].
(3), There are also some other closely related works such as attentive context normalization [2] and mode normalization [3]. Both normalizations learn an assignment matrix as the proposed DGN -- split the inputs into multiple or single groups, and then normalize accordingly. I think the paper should discuss or even compare DGN with them.
Reference:
[1]Wang et al. “Attentive Normalization for Conditional Image Generation”. CVPR 2020.
[2], Sun et al. “ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning” CVPR 2020
[3], Deecke et al. “Mode Normalization” ICLR 2020.

__ Correctness__: The claim of the proposed differential group normalization (DGN) might not be true as explained in the weaknesses.

__ Clarity__: Yes, the writing of this paper is excellent.

__ Relation to Prior Work__: The paper clearly relate with prior works on over-smooth issue of GNN. But it fails to relate with works on normalization as mentioned in the weaknesses.

__ Reproducibility__: Yes

__ Additional Feedback__: AFTER REBUTTAL:
I have gone through other reviewers' comments and the author's feedback. Other reviewers' concerns are valid to me. And I don't think the rebuttal has solved my main concern about the overlap between attentive normalization (AN) and the proposed DGN.
While the authors put the performance of AN in the table in rebuttal, they didn't clarify the difference between AN and DGN. From my point of view, they basically share the same form which was also mentioned in my review. While the performance of AN and DGN are different in as shown in the table of rebuttal, I doubt if the rebuttal has properly implemented the AN given that there are several regularization tricks in AN's paper.
Therefore, I would keep my original ratings.

__ Summary and Contributions__: The paper targets at the over-smoothing issue in GNNs by considering the community structures in a graph in terms of the two proposed over-smoothing metrics and a differentiable group normalization. Experimental results on several data sets have validated the effectiveness of the propsoed method.

__ Strengths__: The soundness of this research work is good from both the theoretical thinking and empirical evaluation perspectives. The claim is quite valuable for research of GNNs and very relevant to the NeurIPS community.

__ Weaknesses__: It would be better if the paper can give a few case studies(typical examples) with explanations and analyses. For instance, for SGC on the data set Citeseer in Table 1, #K is 30. It is very big compared to other cases (in some cases #K is 2). It's not easy to imagine this. What happens? and how? Futher, why for all three models on Coauthors, #K is always 1 however, in Table 2, #K turns quite big, ranging from 15 ot 25? This might be a confusing for readers.

__ Correctness__: The claims and method are correct and, the empirical methodology is correct.

__ Clarity__: Yes, the paper is well written.

__ Relation to Prior Work__: Yes, the paper clearly discussed how this work differs from previous contributions.

__ Reproducibility__: Yes

__ Additional Feedback__: Concrete analysis on experimental results is expected.