NeurIPS 2020

Graph Meta Learning via Local Subgraphs

Review 1

Summary and Contributions: The authors propose a novel algorithm named G-meta that applies a meta-learning algorithm to node classification or link prediction for the fast adaptation of the new task. The proposed method estimates the prototype based on the subgraph centered at each node and the entire process is optimized by using the optimization-type meta-learning algorithm, MAML. The authors also show that the information of the subgraph is rich enough to reduce the graph influence loss if the node influence between two apart nodes diminishes.

Strengths: The strength is that the authors propose a novel combination of MAML and the graph neural network that estimates the prototype constructed from local subgraph to quickly adapt a new task of the node classification task or edge prediction task, and show its effectiveness by the experiments on five datasets. Theoretical analysis is great and it also supports the validity of their approach. The validity of the proposed method is confirmed by comparing with the nine baseline methods.

Weaknesses: Although the authors show the top performance in the experiments, the experiment is not well designed to reveal which condition the proposed method works best and what part is most essential for the success. For example, meta-learning algorithm will work well if the training data in the test task is few and it is trained on the other several tasks. So if the meta-learning based method is compared with the conventional supervised learning method, it would be better to see how few the dataset in the test task the meta-learning based method can take advantage over the conventional supervised learning method when varying the size of the training dataset. Also it would be nice to see the ablation study how much the proposed method benefit from the optimization-type meta leaning algorithm by comparing the proposed algorithm with and without MAML. As for the algorithm, meta-GNN already proposed the combination of MAML and graph neural network. Meta-GNN is close to the proposed G-Meta algorithm, and the difference is the usage of the metric learning to cope with Multiple Graphs and Disjoint Labels problem. I feel the difference remains moderate.

Correctness: Although the authors claim that the proposed method covers the wider class of graph problem, and it is true when compared with some existing methods. However, it only covers node classification task and link prediction task where the target is discrete class label. It does not include the molecular prediction problems discussed in [15] that is referred in the introduction of the paper. To safely avoid the confusion, it would be better to explicitly mention the problems the proposed method solves in the paper and Figure 1.

Clarity: I think the paper is overall well written and enough clear except for the problems the proposed method solves as mentioned above.

Relation to Prior Work: Existing methods are described in the paper, but the difference with meta-Graph remains moderate.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper proposes a novel and general method for meta-learning inductive biases for graph-neural networks. The proposed method is amenable to a greater set of problems than prior methods that tend to specialise on certain cases, and show substantial improvements over prior works on several benchmarks. --- Post rebuttal update --- Thank you for a great rebuttal that addressed my main comments, especially with ablations on sub-graph size and the relative importance of metric- and gradient-based meta-learning.

Strengths: This paper is generally well-written and easy to follow. It presents a novel method for solving an important class of meta-learning problems, and demonstrates superior performance to baselines that are less general than the proposed method. As such, I recommend acceptance. In particular, the core idea of this paper is to represent each node with a local subgraph, as opposed to a full, global graph as in prior works. The degree of locality is measured by a hard cutoff threshold, and the authors motivate this locality by showing that the degree to which a node can influence another decays exponentially with distance in the graph. This suggests a bias-variance trade-off from a meta-learning point of view, and the authors results suggest that this trade-off is not trivial.

Weaknesses: My main suggestion for further improvement would be to provide a deeper analysis of some key algorithmic choices to both support claims made in the paper and help the reader understand what seems to be driving performance. At a high level, the proposed method makes use of both metric meta-learning and gradient-based meta-learning, but it is unclear how these relate and contribute to the method’s overall performance.

Correctness: Theoretical claims and empirical results appear correct to the best of my knowledge.

Clarity: This paper is well-written and easy to follow. If anything I would say that Section 3.1 feels out of place - this sits more naturally within the experimental section.

Relation to Prior Work: The related work section needs to be expanded. A single paragraph is no good. As for the choice of a meta-learned metric for node classification, this is taken from Matching Nets and Prototypical Nets, but this algorithmic choice is only briefly discussed without much of a motivation. It would strengthen the paper to discuss this choice and its relationship with prior works in greater depth.

Reproducibility: Yes

Additional Feedback: I would recommend the authors to provide an explicit ablation study to demonstrate both how sensitive the method is to the choice of this cutoff threshold and what kind of relationship we see between generalisation and subgraph locality. I would expect an inverted U-shape, where too local a subgraph would not capture sufficient information, and too large a subgraph would lead to overfitting, but this remains to be demonstrated. Similarly, it would be illuminating to see what kind of subgraphs are being extracted, in particular in the synthetic experiment.

Review 3

Summary and Contributions: The paper proposes a general framework for a variety of graph meta-learning problems. The G-Meta represents every node with a local subgraph and uses local subgraphs to transfer subgraph-specific information. Experimental results on seven datasets show promising performance, compared with several baselines. ----------------- After read all the comments from other reviews and the rebuttal of the authors. The authors addressed my concerns about ablation analysis. However, for the main concern about the subgraph, I still think there is nothing wrong with my previous understanding. Meta-GNN also computes the nodes representations using a sub-graph based on the number of aggregation layers (for l layer GNN, it computes nodes representations based on the l-hop neighbors, rather than the entire graph). Furthermore, I suppose the performance improvement of G-Meta compared with Meta-GNN comes from the usage of metric learning. Drawing conclusions from such comparisons are not fair. Therefore, I will remain my score unchanged.

Strengths: +The idea of using local subgraphs information to facilitate essential knowledge transferring via meta gradients is sound. +The theoretical analysis is substantial and reasonable. +Comprehensive experiments on seven datasets show considerable improvement. + The paper is well written in general and easy to read. The figures are illustrative.

Weaknesses: 1. The idea of using local subgraphs to compute the node representation is not novel. In practice, GCN also uses node’s neighborhood to define a computation graph and generates node embedding based on the local network neighborhoods. 2. There is a lack of experimental analysis on the influence of the different subgraph sizes on performance. 3. It seems like the baseline (ProtoNet, MAML) setting is not explained clearly. From the current description, it seems like that the ProtoNet uses the same encoder as G-Meta to get the subgraph representation h, and then uses the prototypical loss as G-Meta, which can be seen as an ablation model of G-Meta. Also, the exact setting of MAML needs to be explained clearly. 4. Comparing the performance of G-Meta and ProtoNet in Table 2, I am wondering about the reason why ProtoNet gets the comparable results on Single Graph Disjoint Labels and Multiple Graphs Disjoint Labels, but get much worse results on Multiple Graphs Shared labels.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: typos: 1. line 236 a varying distribution “are” -> “is” 2. line 264 use -> uses

Review 4

Summary and Contributions: --- Update after rebuttal --- I thank the authors for their detailed rebuttal, and in particular regarding the additional ablation experiments and clarifications. I maintain my recommendation to accept the paper and strongly encourage authors to integrate the additional content in the final version of their paper. --- The authors propose a generic meta-learning framework for node graph classification and link prediction when label annotations are scarce. Their key contribution is to decompose the graph inputs into a set of local subgraphs centered around the node of interest. The provide theoretical motivation that information lost by this decomposition is minimal. The approach itself combines concepts from few-shot learning prototypical networks and MAML. Main contributions are: 1- theoretical motivation for their subgraph decomposition strategy 2- applicability to a variety of settings and scalability 3- state of the art performance on multiple datasets

Strengths: - The paper is well written and motivated. In particular proofs in appendix are clear with all steps explained and justified. - The idea of exploiting only local information to learn graph representations is appealing and results suggests that it is sufficient to learn accurate representations. The increased scalability aspect is particularly appealing. Could authors comment on the impact of local decompositions on the oversmoothing problem , could this strategy alleviate the problem? - Experiments are provided on multiple datasets with state of the art performance, and show that the strategy can be exploited in multiple settings. - Authors promise to release code and datasets after the review period.

Weaknesses: - The main weakness of the work relates to the computational complexity of 1) computing the local subgraphs (are shortest paths computed ahead of the training process?), 2) evaluating each node's label individually. Can authors comment on the impact on training/evaluation time? - Another important missing element from the paper is the value of neighborhood size h, as well as an analysis of its influence over the model's performance. This is the key parameter of the proposed strategy and providing readers with intuitive knowledge of the value of h to use, and the robustness of the method with respect to larger or smaller neighborhoods is essential. Similarly, different hyperparameter sets are used per dataset, which is not ideal. Can authors provide insights into how performance varies with a constant set of parameters? - Certain aspects of the training set-up needs clarifying. Mainly the task generation process (what constitutes a task, can one task contain multiple graphs, are local substructures randomly sampled regardless of the original graph, are all nodes labelled in the training set, etc) - Certain sections are too condensed and would be much clearer and informative if expanded (e.g. related work, training setup, testing setup, baseline methods). In the interest of space, table 1 could be moved to supplementary.

Correctness: I have not seen issues in the proofs provided in supplementary. A few comments claims can be clarified in the results sections, e.g. l 276, please specify the setting/dataset and which baseline model is referred to (e.g KNN model or meta-GNN type methods). - Can authors please comment on the claim that methods typically require large amounts of annotated data and clarify the distinction between the studied set-up and the standard semi-supervised setting when only a small fraction of nodes is labelled ?

Clarity: The paper is for the most part clear and well written, albeit a little condensed at times. As mentioned in the weaknesses section, the experimental set-up needs additional attention: -Are all graph nodes labelled in the training set? Are K shot settings simulated as in standard episode training? -How are tasks constructed, especially with respect to the settings with multiple graphs? -Can authors provide a clear description of the testing setting? In particular with respect to the query set structure, and evaluation: is each node labeled individually? -When working in 5-shot settings, is it 5 nodes across 5 graphs? Regarding baseline methods compared to, am I correct to assume that this constitutes some form of ablative experiment? If so, please highlight that this is a variant of the proposed work and highlight the main difference with the complete method.

Relation to Prior Work: The related work section is very condensed but reviews main works and main differences. The paragraph at line 257 provides an additional description of the most similar methods and compares to main works exploring few-shot/meta-learning on graphs.

Reproducibility: Yes

Additional Feedback: l. 104: it is mentioned that the label set size over the whole data set is N, how does that relate to M? Considering it is mentioned next that this constitutes a N-way K shot setting, does this mean that N is the number of labels per task? If so, please clarify the distinction in the text. l. 142: please briefly clarify the notation x^(inf). l. 147: the proof makes the assumption that graph weights are non negative, this should be mentioned here. l. 154: neighborhood size is often referred to as the number of nodes in the neighborhood of a given nodes, I would suggest to make the distinction clear here that h is the number of hops from node u.