NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 1487 The Multiscale Laplacian Graph Kernel

### Reviewer 1

#### Summary

The paper presents a graph kernel algorithm combines Graph Laplacian style global graph kernel with the feature representations coming from nested neighbourhoods, often found in subgraph and other local feature-based kernels. For the base kernel, a Graph Laplacian is combined with feature maps of the graph vertices in order to achieve invariance of the number and identifies of the vertices, and kernelized. The base kernel is then iteratively for the induced subgrahps in larger and large neighbourhoods always using the kernel values of the smaller neighborhood as the basis of computing the kernel for the next level. To speed up the kernel the authors consider computation of a global basis for the feature representations of all vertices in all graphs of the dataset. A random projection approach is further used to decrease the computational overhead. Experiments show promising behaviour compared to other approaches, except the NCI molecule datasets.

#### Qualitative Assessment

The paper is an interesting combination of spectral (Graph Laplacian) and subgraph approaches for graph kernels. I feel the methods is computationally rather demanding (e.g. compared to WL kernels) that may preclude its use in very large data. Few notes: - It seems that the kernel has double regularization, first adding \eta to the diagonal of the Graph Laplacian L, then \gamma to the diagonal of the diagonal of the inverse of the above matrix L. I wonder if this is indeed necessary? - I found (8) redundant since it has exactly the same appearance as (5). I would be better to formulate this bit in some other way than repeating the same formula. - In beginning of section 4, I would propose going immediately to "business" with Definition 5, without giving the slightly ambiguous intro as the numbered list 1-3. The problem with the list is that it is not really understandable before you explain the detail (e.g. why would the subgraphs have vertices as centers - this only makes sense after Def. 5) - line 210 onwards: I find this discussion somewhat odd. Why would the two graphs NOT have the same base kernels? Is there any benefit of not having the same base kernels. One would imagine that since the graphs are embedded into the feature space of the vertices, it really does not matter what the origin of the graphs is. Thus, it is rather confusing to make a big point about the fact that you can compare any two graphs. Basically, I don't see why Definition 6 is really needed. - 3.1.: You make an issue of avoiding the recursive computation here. However, I think, it is rather clear for everybody that you can do this iteratively starting from the small neighbourhoods. - line 270: It is not clear to me why the big Gram matrix would be needed to store. As far as I can tell, the methods rely on the feature representations of the vertices, not the identities of them.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

The paper proposes a new graph kernel called the Multiscale Laplacian Graph (MLG) Kernel, which is itself based on the Feature space Laplacian Graph (FLG) kernel, still presented by the authors. The MLP graph kernel relies on the graph spectral decomposition, having the advantage to account for learning similarities at multiple different scales, considering both the overall and local structures. The mathematical formulations and intuitive steps leading to the novel approach are well presented in the paper. Experiments are also conducted for comparison with the state of the art. The MLG outperforms existing methods in 4 out of 6 datasets analysed, and it is competitive in the remaining 2.

#### Qualitative Assessment

---------------- Overall Opinion ---------------- The paper is overall well written, and the methods are technically well presented. My main concerns are relative to the missing intuition and unclear explanations in some parts of the paper, as well as to provide more details in the experimental section. More specific information are given in the comments below. ------- Novelty ------- The authors propose a new graph kernel, which appears to be competitive with the state of the art. I would overall consider the paper as based on a novel idea. ----------------- Minor comments ----------------- Style and language issues: - Line 20: "have been appeared" is formally incorrect, the authors could use (for instance) "have appeared" or "have been proposed" - Line 46: the meaning of "it can take structure into account at multiple different scales" is not fully clear, possibly the authors could correct with "it can take into account structures at multiple different scales" - Line 218: the neighbourhoods ... "have" to be defined, rather than "are". References: - Reference [10] is poorly written: rather than "jmlr" a clear statement as in reference [5] would be preferable. ----------------- Major comments ----------------- Include unclear definitions, notation, technical incongruences and poor intuitive explanations. Section 2: - It would be nice to better clarify the notation, in particular regarding graphs, before describing the method. For instance: are the words node's attributes, features and labels used with the same meaning? How are these exactly defined? - Lines 81, 84 and 95: the "low eigenvalue eigenvectors" is unclear, one could just say the "low eigenvalues" or "the low eigenvalues with corresponding eigenvectors". This might have been considered as minor concern, but since is often repeated it makes its meaning unclear; I therefore consider it as a major comment. - Line 149-152: the sentence is not clear. Though the mathematics formulation well defines the differences between the two kernels, some more effort should be done here to better explain the intuition behind. Section 3: - It would be nice to give more intuitive explanation on what the "level L" represents and how this affects the algorithm. Possibly, also the similarity with the WL kernel should be highlighted. Section 4: - The pseudocode algorithm is provided in the Supplementary Materials. I would suggest to include it in the main paper, as it provides the reader with an intuitive and detailed description of the method proposed. Section 5: - Are the authors using a stratified cross validation? If this is the case, it should be explicitly stated in the paper, otherwise it is advised to repeat the experiments with a stratified approach to avoid overfitting. - It would be nice to include the FLG kernel, defined by the author themselves, as a further baseline comparison. This would help to strengthen the real advantages of the proposed MLG kernel. ----------- Next Steps ----------- I would suggest to revise the paper according to the comments and modifications that I stated above. After that, I would consider the paper to be at an appropriate level for a NIPS publication.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

This paper introduces a novel kernel which is based on the idea of comparing the structure of a pair of graphs at multiple levels of resolution. This is achieved by building a hierarchy of subgraphs, which are compared using the FLG kernel (which makes use of local problem-dependent vertex features). A randomised projection procedure is used to overcome the computational burden of the original recursive definition of the kernel. The experiments show that the proposed kernel outperforms state-of-the-art kernels in most of the dataset considered.

#### Qualitative Assessment

This is a well written and sound paper, which, unlike other "incremental" papers on graph kernels, provides a refreshingly new perspective on the problem. The content is very "dense", and somehow 9 pages feel not enough for the amount of information the paper contains. In fact, most of the discussion on the experiments has been moved to the supplementary material. Also, having some extra space, it would be nice to add some illustrations to the paper, to help the reader understand the ideas underpinning the proposed kernel. Overall, anyway, this is a well written paper that I recommend for acceptance. A couple of minor comments: I would add reference to the MRF literature when referring to known results/theorems, as the readers interested in graph kernels may not be familiar with some of them. I am referring in particular to the text in the first part of section 2. It would have been nice to have the FLG kernel included as a separate entry in the experiments. 60: Graph Kernel(FLG kernel) -> Graph Kernel (FLG kernel)

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

A new graph kernel is proposed that - rather than globally or locally - compares two graphs at multiple scales, so as to capture similarities between graphs at multiple scales. The core idea is to break down the graph into mutliple subgraphs and compute pairwise kernels on these. In addition, a random projection method for this graph kernel is proposed, enabling more efficient computation. The experiments show that it is usually of similar runtime as competing graph kernels, but shows substantial improvements in accuracy.

#### Qualitative Assessment

This seems to be solid work. A novel graph kernel that shows very good experimental results. The technical content - although not containing a theoretical analysis - is not straightforward and the technical treatment is sufficiently deep. The random projection method might pave the way for more efficient computation of related graph kernels. In summary, novel, sufficiently deep material together with good empirical results; good paper.

#### Confidence in this Review

1-Less confident (might not have understood significant parts)

### Reviewer 5

#### Summary

A graph kernel based on the Laplacian of the input graphs is proposed, which supports the comparison of vertices by a kernel. This kernel is employed recursively comparing neighborhood subgraphs of differing size to obtain the multiscale Laplacian graph kernel. A method of computation using low rank approximation is proposed. The article contains an experimental evaluation showing promising accuracy results.

#### Qualitative Assessment

In recent years a variety of graph kernels have been proposed. This article provides a convincing motivation for the proposed approach and is on a high technical level. The presented results are non-trivial and the method of computation appears to be competitive. The paper is comprehensible and well written. The approach is applied to graphs with discrete labels by employing the extremely sparse feature vectors of the Dirac kernel to compare vertices. It would be interesting to discuss other vertex feature maps and how they affect the computation. Can the approach, for example, be applied to attributed graphs (with multidimensional continuous labels)? It is not clear which graph kernels in the comparison take vertex labels into account. The Graphlet kernel, e.g., has been proposed for unlabelled graphs. The accuracy of 10.95 of the Graphlet kernel on the data set ENZYMES is remarkable and not in line with previous findings, see ref. [9] (maybe a bug?). Nevertheless, in summary, the experimental comparison is convincing. The idea of considering the neighbourhoods of increasing size centred at individual vertices is not new. The same principle is, e.g., used by Weisfeiler-Lehman kernels (WL) or the propagation kernels. The authors justifiably note that WL uses a hashing technique which does not give rise to well-behaved summaries. However, this argument does not seem to hold for propagation kernels, which compare distributions of vertex labels. Therefore I think it should be interesting to discuss (or even compare to) propagation kernels: Propagation kernels: efficient graph kernels from propagated information Neumann, M.; Garnett, R.; Bauckhage, C. & Kersting, K. Machine Learning, 2015, 1-37 Since the used techniques and the method of computation are non-trivial and the description is on a high technical level, the immediate impact of the approach may be limited (such graph kernels unfortunately are not often used in following experimental comparisons). It would be extremely helpful to make the implementation publicly available. Minor remarks: l20: have been appeared l62: the the l105: "graphs of n vertices" -> with/on n vertices l361: The number of citations (and the award) should not part of the reference

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 6

#### Summary

This paper proposes a graph kernel, MLG, that addresses the multiscale structure similarity of graphs. The kernel is designed based on a recursive scheme that lifts a base kernel between graph vertices to a kernel between subgraphs. A linearization approach is proposed to make the computation feasible. Experimental results show that the proposal is effective.

#### Qualitative Assessment

The idea is interesting and elegant. I think the paper deserves publication in NIPS. The following contains some comments that the authors may consider for extending the work. 1. The proposed kernel depends on a base kernel for graphs. In the experiments, the authors use a base kernel that makes use of node labels. In general, if node labels are not present, what would be good kernels to use for the base case? 2. Because of linearization, the proposed kernel becomes dataset dependent, in the sense that if the dataset includes one additional graph, then the kernel for a pair of existing graphs changes. Moreover, such a characteristic means that if the kernel is used for classification, the validation and testing set must be known and utilized for training. 3. Whereas the experimental results confirm superiority of the proposed kernel for four data sets, it in fact does not yield as good results as do the WL kernels for the two data sets of larger sizes. There is a natural suspicion that the proposed kernel cannot compete with the WL kernels for large data. Would increasing the dimension $\tilde{P}$ and the sample size $\tilde{N}$ in the Nystrom approximation help?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)