Paper ID: | 5460 |
---|---|

Title: | Efficient Near-Optimal Testing of Community Changes in Balanced Stochastic Block Models |

Overall I enjoyed reading this paper, and it is of pretty high quality, and moderately original. Even for a non-specialist of hypothesis testing problems (which are classical questions in statistics) the paper is accessible. It is well written, concise and yet clear (if one trusts the proofs, that are in appendix). The hypothesis testing problem studied seems not knew (I'm not a specialist of this literature), but the application to the SBM is. This is not too much "incremental", as theoretical analysis of the SBM is a notoriously difficult problem. The paper provides rigorous bounds, and convincing numerical experiments. My main concern is related to the interest by the ML audience: I wonder how much of a concern is the problem to a pure ML audience, and if that paper would fit better in a good statistics conference. This is not necessary the case. Some detailed comments: _My main concern: There is something that I do not understand and that the authors should clarify (it is actually a crucial point): It is claimed, in particular at line 46, 47 that the case of sparse graphs with bounded degree (bounded in average) is considered, which in the notations of the paper means to me (and in general in the SBM literature) a,b=O(1). But later in the definittio of the SBM it is written Kline 134 that a,b=O(ln n). So they diverge. So formally, even if ln n grows slowly, the graph is not sparse (and for non-sparse case the analysis is much more easy in general). Do I miss something? Even if yes, this point is misleading and should be clarified. _please define the acronyms FA and MD(x) in eq (1) _I think the definition of R_{TST} in (2) is wrong: as you restrict the sup over x,y s.t. d(x,y) >= s, there is an error when the test is different than 1, while you put an indicator 1(x = y), which is 0 all the time as you consider x,y s.t. d(x,y) >= s. _line 179: please specify that you allow sel edges (which is unusual), otherwise the second Bin(n^2/4, a/n) is not correct _lines 183, 184 are very mysterious, please clarify what you mean here _lnie 308: "and" is reputed twice

The paper presents a solid contribution into the statistical literature related to the stochastic block model. The related work on the Goodness of Fit and Two Sample Testing is up to my knowledge very well covered. In my opinion the paper present a couple of drawbacks that I list below: I believe that the reason SBM is so widely studied largely because some of the constants, that in other statistical models might be very hard to determine, can be determined in this model. This being the main attraction of the model, and given that the present paper does not specify the constants, this limits the significance or the contribution. The manuscript is not consistent in terms of the scaling of the degree that is considered in various parts. The abstract states "we consider the challenging sparse SBM regime, where degree-per-node is constant". Then for instance the results on [DAM17] are used, but those apply to degree growing with n, analogous results of \Lamba < 2 for constant degrees are due to Mossel, Neeman, and Sly, "Stochastic Block Models and Reconstruction", 2012. Towards the end of page 5 they state: "a, b = O(log n) considered in this paper", contradicting the abstract. Are the obtained result useful in practice? The experimental part contains results in the blog network, but it is also stated that "TST plot suffers since the political blogs graph is not completely described as a 2-community SBM". Real networks are indeed rarely well described by the 2-community SBM, the authors should comment more on this issue. (minor) The papers studied SBM with two equal-sized groups, this should be specified from the beginning as this is a particular case of the more general model. And several of the informally stated claims in the first 3 pages do not apply to the general case. -------- post-feedback: I have read the authors answer and the other reviews. While the feedback clarified some issues (sparsity for instance), it also made it clear that applicability of the theory in its present form is limited. In my opinion the paper remains of borderline originality.

The question addressed in this paper is interesting and the results non trivial in my opinion. The mathematical methods seem quite standard - Bernstein inequalities and Le Cam's method. Experiments are also provided to validate the results on real data sets. But I did not look at this part. I have not looked at this part and my feeling is that much more investigation would be needed to really validate the modeling on real datta sets. The paper is clearly written. The introduction gives a detailed review of the literature. Review update: in their answers to the reviewers the authors have essentially clarified technical concerns. The question of the relevance of the theory to experimental situations has come up. I am happy to support the paper even if the applicability of the theory is questionable since there seem to be new serious theoretical results w.r.t the literature. However if the experiments show that the theory has unclear applicability the authors should honestly state it, and underline that the main value of the paper is its theoretical contribution.