NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The proposed algorithm seems impractical. The problem in Equation (1) is not only nonconvex (for the MCP regularizer \rho_\lambda), but also combinatorial (for any regularizer \rho_\lambda). The combinatorial aspect comes from the constraint B \in D, where D is the set of p*p matrices representing the weighted adjacency matrix of a DAG. This problem is NP-hard. This is why [21] and [23] argue for an identifiability condition that allows recovering the node ordering in poly-time. The paper is overselling its contributions. Results in [23] do not require faithfulness, equal variance or Gaussianity. The submitted paper needs Gaussianity and equal variance. Furthermore, the submited paper is not the first to avoid faithfulness. Experimental evidence is mandatory in this particular case. Since the problem in Equation (1) is combinatorial/NP-hard, it is important to perform synthetic experiments to assess the time complexity. Minor comment: Some references labeled as Arxiv seem to actually be conference papers, for instance [21] [22] [23]. === AFTER REBUTTAL I am not satisfied with the authors response. The authors argue that reference [1] solves the problem in equation Equation (1). The authors in [1] use a heuristics, not exact optimization. Thus, my comment on practicality still stands, given that Equation (1) is NP-hard. The authors argue that their method is the first score-based method (for Gaussian, equal-variance data) to avoid faithfulness. Theoretical results in [23] do not require faithfulness, equal variance or Gaussianity. Thus, the "score-based" part does not bring any advantage. My comment about experiments was clearly due to the NP-hardness of exact optimization (not heuristics), and it still stands.
Reviewer 2
The paper is very relevant to the community. It is clearly written, though I would use fewer superlatives. I didn’t check all the proofs, but what I did check seemed solid. Regarding significance, I do believe the contribution an analysis here are excellent. Some aspects are a bit oversold, like saying “no faithfulness required”, but using a set of assumptions which cannot be cleanly untangled from such conditions. *** After authors' response *** I read the authors' response and my assessment is unchanged.
Reviewer 3
Update: The authors gave a good rebuttal, I have increased my score to 6. Original comments: In this paper, the authors considered the problem of learning directed acyclic graphs via optimizing a score. In particular, they have developed a new approach that requires O(s log p) samples to learn a DAG from the data. The proposed a approach is an optimization based approach that learns a DAG via optimizing a nonconvex scoring function. Pros. 1. The theoretical analysis of this paper is complete. In addition, the analysis techniques developed in this paper seems to be helpful to solve other related problems in structure learning and high-dimensional statistics. Cons. 1. In this paper, the authors claimed that the proposed approach is guaranteed to learn the minimum-trace DAG of the underlying distribution. However, the authors did not give any comments about the importance of learning a minimum-trace DAG in practice. It would be helpful if the authors can provide any example to show that the minimum-trace DAG is very important in applications, such as applications in biology, sociology or economics. In applications why are people interested in learning a minimum-trace DAG to represent the distribution? I understand if we assume the underlying data generating system is equivariance, then the minimum-trace DAG is the same as the DAG with equal variances, which is a particular scenario where learning the minimum-trace DAG is useful. However, the equivariance case seems too restrictive in practice. 2. Since the possible application of the minimum-trace DAG seems unclear, it is important for the author to provide real data analysis to show that minimum-trace DAG learned from the estimator (1) is really able to discover some informative results in practice. I regret that such information has not been provided in this paper. 3. It seems to me that the gap condition stated in Section 3.2 is a reasonable assumption only in the scenario where p = O(n^2). If p is beyond than that, the gap condition would become very restrictive. Is that correct? For example, if p is bigger than O(n^2), then one cannot come up with a choice of $a$ that allows gap(\Sigma) = o(1) while tolerating the average degree to grow with $n$. Hence, the gap condition is not applicable in the real high-dimensional setting where p = O(exp^n). If this is the case, perhaps the authors should state this limitation more explicitly in the paper. In conclusion, although the theoretical analysis shown in this paper seems interesting, the possible applications this estimator can be applied to is unclear.