NeurIPS 2020

Decision trees as partitioning machines to characterize their generalization properties


Review 1

Summary and Contributions: This paper introduces partitions as a framework for studying VC dimension, obtains expressions for the partition function of decision stumps and decision trees (leading to VC bounds), and performs numerical experiments using the resulting bounds in a structural risk minimization manner. Here, partitions are of the sample points, and the partitioning function is the largest number of a-partitions (i.e., partitions of size a) that may be formed by the (tree) classifier class. The VC dimension is then related specifically to 2-partitions.

Strengths: The results seem to be interesting and the proofs non-trivial. Additionally, there is extensive empirical validation suggesting that structural risk minimization with these bounds is useful.

Weaknesses: (1) Not having read the proofs in detail, this paper seems like a good contribution. The main question I have is whether this partition framework is useful more broadly. For example, can it be used for multiclass results or with other types of classifiers?

Correctness: As far as I can tell.

Clarity: Yes.

Relation to Prior Work: Yes, prior work is discussed.

Reproducibility: Yes

Additional Feedback: (2) As someone who hasn't studied the VC dimension of trees previously, one of my initial questions would be how this would compare to VC bounds based on parametrized function classes (usually these count the number of arithmetic operations). While I don't think those lead to interesting bounds in this case, a short comment to this effect would be helpful, I think. I've read the author response and the other reviews. I will stick with my original score.


Review 2

Summary and Contributions: Authors present a different viewpoint of decision trees based on partitions of the data. It is showed that this notion of partitioning relates to the growth function and consequently to the VC dimension. Authors then provide a pruning algorithm that does not require cross-validation and attain better performance than CART.

Strengths: 1. Theoretical results are sound although I did not check all proofs thoroughly. Empirical set up is correct although somewhat limited. 2. The novelty of this work is in relating the VC dimension to the proposed partition function. It is an interesting proposition and it is theoretically appealing but the practical significance is more limited. 3. The topic and ideas are relevant to the theoretical community which makes a good paper for acceptance.

Weaknesses: * From the theoretical analysis, the main weakness might be the analysis on pure continuous features. Nowadays, it is very unlikely to have this scenario in the most challenging machine learning problems. Thus, the theoretical implications can be limited due to this. * As a non-expert in the world of decision trees. From the empirical evaluations, I am curious to know why the method was only compared to a very old algorithm such as CART. Is it the only reasonable algorithm out there for decision trees that can be comparable to the method proposed?

Correctness: Yes, theory developed seems correct, although I skipped some proofs. Empirical methodology is simple and correct.

Clarity: Paper is very well written. Easy to follow and grasp the main ideas behind.

Relation to Prior Work: There is a good comparison to existing literature, where authors stress the difference with closer work.

Reproducibility: Yes

Additional Feedback: UPDATE: I am keeping my score to 7 after reading author's response. It is certainly a good submission as is, although it would have been a stronger submission had the analysis on categorical features been included in the paper.


Review 3

Summary and Contributions: The paper analyzes the VC dimension of hypothesis classes that consist of binary decision trees with a fixed topology and varying node labels (split criteria, predictions). It provides an exact result for decision stumps and an asymptotic result for larger trees, which both depend only on the data dimension (l) and the number of split vertices in the tree.

Strengths: The paper provides exact VC dimensions for decision stumps given a data dimension and an asymptotic behavior for decision trees, given their number of internal nodes and data dimension. These are interesting results, as there seem to be none around.

Weaknesses: As already mentioned by the authors, it would be interesting to have results for categorical data. It would be interesting to get at least a small clue of what problems lie ahead, here and why the existing work does not generalize readily to this scenario.

Correctness: I could not spot any obvious mistakes in the arguments of the authors. The bounds for data dimensions 1,2,3,4 and decision stumps seem correct/plausible.

Clarity: The paper is clearly written and allows to follow the train of thought of the authors.

Relation to Prior Work: It seems worth mentioning that Ozlem Asian ; Olcay Taner Yildiz ; Ethem Alpaydin: Calculating the VC-dimension of decision trees (2009) have given an algorithm to compute the VC dimension of binary decision trees. Is there a connection between their algorithmic approach and your proof of Corr. 10? I agree with the rebuttal of the authors that the above paper is limited in scope and solves a slightly different problem. However, it still seems reasonably related.

Reproducibility: Yes

Additional Feedback: Regarding the broader impact of the paper, I feel that a more detailed sentence on the profits of practitioners using DTs might be in order. In particular, your work has implications on the power of the learner. How could a practitioner benefit from this knowledge?