NeurIPS 2020

Revisiting Frank-Wolfe for Polytopes: Strict Complementarity and Sparsity

Meta Review

All reviewers agreed that this paper presented a strong theoretical contribution to NeurIPS by providing the first global convergence analysis of a projection-free method on generic polytopes which depends only on the dimensionality of the optimal face rather than the ambient dimension like previous analysis. Given the significant amount of practical and theoretical interest that Frank-Wolfe variants have had in the machine learning literature, this result is quite significant. The authors also provided a nice motivation for the addition of the strict complementary condition by showing a \Omega(d) lower bound on Frank-Wolfe variants when it does not hold. R1 was originally negative about the write-up as they thought that its (somewhat dry style) was lacking appeal to the broader community, and its technical impact was not sufficiently described. I have also read the paper in details and discussed it with the reviewers. I think the introduction and Section 2 did a decent job to motivate the results in the paper and present a broader appeal. While Section 3 is indeed on the technical side and the paper was lacking a discussion / conclusion, the reviewers agreed that this could be easily fixed in the camera ready version (and also thanks to the 9th page). I also appreciated the fact that the main paper is self-contained without the need for an appendix, unlike the recent trend on technical papers. R1 has thus upgraded their recommendation to 6, and I recommend the paper for acceptance. After my detailed read, I give further suggestions for the camera ready below. The authors should carefully implement the suggestions from the reviewers in the camera ready version, in particular adding a broader discussion about their Section 3 in conclusion / discussion, as well as providing further intuition / discussion of the strict complementary condition. == Other detailed comments == - L57 and elsewhere in the paper -- I agree with R3 that the several mentions of the low-rank condition in this paper is confusing (and slightly misleading) given that all the results only apply to polytopes and not the matrix cones where these low-rank conditions are applicable. I suggest that either it is removed from the paper; or only mentioned once (for context) with a clear caveat that this paper does not provide any theoretical guarantees for these kinds of constraint set. - L79 - I suggest to edit with something along the lines of "See also a follow-up work by B and Z [1] which later extended this result, but still had a dependence on the ambient dimension for generic polytopes." (Given that [1] has "general polytopes" in their title and that their abstract (somewhat misleadingly) claim that "similar rates" to Garber & Meshi [13] "can be generalized to arbitrary polytopes", one could easily assume that the nice dependence on the dimension of the optimal face from [13] could hold in [1] as well, so it is useful to be quite clear that this is not the case after all. - Table 1 is a bit confusing given that [1] *does give* a rate for general polytopes for a decomposition invariant version of AFW with line-search (AFW-2 in their paper), with a better rate for the unit simplex. I think for clarity and proper coverage of the related work, it would good to add a row for [1] with their generic polytope rate applied (naively) to the unit simplex (actually, I'd suggest to just put 3 rows in the table as the rate for [12],[20], naive [1] are the same here). Then in the caption, say something like "Illustrative comparison of assumptions and convergence rates over the unit simplex" (...) "For [1], we used their rate for generic polytopes applied to the unit simplex for illustration; they also provide the tighter rate with dim(F*) (like [13] for a different algorithm), but these rates do not apply to generic polytopes." Or, alternatively, it would be much more interesting to provide an example of a simple polytopes where the rates would be different and expressible in simple form (though I understand that this is not easy given that the pyramidal width [20] / facial distance [25] is not that easy to compute, so sticking to the unit simplex is fine).