Review for NeurIPS paper: Revisiting Frank-Wolfe for Polytopes: Strict Complementarity and Sparsity

NeurIPS 2020

Revisiting Frank-Wolfe for Polytopes: Strict Complementarity and Sparsity

Review 1

Summary and Contributions: This paper is a follow-up on the recent works of Lacoste-Julien & Jaggi (2015) and Garber & Hazan (2016). These prior works presented “away-step Frank-Wolfe” variants for minimization of a smooth convex objective function over a polytope with provable linear rates when the objective function satisfies a quadratic growth condition. These rates, however, explicitly depend on the problem dimensions, and they fail to explain the good empirical performance when in large dimensions. This paper investigates the following fundamental question, copied from the lines 55-56 verbatim: “Can explicit dependence on the dimension be avoided when the set of optimal solutions lies on a low-dimensional face of the polytope?” Here is a summary of the contributions: This paper 1- presents a simple example which proves that the answer for the central question above is negative in general, 2- revisits the strict complementarity condition introduced in the early literature of the FW method (and abandoned in the modern literature) and relates this condition some notion of robustness, 3- gives a new analysis that gives linear rates that do not depend on the dimension “d” but only to the dimension “dimF*” of the face of the polytope containing the solution set. [update after author feedback] I kindly disagree with the authors’ comments on the presentation. The three reviewers are not a good representative of all audiences in NeurIPS since they are assigned for their specific expertise. Also, there is a big difference in terms of the presentation style of this paper and other theoretical papers appeared in NeurIPS in previous years referred to in the authors’ feedback. One of the main goals of these big ML conferences is the information exchange between different fields of ML, so the authors are expected to present the work to the broader community and discuss the impact from various perspectives. I still have considerations about the presentation. I raise my score to 6 but I strongly recommend the authors to address a broader audience in the revision if the paper gets accepted.

Strengths: The paper takes a step to close the gap between the theoretical guarantees and the empirical performance of away-step Frank-Wolfe methods in large-scale problems. To my knowledge, the results are novel. The paper is well-written, and the analysis is easy to follow.

Weaknesses: For this work, the potential audience in the NeurIPS community is limited. The paper presents new convergence guarantees for an existing algorithm to tighten the dimension dependence of the constants of the convergence rate under strict complementarity assumption. The algorithm is already known to guarantee linear convergence, and the strict complementarity assumption is not easy to verify a priori. I do not see a big practical impact of the new results, as it is not discussed in the paper. The technical impact is not discussed either. Can this observation lead to new results in other FW variants or settings?

Correctness: The claims look correct. I have read the paper carefully and did not encounter any issues.

Clarity: The paper is well written, in the sense the novelty and the contributions are very clear. The abstract is concise and it provides a complete summary of the work. Technical results are presented in a systematic and clear way. The ideas and observations behind the analysis are well presented. All these aspects simplify the reading. On the negative side, the presentation is unusually technical for machine learning venues. It lacks discussions on the impacts of the new results, especially from the practical perspective. The paper finishes right after the proof of the main result, which feels like the paper is incomplete. I recommend the authors to include a concluding section that summarizes the takeaways and discusses the potential impacts.

Relation to Prior Work: Relation to prior work and the contributions of this work is clearly discussed in the introduction. In particular, Table 1 compares this paper with the three most related results from the existing literature.

Reproducibility: Yes

Additional Feedback: - Table 1: Although delta and dimF* are not available a priori, it is possible to construct these bounds a posteriori. That would be interesting to see a comparison of the standard bound that depends on the dimension “d”, the new bound, and the actual performance in some realistic numerical experiments. - Theorem 2: This example is clearly related to the examples in Jaggi (2013, Lemma 3) and Lan (arxiv1309.5550v2, 2014, Theorem 1), with the modification of “down-closed” unit simplex to produce a problem where the solution is at a vertex. I think this connection should be mentioned. - Line 142-147: This paragraph needs polishing. - Line 155: From the convexity of f^tilde we have -> From the convexity of f and f^tilde we have - Line 178-179: I recommend adding a small discussion to introduce Theorem 5 here. - Typo line 209: solution closet in Euclidean distance -> solution CLOSEST in Euclidean distance - Typo line 264: while FORM Theorem 2 -> while FROM Theorem 2 - Line 276: I recommend adding a Conclusions section.

Review 2

Summary and Contributions: This paper shows that the convergence rate of *Frank-Wolfe algorithm with away-step and line search* can actually have a better constant dependency. Assume that the optimal point lies on a lower-dimensional face of the polytope. The authors show that the algorithm's convergence rate depends on the dimension of the face, which can be much smaller than $d$ (i.e. the dimension of the polytope).

Strengths: A new result regarding *Frank-Wolfe algorithm with away-step and line search* is presented in this paper. Previous linear-rate results are of the form \exp( - t / d), while this paper shows that the rate can be improved to \exp( - t / dim(F*)) under strict complementary condition [Wolfe 1970], where dim(F*) is the dimension of the face that contains the optimal solution which can be much smaller than $d$. I believe that the result is significant as it shows when the algorithm can be faster ---- the result improves the previous theoretical guarantees of the algorithm.

Weaknesses: I understand that this is a theory paper. But it will be better if a proof-of-concept experiment is provided.

Correctness: I read the proof and the analysis thoroughly. It seems sound and correct. Typos: line 200: "\eta \in [0,1]" but \eta_{\max} could be larger than 1 line 210: "x^* = \sum_{i \in [n] } (lambda_i - Delta_i) + ... " v_i is missing .

Clarity: yes.

Relation to Prior Work: yes.

Reproducibility: Yes

Additional Feedback: *** after rebuttal *** I've read the authors' rebuttal and I maintain my score.

Review 3

Summary and Contributions: This paper revisits a classic result from Wolfe's book and (1) shows that in general, the dependency on the dimension in the rate is that of the ambient space. This is already extremely important as the sparsity of the solution does not improve this dependency. (2) it proves that under the strict complementarity condition the dependency can be improved to that of the face where the solution lays. ------------- Post Rebuttal: I decided to keep my score but I would like the authors to add a conclusion section that is used to position their work and discuss the implications of their results for the neurips community. Note that I am not asking for more experiments (as answered in the rebuttal).

Strengths: The paper is crisp and well written. It addresses an important open problem in optimization, unifying Wolfe's conjecture with the modern linear convergence analysis. I believe this paper may have a significant practical relevance as it significantly improves the understanding of the FW linear convergence rates and bridges a gap between theoretical analysis and previous experimental evidence.

Weaknesses: These are not weaknesses but rather questions. 1) Is there a general relation between the strict complementarity, F*, and the pyramidal width? I understand it in the case of the simplex, I wonder if something can be said in general. 2) It would be useful to discuss some practical applications (for example in sparse recovery) and the implication of the analysis to those. In general, I found the paper would be stronger if better positioned wrt particular practical applications. 3) I found the motivation in the introduction with the low-rank factorization unnecessary given that the main result is about polytopes. If the result has implications for low-rank matrix factorization I would like to see them explicitly discussed.

Correctness: I believe so

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: