Paper ID: 92
Title: Polar Operators for Structured Sparse Estimation
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Comments:

This paper proposes a new method for structured sparse learning based on the generalized conditional gradient (GCG) framework. Several polar operators are proposed for different structured sparsity patterns, e.g. group sparsity, path coding. The comparisons with the proposed method with existing ones in the experiments seem promising.

The proposed algorithm is very interesting and seems useful for several applications. There are two major concerns. Theoretically, the GCG has a slower theoretical convergence rate than APG. With the introduced local optimization (Line 5), it is claimed to have faster convergence in practice without solid theoretical supports. As for the experimental comparisons, overall, the experiments are rather illustrative than actual in this paper. The proposed method is compared with standard solvers only on special applications. It is hard to conclude that the proposed algorithm can be considered as serious alternatives for those of structured sparsity problems. The state-of-the-art algorithm on such applications should be included.

Some additional comments:

1. In line 125, it said that “Although GCG has a slower theoretical convergence rate than APG, the introduction of local optimization (Line 5) often yields faster convergence in practice”. What is the support for this conclusion? This paper will be significantly enhanced if there exist solid theoretical supports.

2. In Algorithm 1, Line 4 finds the optimal step sizes for combining the current iterate wk with the direction vk and Line 5 locally improves the objective (1) by maintaining the same support patterns but re-optimizing the parameters. Both of them require accessing the loss function f(x). If the loss function is complex, the computational costs can not be ignored. On the other hand, it is not necessary to access the loss function f(x) in each iteration for the APG. How to Can you make a discussion on this?

3. In section 5.3, the proposed algorithm has been validated in the problem of latent fused lasso. However, it is not compared with the state-of-art algorithms for this problem. For example, J. Liu, et al. An efficient algorithm for a class of fused lasso problems. In KDD2010. Without fair comparison with the state-of-art algorithm, it is hard to say the benefit of the proposed algorithm.
Q2: Please summarize your review in 1-2 sentences
This paper proposes a new method for structured sparse learning based on the generalized conditional gradient (GCG) framework. However, there are two major concerns for current version of this paper.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors build their work on top of the generalized conditional gradient (GCG) method for sparse optimization. In particular, GCG methods require computation of the polar operator for the sparse regularization function (an example is the dual norm if the regularization function is an atomic norm). In this work, the authors identify a class of regularization functions, which are based on an underlying subset cost function. The key idea is a to 'lift' the regularizer into a higher dimensional space together with some constraints in the higher-dimensional space, where it has the property of 'marginalized modularity' allowing it to be reformulated as a linear program. Finally, the approach is generalized to general proximal objectives. The results demonstrate that the method is able to achieve better objective values in much less CPU time when compared with another polar operator method and accelerated proximal gradient (APG) on group Lasso and path coding problems.

While I am not an expert in this field nor did I understand many of the technical details, I did find that the paper was not written clearly and used an unnecessarily large amount of notation. In addition, the motivation is slightly unclear from the start - on one hand an efficient method is desired while on the other hand APG has significantly better convergence guarantees (1/\epsilon vs. 1/\sqrt{\epsilon}) than this GPG-based method. While the results show a very large efficiency improvement, it is unclear to me whether this generalizes outside of these two cases. The lifting construction and reformulation as LP are interesting in their own right but I do not feel sufficiently informed to evaluate them.
Q2: Please summarize your review in 1-2 sentences
The authors propose (1) a class of regularizers inducing structured sparsity and (2) an efficient optimization method for these regularizers which shows impressive performance improvements over other existing methods. However, theoretical guarantees on this method do not match that of existing methods and it is unclear whether these performance improvements generalize to other data sets.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
-- first-round comments --

This paper proposes new methods to compute the polar operator for a class of regularizers based on marginalized modular functions such as group sparsity.

Pros:
- well-written, and to-the-point. Logically organized
- subject is topical (group sparsity, sparsity on DAGS)
- Polar operators have not been studied too much (with respect to practical use) compared to proximal operators. The proposed method seems to be quite good
- Recent interest in conditional gradient/Frank Wolfe means that polar operators are of interest
- Nice secondary results (e.g., prox of lp + TV)

Cons:
- Prop. 2 allows one to compute the polar via the prox, due to the link between the dual functions and polar functions (and between a function and its conjugate, via Moreau's Thm, eq (43)). By itself, the proposition is not too surprising, and I strongly suspect that a "converse" can also be shown (i.e., one can compute the prox via the polar). If this is true, then does the advantage of the GCG over APG still hold? In the experiments, the proxes used in APG were very slow, but couldn't these be sped up to within a factor of 5 of the time to compute the polar? (since going the other way, it took about 5 evaluations of the prox to compute the polar). Also, it would be nice to say something more about the 1D line search; e.g., can one use Newton's method, as in SPGL1 (Friedlander and van den Berg)?

- Actually, is it not possible to find the polar via the prox in another manner? The polar operator for a set S at the point "d" is polar(d)=argmin_{x in S} -d^T x. The prox operator at the point "y" is prox(y)=argmin_{x in S} .5||x-y||^2. Then by expanding the quadratic, we roughly get lim_{alpha -> infinity} prox( -alpha d ) = polar(d). Computationally there are still issues, but it may be an interesting alternative.

- Regarding the exact solution of 1D TV, improving on that in [25]. I do not think this is new; please check recent literature like L. Condat, "A direct algorithm for 1D total variation denoising," preprint hal-00675043, 2011 (which also mentions other exact, direct algorithms, such as the taut string algorithm).

- complexity results used Nesterov's smoothing method, but experiments use L-BFGS. Then, line 362, the paper says "...the method proposed in this paper is not only faster... in practice, it also possesses a strictly better worst case time complexity..." The practical results use L-BFGS, and the worst case time complexity use Nesterov's APG, so this statement is not correct unless you have a complexity bound for L-BFGS.

- The appendix is rather long and contains new propositions. I prefer appendices to be reserved for small details and lemma. If such a long appendix is necessary, is NIPS really the right venue?

Neither a "pro" nor a "con", this paper will likely be hard to read for most readers, since it combines combinatorial-like functions (modularity), convex optimization (polars and duals), and statistics/machine learning (lasso, etc.)

In my opinion, the pros outweigh the cons considerably. I think this is a significant paper.

------------------------
new comments after reading feedback:

- In response to the other reviewer, you wrote "APG also needs to evaluate the loss function if the Lipschitz constant requires adaptive estimation". Actually, it is possible to do a check using just the iterates and the gradients since if the gradient is Lipschitz it is also firmly non-expansive (when scaled by the Lipschitz constant), and so you can use the firmly non-expansive bound to check (see the old Eckstein-Bertsekas paper on the Douglas-Rachford iteration).

Also, this comment
"- complexity results used Nesterov's smoothing method..."
was not addressed. You are claiming to be better in theory and practice, but your theory uses APG and your practice uses L-BFGS, so theory and practice do not mach.
Q2: Please summarize your review in 1-2 sentences
Novel paper that considers a relevant but not much explored topic: polar operators. Since not many methods have been proposed for these outside the trivial cases, the method in this paper is a significant improvement.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their valuable comments.

====
Reviewer_4:

Q1: GCG has slower theoretical rate but converges faster in practice.

A: True, the current analysis indicates that GCG in theory converges slower than APG. However, the former possesses some special properties that are not shared by APG:

1. We get to control the sparsity of intermediate iterates in GCG, because each iteration adds one "atom".

2. GCG naturally maintains an active set of variables that can be locally optimized. As observed by many groups (on a variety of problems) such as [14, 15], local optimization over an active set can significantly accelerate convergence. Indeed, Liu et al. [25] also used the active set technique (via local optimization and gradient descent) when computing the proximal map for the fused lasso, and obtained significant acceleration over APG. However, different from Liu et al. who applied the active set technique in the inner loop (i.e. the proximal operator), we combine GCG with local optimization in the outer loop.

3. Incorporating local optimization in the *outer* loop of APG is challenging, because APG maintains two sequences of iterates. Locally optimizing one sequence will break the synergy with the other.

Therefore, GCG demonstrates novel promise and is worth studying, despite the slower rate in its vanilla form. The main goal of this paper is not to replace APG with GCG, but to point out that in some cases, by carefully leveraging the special properties of GCG, it can be made very competitive. Considering the great interest that conditional gradient has received in many communities like machine learning, and the relatively little work on computing the polar operator, we believe our results can help understand the relative merits of GCG and APG.


Q2: Comparison with Liu et al.

A: In this experiment on fused lasso, we applied our Algorithm 2 (Appendix I) to solve the proximal map that is used by both APG and GCG. During the rebuttal period, we complemented our Figure 3 by trying APG that uses the proximal algorithm in Liu et al. (called APG-Liu). We observed that in APG, our proximal map allows the overall optimization to converge with slightly less CPU time than APG-Liu. But GCG further reduces the time by 50%.

Also during the rebuttal period, we further conducted a direct comparison between our proximal algorithm and Liu et al. (SLEP 4.1). Both methods were invoked with an m-dimensional vector, whose components were drawn independently from unit Gaussians (m = 10^4 to 10^6). We varied \lambda in {0.01, 0.1, ..., 100}. The computational cost of our algorithm exhibits clear linearity in m with an even slope. It is 2 to 6 times faster than Liu et al., with the margin growing wider as the values of \lambda increase. See detailed plots in above link.


Q3: The proposed method is compared with standard solvers only on special applications.... The state-of-the-art algorithm on such applications should be included.

A: We did compare with the state-of-the-art algorithms for CUR-like matrix factorization and path coding, which are typical sparse coding problems. In fact, as we stated in the paper, the implementation of APG was exactly the SPAMS package. They did not show up in Figure 1 and 2 because the time cost for a single call of the proximal update in APG was enough for GCG to converge to a final solution. Other packages such as SLEP could be tested on. For a rebuttal on fused lasso please see Q2.


Q4: Local optimization requires the access of the function value. If the loss function is complex, the computational costs cannot be ignored. On the other hand, it is not necessary to access the loss function f(x) in each iteration for the APG.

A: All first order algorithms, including APG and GCG, require access to the gradient of the loss function. APG also needs to evaluate the loss function if the Lipschitz constant requires adaptive estimation. Besides, in most applications, the function value is no more expensive than evaluating the gradient.

In practice, our overall scheme indeed requires fewer function/gradient evaluation than APG. This is because: 1. local optimization can substantially reduce the number of iterations required by GCG. 2. Local optimization, which itself is based on quasi-Newton methods, empirically converges in a small number of iterations, thanks to the small size of support and the low accuracy required.


====
Reviewer_5:

Q1: GCG has a slower rate than APG

A: Please see Q1 for Reviewer_4.


====
Reviewer_6:

Q1: Converse of polar via prox operator.

A: Indeed, it is possible to reduce the prox to a sequence of polar evaluations, for instance one can just apply the GCG algorithm to the prox. However, for all reductions that we are aware of, it is usually less efficient to reduce prox to polar than the other way around.

Q2: Can the prox be sped up?

A: In the first part of the paper (Section 3), we do not reduce the polar to prox but compute polar directly through a sequence of developments. So far it is not clear whether or not one can further improve the prox or polar in these examples (either would be a nice contribution). Also, we do not claim that polar is always cheaper than prox or the converse. It certainly depends on the regularizer. What we try to say is that the polar, if not cheaper, is at most a couple of proxs (using the reduction). (Note that l_inf norm costs O(n), not O(n log n))

Q3: 1D line search

A: One can certainly use the Newton method, although we haven't tried it yet.

Q4: Other way to reduce polar operator to prox operator.

A: Very interesting proposal, we will try it out.

Q5: 1D total variation.

A: Thanks for pointing out the reference, which is also based on dynamic programming. Our derivation is completely different, leading to a rather different algorithm. We will cite the reference.