|
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
******** Added after authors' feedback *********
Please include a simulation where the loss is not strongly convex but
PG algorithm converges linearly (for different levels of stability), and
importantly discuss when the theory fails. Please also discuss on how this
can be extended to the analysis of ADMM.
*************************************************
This
paper proves the linear convergence of the proximal gradient method
applied to trace-norm regularized learning problem when the loss function
has the form of f(X)=h(A(X)), where A is linear and h is strongly convex
on any compact set and has Lipschitz gradient.
This paper is an
extension of Tseng [20], Tseng and Yun "A coordinate gradient descent
method for nonsmooth separable minimization" and Zhang et al. [22], which
established the same result using the "error-bound condition" for lasso
and group lasso, to the trace norm. This is a non-trivial extension but
the contribution seems purely technical.
The presentation of the
proofs is mostly clear.
Strength: - Shows the linear
convergence of the proximal gradient algorithm extending the result of
Tseng et al. Weakness: - The contribution is purely technical.
- I would like to see a numerical example showing the linear
convergence.
More details: 1. The outlines of the
proofs of Theorem 3.1 and Lemma 3.2 seem very similar to those of Tseng.
The authors should refer to the original work more precisely to make their
contribution clearer. 2. I would like to see a numerical example that
indeed shows the linear convergence (a semilog plot of function value vs.
the number of iterations). 3. It is not clear how the constants
\kappa_1, ..., \kappa_4 depend on the choice of \underbar{\alpha} and
\bar{\alpha}.
Minor issue: 4. The sequence of inequalities
before inequality (13) is confusing. The last inequality follows due to
the convexity of the trace norm and the fact that -\bar{G}\in
\tau\partial\|X\|_{\ast} and not because of the intermediate inequality
(see p289 in [20]).
Q2: Please summarize your review
in 1-2 sentences
Extension of previous linear convergence result based
on the "error-bound condition" from lasso and group lasso to the trace
norm. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proves linear convergence of the standard
proximal gradient method for trace norm regularized problems, under weaker
assumptions than strong convexity of the loss function, which is a crucial
problem for many ML and signal processing applications.
The paper
is carefully and clearly written. It closely follows the Lipschizian error
bound technique of Tseng [20], whose application to the nuclear norm case
was still left as an open question. The main part of the proof seems
correct as up to my understanding.
* What I miss in this paper is
a clear comparison/analogy to the vector case of l1-norm regularization,
as e.g. in [20,22] and related work. It would be very helpful to
relate to the results and techniques used in that case, and to explain
if/why the l1-case techniques do apparently not directly translate to
trace norm. (Also now the l1-norm case should follow from the trace norm
one as a corollary).
As for judging the value of the paper, this
should be taken into account, as some people might find it not extremely
surprising that the l1-result also holds for trace norm (I still find it a
very valuable contribution though).
* Independent of that I would
suggest a slight change of the title of the paper, as the main new
contribution affects the loss part more than making the trace norm
regularization a *structured* one. (Since the newly gained flexibility by
the linear operator A is on the loss side, not in the regularizer).
*** Minor issues - Notation: Calling P a sampling or mask
operator? (sampling might hint at randomness, but here P is fixed)
-l337: Broken reference - The meanings of "Q-" and "R-" linear
convergence are maybe not that widely known, and could be repeated for
clarity. - the residual R(X) defined in (6) could be additionally
explained in relation to the original definition (46) by [20] for better
interpretation.
In general, I'd vote to include a bit more
discussion and maybe conclusions in favour of the long technical part of
the proof of Lemma 3.2 in Section 3.
*** Update after author
feedback: Thanks for answering most questions. As for the title, i was
referring to the word *structured*, which i found unnecessary in the
context here, but of course i leave that to the
authors. Q2: Please summarize your review in 1-2
sentences
The paper proves linear convergence of the standard
proximal gradient method for trace norm regularized problems, under weaker
assumptions than strong convexity of the loss function, which is a crucial
problem for many ML and signal processing applications.
Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
--------added after authors' rebuttal-------- The
proof for |R_k| <= kappa |X_k - X_{k+1}| has been explicitly
established in "Scalable nonconvex inexact proximal splitting", Suvrit
Sra, NIPS'12. (Lemma 4) --------end--------
For trace norm
regularized risk minimization, a popular method is the proximal gradient
algorithm. The sublinear global convergence rate of this algorithm is
well-understood while it is also known that the rate is linear if the loss
is additionally strongly convex. This paper relaxes the strong convexity
assumption by following the framework established by Luo and Tseng. In
particular, the authors proved a local error bound that allows them to
show that the proximal gradient converges asymptotically at a linear rate
for trace norm regularization. The paper more or less follows the usual
pattern to prove the local error bound, with a new observation that allows
simultaneous diagonalization of the matrices.
Quality: This
paper appears to be technically sound. The only gap I had in mind is in
the appendix, line 085. Please explain why such a choice of kappa_3 (say
alpha < 1) leads to the claimed inequality R(X^k) <= kappa_3 |X^k -
X^{k+1}|_F ?
Clarity: I could follow the paper without much
difficulty. Although due to the technical nature, it might be a good idea
to lay out the main claims first and then go to details one by one.
Particularly, after stating the local error bound, the authors might want
to postpone the very technical proof of the local error bound but turn
immediately to prove the linear rate (which by the way, has hardly
anything new in it).
Originality: The originality of this
paper is moderate. The observation that allows the authors to
simultaneously diagonalize matrices seems to be new and interesting; other
than that, the authors more or less follow the existing approach to
establish the local error bound. The rest steps are not new.
Significance: The significance of this paper is mainly in
theory, with little impact on algorithm design though. While it is
certainly great to extend the linear convergence rate of the proximal
gradient algorithm to more practical scenarios (i.e., relaxed form of
strong convexity), the significance is nevertheless lowered by the
asymptotic nature of the proof (and claim). The way the authors handle the
diagonalization of matrices might be of some independent
interest. Q2: Please summarize your review in 1-2
sentences
This work proved that the proximal gradient algorithm,
when applied to trace norm regularized risk minimization, converges
asymptotically at a linear rate (under a relaxed form of strong
convexity). The authors mostly follow a well-established framework due to
Luo and Tseng, with a new observation that allows them to (simultaneously)
diagonalize matrices hence (more or less) reduce to the vector case.
Overall, this work is incremental and is moderately interesting in the
theoretical aspect.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
First of all, we would like to thank the reviewers for
their careful reading of the manuscript and insightful comments.
Let us begin by clarifying the nature of our contribution and its
relation to existing literature. The main contribution of this work is to
establish, for the first time, an error bound condition for a class of
trace norm minimization problems. As such, the paper indeed has a more
technical/theoretical flavor. Nevertheless, we believe that such a
theoretical study is important and interesting for the following reasons:
1. The existence of the aforementioned error bound has been a
fundamental and intriguing open question in the optimization community
since the work of Tseng [20] about 4 years ago. Our paper resolves this
question in the affirmative.
2. The error bound proven in this
paper allows us to have a better understanding of the convergence behavior
of various first-order methods that are widely used in the machine
learning community. For instance, it allows us to prove the linear
convergence of the proximal gradient method when applied to a class of
trace norm-regularized regression problems. As already noted by the
reviewers, the objective functions in these problems are not necessarily
strongly convex, and hence previous linear convergence results in the
literature cannot be applied to these problems. In addition, our error
bound can be used to establish the linear convergence of the alternating
direction method of multipliers (ADMM) for solving the widely studied
robust principal component analysis (RPCA) problem. Due to the page limit,
we are not able to include this result in our NIPS submission, but we will
do so in the full version.
Now, in view of the \ell_1
regularization error bounds (such as LASSO & group LASSO) discussed or
established in Tseng [20], the error bound for trace norm regularization
itself may not seem surprising. However, its proof, as given in this
paper, is quite non-trivial and requires some new ideas, despite the
apparent similarities with Tseng's proof in [20]:
1. For LASSO,
the proof of the error bound condition relies heavily on the fact that the
level sets of \|x\|_1 are polyhedral. However, such property does not hold
for the level sets of \|X\|_{\ast}.
2. Tseng [20] considered the
group LASSO regularization and developed a framework to handle the
non-polyhedral structure of the regularizer. His argument is mainly based
on coordinate splitting, which is motivated by the structure of the
proximity operator for group LASSO. However, such a technique is not
directly applicable to our problem due to the non-commutability of matrix
multiplication. We observe that, instead of dealing with the whole optimal
set, we could first bound the distance from the current iterate to a
subset of the optimal set, in which all the optimal solutions and optimal
gradient are simultaneously diagonalizable (Proposition 3.3). This allows
us to bypass the non-commutability issue. Moreover, we need to use matrix
perturbation theory and the structure of the proximity operator for trace
norm to get a small-o(\gamma_k) rather than big-O limiting behavior of the
iterates (see eq. (20)), which is crucial in our subsequent analysis.
3. From an optimization-theoretic perspective, the results in
Tseng [20] and Zhang [22] correspond to error bounds for certain conic
quadratic inequalities (CQIs). In contrast, our error bound is for a class
of linear matrix inequalities (LMIs), which includes those in [20,22] as
special cases.
Detailed replies to Reviewer_4:
1.
Actually, once Lemma 3.2 is available, the proof of Theorem 3.1 is
completely standard (see, e.g., [9,20]). The hard part is to prove the
lemma, especially part (c) (this has also been pointed out in Tseng [20]).
Although Tseng proved a similar lemma for group LASSO, his techniques
could not handle the case of trace norm minimization due to the
non-commutability of matrix multiplication. Our approach, as explained
above, exploits several new observations to tackle such difficulties.
2. Thank you for your suggestion. We shall include numerical
examples to demonstrate the linear convergence in the final version.
3. For Theorem 4.1, we allow the step sizes to vary across
iterations just for the sake of generality. For the implementation of the
algorithm, one could consider a constant step size \alpha=1/(2L)
throughout the iterations. Constants \kappa_1, \kappa_2, \kappa_4 are
related to \alpha and the details can be found in the appendix. For
example, \kappa_1=1/2(1/\alpha-L). For constant \kappa_3, it has no
relation with \alpha. However, other factors like the condition number of
the given matrix A would affect the value of \kappa_3.
4. Thank
you for your comment. The intermediate inequality can indeed be removed
using your observations.
Detailed replies to Reviewer_8:
1. Thank you for your positive comments on our paper. For the
comparison of our techniques with those for the \ell_1 case, please see
the third paragraph above (right above the replies to Reviewer_4).
2. Thank you for your suggestion on the title. We will change
“Structured Trace Norm Regularization” to “Structured Trace Norm
Minimization” or variants thereof.
3. We will fix the minor issues
in the final version.
Detailed replies to Reviewer_9:
1.
Thank you for your comment. The inequality in line 085 of the appendix is
indeed not as direct as we have presented. We will update it in the final
version. Actually, its proof follows similar lines as that of Luo and
Tseng [9], p.174, (A.1).
2. Regarding originality, we would like
to refer the reviewer to the third paragraph above (right above the
replies to Reviewer_4). As argued there, the problem considered in this
paper does contain challenges that are not present in existing work, and
our proof requires some new ideas.
| |