Paper ID: 1357
Title: A New Convex Relaxation for Tensor Completion
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new convex lower approximation for the average rank-n of tensors, which is useful for learning low rank tensors from a set of linear measurements. It is shown to be tighter (at some locations) than the commonly used trace norms that are extended from the matrix case. Efficient optimization algorithms are also designed based on alternating direction method of multipliers, together with an efficient algorithm for computing the proximal operator. Experiments on one synthetic dataset and two real datasets show that compared with the conventional trace norm regularization, the proposed relaxation yields statistically lower RMSE for tensor completion, with a tolerable increase in the computational cost.

The idea that motivates the new norm is very interesting, especially the key property of invariance to matricization in different modes. The experimental results are also encouraging. I would like to recommend accepting this paper for publication.

I am wondering if the l_2 ball in Lemma 1 can be extended to other norm balls. Based on the proof, the following two conditions will be necessary:
a) the resulting f**_\alpha(\sigma(W_(n))) in Eq 6 is independent of n,
b) \alpha is large enough such that \Gcal_\infty \subseteq \alpha \Gcal_2.

Here \Gcal stands for \mathcal{G}.

Lemma 1 can be stated in this generalized setting as:

Define f_\alpha as the convex envelope of the cardinality of x, card (x), on the ball of ||.|| with radius \alpha. Then for any x \in R^d such that ||x|| = \alpha, it holds f**_\alpha = card(x).

The proof is almost the same as that in the paper. In equation (7), change ||s_{1:r}||_2 to ||s_{1:r}||_*, where ||.||_* is the dual norm of ||.||. Then pick s = k \alpha y, where y = argmax_{||z||_* = 1} < z, x >. So ||y||_* = 1, and < y, x > = ||x|| = \alpha by the definition of dual norm. Then when k is large enough we have f**_\alpha (x) >= < k \alpha y, x > - (\alpha ||s||_* - card(x)) = k \alpha^2 – \alpha * (k \alpha *1) + card(x) = card(x).

Note the proof doesn’t need the condition a) or b).

Second, Proposition 2 can be extended as:

There exists \Wcal \in \Gcal_\infty such that f_\alpha(\Wcal) > |||\Wcal|||_tr.
Furthermore, for every \Wcal in the unit ball of ||.||, it holds that F_1(\Wcal) > |||\Wcal|||_tr.

Indeed, the second clause is trivial by the definition of Fenchel biconjugation (it’s the max of all convex functions that minorize \Omega(\Wcal)). The first clause does need the above assumptions a) and b).

I suspect the l_2 norm used in Lemma 1 is the only norm that satisfies these two conditions. It will be interesting to prove or disprove it. But of course, this is just a sufficient but not necessary condition for there being a \Wcal such that the induced norm on \Wcal is greater than |||\Wcal|||_tr.

Minor comments:
1. Line 139: \Wcal \in \Gcal_2 => \Wcal \in \sqrt{p_min} \Gcal_2
2. In various places, ||\Wcal||_tr should be changed into |||\Wcal|||_tr.
3. ADM: isn’t Alternating Direction Method of Multipliers more commonly abbreviated as ADMM?
4. It may be useful to highlight that the proposed F_\alpha(\Wcal) is NOT uniformly greater than \Omega(\Wcal), ie it doesn’t dominate \Omega(\Wcal).
5. What’s the value of relative RMSE in experiment? That is, RMSE divided by the ground truth value.
Q2: Please summarize your review in 1-2 sentences
This paper proposed an interesting convex relaxation to the rank of tensor, which is tighter than the conventional trace norm at some points. Efficient optimization algorithms are provided and the experimental results are encouraging.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new convex relaxation for the tensor rank minimization problem. The new convex regularizer is proved to be tighter than the trace norm, which is defined as the average of the nuclear norms of the matricizations. The idea is interesting. The main problem is that computing the proximal mapping of the new regularizer requires an iterative subgradient method, which might take lots of iterations and thus computionally expensive. In the numerical part, the authors should discuss the effect if different alpha is chosen, and also report the cpu times for different problems.
Q2: Please summarize your review in 1-2 sentences
The authors should discuss the effect if different alpha is chosen.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper is about tensor estimation under low-rank assumption. The authors claim that the trace-norm suggested by references [7,15,20] by Gandy et al. 11, Liu et al. 09 and Signoretto et al. does not provide a tight relaxation of the index of interest that is a generalization of the rank of matrices to tensors. The authors support their argument by a counter-example (page 4), a theoretical result (Lemma 1 page 3) and empirical results on both real and synthetic data. They suggest an ADMM (please write it with 2 "M"s as it is the convention to make clear that it is the same algorithm) scheme for optimization with the suggested penalty. To obtain the penalty they suggest a bi-conjugation which is the standard way to do it.

I am not sure that the convex relaxation should be done on the euclidean ball as the authors do it. In the case of the trace norm it is done on the operator or spectral norm ball. For the max norm it is done on the element-wise ell_infty norm ball then approximated by the max-norm. Why should the euclidean norm define the ball on which to relax the rank?
Minor question: what is the cost of coputing the prox (Algorithm 1)? each iteration has cost p^2, this is huge, what is the stopping criterion implying about the computational cost?
Q2: Please summarize your review in 1-2 sentences
The paper suggests a convex relaxation of the tensor rank over the euclidean norm ball, which is intriguing but not convincing to me.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
**** Updated after authors' feedback ****
My main concerns are still that
- the improvement compared to L1 (or trace norm) comes from the discontinuousness at the boundary of the domain, where the solution almost never lies (as the authors point out).
- on the other hand, probably the proposed regularizer agrees with the L1 if ||w||_1 ≤ alpha, because it agrees with L1 on all the points on the L2 ball that have cardinality one, and the convex envelope of those points defines the ball ||w||_1 ≤ alpha.
*****************************************


This paper proposes a new convex relaxation of the tensor multi-linear rank based on the convex envelope of the cardinality function intersected with an L2-norm ball of some radius alpha.

The proposed convex relaxation is in the line of previous studies [7,15,19,20,22] that studied trace norm of the unfoldings (matricizations) of a tensor along different modes. The current paper points out that the tensor trace norm used in previous studies is not the tightest convex relaxation with respect to the Frobenius norm (L2 norm of the spectrum). To fix this, the authors apply the convex envelope of the cardinality function intersected with an L2-norm ball to the spectra of the unfoldings of a given tensor.

The authors show that the proposed convex envelope (for cardinality) is tight for any vector having the prespecified radius alpha. The trace norm is the tightest convex relaxation of the rank in terms of the spectral norm. The authors argue however that the spectral norm is not invariant to the choice of the mode to unfold a given tensor, nor is the tightness of trace norm. On the other hand, the Frobenius norm is invariant, so is the tightness of the proposed convex relaxation.

Some numerical experiments show that the proposed convex relaxation outperforms tensor trace norm in tensor completion.

Strength:
- The proposed convex relaxation is new and it seems to work.
- The idea might be used for inducing sparsity in general.
Weakness:
- The result of Lemma 1 seems applicable to other sprasity problems (sparsity, group sparsity, low-rank matrices). The authors should clarify how much of the contribution is specific to low-rank tensors.
- The authors do not prove that the proposed regularizer is the tightest convex relaxation for the combinatorial regularizer Omega(W) (they only show the tightness of the convex envelope function used to construct the regularizer in Lemma 1 and it is *tighter* than the trace norm in Prop 2).
- The proposed convex relaxation seems rather hard to compute. It is infinity for ||x|| > alpha and it has many discontinuities for ||x||=alpha. This probably makes the optimization and also proving something about the estimator hard.
- It would be nice if the authors could discuss any connection between this work and the k-support norm proposed in Argyriou et al. NIPS 2012)


Minor comments:
- The statement "if F_alpha(W) ≤ Omega(W) for every W\in G_2 then F_alpha(W) ≤ Omega(W) for every G_infty." (line 138) seems false, because G_2 is a smaller set than G_infty.
- The choice of alpha as in line 321 may not be the best, because the proposed regularizer is infinity for ||W||_F > alpha. It should be chosen more carefully.
- It was not clear how tight the proposed regularizer is when ||W||_F < alpha. I guess it equals the trace norm when ||W||_tr ≤ alpha (Correct me if I am wrong).
- What kind of performance guarantee can you prove for this regularizer? How does that compare to Tomioka et al. "Statistical Performance of Convex Tensor Decomposition" (NIPS 2011)?
Q2: Please summarize your review in 1-2 sentences
This paper proposes a new convex relaxation for tensor multi-linear rank. The new regularizer seems interesting and it seems to work but it was not clear how much it is specific to tensors. In addition, though convex the regularizer might be ill-behaving (not continuous).
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Rev 5
Your question about looking for norms other than l_2 which can generalize the content of our paper is very interesting and it requires more investigation. To summarize our selection of the l_2 ball we can state that:
All norms of a matrix that can be written as a vector norm of the matrix elements are clearly matricization invariant norms (e.g. element-wise l_p norm). Among them, the l_2 (Frobenius) norm is perhaps the only one which is also a spectral function, allowing us to use von Neumann's trace theorem to compute the convex envelope of the rank of a matrix.

We agree with the first four minor comments (in the 4th one, we will highlight that F_\alpha(\Wcal) \leq \Omega(\Wcal) for all \Wcal \in the l_2 ball of radius \alpha).
Regarding the last comment, these are the relative RMSE results:
Synthetic dataset:
Tr Norm Our Regularizer
0.8168 0.7867
0.8243 0.7890
0.8674 0.8347
0.9513 0.9445
1.0000 1.0000
School dataset:
Tr Norm Our Regularizer
0.6132 0.6050
0.5923 0.5829
0.5755 0.5642
0.5622 0.5520
0.5566 0.5472
0.5494 0.5410
0.5449 0.5367
0.5422 0.5342
0.5363 0.5293
Video completion:
Tr Norm Our Regularizer
0.3227 0.3226
0.3194 0.3169
0.3121 0.2998
0.2989 0.2717
0.2802 0.2446
0.2588 0.2245
0.2412 0.2176
0.2273 0.2135
0.2159 0.2105

Rev 6
We have performed one more experiment varying the value of alpha in the synthetic dataset (for sigma^2=10^-4 and the training set being composed of 10% of the tensor elements), where we know that the ground truth’s alpha=1. The results are as follows:
log_10(alpha) RMSE
-1 0.0111
-0.5 0.0110
-0.25 0.0099
-0.15 0.0091
-0.1 0.0090
-0.05 0.0089
0 0.0089
0.05 0.0089
0.1 0.0090
0.15 0.0092
0.25 0.0093
0.5 0.0093
1 0.0093
whereas trace norm’s RMSE = 0.0093. Note that our approach performs better than trace norm for values of alpha around the ground truth. It does not perform well when alpha is much smaller than the ground truth, whereas it obtains better or identical RMSE when the value of alpha is overestimated. Experiments on the School Dataset (resp. Video Completion) show similar trends, namely our approach performs better than the trace norm for values of alpha in the range [10^-0.5, 10^1] (resp. [10^-0.25, 10^0.25]) times the value of alpha in eq. line 321.

Note that in lines 361-372 and Fig. 1 right, we report experiments regarding the cpu time for different tensor sizes.

Rev 7
We employ the Frobenius norm because it is invariant wrt. different matricizations of a tensor, a property which is lacked by the spectral norm (see also answer to Rev. 1).
Regarding the minor question, the reason for each iteration of the prox being quadratic is due to the projection algorithm, which in the worst case (i.e. when the input is sorted in opposite order) is quadratic. In practice, we have noticed that the input to the projection algorithm is usually very far from the worst case - the intuition behind this is that the input is the result of advancing toward a negative subgradient from an already sorted vector.
Furthermore, we have performed one more experiment to compare the average (over 1000 trials) number of iterations (ANI) required by the prox algorithm to terminate (we stop the Algorithm 1 when an update to {\hat w} is not made for more than 250 consecutive iterations) for different values of p, which indicates that this number changes gracefully with p:
p ANI
20 340.96
40 386.00
80 451.61
160 472.87

Rev 8
Weaknesses:
- The result in Lemma 1 can indeed be applied to other sparsity problems and its study could be an interesting extension of this paper. Here, we focus on tensors, where the advantage of using the Frobenius norm over the spectral norm is clear, due to its invariance to the choice of the mode to unfold a given tensor.
- Because of the composite nature of the combinatorial regularizer Omega, it seems hopeless to find the tightest convex relaxation of Omega.
- The discontinuity on the boundary does not seem an issue for optimization. Indeed we verified that in our simulations the optimal solution found has always norm smaller than alpha.
- Thank you for bringing up to our attention the paper by Argyriou et al., which we will cite in the revised version. There are some important differences between our work and that paper: a) They consider the convex envelope of the *indicator function* of the cardinality of a vector on the L2 ball, we consider the convex envelope of the cardinality of a vector on the L2 ball; b) They entirely focus on vector problems, we consider tensor problems; c) perhaps most importantly, their regularizer does not seem easily applicable to tensors because it requires one to specify N hyperparameters (the bound on the rank of each matricization involved in the convex relaxation). This fact makes the adaptation of their regularizer computationally challenging in the tensor setting.

Minor comments:
- You are right, this should be modified to "F_alpha(W) <= Omega(W) for every W \in sqrt{p_min} G_2 for every..."
- The choice of alpha is based on a reasonable heuristic which seems to work well in practice (see also answer to Rev. 2). Choosing alpha more carefully (e.g. by cross validation) would potentially lead to further improvements in the performance, at the cost of additional computation.
- Since the level sets of F_alpha are closed, then there exists W s.t. ||W||_F < alpha and F_alpha(W) > |||W|||_tr. Indeed, f^**_alpha is lower semi continuous (l.s.c.), see Thm. 11.1 in Rockafellar and Wets. Consequently, the function F_alpha is l.s.c. as well (Thm. 1.39). In turn this implies that for every eps > 0 there exists a point W in the interior of the alpha ball s.t. F_alpha(W) = Omega(W)-eps.
- We have not made a statistical analysis of the method yet; possibly Rademacher bounds could be derived, but this is subject for future work.